Pathogenwatch
  • Welcome to Pathogenwatch
  • 🎉Announcements
  • ▶️A "Getting Started" Tutorial
  • 🎦Video Tutorials
  • 🧐Useful Links
  • 📖How to use Pathogenwatch
    • Uploading Genomes
    • Genome Reports
    • Browsing Genomes
    • Editing Metadata
    • 🚮Deleting genomes
    • Downloads
    • Creating A Collection
    • Browsing Collections
    • Sharing a collection
    • Genomic Context Search
    • Using The Interactive Collection Views
      • The Map View
      • The Tree Viewer
      • The Filter Bar
      • The Metadata Tables
        • Uploaded Metadata
        • Typing Results
        • Genome Statistics
        • Antimicrobial Resistance
    • Private Metadata
  • 📖Technical Descriptions
    • Species Assignment
      • Speciator
    • Sequence Typing Methods
      • cgMLST
      • Genotyphi
      • Kaptive
      • Kleborate
      • Klebsiella LIN Codes
      • MLST
      • NG-MAST
      • Pangolin
      • PopPUNK
      • SeroBA
      • Vista
      • SISTR
    • Antimicrobial Resistance Prediction
      • SPN-PBP-AMR
      • Kleborate
      • Pathogenwatch AMR
    • Inctyper
    • cgMLST Clustering
    • SARS-CoV-2 Notable Mutations
    • SARS-CoV-2 Genome Tree
    • Core Genome Tree
      • Core Assignment
      • Reference Assignment
      • Core Filter
      • Tree Construction
    • Short Read Assembly
  • ❓FAQ
  • 💾Public data downloads
  • 💊WHO bacterial priority pathogens
  • 📜Release Notes 2025
  • Release Notes 2024
  • Release Notes 2023
  • Release Notes 2022
  • Release Notes 2019-2021
  • ⚠️Privacy and Terms Of Service
  • 📣How to cite
  • 🙏Acknowledgements
  • ❗Report an Issue
Powered by GitBook
On this page
  • About
  • Scoring Assembly Pairs
  • The Dendrogram.
  1. Technical Descriptions
  2. Core Genome Tree

Tree Construction

PreviousCore FilterNextShort Read Assembly

Last updated 5 years ago

About

To generate a score suitable for clustering related assemblies, Pathogenwatch compares variant positions from all pairs of loci found in the two assemblies, bar those excluded by the previously calculated variation filter (see ). In order to be able to analyse incomplete assemblies as well, the score over the partial region is scaled to an expected core size, giving an approximation of what the score would be if the assembly was complete.

Scoring Assembly Pairs

  1. Extract substitutions for each locus - indels are excluded as they are often the result of assembly or sequencing error and our testing found the noise from these events could overwhelm the true distances in closely related assemblies.

  2. In the vast majority of cases there will only be a single locus for the family in both assemblies, so these are trivially paired up. If there is more than one locus then the most similar loci are paired together. Unmatched loci are ignored.

  3. During the comparison the number of compared nucleotides is tracked. At the end this is used to scale the score to the "expected number of nucleotides". The expected number of nucleotides is calculated as the sum of the reference sequences used to identify the core.

  4. Total variant sites between the pair of assemblies are calculated and then modified using the expected number of nucleotide scaling described above.

The Dendrogram.

  1. A dendrogram is then constructed by writing all scaled pairwise scores to a matrix, and running the APE package () neighbour-joining implementation.

  2. The resulting tree is then midpoint rooted using the phangorn package ().

📖
Core Filter
Paradis et al
KP Schliep