Pathogenwatch
  • Welcome to Pathogenwatch
  • ๐ŸŽ‰Announcements
  • โ–ถ๏ธA "Getting Started" Tutorial
  • ๐ŸŽฆVideo Tutorials
  • ๐ŸงUseful Links
  • ๐Ÿ“–How to use Pathogenwatch
    • Uploading Genomes
    • Genome Reports
    • Browsing Genomes
    • Editing Metadata
    • ๐ŸšฎDeleting genomes
    • Downloads
    • Creating A Collection
    • Browsing Collections
    • Sharing a collection
    • Genomic Context Search
    • Using The Interactive Collection Views
      • The Map View
      • The Tree Viewer
      • The Filter Bar
      • The Metadata Tables
        • Uploaded Metadata
        • Typing Results
        • Genome Statistics
        • Antimicrobial Resistance
    • Private Metadata
  • ๐Ÿ“–Technical Descriptions
    • Species Assignment
      • Speciator
    • Sequence Typing Methods
      • cgMLST
      • Genotyphi
      • Kaptive
      • Kleborate
      • Klebsiella LIN Codes
      • MLST
      • NG-MAST
      • Pangolin
      • PopPUNK
      • SeroBA
      • Vista
      • SISTR
    • Antimicrobial Resistance Prediction
      • SPN-PBP-AMR
      • Kleborate
      • Pathogenwatch AMR
    • Inctyper
    • cgMLST Clustering
    • SARS-CoV-2 Notable Mutations
    • SARS-CoV-2 Genome Tree
    • Core Genome Tree
      • Core Assignment
      • Reference Assignment
      • Core Filter
      • Tree Construction
    • Short Read Assembly
  • โ“FAQ
  • ๐Ÿ’พPublic data downloads
  • ๐Ÿ’ŠWHO bacterial priority pathogens
  • ๐Ÿ“œRelease Notes 2025
  • Release Notes 2024
  • Release Notes 2023
  • Release Notes 2022
  • Release Notes 2019-2021
  • โš ๏ธPrivacy and Terms Of Service
  • ๐Ÿ“ฃHow to cite
  • ๐Ÿ™Acknowledgements
  • โ—Report an Issue
Powered by GitBook
On this page
  • About
  • Filtering Process
  • Paralog Filter
  • Variance Filter
  1. Technical Descriptions
  2. Core Genome Tree

Core Filter

About

Two filtering steps are applied to remove loci that can be problematic for tree building. Firstly, paralogues are added to the filter, and then loci that show unexpectedly high variance when compared to the nearest reference are removed.

Filtering Process

Paralog Filter

  • Any core gene that has more than one match in the core profile is added to the filter.

Variance Filter

Principle

The variation filter is used to identify and remove loci that show an unusually large (or small in more distant comparisons) number of variant sites given the mutation rate over the rest of the genome. For this we assume that the distribution of mutations amongst loci should approximate a Poisson variation, and exclude loci that fall outside of a predetermined probability threshold. "Excessively" variant loci are likely to be so due to either (a) erroneous assembly - not unlikely when dealing with significant numbers of genomes or (b) the result of lateral gene transfer. In both cases, inclusion of the locus in tree building can lead to errors in branch length and the neighbour joining algorithm.

Determining the probability threshold

In order to determine a probability threshold for marking a locus as unexpectedly variant, the following approach is applied. We assume that for a given pair of assemblies, they are equally diverged from a (close) common ancestor, and so we should observe twice as many variants as have occurred in a single genome. Thus the first calculation is 1 / (2 x core families). At this point it would expected to use the number of comparisons in the calculation to further lower the threshold, since in carrying out many comparisons we would expect to see rare events occurring. However, this makes calculations between collections not directly comparable, so we use a fixed size of comparisons that takes into account the large number comparisons we expect to run in Pathogenwatch. Thus the final threshold calculation is:

1 / (1000000 x 2 x C) where C is the number of core families.

It should be noted that this filter is very conservative and only removes extremely divergent alleles. In the vast majority of genomes we donโ€™t observe any filtered loci.

Creating The Variation Filter

During the reference assignment task, the number of differences at each locus and the total number of differences and nucleotides are counted. An overall mutation rate is calculated as differences / total nucleotides.

  1. Then for each locus an expect number of mutations is determined by multiplying the mutation rate by the locus length in nucleotides. If the expected number is below 1 a minimum value of 1 is used.

  2. The expected value is used as the mean of a poisson curve and the cumulative probability of observed number of mutations or more is determined. Or the inverse if the number of mutations is below the mean - i.e. fewer may be observed than expected, though in practice this never occurs as the mutation rate is normally low.

  3. Loci that fail to meet the threshold are noted and excluded from further comparisons.

PreviousReference AssignmentNextTree Construction

Last updated 5 years ago

๐Ÿ“–