Two filtering steps are applied to remove loci that can be problematic for tree building. Firstly, paralogues are added to the filter, and then loci that show unexpectedly high variance when compared to the nearest reference are removed.
- Any core gene that has more than one match in the core profile is added to the filter.
The variation filter is used to identify and remove loci that show an unusually large (or small in more distant comparisons) number of variant sites given the mutation rate over the rest of the genome. For this we assume that the distribution of mutations amongst loci should approximate a Poisson variation, and exclude loci that fall outside of a predetermined probability threshold. "Excessively" variant loci are likely to be so due to either (a) erroneous assembly - not unlikely when dealing with significant numbers of genomes or (b) the result of lateral gene transfer. In both cases, inclusion of the locus in tree building can lead to errors in branch length and the neighbour joining algorithm.
In order to determine a probability threshold for marking a locus as unexpectedly variant, the following approach is applied. We assume that for a given pair of assemblies, they are equally diverged from a (close) common ancestor, and so we should observe twice as many variants as have occurred in a single genome. Thus the first calculation is 1 / (2 x core families). At this point it would expected to use the number of comparisons in the calculation to further lower the threshold, since in carrying out many comparisons we would expect to see rare events occurring. However, this makes calculations between collections not directly comparable, so we use a fixed size of comparisons that takes into account the large number comparisons we expect to run in Pathogenwatch. Thus the final threshold calculation is:
1 / (1000000 x 2 x C) where C is the number of core families.
During the reference assignment task, the number of differences at each locus and the total number of differences and nucleotides are counted. An overall mutation rate is calculated as differences / total nucleotides.
- 1.Then for each locus an expect number of mutations is determined by multiplying the mutation rate by the locus length in nucleotides. If the expected number is below 1 a minimum value of 1 is used.
- 2.The expected value is used as the mean of a poisson curve and the cumulative probability of observed number of mutations or more is determined. Or the inverse if the number of mutations is below the mean - i.e. fewer may be observed than expected, though in practice this never occurs as the mutation rate is normally low.
- 3.Loci that fail to meet the threshold are noted and excluded from further comparisons.