To generate a score suitable for clustering related assemblies, Pathogenwatch compares variant positions from all pairs of loci found in the two assemblies, bar those excluded by the previously calculated variation filter (see Core Filter). In order to be able to analyse incomplete assemblies as well, the score over the partial region is scaled to an expected core size, giving an approximation of what the score would be if the assembly was complete.
Scoring Assembly Pairs
Extract substitutions for each locus - indels are excluded as they are often the result of assembly or sequencing error and our testing found the noise from these events could overwhelm the true distances in closely related assemblies.
In the vast majority of cases there will only be a single locus for the family in both assemblies, so these are trivially paired up. If there is more than one locus then the most similar loci are paired together. Unmatched loci are ignored.
During the comparison the number of compared nucleotides is tracked. At the end this is used to scale the score to the "expected number of nucleotides". The expected number of nucleotides is calculated as the sum of the reference sequences used to identify the core.
Total variant sites between the pair of assemblies are calculated and then modified using the expected number of nucleotide scaling described above.
A dendrogram is then constructed by writing all scaled pairwise scores to a matrix, and running the APE package (Paradis et al) neighbour-joining implementation.
The resulting tree is then midpoint rooted using the phangorn package (KP Schliep).