In order to determine a probability threshold for marking a locus as unexpectedly variant, the following approach is applied. We assume that for a given pair of assemblies, they are equally diverged from a (close) common ancestor, and so we should observe twice as many variants as have occurred in a single genome. Thus the first calculation is 1 / (2 x core families). At this point it would expected to use the number of comparisons in the calculation to further lower the threshold, since in carrying out many comparisons we would expect to see rare events occurring. However, this makes calculations between collections not directly comparable, so we use a fixed size of comparisons that takes into account the large number comparisons we expect to run in Pathogenwatch. Thus the final threshold calculation is: