The process aims to construct a library of non-overlapping universal expressed sequence regions. A set of reference assemblies is used to identify these elements and define the expected mutation rate for the species used by the filter step.
For each targeted species a core was either:
- 1.All families with matching starts and ends (flat edges to the alignment, “perfect rectangles”) were used without alteration. This is the case for the vast majority of core families identified.
- 2.The remaining families were aligned with MAFFT and processed to produce straight edges.
- For the majority of these families, variation in the gene boundaries were due to different potential start sites being used, while a much smaller number were due to alternative stop codons. In general a consensus-based approach was used to determine the likely start or stop site.
- If there was uncertainty that a region was found in all genes, a representative sequence was searched against the original assemblies to confirm that it was present in all of them.
- 3.A representative from each family is selected by identifying the sequence with the most number of nucleotides in agreement with the consensus according the alignment.
- 4.These are searched against each reference genome using BLAST (60% identity threshold, E-value 1e-35, 80% gene coverage) and the location of each match extracted.
- 1.Families with multiple matches are removed at this stage.
- 2.Overlapping complete matches (on either strand) are fused into a single segment that covers both genes. This is done in each reference genome and the resulting multi-gene segments searched against each and the cross-hits resolved. Segments with partial matches in other genomes are also removed at this stage.
- 5.A database is constructed using the fused gene set and searched using BLAST as before against the reference genomes again. Any segments that don't meet the following criteria are excluded:
- 1.A complete copy is found in every reference genome
- 2.There are no extra partial matches above the gathering thresholds.
- 6.This merged and reduced set of DNA segments define the core sequence database used in Pathogenwatch.
- 1.The query assembly is searched against the species' BLAST database using blastn with species specific parameters. By default these are
-evalue 1e-35 -perc_identity [gathering threshold].
- 2.Hits below 80% of the core gene length are removed as fragments.
- 3.For each allele, the SHA-1 checksum and variants relative to the representative sequence are determined. Variants are segregated into substitutions, insertions and deletions, with adjacent variants of the same type merged into a single mutation.