Core Assignment
Last updated
Last updated
The process aims to construct a library of non-overlapping universal expressed sequence regions. A set of reference assemblies is used to identify these elements and define the expected mutation rate for the species used by the filter step.
For each targeted species a core was either:
Provided by a collaborator and BLAST matches are extracted from each reference genome and aligned using MAFFT
Built in-house using ROARY and the alignments used directly.
All families with matching starts and ends (flat edges to the alignment, “perfect rectangles”) were used without alteration. This is the case for the vast majority of core families identified.
The remaining families were aligned with MAFFT and processed to produce straight edges.
For the majority of these families, variation in the gene boundaries were due to different potential start sites being used, while a much smaller number were due to alternative stop codons. In general a consensus-based approach was used to determine the likely start or stop site.
If there was uncertainty that a region was found in all genes, a representative sequence was searched against the original assemblies to confirm that it was present in all of them.
A representative from each family is selected by identifying the sequence with the most number of nucleotides in agreement with the consensus according the alignment.
These are searched against each reference genome using BLAST (60% identity threshold, E-value 1e-35, 80% gene coverage) and the location of each match extracted.
Families with multiple matches are removed at this stage.
Overlapping complete matches (on either strand) are fused into a single segment that covers both genes. This is done in each reference genome and the resulting multi-gene segments searched against each and the cross-hits resolved. Segments with partial matches in other genomes are also removed at this stage.
A database is constructed using the fused gene set and searched using BLAST as before against the reference genomes again. Any segments that don't meet the following criteria are excluded:
A complete copy is found in every reference genome
There are no extra partial matches above the gathering thresholds.
This merged and reduced set of DNA segments define the core sequence database used in Pathogenwatch.
Any candidate core segments with potential paralogs or pseudogene copies are identified and removed. Such regions can often fail to assemble well and produce artificial variance in genome comparisons.
The query assembly is searched against the species' BLAST database using blastn with species specific parameters. By default these are -evalue 1e-35 -perc_identity [gathering threshold]
.
Hits below 80% of the core gene length are removed as fragments.
For each allele, the SHA-1 checksum and variants relative to the representative sequence are determined. Variants are segregated into substitutions, insertions and deletions, with adjacent variants of the same type merged into a single mutation.
Species Name
References Genomes
Core Size
Core Nt Size
PID Threshold
Klebsiella pneumoniae
13
1972
2172367
90.0
Klebsiella quasipneumoniae
56
1810
1974880
Klebsiella variicola
41
1783
1942295
Staphylococcus aureus
20
1625
1742498
80.0
Salmonella Typhi
19
3284
3594178
90.0
Neisseria gonorrhoeae
14
1542
1470119
80.0
Vibrio cholerae
17
2736
3075173
Zika virus
1
10
10215
80.0
Streptococcus equi
1
1286
1441721
80.0
Renibacterium salmoninarum
5
2500
2667272
80.0
Candida auris
5
4173
6249130
70.0