Core Assignment
The process aims to construct a library of non-overlapping universal expressed sequence regions. A set of reference assemblies is used to identify these elements and define the expected mutation rate for the species used by the filter step.
Species Name | References Genomes | Core Size | Core Nt Size | PID Threshold |
Klebsiella pneumoniae | 13 | 1972 | 2172367 | 90.0 |
Klebsiella quasipneumoniae | 56 | 1810 | 1974880 | |
Klebsiella variicola | 41 | 1783 | 1942295 | |
Staphylococcus aureus | 20 | 1625 | 1742498 | 80.0 |
Salmonella Typhi | 19 | 3284 | 3594178 | 90.0 |
Neisseria gonorrhoeae | 14 | 1542 | 1470119 | 80.0 |
Vibrio cholerae | 17 | 2736 | 3075173 | |
Zika virus | 1 | 10 | 10215 | 80.0 |
Streptococcus equi | 1 | 1286 | 1441721 | 80.0 |
Renibacterium salmoninarum | 5 | 2500 | 2667272 | 80.0 |
Candida auris | 5 | 4173 | 6249130 | 70.0 |
For each targeted species a core was either:
- Provided by a collaborator and BLAST matches are extracted from each reference genome and aligned using MAFFT
- 1.All families with matching starts and ends (flat edges to the alignment, “perfect rectangles”) were used without alteration. This is the case for the vast majority of core families identified.
- 2.The remaining families were aligned with MAFFT and processed to produce straight edges.
- For the majority of these families, variation in the gene boundaries were due to different potential start sites being used, while a much smaller number were due to alternative stop codons. In general a consensus-based approach was used to determine the likely start or stop site.
- If there was uncertainty that a region was found in all genes, a representative sequence was searched against the original assemblies to confirm that it was present in all of them.
- 3.A representative from each family is selected by identifying the sequence with the most number of nucleotides in agreement with the consensus according the alignment.
- 4.These are searched against each reference genome using BLAST (60% identity threshold, E-value 1e-35, 80% gene coverage) and the location of each match extracted.
- 1.Families with multiple matches are removed at this stage.
- 2.Overlapping complete matches (on either strand) are fused into a single segment that covers both genes. This is done in each reference genome and the resulting multi-gene segments searched against each and the cross-hits resolved. Segments with partial matches in other genomes are also removed at this stage.
- 5.A database is constructed using the fused gene set and searched using BLAST as before against the reference genomes again. Any segments that don't meet the following criteria are excluded:
- 1.A complete copy is found in every reference genome
- 2.There are no extra partial matches above the gathering thresholds.
- 6.This merged and reduced set of DNA segments define the core sequence database used in Pathogenwatch.
Any candidate core segments with potential paralogs or pseudogene copies are identified and removed. Such regions can often fail to assemble well and produce artificial variance in genome comparisons.
- 1.The query assembly is searched against the species' BLAST database using blastn with species specific parameters. By default these are
-evalue 1e-35 -perc_identity [gathering threshold]
. - 2.Hits below 80% of the core gene length are removed as fragments.
- 3.For each allele, the SHA-1 checksum and variants relative to the representative sequence are determined. Variants are segregated into substitutions, insertions and deletions, with adjacent variants of the same type merged into a single mutation.
Last modified 3mo ago