Core Assignment

Constructing A Core Library

The process aims to construct a library of non-overlapping universal expressed sequence regions. A set of reference assemblies is used to identify these elements and define the expected mutation rate for the species used by the filter step.

Species Name

References Genomes

Core Size

Core Nt Size

PID Threshold

Klebsiella pneumoniae

13

1972

2172367

90.0

Klebsiella quasipneumoniae

56

1810

1974880

Klebsiella variicola

41

1783

1942295

Staphylococcus aureus

20

1625

1742498

80.0

Salmonella Typhi

19

3284

3594178

90.0

Neisseria gonorrhoeae

14

1542

1470119

80.0

Vibrio cholerae

17

2736

3075173

Zika virus

1

10

10215

80.0

Streptococcus equi

1

1286

1441721

80.0

Renibacterium salmoninarum

5

2500

2667272

80.0

Candida auris

5

4173

6249130

70.0

Generating An Initial Start Set

For each targeted species a core was either:

  • Provided by a collaborator and BLAST matches are extracted from each reference genome and aligned using MAFFT

  • Built in-house using ROARY and the alignments used directly.

Processing the Alignments

  1. All families with matching starts and ends (flat edges to the alignment, “perfect rectangles”) were used without alteration. This is the case for the vast majority of core families identified.

  2. The remaining families were aligned with MAFFT and processed to produce straight edges.

    • For the majority of these families, variation in the gene boundaries were due to different potential start sites being used, while a much smaller number were due to alternative stop codons. In general a consensus-based approach was used to determine the likely start or stop site.

    • If there was uncertainty that a region was found in all genes, a representative sequence was searched against the original assemblies to confirm that it was present in all of them.

  3. A representative from each family is selected by identifying the sequence with the most number of nucleotides in agreement with the consensus according the alignment.

  4. These are searched against each reference genome using BLAST (60% identity threshold, E-value 1e-35, 80% gene coverage) and the location of each match extracted.

    1. Families with multiple matches are removed at this stage.

    2. Overlapping complete matches (on either strand) are fused into a single segment that covers both genes. This is done in each reference genome and the resulting multi-gene segments searched against each and the cross-hits resolved. Segments with partial matches in other genomes are also removed at this stage.

  5. A database is constructed using the fused gene set and searched using BLAST as before against the reference genomes again. Any segments that don't meet the following criteria are excluded:

    1. A complete copy is found in every reference genome

    2. There are no extra partial matches above the gathering thresholds.

  6. This merged and reduced set of DNA segments define the core sequence database used in Pathogenwatch.

Any candidate core segments with potential paralogs or pseudogene copies are identified and removed. Such regions can often fail to assemble well and produce artificial variance in genome comparisons.

Profiling A New Assembly

  1. The query assembly is searched against the species' BLAST database using blastn with species specific parameters. By default these are -evalue 1e-35 -perc_identity [gathering threshold].

  2. Hits below 80% of the core gene length are removed as fragments.

  3. For each allele, the SHA-1 checksum and variants relative to the representative sequence are determined. Variants are segregated into substitutions, insertions and deletions, with adjacent variants of the same type merged into a single mutation.

Last updated