Speciator

Species assignment tool in Pathogenwatch

About

Speciator is an in-house tool for assigning a species to an assembled genome. It combines the approach developed by Anthony Underwood (bactinspector) for searching the NCBI RefSeq database using mash with the curated library developed by Kat Holt et al for Kleborate and Bacsort. Speciator is able to accurately assign species for the majority of genome assemblies in just a few seconds.

Library Structure

Curated Library

A manually constructed set of reference assemblies which provide a very accurate assignment of species. The library is based on the Kleborate library along with some in-house modifications - for a description please see the source notes here. This library currently best covers the Klebsiella and other Enterobacteriaceae species, as well as SARS-CoV-2. Signatures: 2557 Species: 302

Genus Finder Library

A library of references used for identifying the genus of an uploaded genome derived from the NCBI RefSeq genome database. For each species in a genus a reference is randomly selected and added to the library. Signatures: 35,654

Virus/Fungus/Genus-specific Libraries

A set of libraries that represent viruses, fungi and each bacterial genus are constructed using the available reference assemblies in RefSeq (March 2020). Signatures: 196,277 Genera: 2,842 + Virus/Fungus Species: 39,268 - NB this includes a significant number of singleton species that are likely to be merged into another species on review.

"No Genus" Library

A number of RefSeq assemblies have not yet been assigned to a species within a known genus. This can be for a variety of reasons, including for newly identified species that have yet to be classified. They can also include assemblies that are in fact part of a known species but this has not yet been recognised in the database. These are collected into a single library in the same fashion as a Genus library. Signatures: 2845

Method

Mash is used for all searches between query assemblies and reference libraries using a kmer size (-k) of 21 and sketch size (-s) of 1000.

  1. The query genome is searched against curated library with distance threshold (-d) of 0.04 (Kleborate default) and the nearest match used to assign the species.

  2. If no match is found, the assembly is assigned to a kingdom or bacterial genus by searching the Genus Finder library with a distance threshold (-d) of 0.15 and the top 20 matches used to identify the genus.

  3. The selected genus (-d 0.05) or kingdom (-d 0.075) library is searched and the top 20 matches used to identify the species.

  4. If no genus is identified in step 2 or no species is identified in step 3 then a final search is carried out against the No Genus library with a distance threshold (-d) of 0.05 and the top 20 matches used to identify the species.

  5. If no species is assigned in the previous steps, then it is considered "unclassified".

This assignment is then passed on to downstream tools that require a species identifier, such as MLST, or which are species-specific, such as Genotyphi.

Validation

Pathogenwatch Public Collections

All public collections in Pathogenwatch only container verified members of the specified species - including more than 10,000 assemblies in Staphylcoccus aureus, Salmonella Typhi and Neisseria gonorrhoeae. Furthermore we have tested against examples from a range of other species including Candida auris, Zika virus, Renibacterium salmonarium, and several thousands of CoV-2 SARS versus non-CoV-2 SARS. Speciator is able to identify the correct species for these assemblies with 100% accuracy without any specific interventions in the software to achieve this.

SPARK Klebsiella/Raoultella Collection

We were kindly given the opportunity to test our species assignments against the curated manual assignments of the SPARK collection of diverse Klebsiella. Speciator gets 100% of assignments correct.

EuSCAPE non-Kpn assemblies

Speciator now gets 100% of the non-Kpn part of the EuSCAPE Klebsiella survey correct as well. The previous version was able to capture the K. pneumoniae part of the collection but was nearer 50% correct on the others.

Caveats

  • Not all species are well defined, and references can be incorrectly classified or the classification out of date in RefSeq.

  • Contaminated samples (genomes > 1 species) will get a single species assignment. This tool is not for metagenomics in any way.

  • There are many species that we haven't been able to test in depth and there's no gold standard data set covering the more unusual species to any depth.

  • The discrimination of Escherichia coli and Shigella species has not been tested and should not be relied on currently.

References

Last updated