cgMLST schemes are based around a community-agreed set of gene loci present in all strains of the species. A database of validated allele sequences is maintained for each locus and a code assigned to each one. An "ST" code is then generated from the unique combination alleles. The schemes supported by Pathogenwatch are provided by PubMLST, the Pasteur Institute, Enterobase, and the cgMLST.org Nomenclature Server, while an in-house search tool is used to rapidly but accurately assign the correct cgMLST assignment.
The assembly is searched for exact matches to known alleles. A representative set of alleles for each locus are then searched for using BLAST. These searches are combined and filtered based on the similarity of the match and length of the match. Novel alleles are hashed using the SHA-1 algorithm, this is then used as their unique identifier. Profiles are assigned based on the combination of alleles detected. Novel profiles are also given a unique identifier using the SHA-1 hash algorithm.
The cgMLST results are not displayed directly, but are available as a download from both collection and genome selection download menus (for more details see "Downloads"). The results also serve as the basis for the cgMLST clustering method for quickly finding closely related assemblies.