Klebsiella LIN Codes

The Pathogenwatch LIN code tool infers Klebsiella lineage codes based on references from the Pasteur/PubMLST resource.

About

Introduction

LIN (Life Identification Number) codes are a system developed by Vinatzer et al (2020) for creating stable nomenclatures for taxonomic trees. It has been applied to the Klebsiella pneumoniae complex though clustering of their core genome MLST profiles, providing a mechanism for identifying and referencing Kpn complex lineages (Hennart et al, 2022). The resulting reference database of core genome MLST (cgMLST) profiles and associated LIN codes are maintained and provided via the BigsDB resource at the Pasteur institute. These are imported into Pathogenwatch, and compared against query genome cgMLST profiles. Depending on the level of similarity between the query and closest reference, a partial or complete LIN code is inferred and the "Clonal group" and "Sublineage" assignments highlighted.

For specific information about the Klebsiella LIN code scheme, please visit the Pasteur Institute's reference documentation: https://bigsdb.pasteur.fr/klebsiella/cgmlst-lincodes/

Lineage annotations

LIN code

The LIN code is a hierarchical code consisting of 10 levels, with each level indicating a division of the previous level. So the topmost level is the most general and approximates to the species level within the Klebsiella pneumoniae complex. The bottom level indicates genomes that have identical core genomes, providing a fine-grained view of the population.

LIN codes inferred by Pathogenwatch will always be incomplete for novel core genome profiles. These have a placeholder cgST assigned consisting of a unique combination of letters and numbers. In the genome report the first four characters are shown prepended by an asterisk to indicate it is a novel cgST. Novel cgSTs will be replaced with standard numeric codes when a representative profile is made available.

To infer the LIN code for a novel cgST, the similarity metric is used to identify the maximum level of similarity according to the thresholds show in the table below. The code is then inferred up to this level, while subsequent levels are marked with "*".

Sublineage and clonal group

The sublineage and clonal group assignments are based on the third and fourth levels of the LIN code respectively, and they represent the deepest branches in the Kpn complex lineages. The sublineage and clonal group names are based on the dominant MLST within that group and provide a stable naming scheme that is more coherent than MLST by itself.

Method

Software components:

  1. A library of cgMLST profiles with linked cgST codes, LIN codes, clonal groups codes, and sublineage codes.

  2. A tool that takes a cgMLST profile as input, and finds the nearest neighbour(s) according to the distance calculation described. It then infers a LIN code, as well as the sublineage and clonal groups according to the similarity of the best match.

Similarity calculation

For each profile the allele code at each locus is compared, and the similarity between two profiles is calculated as a percentage using the following formula:

( Number of identical loci ) / ( Total loci [629] - Missing loci )

Note: There are 629 loci in the Kpn cgMLST scheme. If 30 or more loci are missing between the two profiles then the score is invalid and not reported.

It is possible for more than one reference profile to be equidistant to the query profile. In this case all the nearest cgST codes are reported. This has no impact on the inferred LIN code.

Inferring the lineage codes

The calculated similarity above is used to identify the similarity level, from 1-10. The thresholds are shown in the table below. The query profile's LIN code is inferred up to one level below the assigned bin. So an assignment to bin 10 (1 allele different) will result in a 9 part LIN code with a single asterisk at the end.

LevelSimilarity ThresholdIdentity*

1

<3.0207

<19

2

<6.9952

<44

3 (Sublineage)

<69.7933

<439

4 (Clonal group)

<93.1638

<586

5

<98.4102

<619

6

<98.8871

<622

7

<99.3641

<625

8

<99.6820

<627

9

<99.8410

<628

10

<100

<629

Notes: * This is the equivalent identity between the two profiles if they both have an allele assigned to every locus. There's a total of 629 loci in the Klebsiella pneumoniae complex cgMLST scheme.

If the similarity score is in bin 1, then no part of the LIN code will be inferred. This is also likely to indicate an issue with the assembly, or possibly, an issue with speciation.

Citation

If you make use of the lineage assignments from this tool, please cite the foundational work by Hennart et al (2022) A Dual Barcoding Approach to Bacterial Strain Nomenclature: Genomic Taxonomy of Klebsiella pneumoniae Strains Molecular Biology and Evolution 39:7

Last updated