# cgMLST Clustering & Context Searching

## About

cgMLST clustering helps to identify similar sequences which could be indicative of a transmission event or outbreak.

The sequence typing results from the [cgMLST tool](/pathogenwatch/technical-descriptions-of-analysis-tools/lineage-and-genotyping/cgmlst.md) are used during [context searches](/pathogenwatch/how-to-use-pathogenwatch/collections/using-the-icv/context-search.md) to identify similar sequences which could be indicative of a transmission event or outbreak.

Pathogenwatch provides a tool for calculating distances between cgMLST profiles and clustering them using single-linkage clustering. The construction of the cgMLST profiles, which has the biggest impact on the structure of the clusters, has been shown to make profiles functionally similar to those from EnteroBase HierCC, with just a small percentage of profiles showing significant divergence.

## Context Search Methods <a href="#methods" id="methods"></a>

In a collection Context Search, cgMLST profiles are calculated for all genomes in a collection.&#x20;

Pairwise distances are calculated for all assembled genomes sharing a given cgMLST scheme. The pairwise distance is calculated as the number of different loci for the scheme, ignoring any which are missing (possibly due to sequencing or assembly errors). These calculated pairwise distances are used in Single Linkage Clustering to determine how closely genomes are related.

The threshold defined in the [Context Search panel](/pathogenwatch/how-to-use-pathogenwatch/collections/using-the-icv/context-search.md) is the number of allele differences allowed between the selected genome and other genomes. The context search feature will return genomes that are within the threshold distance and also meet the filtering criteria set in the Folders / Location / Time settings.

## Validation of cgMLST single-linkage Clustering

The full Validation Report can be found [HERE](https://docs.google.com/document/d/1zsGsuAuYUCD2Y-bdEeH83judI3NkAmgymzR1TAlOTsA/edit#heading=h.hauym2ifhu4c).

These are then clustered using Single Linkage Clustering based on the calculated pairwise distances.

## How to cite

The cgMLST clustering tool is first described in:

Sánchez-Busó L, Yeats CA, Taylor B, et al. A community-driven resource for genomic epidemiology and antimicrobial resistance prediction of Neisseria gonorrhoeae at Pathogenwatch. *Genome Med*. 2021;13(1):61. Published 2021 Apr 19. doi:10.1186/s13073-021-00858-2

The software is available with an OSS licence from <https://github.com/pathogenwatch-oss/cgmlst-clustering>


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://cgps.gitbook.io/pathogenwatch/technical-descriptions-of-analysis-tools/trees-clustering-and-context-search/cgmlst-clustering.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
