# Tree Construction

## About

To generate a score suitable for clustering related assemblies, Pathogenwatch compares variant positions from all pairs of loci found in the two assemblies, bar those excluded by the previously calculated variation filter (see [Core Filter](https://cgps.gitbook.io/pathogenwatch/technical-descriptions/core-genome-tree/core-filter)). In order to be able to analyse incomplete assemblies as well, the score over the partial region is scaled to an expected core size, giving an approximation of what the score would be if the assembly was complete.

## Scoring Assembly Pairs

1. Extract substitutions for each locus - indels are excluded as they are often the result of assembly or sequencing error and our testing found the noise from these events could overwhelm the true distances in closely related assemblies.
2. In the vast majority of cases there will only be a single locus for the family in both assemblies, so these are trivially paired up. If there is more than one locus then the most similar loci are paired together. Unmatched loci are ignored.
3. During the comparison the number of compared nucleotides is tracked. At the end this is used to scale the score to the "expected number of nucleotides". The expected number of nucleotides is calculated as the sum of the reference sequences used to identify the core.
4. Total variant sites between the pair of assemblies are calculated and then modified using the expected number of nucleotide scaling described above.

## **The Dendrogram.**

1. A dendrogram is then constructed by writing all scaled pairwise scores to a matrix, and running the APE package ([Paradis et al](https://www.ncbi.nlm.nih.gov/pubmed/14734327)) neighbour-joining implementation.
2. The resulting tree is then midpoint rooted using the phangorn package ([KP Schliep](https://www.ncbi.nlm.nih.gov/pubmed/21169378)).


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://cgps.gitbook.io/pathogenwatch/technical-descriptions/core-genome-tree/tree-construction.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
