Each assembly is linked to the nearest reference assembly by comparing the substitutions in the core profiles to each of the reference core profiles. The reference assignment is then used to identify potentially unreliable loci in the query assembly according to the variation filter method described in the Core Filter section.
For some species (e.g. Salmonella Typhi) assemblies with the same reference assignment will be clustered to provide a more fine-grained view, useful for large collections in the Collection View.
- 1.The core profile is generated for each reference assembly.
- 2.All substitutions are selected - excluding those with non-ATCG characters - and are extracted and aggregated into a single list of variant locations per gene family.
- 1.Each assembly is compared against each reference at all the sites in the species profile, excluding sites outside the boundaries of any fragment matches.
- 2.The total number of sites in common are divided by the total number of compared sites in order to generate a similarity score.
- 3.The query assembly is then assigned to the subgroup identified by the name of the most similar reference. If two references have the same score then then alphabetical order is used.