Pathogenwatch
  • Welcome to Pathogenwatch
  • 🎉Announcements
  • ▶️A "Getting Started" Tutorial
  • 🎦Video Tutorials
  • 🧐Useful Links
  • 📖How to use Pathogenwatch
    • Uploading Genomes
    • Genome Reports
    • Browsing Genomes
    • Editing Metadata
    • 🚮Deleting genomes
    • Downloads
    • Creating A Collection
    • Browsing Collections
    • Sharing a collection
    • Genomic Context Search
    • Using The Interactive Collection Views
      • The Map View
      • The Tree Viewer
      • The Filter Bar
      • The Metadata Tables
        • Uploaded Metadata
        • Typing Results
        • Genome Statistics
        • Antimicrobial Resistance
    • Private Metadata
  • 📖Technical Descriptions
    • Species Assignment
      • Speciator
    • Sequence Typing Methods
      • cgMLST
      • Genotyphi
      • Kaptive
      • Kleborate
      • Klebsiella LIN Codes
      • MLST
      • NG-MAST
      • Pangolin
      • PopPUNK
      • SeroBA
      • Vista
      • SISTR
    • Antimicrobial Resistance Prediction
      • SPN-PBP-AMR
      • Kleborate
      • Pathogenwatch AMR
    • Inctyper
    • cgMLST Clustering
    • SARS-CoV-2 Notable Mutations
    • SARS-CoV-2 Genome Tree
    • Core Genome Tree
      • Core Assignment
      • Reference Assignment
      • Core Filter
      • Tree Construction
    • Short Read Assembly
  • ❓FAQ
  • 💾Public data downloads
  • 💊WHO bacterial priority pathogens
  • 📜Release Notes 2025
  • Release Notes 2024
  • Release Notes 2023
  • Release Notes 2022
  • Release Notes 2019-2021
  • ⚠️Privacy and Terms Of Service
  • 📣How to cite
  • 🙏Acknowledgements
  • ❗Report an Issue
Powered by GitBook
On this page
  • About
  • The Statistics
  • Core
  • Genome Statistics
  1. How to use Pathogenwatch
  2. Using The Interactive Collection Views
  3. The Metadata Tables

Genome Statistics

PreviousTyping ResultsNextAntimicrobial Resistance

Last updated 7 years ago

About

For each genome uploaded to Pathogenwatch a summary set of statistics are calculated. These can help provide insight into the assembly quality and completeness - for instance the highlighted assembly below has a high number of non-ATCG characters and is broken into many contigs, but the N50 is reasonably high and the core well covered. They can also be viewed in the in the pages.

The Statistics

Core

CORE MATCHES

The number core genes matched by the core library.

% CORE FAMILIES

The percentage of core families matched. This can be useful for identifying genomes that are missing large sections or have been assigned to the wrong species, perhaps a closely related one.

% NON-CORE GENOME

The percentage of the assembly that has not been assigned to a core gene.

Genome Statistics

LENGTH

The length of the genome in nucleotide pairs, calculated by summing the lengths of the individual contigs.

N50

NO. CONTIGS

The number of contigs in the assembly. Ideally this would match the number of chromosomes and plasmids in the genome assembly, though 10s or 100s of contigs is more typical. It's possible that an assembly with a well formed core can contain a lot of small contigs, and so it's best to use this number in conjunction with the N50 when making quality judgements.

NON-ATCG

This is the number of non-ATCG characters in the assembly - 'N' for an uncertain nucleotide is a common occurrence. Again, the ideal is for there to be none present, and while their impact is minimal for most analyses, if there are more than a few hundred it could be indicative of an issue with sequencing or assembly.

%GC CONTENT

The percentage of the nucleotides that are either guanine or cytosine. Most species show little variance in their GC-AT ratio over the whole genome, so a significant deviation from that might indicate contamination or missing parts of the assembly.

Statistics produced by the .

The N50 () is a measure of how many contigs are required to cover more than half the genome, relative to the size of the genome. Better assemblies, in which the core genome has been assembled into a small number of contigs, will have a larger N50. The closer the N50 comes to the size of a gene, the more likely it is that core genes may have only been partially or incorrectly assembled.

📖
core genome tree building method
Wikipedia
Genome Browsing
Assembly Report