Genome Statistics

About

For each genome uploaded to Pathogenwatch a summary set of statistics are calculated. These can help provide insight into the assembly quality and completeness - for instance the highlighted assembly below has a high number of non-ATCG characters and is broken into many contigs, but the N50 is reasonably high and the core well covered. They can also be viewed in the Assembly Report in the Genome Browsing pages.

The Statistics

Core

Statistics produced by the core genome tree building method.

CORE MATCHES

The number core genes matched by the core library.

% CORE FAMILIES

The percentage of core families matched. This can be useful for identifying genomes that are missing large sections or have been assigned to the wrong species, perhaps a closely related one.

% NON-CORE GENOME

The percentage of the assembly that has not been assigned to a core gene.

Genome Statistics

LENGTH

The length of the genome in nucleotide pairs, calculated by summing the lengths of the individual contigs.

N50

The N50 (Wikipedia) is a measure of how many contigs are required to cover more than half the genome, relative to the size of the genome. Better assemblies, in which the core genome has been assembled into a small number of contigs, will have a larger N50. The closer the N50 comes to the size of a gene, the more likely it is that core genes may have only been partially or incorrectly assembled.

NO. CONTIGS

The number of contigs in the assembly. Ideally this would match the number of chromosomes and plasmids in the genome assembly, though 10s or 100s of contigs is more typical. It's possible that an assembly with a well formed core can contain a lot of small contigs, and so it's best to use this number in conjunction with the N50 when making quality judgements.

NON-ATCG

This is the number of non-ATCG characters in the assembly - 'N' for an uncertain nucleotide is a common occurrence. Again, the ideal is for there to be none present, and while their impact is minimal for most analyses, if there are more than a few hundred it could be indicative of an issue with sequencing or assembly.

%GC CONTENT

The percentage of the nucleotides that are either guanine or cytosine. Most species show little variance in their GC-AT ratio over the whole genome, so a significant deviation from that might indicate contamination or missing parts of the assembly.

Last updated