Uploading Genomes

Description of the upload page and file formats.

Introduction

Free of charge, we provide the ability to analysis large numbers of new microbial assemblies in Pathogenwatch. To upload your own microbial pathogen assembly data, click on the 'Uploads' tab on the top right and follow the onscreen instructions.You will have the choice to upload three types of file:

  1. One or more FASTAs each containing a single genome (i.e. bacterial genomes);

  2. One or more FASTAs containing one genome per record (e.g. a FASTA of multiple viral genomes);

  3. Pairs of read files in FASTQ format.

File Formats

Single genome FASTAs

Sequences must be represented in standard IUPAC code (i.e. ATCGATCGNA). Each record represents a single contig in the assembly. The file name is used to name the genome by default and to link to a record in an accompanying metadata CSV. More than one can be uploaded at a time, though we recommend small batches on slow or unstable internet connections.

Multi-genome FASTAs

Sequences must be represented in standard IUPAC code (i.e. ATCGATCGNA). Each record represents an assembled or complete viral genome. The record header will be used to name each genome by default and to link to records in an accompanying metadata CSV. More than one can be uploaded at a time, though we request users to be mindful not to submit thousands of genomes at once as it will impact other users.

Sequence Reads

You can also upload a limited number of pairs of FASTQ files for assembly using our in-house assembly pipeline. The default genome name is taken from the shared part of the filename of the FASTQs. For more details about this pipeline, please see the technical documentation.

Metadata

Metadata files are accepted in CSV format, with a .csv file ending.

  • One row per assembly.

  • Rows are linked to the FASTA file by column titled filename. For mutli-genome FASTAs (e.g. viral genomes) put the identifier in each record header in this column.

  • Provide a default name for an assembly with the column displayname.

  • Geographical location is provided by columns titled latitude and longitude.

  • Sample timestamps are recorded as three separate columns: year, month, day.

  • Literature references can be provided as DOI system identifiers (e.g. ) or Pubmed identifiers (e.g. ) in a column called literaturelink or in two columns called doi and pmid respectively. If a column called literaturelink is provided, any columns called doi or pmid will be added to general user metadata instead and otherwise ignored.

We strongly recommend including at least when and where the sample was taken.

Uploading Options

Enable Compression

This will compress the files prior to upload. On a fast connection this will have little impact, and may slow it down, but it can significantly improve upload times on a slower connection.

Upload Files Individually

If your connection regularly disconnects, then this will increase the chance that each file will be uploaded successfully. Compression should help as well in this case.

Select these options prior to dropping your files onto the page.

The Processing Screen

Task Progress

The tasks being carried out, and their individual progress are tracked in the bottom left corner. The overall progress and current stage is tracked on the top right, and indicated when complete.

As results arrive from Speciator, and then MLST, the species and type are displayed for each submitted assembly in the animated circle on the right.

Viewing The Results

Once all tasks are complete, you can press the "View Genomes" button to view the results in your "Genomes" page. Individual uploads are tagged and listed in the bottom left corner.

Last updated