Uploading Genomes
Description of the upload page and file formats.
Last updated
Description of the upload page and file formats.
Last updated
Free of charge, we provide the ability to analysis large numbers of new microbial assemblies in Pathogenwatch. To upload your own microbial pathogen assembly data, click on the 'Uploads' tab on the top right and follow the onscreen instructions.You will have the choice to upload three types of file:
One or more FASTAs each containing a single genome (i.e. bacterial genomes);
One or more FASTAs containing one genome per record (e.g. a FASTA of multiple viral genomes);
Pairs of read files in FASTQ format.
Sequences must be represented in standard IUPAC code (i.e. ATCGATCGNA
). Each record represents a single contig in the assembly. The file name is used to name the genome by default and to link to a record in an accompanying metadata CSV. More than one can be uploaded at a time, though we recommend small batches on slow or unstable internet connections.
Sequences must be represented in standard IUPAC code (i.e. ATCGATCGNA
). Each record represents an assembled or complete viral genome. The record header will be used to name each genome by default and to link to records in an accompanying metadata CSV. More than one can be uploaded at a time, though we request users to be mindful not to submit thousands of genomes at once as it will impact other users.
You can also upload a limited number of pairs of FASTQ files for assembly using our in-house assembly pipeline. The default genome name is taken from the shared part of the filename of the FASTQs. For more details about this pipeline, please see the technical documentation.
Metadata files are accepted in CSV format, with a .csv
file ending.
One row per assembly.
Rows are linked to the FASTA file by column titled filename
. For mutli-genome FASTAs (e.g. viral genomes) put the identifier in each record header in this column.
Provide a default name for an assembly with the column displayname
.
Geographical location is provided by columns titled latitude
and longitude
.
Sample timestamps are recorded as three separate columns: year
, month
, day
.
Literature references can be provided as DOI system identifiers (e.g. ) or Pubmed identifiers (e.g. ) in a column called literaturelink
or in two columns called doi
and pmid
respectively. If a column called literaturelink
is provided, any columns called doi
or pmid
will be added to general user metadata instead and otherwise ignored.
We strongly recommend including at least when and where the sample was taken.
This will compress the files prior to upload. On a fast connection this will have little impact, and may slow it down, but it can significantly improve upload times on a slower connection.
If your connection regularly disconnects, then this will increase the chance that each file will be uploaded successfully. Compression should help as well in this case.
Select these options prior to dropping your files onto the page.
The tasks being carried out, and their individual progress are tracked in the bottom left corner. The overall progress and current stage is tracked on the top right, and indicated when complete.
As results arrive from Speciator, and then MLST, the species and type are displayed for each submitted assembly in the animated circle on the right.
Once all tasks are complete, you can press the "View Genomes" button to view the results in your "Genomes" page. Individual uploads are tagged and listed in the bottom left corner.