Short Read Assembly
How Pathogenwatch assembles short read Illumina data
Last updated
How Pathogenwatch assembles short read Illumina data
Last updated
Primarily for the benefit of smaller labs or individual researchers without bioinformatics resources we also provide a small scale assembly service. Please be aware resources are limited, and large submissions are likely to take a long time to process.
We're using an assembly pipeline developed by our , based on . The pipeline is built using Nextflow, allowing it to run locally, on a cluster, or in the cloud. Find out more on the .
The assembly pipeline is run using the Pathogenwatch runner system and is included in the user fair-share calculations. Assembly tasks are heavily weighted in these calculations due to their large resource requirements, so if you are running them you may not see other submissions running for a while.
Start by visiting the upload page: and select "Short Read Assembly". If you don't have an account then you can create one quickly using your email address or social media account.
Drag your reads onto the page and they will start being uploaded.
You will start to see progress as your genomes are assembled and analysed.
When your genomes have finished being analysed you can view the results as normal. You can also download your assemblies from the if you want to analyse them locally.
Assembling and analysing one genome can take 15-45 minutes. It can also take more depending on the species and how busy the servers are.
You can find more details of the pipeline in Github:
At time of writing, we run version 1.3.1 of the pipeline with the following commands:
The pipeline dependencies can be found in the docker container bioinformant/ghru-assembly:1.3.1
. We provide additional arguments to run the pipeline on our infrastructure which will not affect the end results.
In due course we will add additional QC checks and present errors to the user.
We only support paired Illumina read data in .fastq.gz
format. We welcome feedback on other sequencing platforms and suitable assembly pipelines.
As a Spades pipeline, we expect to receive bacteria or viral genomes. The only hard constraint is that the organism should have a predicted genome length of less than 20,000,000 bases.
We have set fair usage constraints for all our tasks in a single integrated queuing system. Assembly tasks require significant resources and so are weighted more heavily than other tasks, meaning that if you want to process more than a few (e.g. 10) genomes it will progress much more quickly if they are already assembled. Since our resources are openly shared with anyone who wants to use them, users uploading assembled genomes will tend to get priority and jump ahead in the queue.
The SPAdes assembler: Prjibelski A, Antipov D, Meleshko D, Lapidus A, Korobeynikov A. Using SPAdes De Novo Assembler. Curr Protoc Bioinformatics. 2020;70(1):e102. doi:10.1002/cpbi.102
The pipeline is available under an OSS licence from