Short Read Assembly

How Pathogenwatch assembles short read Illumina data

Primarily for the benefit of smaller labs or individual researchers without bioinformatics resources we also provide a small scale assembly service. Please be aware resources are limited, and large submissions are likely to take a long time to process.

We're using an assembly pipeline developed by our Global Health Research Unit, based on Spades. The pipeline is built using Nextflow, allowing it to run locally, on a cluster, or in the cloud. Find out more on the pipeline project page.

The assembly pipeline is run using the Pathogenwatch runner system and is included in the user fair-share calculations. Assembly tasks are heavily weighted in these calculations due to their large resource requirements, so if you are running them you may not see other submissions running for a while.

Usage

Start by visiting the upload page: https://pathogen.watch/upload and select "Short Read Assembly". If you don't have an account then you can create one quickly using your email address or social media account.

Drag your reads onto the page and they will start being uploaded.

You will start to see progress as your genomes are assembled and analysed.

It may take time before your genomes start being assembled. During this time we are automatically creating new servers

When your genomes have finished being analysed you can view the results as normal. You can also download your assemblies from the genomes page if you want to analyse them locally.

Expected results

Assembling and analysing one genome can take 15-45 minutes. It can also take more depending on the species and how busy the servers are.

Pipeline details

You can find more details of the pipeline in Github: https://gitlab.com/cgps/ghru/pipelines/assembly

At time of writing, we run version 1.3.1 of the pipeline with the following commands:

nextflow run assembly.nf \
  --input_dir DIRECTORY_IN_S3 \
  --fastq_pattern '*_{1,2}.fastq.gz' \
  --output_dir DIRECTORY_IN_S3 \
  --adapter_file adapters.fas \
  --depth_cutoff 100 \
  --qc_conditions qc_conditions_nextera.yml \
  --prescreen_size_check 20000000

The pipeline dependencies can be found in the docker container bioinformant/ghru-assembly:1.3.1. We provide additional arguments to run the pipeline on our infrastructure which will not affect the end results.

In due course we will add additional QC checks and present errors to the user.

Constraints

We only support paired Illumina read data in .fastq.gz format. We welcome feedback on other sequencing platforms and suitable assembly pipelines.

As a Spades pipeline, we expect to receive bacteria or viral genomes. The only hard constraint is that the organism should have a predicted genome length of less than 20,000,000 bases.

Fair Usage Policy

We have set fair usage constraints for all our tasks in a single integrated queuing system. Assembly tasks require significant resources and so are weighted more heavily than other tasks, meaning that if you want to process more than a few (e.g. 10) genomes it will progress much more quickly if they are already assembled. Since our resources are openly shared with anyone who wants to use them, users uploading assembled genomes will tend to get priority and jump ahead in the queue.

Last updated