💾Public data downloads
Accessing complete species metadata, analysis and FASTA downloads
About the downloads
In order to facilitate access to the Pathogenwatch public data sets, we have exported all the metadata and analysis CSVs, along with the assembled genome FASTAs, to a public "S3" bucket on DigitalOcean.
The root bucket URL is https://pathogenwatch-public.ams3.cdn.digitaloceanspaces.com
File naming scheme
Species names contain characters that will need to be "URL encoded" for access. Examples of how to do this are given below.
Annotation files
Name format
<species name>__<tool name>.csv.gz
Example file link
FASTA files
Name format
<species name>__fasta.zip
Example file link
https://pathogenwatch-public.ams3.cdn.digitaloceanspaces.com/Klebsiella%20pneumoniae__fastas.zip
Using the downloads bucket
Via the browser
Getting the complete list of files
Click on the root bucket URL to view an XML text representation of all the available files.
Downloading an individual file
Use Ctrl-F/Cmd-F
to search the page with the name of the species
Copy the root bucket URL into a new tab + /
at the end of the URL and append the the contents of the Key
field (i.e. <Key>[file name]</Key>
and your browser should automatically download it (tested in Chrome)
With cURL/jq/yq on the command line
Getting the complete list of files.
xq
is tool for parsing XML from the yq
set of tools. It can be easily installed for most systems.
curl https://pathogenwatch-public.ams3.cdn.digitaloceanspaces.com | xq '.ListBucketResult.Contents[].Key'
Downloading an individual file
jq
is a tool for parsing JSON files on the command line. It can also be easily installed on most systems.
Substitute the name of the file you wish to download into the command below.
curl -O https://pathogenwatch-public.ams3.cdn.digitaloceanspaces.com/$( printf "Klebsiella pneumoniae__kleborate.csv.gz" | jq -sRr '@uri )'
s3cmd
The easiest tool for working with S3 buckets is the s3cmd tool. It supports browsing, downloading and syncing from S3 buckets in general.
Getting the complete list of files
s3cmd --host ams3.cdn.digitaloceanspaces.com --host-bucket "%(bucket)s.ams3.cdn.digitaloceanspaces.com" ls s3://pathogenwatch-public | sed -re 's,\s+, ,g' | cut -f 4- -d ' '
Downloading an individual file
s3cmd --host ams3.cdn.digitaloceanspaces.com --host-bucket "%(bucket)s.ams3.cdn.digitaloceanspaces.com" get "s3://pathogenwatch-public/Klebsiella pneumoniae__kleborate.csv.gz"
Downloading all the files
This will download all the files into the current directory
s3cmd --host ams3.cdn.digitaloceanspaces.com --host-bucket "%(bucket)s.ams3.cdn.digitaloceanspaces.com" get s3://pathogenwatch-public/ --recursive
Other
There are also libraries supporting the S3 API in most programming languages and computation platforms (i.e Nextflow).
Last updated