Skip to content

Latest commit

 

History

History
374 lines (282 loc) · 21 KB

tutorial.md

File metadata and controls

374 lines (282 loc) · 21 KB

q2-fondue Tutorial

q2-fondue (Functions for reproducibly Obtaining and Normalizing Data re-Used from Elsewhere)

A QIIME 2 plugin enabling an easy download of high throughput sequencing data and the corresponding metadata.

With this easy-to-use plugin you have plenty of deposited sequencing data at your fingertips! Want to set your own data in comparison to other published datasets or start off with a meta-analysis? q2-fondue is here to help! 🙌

Why use q2-fondue?

  • incorporated provenance tracking
  • direct integration with QIIME 2 sequence analysis pipeline - see Overview of QIIME 2 Plugin Workflows
  • support for multiple user interfaces
  • no need to navigate online databases to retrieve data
  • prevention of data loss upon space exhaustion
  • support for downloading (meta)data given a publication library

This tutorial will give you an insight into working with q2-fondue and how its output artifacts can further be used.

Some Background

There are numerous databases online, where researchers can deposit or retrieve high-throughput sequencing data. However, the three biggest and most important storage and sharing platforms certainly are the European Nucleotide Archive (ENA, by the European Bioinformatics Institute (EMBL-EBI)), the Sequence Read Archive (SRA, by the National Institutes of Health (NIH), a part of the U.S. Department of Health and Human Services) and the DNA Data Bank of Japan (DDBJ, at the Research Organization of Information and System National Institute of Genetics(NIG)). Through the International Nucleotide Sequence Database Collaboration (INSDC), data submitted to any of these three organizations is immediately shared among them, enabling free and unrestricted access to deposited data from all three platforms.

Despite the effort to consolidate these colossal amounts of data, it can still be hard to navigate through variable nomenclature used by the different platforms and to actually fetch raw data from the diverse user interfaces.

Overview to INSDCs accession number nomenclature:

Accession Type Accession Format Example
Projects PRJ(E|D|N)[A-Z][0-9]+ PRJEB12345
Studies (E|D|S)RP[0-9]{6,} ERP123456
Experiments (E|D|S)RX[0-9]{6,} ERX123456
Samples (E|D|S)RS[0-9]{6,} ERS123456
Runs (E|D|S)RR[0-9]{6,} ERR123456

Since the launch of BioProject IDs in 2011, this accession number is commonly referenced in most publications, allowing access to all raw sequencing data and corresponding metadata of the entire project1. Study, experiment and sample IDs also allow access to all raw sequencing data and corresponding metadata of the entire project. The subordinate Runs, contain the actual sequencing data of individual samples. All these accession number types can be used as input for fetching metadata and sequences with q2-fondue.

Database specific accession numbers:

Prefix Name Platform
PRJNA BioProject accession number SRA (NCBI)
PRJEB EBI Project accession ENA (EMBL-EBI)
PRJD DDBJ BioProject DDBJ
SRP SRA study accession SRA (NCBI)
ERP EBI study accession ENA (EMBL-EBI)
DRP DDBJ study accession DDBJ
SRR SRA run accession SRA (NCBI)
ERR ERA run accession ENA (EMBL-EBI)
DRR DRA run accession DDBJ

Some microbiome datasets are also uploaded on Qiita, an open-source microbial study management platform. While all data deposited on Qiita is automatically deposited into ENA-EBI, one can also use the QIIME 2 plugin redbiome to query and obtain data and metadata from Qiita.

Using q2-fondue

After reading about regionally distinct microbial communities in vineyards in the publication by Bokulich et al. (2016)2, we are super curious to explore the dataset this study was based on. Luckily, with q2-fondue retrieving all this data is a cakewalk! 🍰

Installation

To install q2-fondue please follow the instructions for installing q2-fondue in an existing QIIME 2 environment (option 2) available in the README.

Getting started

First, let's move to the tutorial directory.

cd tutorial

To run q2-fondue we need a TSV file containing the accession numbers of the desired Runs, Studies, BioProjects, Experiments or Samples. This metadata file has to contain a header QIIME 2 can recognize and we can for example put id as the column name. To learn more about other options for identifiers used in QIIME 2 or learn about metadata in general, check out the QIIME 2 metadata documentation.

Get some example metadata file with accession numbers of Bokulich et al. (2016) here: https://github.com/bokulich-lab/q2-fondue/tree/main/tutorial

The metadata_file.tsv contains the BioProject accession number (PRJEB14186) and the metadata_file_runs.tsv the selected Run accession numbers (ERR1428207-ERR1428236).

Tip: one can of course also pass several BioProject accession numbers at once by having them all in the same metadata file!

Fetching sequences and corresponding metadata together

We first have to convert our metadata_file_runs.tsv file to a QIIME2 artifact of semantic type NCBIAccessionIDs.

qiime tools import \
      --type NCBIAccessionIDs \
      --input-path metadata_file_runs.tsv \
      --output-path metadata_file_runs.qza

Then, to download the raw sequencing data and associated metadata of the entire project we simply pass the metadata_file_runs.qza to qiime fondue get-all and specify the output directory.

To not overload their servers, NCBI recommends to avoid posting more than 3 URL requests per second and to run requests for very large jobs on the weekend (find more info on this in their Usage Guidelines and Requirements). Therefore NCBI requires a valid email address, enabling them to get in touch in case of an issue with downloading too much data.

Tip: We recommend always adding the --verbose flag when running q2-fondue. Depending on the amount of data we are retrieving, the download might take some time and it is easier to follow the process with the integrated download progress bar. Additionally, running with --verbose facilitates catching potential issues with internet connectivity.

qiime fondue get-all \
      --i-accession-ids metadata_file_runs.qza \
      --p-email [email protected] \
      --p-retries 3 \
      --verbose \
      --output-dir fondue-output

Note: The optional parameter --p-retries specifies the number of times q2-fondue is retrying to fetch the sequencing data and is set to 2 by default. If you notice that some of the desired data is not properly downloaded you might increase this number or try refetching manually (see Troubleshooting).

Now let's have a look at the output files!

In the fondue-output directory we can find four files:

  • metadata.qza of semantic type SRAMetadata, containing the metadata
  • paired_reads.qza of semantic type SampleData[PairedEndSequencesWithQuality]
  • single_reads.qza of semantic type SampleData[SequencesWithQuality]
  • failed_runs.qza of semantic type SRAFailedIDs, containing the IDs of samples that failed to download (see Troubleshooting)

It is important to know that q2-fondue always generates two files of semantic type SampleData, one for paired end and one for single end reads, however most of the time only one of them contains the sequencing data we want (unless we are fetching sequencing data from various BioProjects at the same time).

How can we now find out which raw sequence file we should be using? These are your options:

⇨ read the methods section of the original publication to see whether they used paired or single end sequencing.

⇨ check out the metadata file (how to unpack this is shown below!) - the column Library Layout specifies SINGLE or PAIRED end sequencing.

⇨ in the fondue-output directory type ls -lah to show the file size in kilo- (K) or megabyte (M), one of the files will contain only a few kilobyte while the other has several MB of juicy raw data!

⇨ when running qiime fondue get-all, add the --verbose flag to automatically get the UserWarning: No paired-read sequences available for these sample IDs.

In this case we will therefore continue with the single_reads.qza! 🔥

Other q2-fondue functionalities

Fetching only metadata

We might just want to gain more insight into the metadata of a specific study. Also for this action we can start with TSV a file with accession number of BioProjects, Studies, Experiments, Samples or individual Runs.

qiime tools import \
      --type NCBIAccessionIDs \
      --input-path metadata_file.tsv \
      --output-path metadata_file.qza

qiime fondue get-metadata \
      --i-accession-ids metadata_file.qza \
      --p-n-jobs 1 \
      --p-email [email protected] \
      --o-metadata output_metadata.qza \
      --o-failed-runs failed_metadata.qza \
      --verbose

Note: The parameter --p-n-jobs is the number of parallel download jobs and the default is 1. Since this specifies the number of threads, there are hardly any CPU limitations and the more is better until you run out of bandwidth. However, this action is fairly quick so feel free to stick to 1.

The output file output_metadata.qza now contains all the metadata for the requested IDs. And failed_metadata.qza list all IDs where fetching metadata failed, with their corresponding error messages.

Fetching only sequencing data

In contrast, to only get the raw sequences associated with a number of runs, execute these commands:

qiime tools import \
      --type NCBIAccessionIDs \
      --input-path metadata_file.tsv \
      --output-path metadata_file.qza

qiime fondue get-sequences \
      --i-accession-ids metadata_file.qza \
      --p-email [email protected] \
      --o-single-reads single_reads.qza \
      --o-paired-reads paired_reads.qza \
      --o-failed-runs failed_ids.qza \
      --verbose

Note: We can also add the --p-n-jobs and --p-retries parameters in this command (see get-metadata and get-all for more explanations).

Note: To fetch restricted access sequencing data with a dbGAP repository key, see the instructions in the README.md here.

Scraping IDs from a Zotero web library collection

For now we have assumed that a file exists with the accession IDs, for which we want to fetch the sequences and corresponding metadata, namely metadata_file.qza. If you want to scrape the run, study, BioProject, experiment and samples IDs with associated DOI names from an existing Zotero web library collection, you can use the scrape-collection method. Before running it, you have to set three environment variables linked to your Zotero account:

To set these environment variables run the following commands in your terminal for each of the three required variables: export ZOTERO_TYPE=<your library type> or create a .env file with the environment variable assignment. For the latter option, make sure to ignore this file in your version control (e.g., by adding it to .gitignore).

qiime fondue scrape-collection \
              --p-collection-name collection_name \
              --o-run-ids fondue-output/run_ids.qza \
              --o-study-ids fondue-output/study_ids.qza \
              --o-bioproject-ids fondue-output/bioproject_ids.qza \
              --o-experiment-ids fondue-output/experiment_ids.qza \
              --o-sample-ids fondue-output/sample_ids.qza \
              --verbose

where:

  • --p-collection-name is the name of the collection to be scraped.
  • --o-run-ids is the output artifact containing the scraped run IDs and associated DOI names.
  • --o-study-ids is the output artifact containing the scraped study IDs and associated DOI names.
  • --o-bioproject-ids is the output artifact containing the scraped BioProject IDs and associated DOI names.
  • --o-experiment-ids is the output artifact containing the scraped experiment IDs.
  • --o-sample-ids is the output artifact containing the scraped sample IDs.

To investigate which IDs were scraped from your collection, you can check out the respective output artifacts with the commands:

qiime metadata tabulate \
      --m-input-file fondue-output/run_ids.qza \
      --o-visualization fondue-output/run_ids.qzv

qiime tools view fondue-output/run_ids.qzv

Note: make sure to have q2-metadata installed in your conda environment with conda install -c qiime2 q2-metadata.

Downstream analysis in QIIME 2

Here we show how the artifacts fetched through q2-fondue enable an easy entry to the QIIME 2 analysis pipeline, which is also further described in other QIIME 2 tutorials. ✨

Investigate the metadata

While the metadata files we use in QIIME 2 commonly are in the TSV format, the semantic type SRAMetadata that q2-fondue is creating can be used in the same way.

Let's have a look at the metadata by tabulating it and visualizing the .qzv file.

qiime metadata tabulate \
      --m-input-file fondue-output/metadata.qza \
      --o-visualization fondue-output/metadata.qzv

qiime tools view fondue-output/metadata.qzv

Note: make sure to have q2-metadata installed in your conda environment with conda install -c qiime2 q2-metadata.

Exploring the sequencing data

Apart from avoiding the tedious search and manual downloading of these large piles of data, one of the biggest advantages of using q2-fondue is the fact that the output is already a QIIME 2 artifact and we don't have to import it!

The retrieved single_reads.qza file can therefore be summarized directly:

qiime demux summarize \
      --i-data fondue-output/single_reads.qza \
      --o-visualization fondue-output/single_reads.qzv

qiime tools view fondue-output/single_reads.qzv

Have a look at the overall quality in the Interactive Quality Plot and inspect the sample and feature counts. Then, we can move on to denoising with DADA2 or Deblur.

For example:

qiime dada2 denoise-single \
      --i-demultiplexed-seqs fondue-output/single_reads.qza \
      --p-trunc-len 120 \
      --o-table fondue-output/dada2_table.qza \
      --o-representative-sequences fondue-output/dada2_rep_set.qza \
      --o-denoising-stats fondue-output/dada2_stats.qza

As mentioned above, the metadata.qza file can be used directly in the following analysis! 💪

qiime feature-table summarize \
      --i-table fondue-output/dada2_table.qza \
      --m-sample-metadata-file fondue-output/metadata.qza \
      --o-visualization fondue-output/dada2_table.qzv

qiime tools view fondue-output/dada2_table.qzv

In summary, we showed how the artifacts fetched through q2-fondue enable an easy entry to the QIIME 2 analysis pipeline, which is further described in other tutorials.

Prepare downstream analysis outside of QIIME 2

All q2-fondue outputs are QIIME 2 artifacts, which consist of zip-compressed files of data and associated metadata, containing additional information on the provenance, type, and format. The data within an artifact can be easily extracted for use outside of QIIME 2, using one of several options.

Unzipping data and metadata from any Artifact (without installing QIIME 2)

Any QIIME 2 Artifact (including those output by q2-fondue) can be unzipped with any zip software, such as the zip file compression utility that is available on most computing systems. Hence, QIIME 2 and q2-fondue do not even need to be installed, e.g., to open a file that was prepared by a collaborator. The following command will output a directory containing all contents from an Artifact (e.g., data, metadata, and provenance information):

unzip mysterious-artifact.qza

Extract retrieved sample metadata as a TSV file

If QIIME 2 is installed on a system, the QIIME 2 extract utility can be used to extract all data from an Artifact. Note that QIIME 2 must be installed in the current active environment.

To extract the sample metadata (and file metadata and provenance information) from a SRAMetadata Artifact (e.g., the output of q2-fondue's get-all or get-metadata actions), use the following:

qiime tools extract \
      --input-path metadata.qza \
      --output-path metadata

This creates a metadata directory with all information on provenance tracking, and in the folder data we find the sra-metadata.tsv containing all metadata of the initially requested accession IDs.

Alternatively, the export utility can be used to extract (and optionally transform) only the data stored within an Artifact. The following will export only the sample metadata from the Artifact, and output it as a TSV in the current working directory:

qiime tools export \
      --input-path metadata.qza \
      --output-path .

Extract retrieved sequences as FASTQ files

To extract FASTQ files (and file metadata and provenance information) from a SampleData[SequencesWithQuality] or SampleData[PairedEndSequencesWithQuality] Artifact (e.g., the output of q2-fondue's get-all or get-sequences actions), use the following:

qiime tools extract \
      --input-path fondue-output/single_reads.qza \
      --output-path fondue-output/single_reads

Similarly, when extracting the sequencing data, we find the individual fastq.gz files of each Run as well as a metadata.yml and a MANIFEST file in the data directory.

As above, export can be used instead if you only want to export the data without associated file metadata or provenance information:

qiime tools export \
      --input-path fondue-output/single_reads.qza \
      --output-path .

Troubleshooting

Manually refetching missing sequencing data

Occasionally sequencing data is not completely downloaded due to, for example, server timeouts from NCBI. In this case one can simply use the generated failed_runs.qza file of semantic type SRAFailedIDs containing the IDs that failed to download the missing sequencing data.

qiime fondue get-sequences \
      --i-accession-ids fondue-output/failed_ids.qza \
      --p-email [email protected] \
      --o-single-reads refetched_single \
      --o-paired-reads refetched_paired \
      --o-failed-runs refetched_failed

Combining sequence data from multiple into a single artifact

Instead of working with multiple raw sequencing data (semantic type SampleData[SequencesWithQuality]), for example after refetching IDs that failed to download at first, we can use this powerful action to merge them into a single artifact.

 qiime fondue combine-seqs \
      --i-seqs single_reads.qza refetched_single.qza \
      --o-combined-seqs combined_seqs.qza

Merging metadata files

With q2-fondue we can also easily merge multiple metadata files into a single one!

qiime fondue merge-metadata \
      --i-metadata output_metadata_1.qza output_metadata_2.qza \
      --o-merged-metadata merged_metadata.qza

References

[1] Clark K, Pruitt K, Tatusova T, et al. BioProject. 2013 Apr 28 [Updated 2013 Nov 11]. In: The NCBI Handbook [Internet]. 2nd edition. Bethesda (MD): National Center for Biotechnology Information (US); 2013-. Available from: https://www.ncbi.nlm.nih.gov/books/NBK169438/?report=classic

[2] Bokulich N., et al. Associations among Wine Grape Microbiome, Metabolome, and Fermentation Behavior Suggest Microbial Contribution to Regional Wine Characteristics. 2016 Jun 14. In: ASM Journals / mBio / Vol. 7, No. 3. DOI: https://doi.org/10.1128/mBio.00631-16