Running QIIME2 on ARGO

Versions installed:

QIIME2/2019.10
QIIME2/2020.11 (conda environment)

Setting up your environment

To demonstrate how to use qiime2 on ARGO, we follow the moving-pictures tutorial and run the instructions in the cluster environment. The steps are best run from the /scratch space which is read/write on all nodes.

After logging in, change to /scratch

cd $SCRATCH

Then

module load miniconda3

If this is the first time you're using the conda environments in your shell, you need to first run

conda init
source ~/.bashrc

which will configure your shell for the conda environments. Your command prompt will now be prefixed with (base) to show the conda environments is active.

To see the available conda environments, run

conda env list

Change to the qiime2 environment with

conda activate qiime2-2020.11

Now we follow the steps in the tutorial to download and organize the data files needed

Create the directory for the tutorial and move into it:

mkdir qiime2-moving-pictures-tutorial
cd qiime2-moving-pictures-tutorial

* Download the sample metadata:

wget \
  -O "sample-metadata.tsv" \
  "https://data.qiime2.org/2020.11/tutorials/moving-pictures/sample_metadata.tsv"

* Create a new directory and download a sample set of the sequence reads into it:

mkdir emp-single-end-sequences

wget \
  -O "emp-single-end-sequences/barcodes.fastq.gz" \
  "https://data.qiime2.org/2020.11/tutorials/moving-pictures/emp-single-end-sequences/barcodes.fastq.gz"

wget \
  -O "emp-single-end-sequences/sequences.fastq.gz" \
  "https://data.qiime2.org/2020.11/tutorials/moving-pictures/emp-single-end-sequences/sequences.fastq.gz"

Running Qiime2

At this point, we are ready to run the qiime2 commands. There are 2 ways this can be done on the cluster.

Running qiime2 in an interactive session:

Request a compute node using salloc

salloc

- Load the qiime2 module

module purge
module load miniconda3
conda activate qiime2-2020.11

- It is also necessary to redefine the environment variable TMPDIR

export TMPDIR=/tmp

- Run the qiime2 command to import the sequence data files into a QIIME 2 artifact

qiime tools import \
  --type EMPSingleEndSequences \
  --input-path emp-single-end-sequences \
  --output-path emp-single-end-sequences.qza

You should get an output printed out to screen

Imported emp-single-end-sequences as EMPSingleEndDirFmt to emp-single-end-sequences.qza

You can keep running the subsequent commands one after the other and the interactive session will persist until you type

exit

Writing a slurm script and submitting a qiime2 job

The interactive method is useful for checking your work and running fewer commands for smaller sample sizes. We can also write slurm scripts that can be submitted through slurm to do the same. For the same command that was run in the interactive session, we can create a slurm script, qiime_job.slurm:

#!/bin/sh

#SBATCH --job-name=qiime2_tutorial

#SBATCH --partition=all-HiPri

## NOTE: %u=userID, %x=jobName, %N=nodeID, %j=jobID, %A=arrayID, %a=arrayTaskID
#SBATCH --output=/scratch/%u/qiime2-moving-pictures-tutorial/%x-%N-%j.out  # Output file
#SBATCH --error=/scratch/%u/qiime2-moving-pictures-tutorial/%x-%N-%j.err   # Error file

#SBATCH --mem=5G        # Total memory needed per task (units: K,M,G,T)
##SBATCH --time=<D-HH:MM>  # Total time needed for job: Days-Hours:Minutes

## ----- Parallel Threads -----
## Some programs and libraries (OpenMP) implement parallelism using threads
## (light-weight sub-processes). Advantages: Less processing overhead and
## ability to share memory.  Disadvantages: All threads must run on the same
## node.  Make sure that the resources you request are feasible,
## e.g. --cpus-per-task must be <= # of cores on a node.
##SBATCH --cpus-per-task <C>   # Request extra CPUs for threads
##SBATCH --mem-per-cpu <M>     # If your threads us a lot of memory, and you
                               # plan to vary # of threads, use this not --mem=

##SBATCH --reservation=ssilayi_107 #uncomment this line if you're running before 2021-03-03T23:59:00
                                   #otherwise you can delete the line

## Load the relevant modules needed for the job
# - Using the conda environment for qiime2
#module purge
#module load qiime2/2019.10

export TMPDIR=/tmp/


## Run your program or script
qiime tools import \
  --type EMPSingleEndSequences \
  --input-path emp-single-end-sequences \
  --output-path emp-single-end-sequences.qza

In the script, we direct the output and error files to the directory in which we are currently working. We are also submitting to the all-HiPri queue which has a time limit of 12 hrs - more than sufficient for the importing command to complete given the sample data size.

To submit it

sbatch qiime_job.slurm

To check on the status of the submitted job, run

sacct -X

Once it completes, the generated .out file should have the same line as from the interactive session

Imported emp-single-end-sequences as EMPSingleEndDirFmt to emp-single-end-sequences.qza

In both cases you should now have an additional emp-single-end-sequences.qza file in your directory.

Writing batch scripts

Instead of running the qiime2 commands one at a time, we can run a sequence of commands by creating a shell script. For example, we take a series of commands from tutorial and add them to the shell script, qiime.sh

# importing data
echo 'Importing Data ...'

qiime tools import \
  --type EMPSingleEndSequences \
  --input-path emp-single-end-sequences \
  --output-path emp-single-end-sequences.qza


# check the UUID, type, and format of your newly-imported sequences
echo 'UUID, type, and format imported sequences ...'
qiime tools peek emp-single-end-sequences.qza


# Demultiplexing sequences

echo 'Demultiplexing sequences ...'

qiime demux emp-single \
  --i-seqs emp-single-end-sequences.qza \
  --m-barcodes-file sample-metadata.tsv \
  --m-barcodes-column barcode-sequence \
  --o-per-sample-sequences demux.qza \
  --o-error-correction-details demux-details.qza

# generate a summary of the demultiplexing results
echo 'Generating summary ...'
qiime demux summarize \
  --i-data demux.qza \
  --o-visualization demux.qzv

# Sequence quality control and feature table construction

echo 'Sequence quality control and feature table construction ... '

qiime dada2 denoise-single \
  --i-demultiplexed-seqs demux.qza \
  --p-trim-left 0 \
  --p-trunc-len 120 \
  --o-representative-sequences rep-seqs-dada2.qza \
  --o-table table-dada2.qza \
  --o-denoising-stats stats-dada2.qza

qiime metadata tabulate \
  --m-input-file stats-dada2.qza \
  --o-visualization stats-dada2.qzv

echo 'Renaming outputs ... '

mv rep-seqs-dada2.qza rep-seqs.qza
mv table-dada2.qza table.qza

First we make the shell script executable with

chmod +x qiime.sh

Then we update the slurm script to now run the batch script with the commands in it

#!/bin/sh

#SBATCH --job-name=qiime2_moving_pictures

#SBATCH --partition=all-HiPri

## NOTE: %u=userID, %x=jobName, %N=nodeID, %j=jobID, %A=arrayID, %a=arrayTaskID
#SBATCH --output=/scratch/%u/qiime2-moving-pictures-tutorial/%x-%N-%j.out  # Output file
#SBATCH --error=/scratch/%u/qiime2-moving-pictures-tutorial/%x-%N-%j.err   # Error file

#SBATCH --mem=5G        # Total memory needed per task (units: K,M,G,T)
##SBATCH --time=<D-HH:MM>  # Total time needed for job: Days-Hours:Minutes


## Load the relevant modules needed for the job
#module purge
#module load miniconda3
#conda activate qiime2-2020.11
#conda init
#source ~/.bashrc

export TMPDIR=/tmp/


## Run your program or script
## Replaced the qiime commands with the executable script
./qiime.sh

Submit to the compute nodes with

sbatch qiime_shell_job.slurm

The shell script will print out periodic messages to show the progress of the commands. You can also check the status of the running job with

sacct -X

Parallel Qiime Jobs

Some QIIME commands can utilize multiple threads with --p-n-threads. Following the example of the script above, we can write a new script, qiime_threads.sh, that includes qiime2 commands with threading taken from the Atacama soil microbiome tutorial:

#!/bin/bash

mkdir -p atacama-tutorial
cd atacama-tutorial

wget \
  -O "sample-metadata.tsv" \
  "https://data.qiime2.org/2020.11/tutorials/atacama-soils/sample_metadata.tsv"


mkdir -p emp-paired-end-sequences

wget \
  -O "emp-paired-end-sequences/forward.fastq.gz" \
  "https://data.qiime2.org/2020.11/tutorials/atacama-soils/10p/forward.fastq.gz"

wget \
  -O "emp-paired-end-sequences/reverse.fastq.gz" \
  "https://data.qiime2.org/2020.11/tutorials/atacama-soils/10p/reverse.fastq.gz"


wget \
  -O "emp-paired-end-sequences/barcodes.fastq.gz" \
  "https://data.qiime2.org/2020.11/tutorials/atacama-soils/10p/barcodes.fastq.gz"


echo "Paired-end read analysis commands ... "

qiime tools import \
  --type EMPPairedEndSequences \
  --input-path emp-paired-end-sequences \
  --output-path emp-paired-end-sequences.qza

qiime demux emp-paired \
  --verbose \
  --m-barcodes-file sample-metadata.tsv \
  --m-barcodes-category BarcodeSequence \
  --i-seqs emp-paired-end-sequences.qza \
  --o-per-sample-sequences demux \
 --p-rev-comp-mapping-barcodes

qiime demux summarize \
  --verbose \
  --i-data demux.qza \
  --o-visualization demux.qzv

qiime dada2 denoise-paired \
  --verbose \
  --p-n-threads $SLURM_CPUS_PER_TASK \
  --i-demultiplexed-seqs demux.qza \
  --o-table table \
  --o-representative-sequences rep-seqs \
  --p-trim-left-f 13 \
  --p-trim-left-r 13 \
  --p-trunc-len-f 150 \
  --p-trunc-len-r 150

qiime feature-table summarize \
  --verbose \
  --i-table table.qza \
  --o-visualization table.qzv \
  --m-sample-metadata-file sample-metadata.tsv

qiime feature-table tabulate-seqs \
  --verbose \
  --i-data rep-seqs.qza \
  --o-visualization rep-seqs.qzv

The script combines bash commands to download the necessary files and the qiime commands to be run. It also uses the number of cpus-per-task that we define in the submission script with the updated slurm commands for multi-processing:

#!/bin/sh

#SBATCH --job-name=qiime2_w_threads

#SBATCH --partition=all-HiPri

## NOTE: %u=userID, %x=jobName, %N=nodeID, %j=jobID, %A=arrayID, %a=arrayTaskID
#SBATCH --output=/scratch/%u/qiime2_tutorial/moving-pictures/%x-%N-%j.out  # Output file
#SBATCH --error=/scratch/%u/qiime2_tutorial/moving-pictures/%x-%N-%j.err   # Error file

#SBATCH --mem=5G        # Total memory needed per task (units: K,M,G,T)
##SBATCH --time=<D-HH:MM>  # Total time needed for job: Days-Hours:Minutes

## ----- Parallel Threads -----
## Some programs and libraries (OpenMP) implement parallelism using threads
## (light-weight sub-processes). Advantages: Less processing overhead and
## ability to share memory.  Disadvantages: All threads must run on the same
## node.  Make sure that the resources you request are feasible,
## e.g. --cpus-per-task must be <= # of cores on a node.
#SBATCH --nnodes 1
#SBATCH --cpus-per-task 16   # Request extra CPUs for threads
##SBATCH --mem-per-cpu <M>     # If your threads us a lot of memory, and you
                               # plan to vary # of threads, use this not --mem=

## Load the relevant modules needed for the job
#module purge
#module load qiime2/2019.10

export TMPDIR=/tmp/
export OMP_NUM_THREADS=16

## Run your program or script
./qiime_threads.sh

To view the generated files in all cases, you can transfer the .qzv files to your local machine and use the QIIME2 viewer.

When finished, you can deactivate the conda environments with

conda deactivate