Case Study: Running a preconfigured metagenomics pipeline
This case study shows how to set up and run the preconfigured scripts of the UMGAP for the taxonomic analysis of a metagenomics dataset.
Introduction
The Unipept Metagenomics Analysis Pipeline (UMGAP) is accompanied by 3 scripts:
umgap-setup.sh
, umgap-analyse.sh
and umgap-visualize.sh
.
The setup script deals with downloading prebuilt databases and checking and linking external tools.
The analyse script runs several analyses in sequence.
The visualize script creates webpages from the results of the analyse script.
The following pipelines are preconfigured:
- UMGAP High Precision: Optimized for high precision identifications on your metagenomics reads.
- UMGAP Max Precision: Optimized for highest precision, with a small setback on sensitivity.
- UMGAP Tryptic Precision: Made for fast analyses on your laptop. Fewer results, but accurate.
- UMGAP High Sensitivity: Optimized for high sensitivity identifications on metagenomics reads.
- UMGAP Max Sensitivity: Optimized for highest sensitivity, with a small setback on precision.
- UMGAP Tryptic Sensitivity: Made for fast analyses on your laptop. Get a quick overview of your data.
Setup
The preconfigured pipelines require some databases and external tools. The setup script will help you to get these in the right place. In general, you only need to run this script once, but you can also use it to verify the file locations and download newer versions of the data. If you're planning to use the tryptic precision, tryptic sensitivity, high precision or max precision pipelines, you'll need to install FragGeneScan++ first. In this casestudy, we will use a tryptic pipeline, so we don't have to download the 9-mer index (~12GiB, rather than ~150GiB).
Following code snippet shows the interaction with the setup script. It starts out by asking the relevant questions, without delay. At the end, it downloads the relevant files in sequence, without further interaction.
$ umgap-setup.sh -f /opt/FragGeneScanPlusPlus Use '/home/user/.config/unipept' as configuration directory? [y]/n y Created directory /home/user/.config/unipept. Found, tested and remembered the FragGeneScan++ location. Use '/home/user/.local/share/unipept' as data directory? [y]/n y Created directory /home/user/.local/share/unipept. Checking the latest version on the server. Latest version is 2020-12-02. For any type of analysis, you need a taxonomony. For mapping tryptic or 9-mer peptides to it, you need the respective index file of the same version. Would you like to download the taxonomy from 2020-12-02 (115MiB)? [y]/n y Would you like to download the tryptic index from 2020-12-02 (12GiB)? [y]/n y Would you like to download the 9-mer index from 2020-12-02 (152GiB)? [y]/n n % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 114M 100 114M 0 0 6477k 0 0:00:18 0:00:18 --:--:-- 6332k % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 12G 100 12G 0 0 5076k 0 0:34:12 0:34:12 --:--:-- 5790k
To change where the setup script will store the configuration and downloaded files,
add some of the options detailed below.
If you're familiar with the setup script,
or you're running the script within another script,
you can add the -y
flag to run without interaction.
- -c dir
- Set a different configuration directory.
- -f dir
- Link the location of your FragGeneScan installation.
- -d dir
- Set a different location for downloading the database files. This can be on a separate disk.
- -y
- Download all files of the latest version without asking for confirmation.
Analyse
When setup is complete, the analyse script will automatically use the downloaded databases and linked tools. The input data consists of a 100 paired-end reads sampled from a dataset generated by Lindgreen et al. (2016) for their metabenchmark. With these paired-end reads in the A1.fq and A2.fq files, the commands are as follows.
$ head -4 A1.fq @1198114###CP002480-_Acidobacteria_733918/1 ATCGCGCACGCGGCCGATGCCCCAGAAGAGATTGACAGCGGTGGGGCGGGCGGCGGCGAGGTGGTCGCAGATCTCGGCGACCTCTGCGTTGAGGGTCGGG + AADEGGAGIIEIHFHKHJKKJKHIJJIIKCJAHBFKKFHIJIF;JIH5DE$I>CBE$E:FGEJCGABEEECCDKD?D?C$ECEEE;CEEDEEAD=?ED$D $ umgap-analyse.sh -1 A1.fq -2 A2.fq -t tryptic-precision -z -o tryptic-precision-output.fa.gz $ zcat tryptic-precision-output.fa.gz | head -2 >1198114###CP002480-_Acidobacteria_733918/1_1_100_- 1
It seems the pipeline cannot be more specific than
If you used a non-default configuration directory for setup,
you'll need to pass the same directory here, with the -c
option.
When running the pipeline on large samples, it may be useful to compress the output.
Use the -z
flag to GZIP the results.
Should the input be GZIP-compressed,
the script will detect this.
Note that all arguments can be repeated to bundle multiple analysis,
each series ending with a -o
option.
- -c dir
- Set a different configuration directory.
- -z
- Request compression of the output file.
- -1 file
- Single ended FASTA or first pair-ended FASTQ input file, optionally compressed.
- -2 file
- Second pair-ended FASTQ input file, optionally compressed.
- -t
- Type of the analysis (high precision by default).
- -o file
- The output file.
Visualize
After analysis, you probably want to view the results. While the analyse script output format is easily parsed, you can also use the visualize script to create importable CSV frequency tables and interactive visualisations. The interactive visualisations can be stored locally or hosted online. The following snippet creates all three in turn.
$ umgap-visualize.sh -t -r phylum tryptic-precision-output.fa.gz taxon id,taxon name,tryptic-precision-output.fa.gz 1117,Cyanobacteria,2 57723,Acidobacteria,4 201174,Actinobacteria,4 1224,Proteobacteria,5 1,root,160 $ umgap-visualize.sh -w tryptic-sensitivity-output.fa.gz > tryptic-sensitivity.html $ umgap-visualize.sh -u tryptic-sensitivity-output.fa.gz tryptic-sensitivity-output.fa.gz: https://bl.ocks.org/11b7809d6754b9530cf1a49d93a8d568
The CSV tables contain a record for each taxon found in the sample. A record contains, in order, the taxon ID, (for convenience) the taxon name and the number of reads assigned to this taxon or below. All records will be at the same specified taxon rank or at root (unidentified).
- -t
- Output a CSV frequency table on species rank.
- -w
- Output an HTML webpage of an interactive visualization.
- -u
- Print a shareable URL to a online interactive visualisation.
- -r rank
- Set the rank for the CSV frequency table (default: species).

References
- Lindgreen, S., Adair, K. L., & Gardner, P. P. (2016). An evaluation of the accuracy and speed of metagenome analysis tools. Scientific reports, 6, 19233.