UMGAP documentation

Use the Unipept MetaGenomics Analysis Pipeline to assign taxonomic labels to your shotgun metagenomics reads. The results are available as taxonomic frequency tables and interactive visualizations.

The UMGAP is a collection of command line tools. Combine them into your own pipeline to identify (short) shotgun metagenomics reads guided by our case studies, or use one of the preconfigured pipelines. After each read is assigned a taxon, collect the results in frequency tabels and interactive visualizations. With communication via standard input and standard output, and an easy to understand, consistent intermediate format, it's easy to plug your own extensions into the pipeline.

  • UMGAP High Precision: Optimized for high precision identifications on your metagenomics reads.
  • UMGAP Max Precision: Optimized for highest precision, with a small setback on sensitivity.
  • UMGAP Tryptic Precision: Made for fast analyses on your laptop. Fewer results, but accurate.
  • UMGAP High Sensitivity: Optimized for high sensitivity identifications on metagenomics reads.
  • UMGAP Max Sensitivity: Optimized for highest sensitivity, with a small setback on precision.
  • UMGAP Tryptic Sensitivity: Made for fast analyses on your laptop. Get a quick overview of your data.

Throughout this documentation, the term peptides is used for both tryptic peptides and k-mers. The term taxon ID refers to an identifier of an NCBI taxonomy (which should be the same version in the whole pipeline).

The UMGAP is free and open-source software under the MIT License and all code is available on Github. In case you have encountered an issue using these tools, have feature requests or found a bug, don't hesitate to contact us by email (unipept@ugent.be), or create an issue on Github.

fastq2fasta
Interleaves a number of FASTQ files into a single FASTA output
translate
Translates DNA into Amino Acid Sequences
prot2tryp
Splits each protein sequence in a FASTA format into a list of (tryptic) peptides
filter
Filter peptides in a FASTA format based on specific criteria
prot2kmer
Splits each protein sequence in a FASTA format into a list of kmers
pept2lca
Looks up each line of input in a given FST index and outputs the result. Lines starting with a '>' are copied. Lines for which no mapping is found are ignored
prot2tryp2lca
Reads all the records in a specified FASTA file and queries the tryptic peptides in an FST for the LCAs
prot2kmer2lca
Reads all the records in a specified FASTA file and queries the k-mers in an FST for the LCAs
bestof
Pick the frame with the most none-root hits
seedextend
Seed and extend
uniq
Concatenates the data strings of all consecutive FASTA entries with the same header
taxa2agg
Aggregates taxa to a single taxon
snaptaxon
Snap taxa to a specified rank or one of the specified taxa
taxa2freq
Count and report on a list of taxon IDs
taxa2tree
Visualizes the given list of taxons using the Unipept API
taxonomy
Show taxonomy info
splitkmers
Splits each protein sequence in a CSV format into a list of kmers
joinkmers
Groups a CSV by equal first fields (k-mers) and aggregates the second fields (taxon IDs)
buildindex
Write an FST index of stdin on stdout
printindex
Print the values in an FST index to stdout

Installation

To use UMGAP, Rust needs to be installed on your system. We recommend the lastest version, but all versions since Rust 1.35 are supported. To check if you have the correct Ruby version installed, open a terminal and run rustc --version.

$ rustc --version
rustc 1.42.0

Installing Rust

If the rustc --version command returns command not found, Rust is not yet installed on your system. More information on installing Rust can be found at https://www.rust-lang.org/tools/install.

The next step is to download the UMGAP source code. The easiest way to do this, is by cloning our git repository using git clone https://github.com/unipept/umgap.git. Alternatively, you can also click the download button on GitHub.

Next, we're ready to compile and install UMGAP. First navigate to the directory where you cloned or downloaded the code, then run cargo install --path .. The output should look like this:

$ cargo install --path .
 Installing umgap v0.3.5 (/Users/unipept/Downloads/umgap)
    Updating crates.io index
  Downloaded num-traits v0.2.11
  Downloaded error-chain v0.12.2
  ...
  Compiling structopt v0.3.13
  Compiling umgap v0.3.5 (/Users/bart/Downloads/umgap)
 Finished release [optimized] target(s) in 1m 50s
 Installed package `umgap v0.3.5 (/Users/unipept/Downloads/umgap)` (executable `umgap`)
 warning: be sure to add `/Users/unipept/.cargo/bin` to your PATH to be able to run the installed binaries

UMGAP should now be installed to the ~/.cargo/bin directory. To access it everywhere, be sure to add it to your $PATH.

After successful installation, the UMGAP command should be available. To check if it was installed correctly, run umgap --version. This should print the version number:

$ umgap --version
umgap 0.3.5

More information about the installed command can be found on these pages, or by running the umgap --help command.

Updates

To update UMGAP, simply repeat the install instructions and be sure to redownload the source code. The changes between releases are listed in the changelog.

Configuration

If you want to use FragGeneScanPlusPlus as gene predictor, this needs to be installed as well. Instructions can be found at https://github.com/unipept/FragGeneScanPlusPlus.

Run the configuration script at scripts/umgap-setup.sh to interactively configure UMGAP and download the data files required for some steps of the pipeline.

Depending on which type of analysis you are planning, you will need the tryptic index file (less powerfull, but runs on any decent laptop) and the 9-mer index file (uses about 100GB disk space for storage and as much RAM during operation. The exact size depends on the version.)

To check if the configuration was successful, you can run a sample analysis with some include test data. The following command should show you a FASTA-like file with a taxon ID per header.

$ ./scripts/umgap-analyse.sh -1 testdata/A1.fq -2 testdata/A2.fq -t tryptic-sensitivity -o - | tee output.fa