UMGAP documentation
Use the Unipept MetaGenomics Analysis Pipeline to assign taxonomic labels to your shotgun metagenomics reads. The results are available as taxonomic frequency tables and interactive visualizations.
The UMGAP is a collection of command line tools. Combine them into your own pipeline to identify (short) shotgun metagenomics reads guided by our case studies, or use one of the preconfigured pipelines. After each read is assigned a taxon, collect the results in frequency tabels and interactive visualizations. With communication via standard input and standard output, and an easy to understand, consistent intermediate format, it's easy to plug your own extensions into the pipeline.
- UMGAP High Precision: Optimized for high precision identifications on your metagenomics reads.
- UMGAP Max Precision: Optimized for highest precision, with a small setback on sensitivity.
- UMGAP Tryptic Precision: Made for fast analyses on your laptop. Fewer results, but accurate.
- UMGAP High Sensitivity: Optimized for high sensitivity identifications on metagenomics reads.
- UMGAP Max Sensitivity: Optimized for highest sensitivity, with a small setback on precision.
- UMGAP Tryptic Sensitivity: Made for fast analyses on your laptop. Get a quick overview of your data.
Throughout this documentation, the term peptides is used for both tryptic peptides and k-mers. The term taxon ID refers to an identifier of an NCBI taxonomy (which should be the same version in the whole pipeline).
The UMGAP is free and open-source software under the MIT License and all code is available on Github. In case you have encountered an issue using these tools, have feature requests or found a bug, don't hesitate to contact us by email (unipept@ugent.be), or create an issue on Github.
- fastq2fasta
- Interleaves a number of FASTQ files into a single FASTA output
- translate
- Translates DNA into Amino Acid Sequences
- prot2tryp
- Splits each protein sequence in a FASTA format into a list of (tryptic) peptides
- filter
- Filter peptides in a FASTA format based on specific criteria
- prot2kmer
- Splits each protein sequence in a FASTA format into a list of kmers
- pept2lca
- Looks up each line of input in a given FST index and outputs the result. Lines starting with a '>' are copied. Lines for which no mapping is found are ignored
- prot2tryp2lca
- Reads all the records in a specified FASTA file and queries the tryptic peptides in an FST for the LCAs
- prot2kmer2lca
- Reads all the records in a specified FASTA file and queries the k-mers in an FST for the LCAs
- bestof
- Pick the frame with the most none-root hits
- seedextend
- Seed and extend
- uniq
- Concatenates the data strings of all consecutive FASTA entries with the same header
- taxa2agg
- Aggregates taxa to a single taxon
- snaptaxon
- Snap taxa to a specified rank or one of the specified taxa
- taxa2freq
- Count and report on a list of taxon IDs
- taxa2tree
- Visualizes the given list of taxons using the Unipept API
- taxonomy
- Show taxonomy info
- splitkmers
- Splits each protein sequence in a CSV format into a list of kmers
- joinkmers
- Groups a CSV by equal first fields (k-mers) and aggregates the second fields (taxon IDs)
- buildindex
- Write an FST index of stdin on stdout
- printindex
- Print the values in an FST index to stdout
Installation
To use UMGAP, Rust needs to be installed on your system. We recommend the lastest version, but all versions since Rust 1.35 are supported. To check if you have the correct Ruby version installed, open a terminal and run
rustc --version
.
$ rustc --version rustc 1.42.0
Installing Rust
If the
rustc --version
command returns
command not found, Rust is not yet installed on your system. More information on installing Rust can be found at
https://www.rust-lang.org/tools/install.
The next step is to download the UMGAP source code. The easiest way to do this, is by cloning our git repository using git clone https://github.com/unipept/umgap.git
. Alternatively, you can also click the download button on GitHub.
Next, we're ready to compile and install UMGAP. First navigate to the directory where you cloned or downloaded the code, then run cargo install --path .
. The output should look like this:
$ cargo install --path . Installing umgap v0.3.5 (/Users/unipept/Downloads/umgap) Updating crates.io index Downloaded num-traits v0.2.11 Downloaded error-chain v0.12.2 ... Compiling structopt v0.3.13 Compiling umgap v0.3.5 (/Users/bart/Downloads/umgap) Finished release [optimized] target(s) in 1m 50s Installed package `umgap v0.3.5 (/Users/unipept/Downloads/umgap)` (executable `umgap`) warning: be sure to add `/Users/unipept/.cargo/bin` to your PATH to be able to run the installed binaries
UMGAP should now be installed to the ~/.cargo/bin
directory. To access it everywhere, be sure to add it to your $PATH
.
After successful installation, the UMGAP command should be available. To check if it was installed correctly, run
umgap --version
. This should print the version number:
$ umgap --version umgap 0.3.5
More information about the installed command can be found on these pages, or by running the
umgap --help
command.
Updates
To update UMGAP, simply repeat the install instructions and be sure to redownload the source code. The changes between releases are listed in the changelog.
Configuration
If you want to use FragGeneScanPlusPlus as gene predictor, this needs to be installed as well. Instructions can be found at https://github.com/unipept/FragGeneScanPlusPlus.
Run the configuration script at scripts/umgap-setup.sh
to interactively configure UMGAP and download the data files required for some steps of the pipeline.
Depending on which type of analysis you are planning, you will need the tryptic index file (less powerfull, but runs on any decent laptop) and the 9-mer index file (uses about 100GB disk space for storage and as much RAM during operation. The exact size depends on the version.)
To check if the configuration was successful, you can run a sample analysis with some include test data. The following command should show you a FASTA-like file with a taxon ID per header.
$ ./scripts/umgap-analyse.sh -1 testdata/A1.fq -2 testdata/A2.fq -t tryptic-sensitivity -o - | tee output.fa