Maps all k-mers from a FASTA stream of peptides to taxon IDs.
umgap prot2kmer2lca command takes one or more peptides as input and outputs the lowest
common ancestors of all their k-mers. It is a combination of the
umgap prot2kmer and
umgap pept2lca commands to allow more efficient parallel computing.
The input is given in a FASTA format on standard input, with a single peptide per FASTA
header, which may be hardwrapped with newlines. All overlapping k-mers in these peptides (k
configurable via the -k option, and 9 by default) are searched for in the index (as build by
umgap buildindex command) passed as argument. The results are printed on standard output
in FASTA format.
$ cat input.fa >header1 DAIGDVAKAYKKAG*S $ umgap prot2kmer2lca -k9 uniprot-2020-04-9mer.index < input.fa >header1 571525 571525 6920 6920 1 6920
Add the -o option to print out 0 for k-mers not found in the index.
$ umgap prot2kmer2lca -o uniprot-2020-04-9mer.index < input.fa >header1 571525 571525 6920 6920 1 6920 0 0
This command also allows an alternative mode of operation. When memory mapped, it can take
some time for the index to be searched. With the -m flag, the complete index will be loaded
in memory before operation. This, too, takes some time, but for a single large analysis, this
impact is irrelevant compared to the time of analysis. When processing many short files, the
index would need to be loaded again and again. Instead of using this command as part of a
... | umgap prot2kmer2lca index | ..., it can run in a separate (and persistent)
process, reusing the same loaded index. Run
umgap prot2kmer2lca -m -s umgap-socket index as a
service, and when the index is loaded, change your original pipeline(s) to communicate with the
socket using OpenBSD's netcat:
... | nc -NU /path/to/umgap-socket | ....
- -m / --in-memory
- Load index in memory instead of memory mapping the file contents. This makes querying significantly faster, but requires some initialization time.
- -h / --help
- Prints help information
- -o / --one-on-one
- Map unknown sequences to 0 instead of ignoring them
- -V / --version
- Prints version information
- -c / --chunksize c
- Number of reads grouped into one chunk. Bigger chunks decrease the overhead caused by multithreading. Because the output order is not necessarily the same as the input order, having a chunk size which is a multiple of 12 (all 6 translations multiplied by the two paired-end reads) will keep FASTA records that originate from the same reads together [default: 240]
- -l / --length l
- The length of the k-mers in the index [default: 9]
- -s / --socket s
- Instead of reading from stdin and writing to stdout, create a Unix socket to communicate with using OpenBSD's netcat (nc -NU socket). This is especially useful in combination with the --in-memory flag: you only have to load the index in memory once, after which you can query it without having the loading time overhead each time