The umgap splitkmers command takes tab-separated taxon IDs and protein sequences and outputs the k-mers mapped to the taxon IDs.

Usage

The input is given on standard input and should be a TSV formatted stream of taxon IDs and a protein sequence from this taxon. The output will be written to standard output and consists of a TSV formatted stream of k-mers mapped to the taxa in which they occur. The k-mer length is configurable with the -k option, and is 9 by default.

This output stream is ready to be grouped by K-mer by sorting and then aggregated into a searchable index, with the sort, umgap joinkmers and umgap buildindex commands.

$ cat input.tsv
654924	MNAKYDTDQGVGRMLFLGTIGLAVVVGGLMAYGYYYDGKTPSSGTSFHTASPSFSSRYRY
176652	MIKLFCVLAAFISINSACQSSHQQREEFTVATYHSSSICTTYCYSNCVVASQHKGLNVESYTCDKPDPYGRETVCKCTLIKCHDI
$ umgap splitkmers < input.tsv
MNAKYDTDQ	654924
NAKYDTDQG	654924
AKYDTDQGV	654924
KYDTDQGVG	654924
YDTDQGVGR	654924
...
SPSFSSRYR	654924
PSFSSRYRY	654924
MIKLFCVLA	176652
IKLFCVLAA	176652
KLFCVLAAF	176652
...
-h / --help
Prints help information
-V / --version
Prints version information
-k / --length k
The k-mer length [default: 9]
-p / --prefix p
Print only the (k-1)-mer suffixes of the k-mers starting with this character