Inferring the taxonomic composition of a microbial community from a large collection of anonymous DNA sequencing reads is a challenging task in computational biology. Because existing methods for taxonomic profiling of metagenomes are all based on the assignment of fragmental sequences to phylogenetic categories, the accuracy of results largely depends on fragment length. This dependency complicates comparative analysis of data originating from different sequencing platforms or preprocessing pipelines. We have developed a read length-independent method for taxonomic profiling and we provide a freely available Matlab/Octave toolbox which includes an ultra-fast implementation of that method. Besides the platform-independent toolbox we also provide a prototype tool implementation for Windows that allows the user to compare a large number of preprocessed metagenomes within a graphical environment. Our tests indicate that Taxy results compare well with taxonomic profiles obtained with other methods. However, in contrast to the existing methods, Taxy provides a nearly constant profiling accuracy across all kinds of read lengths and it operates at an unrivaled speed. As input, DNA sequences in terms of multi-FASTA files of any size can be used for the estimation of metagenomic profiles. The analysis of a large sequence file with a Gbp volume typically requires less than a minute of processing time and can even be performed on a standard notebook.
In contrast to the oligonucleotide-based Taxy method, Taxy-Pro is based on mixture model analysis of protein signatures in terms of protein domain frequencies. The Pfam domain counts of a metagenome under study can be obtained from the CoMet webserver and are imported as a protein signature (or "profile") in Taxy-Pro. The mixture model-based estimation of metagenomic taxon abundances is realized on the basis of reference signatures from all domains of life including viruses. Furthermore, Taxy-Pro for the first time includes signatures of viral metagenomes as reference data to provide realistic estimates of the virus fraction. In a comparative study (Klingenberg et al., submitted to Bioinformatics) we showed that Taxy-Pro as the only method achieves good profiling results that cover the full range of biological entities.
The platform-independent
toolboxes for the Matlab/Octave programming environment can be
downloaded here:
Taxy toolbox
Taxy-Pro toolbox
Taxy-Pro toolbox based on
Pfam27 (including prediction of functional novelty in metagenomes)
The toolboxes contain functions for taxonomic profiling of large
multi-FASTA DNA files and allows the user to easily modify and
adapt the functionality. We tested the toolboxes with Matlab 7
and with the freely available GNU Octave 3.X under Windows and
Linux operating systems.
Please note that the
standalone Taxy Windows software is no longer
maintained! The development and maintenance of different
software versions for Taxy and Taxy-Pro requires many personnel
resources. Furthermore, the corresponding Matlab/Octave toolboxes
are platform-independent, easily extendable and cover all
important application cases. Therefore we decided to concentrate
on the advancement of the Taxy and Taxy-Pro algorithms and
maintenance of the Matlab/Octave packages.
H. Klingenberg, K.P. Asshauer,
T. Lingner and P. Meinicke.
Protein
signature-based estimation of metagenomic abundances including
all domains of life and viruses.
Bioinformatics, 29(8):973-80, 2013.
P. Meinicke, K.P. Asshauer and
T. Lingner.
Mixture models for analysis of the taxonomic composition of
metagenomes.
Bioinformatics, 27(12):1618-1624, 2011.