Treephyler Fast Taxonomic profiling of metagenomes This software is freely available at http://www.gobics.de/fabian/software/treephyler.php If you use Treephyler please cite: F. Schreiber, P. Gumrich , R. Daniel and P. Meinicke (2010) Treephyler: fast taxonomic profiling of metagenomes Bioinformatics 26, 960-961 Installation 1) Checking system requirements - Perl (version 5.8 or higher) - BioPerl (version 1.4 or higher) - HMMER Version 3.0 (28 March 2010) (http://hmmer.janelia.org) - Pfam database (Release 24.0) - FastTree Version 2.0.1 Installing the Pfam database (Release 24) a) The following Pfam files are required: - Pfam-A.full The full alignments of all curated families - Pfam-A.fasta A fasta version of Pfam's underlying sequence database - Pfam-A.hmm The Pfam Hmm library - pfamseq.txt Contains the taxonomic information These files can be downloaded from: ftp://ftp.sanger.ac.uk/pub/databases/Pfam/releases/Pfam24.0/ b) Extract all files. INSTALLING TREEPHYLER 4) Getting the most recent treephyler tarball from http://www.gobics.de/fabian/software/treephyler.php 5) Unpacking the tarball in a temporary directory: tar -xvvf treephyler.tar 6) Configure treephyler All global configurations are stored in the perl file Configurations.pm. Adapt all parameters in the section "System parameters" to your system. 7) Use the script treephyler.pl to prepare all PFAM-related files (alignments,hmms,taxonomy): perl treephyler.pl -m prepare Pfam-A.full -pfamT pfamseq.txt -pfamH Pfam-A.hmm -pfamF Pfam-A.fasta This will put all pfam alignments, hmms, and taxonomy entries into single files. The directory structure will be: pfam/alignments pfam/hmms pfam/taxonomy 8) Testing the installation Execute treephyler: perl treephyler.pl A usage message describing the available parameters should appear. Prediction workflow The whole prediction workflow consists of two parts, the generation of Pfam assignments using e.g. UFO or Comet and the taxonomic classification using Treephyler. UFO Predictions PFAM predictions for input sequences can be made using the UFO web server at www.ufo.gobics.de. You can use treephyler to translate your dataset into protein sequences and split it into equal chunks, so that you can easily analyze it using UFO. Comet Predictions In case predictions were generated with Comet (http://comet.gobics.de/), treephyler will be invoked with the "-comet" option. Internally, the Comet predictions are converted into the UFO format. Taxonomic classification using treephyler Adapt the configuration file (Configuration.pm) to match your type of input data (nucleotide or protein) and the computer you are using (single-/multi core or computer cluster using a Sun Grid Engine). With all sequences in sequences.fa and all UFO predictions in UFO_predictions.txt you can start the analysis typing: perl treephyler.pl -m analyse i sequences.fa -a Ufo_predictions.txt in case predictions are from Comet, the command will be: perl treephyler.pl -m analyse i sequences.fa -a Ufo_predictions.txt -comet Treephyler will put all output files in project directory specified in the configuration file Configuration.pm). You will find the taxonomic classifications in the file taxonomic assignments.txt in the project directory. In case you start the analysis in an parallel environment, results will be in taxonomic_assignments1.txt, taxonomic_assignments2.txt, taxonomic_assignments3.txt, etc. You can easily put all assignments in one file using cat taxonomic_assignments* > taxonomic_assignments.txt Additional methods Splitting file into equal chunks The following command takes the input file sequences.fa and splits into files containing e.g. 50,000 sequences each. perl treephyler.pl -m split i sequences.fa nsplits 50000 Translating file to protein format The following command takes the input file sequences.fa, translates it in all six reading frames, and saves it in sequences.fa_translated. perl treephyler.pl -m translate i sequences.fa Generating statistics The following command generates a statistic file showing the percentage of assigned sequences for each bacterial phylum from all predictions in file all_predictions.fa. perl treephyler.pl -m statistics I all_predictions.fa Appendix Input format: -sequences: Input sequences have to be in multiple fasta format and can be either nucleotide or protein sequences -predictions: The required format (UFO) is: FASTA_HEADER_OF_SEQUENCE PFXXXX FASTA_HEADER_OF_SEQUENCE PFXXXX PFYYYY FASTA_HEADER_OF_SEQUENCE no assignments where FASTA_HEADER_OF_SEQUENCE is the fasta header that can also be found in the sequence file and PFXXXX is the PFAM family the sequence has been assigned to. This is the standard UFO output format. But you can adapt the output of different prediction methods and use them as input. Synopsis: Parameters of treephyler.pl: perl treephyler.pl -m ("statistics"|"analyse"|"prepare"|"split"|translate) -i Fastafile -a Assignment_file [-nsplits|-pfamA|-pfamT|-pfamH|-pfamF] Required parameters -i : Contains the query sequences in Fasta format -a: Contains PFAM predictions in UFO format -comet: if predictions are from the comet server. Internally converts them into UFO format -m: modus, available options are: "statistics" - Compute statistics e.g. perl treephyler.pl -m statistics -i assignments.fa "prepare" - Extracts information about alignments, taxonomy, and hmms from PFAM files and saves them in the subfolder "pfam" e.g. perl treephyler.pl -m prepare -p Pfam-A.full -pfamT pfamseq.txt -pfamH Pfam_fs -pfamF Pfam-A.fasta "analyse" - Start analysis e.g. perl treephyler.pl -m analyse -i gletscher.fas -a Ufo.out "split" - Splits a file into several smaller chunks e.g. perl treephyler.pl -m split -i input_sequences.fa -nsplits 50000 "translate" - Translates sequences in all six reading frames e.g. perl treephyler.pl -m translate -i input_sequences.fa "help" - print this help message Example call: perl treephyler.pl -m analyse -i gletscher.fas -a Ufo.out ------------------------------------------------------------------------------------------------------------