The method article in @sec:full_taxator-tk describes a high-performance tool for taxonomic annotation of metagenomes using phylogenetic principles. The procedure splits the input sequences (contigs) into smaller separate homology regions (segments), to which it applies a newly developed realignment placement algorithm (RPA) for taxonomic classification of these regions. This algorithm calculates pairwise alignment scores to estimate the phylogenetic distances and simultaneously approximates a corresponding tree structure. The alignments are non-exhaustive and are stopped once a good taxon estimate has been determined or if no phylogenetic signal can be found in the input. In a final merging step, the subregion predictions are combined for the full sequence to minimize the error of the predicted taxon. The corresponding computer program taxatork-tk is implemented in C++ and utilizes parallel computation.
In metagenomics, we study microbial communities from natural environments without obtaining cultures. Using sequencing followed by computational analyses, we can estimate the abundances of taxa, known as taxonomic profiling, and characterize their metabolic potentials by sorting nucleotide sequences into genome bins (binning) and predicting proteins therein. Taxonomic profiling is conceptually different from taxonomic binning because it only requires (partial) genes, which are taxonomically informative, and which can be obtained using amplicon sequencing whereas binning needs to deal with all parts of a genome. Universal marker genes used for profiling are usually classified by phylogenetic placement, which considers a gene reference tree of the corresponding gene as a proxy for the species phylogeny. Random genome regions, as obtained by shotgun sequencing, typically lack such reference trees. Therefore, a taxonomy is used instead and query sequences are compared to reference genomes, which are annotated with corresponding taxa. Such comparison can be done based on direct sequence matching or based on nucleotide sequence composition, for instance
The workflow for the taxonomic assignment of a query sequence consists of three parts ([@fig:taxatortk_workflow]): (a) a local alignment search for homologs, (b) the core assignment algorithm and (c) a post-processing step to merge subregion annotations. The initial search can be run by different aligners and using different reference sequence collections. Based on the resulting local alignments, each query sequence is split into distinct subregions (segments), omitting parts which have no similarity to any reference. This step reduces the overall number of positions for further alignments and accounts for genome arrangements. Each segment, along with its homologous reference sequences, is processed by the core algorithm to predict a taxon. The final merging step considers all segment predictions of a query sequence and determines the final taxon for assignment.
The core realignment placement algorithm (RPA) ([@fig:taxatortk_rpa]) assigns a taxon Q to a query segment q using a limited number of pairwise alignments among q and its homologous segments obtained by local alignment to reference sequences. It aims to identify a set of segments which form a monophyletic group or subtree in the corresponding phylogeny. First, the most similar segment s is aligned to the query q and all other segments in the set (pass 1). An outgroup segment o is determined as the first sequence with distance larger than
We evaluated the performance of taxonomic assignment with taxator-tk for different datasets: (a) 7176 16S rRNA genes, (b) simulated short sequences of length 100, 500 and 1000 bp, (c) simulated contigs for a synthetic microbial community and two public benchmark datasets and (d) contigs of a microbial community from cow rumen. When possible, we applied cross-validation and evaluated different taxonomic distances between sample and reference taxa. In all cases, the reference data were a diverse collection of full and partial genome sequences with taxonomic annotation. As expected, performance for 16S marker genes was best because it contained a clear phylogenetic signal. In practice, such sequences are best classified using phylogenetic placement because it makes use of reference phylogenies. The second evaluation with nucleotide sequences resembling individual reads, which were sampled from 1729 different species, showed that precision was high even for short sequences, but about 10% lower on average than for 16S data. The recall increased with the length of the sequences. Therefore, it is recommended to assemble reads prior to assignment with taxator-tk. For the validation with assembled contigs, we compared our results to other state-of-the-art assignment methods: CARMA, MEGAN, Kraken (all similarity-based) and PhyloPythiaS (composition-based). For the newly simulated community consisting of 49 different species and the two benchmark datasets, taxator-tk misassigned substantially fewer contigs at species and genus levels, resulting in a much better precision but a reduced recall. PhyloPythiaS, a classifier based on nucleotide composition (
For all compared methods, the bin precision decreased with the bin size. Throughout all validation experiments, we could show that taxator-tk was the most precise method in assigning metagenome nucleotide sequences to corresponding taxa among the compared methods (an example shown in [@fig:taxatortk_precision]), which also resulted in the most realistic number of taxa. However, it assigned fewer data overall than other methods. This trade-off is a direct implication of the algorithm design, which is tailored towards minimization of errors. Therefore, it can confidently assign a core of sequences, for instance to train a model using nucleotide composition or to estimate taxon abundances. The use of unstructured reference data allows assigning across all domains of life, in contrast to most methods using specific gene families. From a methodological point of view, we presented an alternative phylogenetic inference algorithm which runs in linear time with respect to the number of homologs, and which applies to any nucleotide sequences with no need to select algorithm parameters. Besides taxonomic annotation of metagenomes, it can be applied to any DNA or RNA sequence, for instance to detect contamination in isolate sequencing data.
\FloatBarrier