01_treeio_importing_tree.Rmd

\mainmatter


# (PART\*) Part I: Tree data input, output, and manipulation {-}

\newpage

# Importing Tree with Data {#chapter1}


```{r echo=FALSE, results="hide", message=FALSE}
library(tidyr)
library(dplyr)
library(tidytree)
library(ggplot2)

library("treeio")
```

## Overview of Phylogenetic Tree Construction {#pc}


Phylogenetic trees are used to describe genealogical relationships among a group of organisms, which can be constructed based on the genetic sequences of the organisms. A rooted phylogenetic tree represents a model of evolutionary history depicted by ancestor-descendant relationships between tree nodes and clustering of `r squote('sister')` or `r squote('cousin')` organisms at a different level of relatedness, as illustrated in Figure \@ref(fig:phylogeny). In infectious disease research, phylogenetic trees are usually built from the pathogens' gene or genome sequences to show which pathogen sample is genetically closer to another sample, providing insights into the underlying unobserved epidemiologic linkage and the potential source of an outbreak.

(ref:phylogenyscap) Components of a phylogenetic tree.

(ref:phylogenycap) **Components of a phylogenetic tree.** External nodes (green circles), also called `r squote("tips")`, represent actual organisms sampled and sequenced (*e.g.*, viruses in infectious disease research). They are the `r squote("taxa")` in the terminology of evolutionary biology. The internal nodes (blue circles) represent hypothetical ancestors for the tips. The root (red circle) is the common ancestor of all species in the tree. The horizontal lines are branches and represent evolutionary changes (gray number) measured in a unit of time or genetic divergence. The bar at the bottom provides the scale of these branch lengths. 

```{r phylogeny, message=FALSE, echo=F, fig.width=6, fig.height=4, fig.cap='(ref:phylogenycap)', fig.scap='(ref:phylogenyscap)', out.extra=''}
require(ggtree)
set.seed(2017-04-22)
tr <- rtree(10)
p <- ggtree(tr) + 
  geom_nodepoint(color='#619CFF', size=3) + 
  geom_rootpoint(size=3, color='#F8766D') + 
  geom_tippoint(color='#00BA38', size=3) +
  geom_text2(aes(subset=(node != parent), 
                 label=round(branch.length, 2), x=branch), 
             vjust=-.5, color='grey50', size=3) + geom_treescale() 

d <- p$data
i <- d$isTip
d$label[i][order(d$y[i],decreasing=F)] <- paste0('virus', 10:1)
p$data <- d
p + geom_tiplab(offset = .05, size=4.5) + xlim(NA,4)
```

A phylogenetic tree can be constructed from genetic sequences using distance-based methods or character-based methods. Distance-based methods, including the unweighted pair group method with arithmetic means (UPGMA\index{UPGMA}) and the Neighbor-joining (NJ\index{NJ}), are based on the matrix of pairwise genetic distances calculated between sequences. The character-based methods, including maximum parsimony\index{maximum parsimony} (MP) [@fitch_toward_1971], maximum likelihood\index{maximum likelihood} (ML) [@felsenstein_evolutionary_1981], and Bayesian\index{Bayesian} Markov Chain Monte Carlo\index{Markov Chain Monte Carlo} (BMCMC) method [@rannala_probability_1996], are based on a mathematical model that describes the evolution of genetic characters and searches for the best phylogenetic tree according to their optimality criteria.


The MP method assumes that the evolutionary change is rare and minimizes the amount of character-state changes (*e.g.*, number of DNA substitutions). The criterion is similar to Occam's razor, that the simplest hypothesis that can explain the data is the best hypothesis. Unweighted parsimony assumes mutations across different characters (nucleotides or amino acids) are equally likely, while the weighted method assumes the unequal likelihood of mutations (*e.g.*, the third codon position is more liable than other codon positions; and the transition mutations have a higher frequency than transversion). The concept of the MP method is straightforward and intuitive, which is a probable reason for its popularity amongst biologists who care more about the research question rather than the computational details of the analysis. However, this method has several disadvantages, in particular, the tree inference can be biased by the well-known systematic error called long-branch attraction (LBA\index{LBA}) that incorrectly infer distantly related lineages as closely related [@felsenstein_cases_1978]. This is because the MP method poorly takes into consideration of many sequence evolution factors (*e.g.*, reversals and convergence) that are hardly observable from the existing genetic data.

The maximum likelihood (ML) method and Bayesian Markov Chain Monte Carlo (BMCMC) method are the two most commonly used methods in phylogenetic tree construction and are most often used in scientific publications. ML and BMCMC methods require a substitution model of sequence evolution. Different sequence data have different substitution models to formulate the evolutionary process of DNA, codon and amino acid. There are several models for nucleotide substitution, including JC69\index{JC69}, K2P\index{K2P}, F81\index{F81}, HKY\index{HKY}, and GTR\index{GTR} [@arenas_trends_2015]. These models can be used in conjunction with the rate variation across sites (denoted as +$\Gamma$)) [@yang_maximum_1994] and the proportion of invariable sites (denoted as +I) [@shoemaker_evidence_1989]. Previous research [@lemmon_importance_2004] had suggested that misspecification of substitution model might bias phylogenetic inference. Procedural testing for the best-fit substitution model is recommended.


The optimal criterion of the ML method is to find the tree that maximizes the likelihood given the sequence data. The procedure of the ML method is simple: calculating the likelihood of a tree and optimizing its topology and branches (and the substitution model parameters, if not fixed) until the best tree is found. Heuristic search, such as those implemented in `r pkg_phyml` and `r pkg_raxml`, is often used to find the best tree based on the likelihood criterion. The Bayesian method finds the tree that maximizes posterior probability by sampling trees through MCMC based on the given substitution model. One of the advantages of BMCMC is that parameter variance and tree topological uncertainty, included by the posterior clade probability, can be naturally and conveniently obtained from the sampling trees in the MCMC process. Moreover, the influence of topological uncertainty on other parameter estimates is also naturally integrated into the BMCMC phylogenetic framework.

In a simple phylogenetic tree, data associated with the tree branches/nodes could be the branch lengths (indicating genetic or time divergence) and lineage supports such as bootstrap values estimated from bootstrapping procedure or posterior clade probability summarized from the sampled trees in the BMCMC analysis.


## Phylogenetic Tree Formats {#format}


There are several file formats designed to store phylogenetic trees and the data associated with the nodes and branches. The three commonly used formats are Newick^[http://evolution.genetics.washington.edu/phylip/newick_doc.html], NEXUS [@maddison_nexus:_1997], and Phylip [@felsenstein_phylip_1989]. Some formats (*e.g.*, NHX) are extended from the Newick\index{Newick} format. Newick and NEXUS formats are supported as input by most of the software in evolutionary biology, while some of the software tools output newer standard files (*e.g.*, `r pkg_beast` and `r pkg_mrbayes`) by introducing new rules/data blocks for storing evolutionary inferences. In the other cases (*e.g.*, `r pkg_paml` and `r pkg_r8s`), output log files are only recognized by their own single software.


### Newick tree format 

The Newick\index{Newick} tree format is the standard for representing trees in computer-readable form. 


(ref:randomTreescap) A sample tree for demonstrating Newick text to encode tree structure.

(ref:randomTreecap) **A sample tree for demonstrating Newick text to encode tree structure.** Tips were aligned to the right-hand side and branch lengths were labeled on the middle of each branch.

```{r randomTree, message=FALSE, warning=F, fig.height=3.5, echo=FALSE, fig.cap="(ref:randomTreecap)", fig.scap="(ref:randomTreescap)", out.extra=''}
library(Biostrings)
library(ape)
library(phytools)
library(ggtree)
library(dplyr)
library(tidyr)
library(stringr)
linewidth <- 60
set.seed(2016-11-08)
tr <- roundBranches(rtree(5), 2)
ggtree(tr, size=1.5) + geom_tiplab(align=T, size=8, linesize=.8) + geom_label2(aes(subset=node != parent, x=branch, label=branch.length))
```

The rooted tree shown in Figure \@ref(fig:randomTree) can be represented by the following sequence of characters as a Newick tree text.

```{r message=FALSE, echo=FALSE, comment=""}
cat(write.tree(tr), '\n')
```

```{r siblingNodes, echo=FALSE}
i <- parent(tr, nodeid(tr, 't1')) # parent node of t1
tr2 <- tree_subset(tr, i, levels_back=0)
tr2$root.edge <- NULL
siblingNodes <- sub(";", "", write.tree(tr2))
```

The tree text ends with a semicolon. Internal nodes are represented by a pair of matched parentheses. Between the parentheses are descendant nodes of that node. For instance `r siblingNodes` represents the parent node of `r tr2$tip.label[1]` and `r tr2$tip.label[2]` that are the immediate descendants. Sibling nodes are separated by a comma and tips are represented by their names. A branch length (from the parent node to child node) is represented by a real number after the child node and is preceded by a colon. Singular data (*e.g.*, bootstrap values) associated with internal nodes or branches may be encoded as node labels and represented by the simple text/numbers before the colon.


Newick tree format was developed by Meacham in 1984 for the Phylogeny Inference Package or `r pkg_phylip` [@retief_phylogenetic_2000] package. Newick format is now the most widely used tree format and used by `r pkg_phylip`, `r pkg_paup` [@wilgenbusch_inferring_2003], *TREE-PUZZLE* [@schmidt_tree-puzzle:_2002], `r pkg_mrbayes`, and many other applications. Phylip tree format contains Phylip multiple sequence alignment (MSA) with a corresponding Newick tree text that was built based on the MSA sequences in the same file.

### NEXUS tree format

The NEXUS\index{NEXUS} format [@maddison_nexus:_1997] incorporates Newick tree text with related information organized into separated units known as **blocks**. A NEXUS block has the following structure:


```
#NEXUS
...
BEGIN characters;
...
END;
```

For example, the above example tree can be saved as the following NEXUS format:

```{r message=FALSE, echo=FALSE, eval=TRUE, comment=NA}
aa = capture.output(write.nexus(tr))
i = grep("^\tTREE", aa)
aa[i] = yulab.utils::str_wrap(aa[i], 60)
cat(aa, sep="\n")
```


Comments can be placed using square brackets. Some blocks can be recognized by most of the programs including `TAXA` (contains information of taxa), `DATA` (contains data matrix, *e.g.*, sequence alignment), and `TREE` (contains a phylogenetic tree, *i.e.*, Newick tree text). Notably, blocks can be very diverse and some of them are only recognized by one particular program. For example NEXUS file exported by `r pkg_paup` has a *paup* block that contains *PAUP\** commands, whereas `r pkg_figtree` exports the NEXUS file with a *figtree* block that contains visualization settings. NEXUS organizes different types of data into different blocks, whereas programs that support reading NEXUS can parse some blocks they recognized and ignore those they could not. This is a good mechanism to allow different programs to use the same format without crashing when unsupported types of data are present. Notably, most of the programs only support parsing `TAXA`, `DATA`, and `TREE` blocks; therefore, a program/platform that could generically read all kinds of data blocks from the NEXUS would be useful for phylogenetic data integration.

The `DATA` block is widely used to store sequence alignment. For this purpose, the user can store tree and sequence data in Phylip format which are essentially Phylip multiple sequence alignment and Newick tree text, respectively. It is used in `r pkg_phylip`. 

<!-- 

~~and it's also a drawback of NEXUS for most of the software can only parse the tree strucuture without support of the associated data. NEXUS is good for storing specific information for particular program, it can be used for storing annotated trees.~~

It is very difficult to share tree annotation data among different programs unless the data is a matrix which can be stored in `DATA` block. The `DATA` block is widely used to store sequence alignment. For this purpose, user can store tree in Phylip format which was used in `r pkg_phylip` package and can contains Phylip multiple sequence alignment with Newick tree text.

-->


### New Hampshire eXtended format {#nhxtext}

<!-- Although NEXUS supports additional information in separated blocks, they are mostly not recognized by different software and can't share information among different programs. In order to store annotated tree,-->

Newick, NEXUS, and phylip are mainly designed to store phylogenetic trees and basic singular data associated with internal nodes or branches. In addition to the singular data annotation at branches and nodes (mentioned above), New Hampshire eXtended (NHX\index{NHX}) format, which is based on Newick (also called New Hampshire), was developed to introduce tags to associate multiple data fields with the tree nodes (both internal nodes and tips). Tags are placed after branch length and must be wrapped between `[&&NHX` and `]` which makes it possible to be compatible with NEXUS format as it defined characters between `[` and `]` as comments. NHX is also the output format of `r pkg_phyldog` [@boussau_genome-scale_2013] and `r pkg_revbayes` [@hohna_revbayes:_2016]. A Tree Viewer (`r pkg_atv`) [@zmasek_atv:_2001] is a java tool that supports displaying annotation data stored in NHX format, but this package is no longer maintained.

Here is a sample tree from NHX definition document^[http://www.genetics.wustl.edu/eddy/forester/NHX.html]:

```{r echo=FALSE, eval=FALSE, comment=NA}
x <- "(((ADH2:0.1[&&NHX:S=human], ADH1:0.11[&&NHX:S=human]):0.05[&&NHX:S=primates:D=Y:B=100],ADHY:0.1[&&NHX:S=nematode],ADHX:0.12[&&NHX:S=insect]):0.1[&&NHX:S=metazoa:D=N],(ADH4:0.09[&&NHX:S=yeast],ADH3:0.13[&&NHX:S=yeast], ADH2:0.12[&&NHX:S=yeast],ADH1:0.11[&&NHX:S=yeast]):0.1[&&NHX:S=Fungi])[&&NHX:D=N];"
cat(str_wrap(x, linewidth))
```

```{r echo=FALSE, comment=NA}
x <- c("(((ADH2:0.1[&&NHX:S=human], ADH1:0.11[&&NHX:S=human]):0.05",
    "[&&NHX:S=primates:D=Y:B=100],ADHY:0.1[&&NHX:S=nematode],",
    "ADHX:0.12[&&NHX:S=insect]):0.1[&&NHX:S=metazoa:D=N],(ADH4:0.09",
    "[&&NHX:S=yeast],ADH3:0.13[&&NHX:S=yeast],ADH2:0.12[&&NHX:S=yeast],",
    "ADH1:0.11[&&NHX:S=yeast]):0.1[&&NHX:S=Fungi])[&&NHX:D=N];"
    )
writeLines(x)
```

### Jplace format 

To store the Next Generation Sequencing (NGS) short reads mapped onto a phylogenetic tree (for metagenomic classification), Matsen [@matsen_format_2012] proposed jplace\index{jplace} format for such phylogenetic placements. Jplace format is based on JSON and contains four keys: `tree`, `fields`, `placements`, `metadata`, and `version`. The `tree` value contains tree text extended from Newick tree format by putting the edge label in brackets (if available) after branch length and putting the edge number in curly braces after the edge label. The `fields` value contains header information of placement data. The value of `placements` is a list of `pqueries`. Each `pquery` contains two keys: `p` for placements and `n` for name or `nm` for names with multiplicity. The value of `p` is a list of placement for `pqueries`.

Here is a jplace sample file:


```
{
	"tree": "(((((((A:4{1},B:4{2}):6{3},C:5{4}):8{5},D:6{6}):
	3{7},E:21{8}):10{9},((F:4{10},G:12{11}):14{12},H:8{13}):
	13{14}):13{15},((I:5{16},J:2{17}):30{18},(K:11{19},
	L:11{20}):2{21}):17{22}):4{23},M:56{24});",
	"placements": [
	{"p":[24, -61371.300778, 0.333344, 0.000003, 0.003887], 
	 "n":["AA"]
	},
	{"p":[[1, -61312.210786, 0.333335, 0.000001, 0.000003],
	      [2, -61322.210823, 0.333322, 0.000003, 0.000003],
	      [3, -61352.210823, 0.333322, 0.000961, 0.000003]],
	 "n":["BB"]
	},
	{"p":[[8, -61312.229128, 0.200011, 0.000001, 0.000003],
	      [9, -61322.229179, 0.200000, 0.000003, 0.000003],
	      [10, -61342.229223, 0.199992, 0.000003, 0.000003]], 
	"n":["CC"]
	}
	],
	"metadata": {"info": "a jplace sample file"},
	"version" : 2,
	"fields": ["edge_num", "likelihood", "like_weight_ratio", 
	"distal_length", "pendant_length"
	]
}
```


Jplace is the output format of `r pkg_pplacer`\index{pplacer} [@matsen_pplacer_2010] and Evolutionary Placement Algorithm (`r pkg_epa`\index{EPA}) [@berger_performance_2011]. But these two programs do not contain tools to visualize placement results. `r pkg_pplacer` provides `placeviz` to convert jplace file to phyloXML\index{phyloXML} or Newick formats which can be visualized by `r pkg_archaeopteryx`. 

### Software outputs

`r pkg_raxml`\index{RAxML} [@stamatakis_raxml_2014] can output Newick format by storing the bootstrap values as internal node labels. Another way that `r pkg_raxml` supports is to place bootstrap value inside square brackets and after branch length. This could not be supported by most of the software that supports Newick format where square brackets will be ignored. 

`r pkg_beast`\index{BEAST} [@bouckaert_beast_2014] output is based on NEXUS, and it also introduces square brackets in the tree block to store evolutionary evidence inferred by `r pkg_beast`. Inside brackets, curly braces may also be incorporated if feature values have a length of more than 1 (*e.g.*, Highest Probability Density (HPD) or range of substitution rate). These brackets are placed between node and branch length (*i.e.*, after label if exists and before the colon). The bracket is not defined in Newick format and is a reserved character for NEXUS comment. So this information will be ignored for standard NEXUS parsers. 


Here is a sample `TREE` block of the `r pkg_beast` output:


```{r echo=FALSE, comment=NA, eval=FALSE}
f <- system.file("extdata/BEAST/beast_mcc.tree", package="treeio")
x <- readLines(f)
tr <- x[42]
tr <- gsub("\\[[^\\[]+(length=\\d+\\.\\d)[^\\[]*\\]", "[&\\1]", tr)
tr <- gsub("(:\\d+\\.\\d{2})\\d*,", "\\1,", tr)
#w <- 80
#ii <- 1:floor(nchar(tr)/w)* w
#start <- c(1, ii+1)
#end <- c(ii, nchar(tr))
#x <- c(x[1:41],
#      substring(tr, start, end),
#      x[43:length(x)])
#cat(paste0(x, collapse="\n"))
cat(str_wrap(tr, linewidth))

## new version, more robust

y <- read.beast.newick(textConnection(tr))
y@phylo$edge.length = round(y@phylo$edge.length, 2)
y@data=y@data[c('node', 'length')]
y@data$length <- round(y@data$length, 2)
write.beast(y, translate=F)

```

```{r echo=FALSE, comment=NA}

x <- c("TREE * TREE1 = [&R] (((11[&length=9.47]:9.39,14[&length=6.47]:6.39)",
  "[&length=25.72]:25.44,4[&length=9.14]:8.82)[&length=3.01]:3.1,",
  "(12[&length=0.62]:0.57,(10[&length=1.6]:1.56,(7[&length=5.21]:5.19,",
  "((((2[&length=3.3]:3.26,(1[&length=1.34]:1.32,(6[&length=0.85]:0.83,",
  "13[&length=0.85]:0.83)[&length=2.5]:2.49)[&length=0.97]:0.94)",
  "[&length=0.5]:0.5,9[&length=1.76]:1.76)[&length=2.41]:2.36,",
  "8[&length=2.19]:2.11)[&length=0.27]:0.24,(3[&length=3.33]:3.31,",
  "(15[&length=5.29]:5.27,5[&length=3.29]:3.27)[&length=1.04]:1.04)",
  "[&length=1.98]:2.04)[&length=2.83]:2.84)[&length=5.39]:5.37)",
  "[&length=2.02]:2)[&length=4.35]:4.36)[&length=0];")

writeLines(x)
```


`r pkg_beast` output can contain many different evolutionary inferences, depending on the analysis models defined in *BEAUTi* for running. For example in molecular clock analysis, it contains `rate`, `length`, `height`, `posterior` and corresponding HPD and range for uncertainty estimation. `Rate` is the estimated evolutionary rate of the branch. `Length` is the length of the branch in years. `Height` is the time from node to root, while `posterior` is the Bayesian clade credibility value. The above example is the output tree of a molecular clock analysis and should contain these inferences. To save space, only the `length` estimation was shown above. Besides, Molecular Evolutionary Genetics Analysis (`r pkg_mega`) [@kumar_mega7_2016] also supports exporting trees in `r pkg_beast` compatible Nexus format (see [session 1.3.2](#mega)). 

*`r pkg_mrbayes`*\index{MrBayes} [@huelsenbeck_mrbayes_2001] is a program that uses the Markov Chain Monte Carlo\index{Markov Chain Monte Carlo} method to sample from the posterior probability distributions. Its output file annotates nodes and branches separately by two sets of square brackets. For example below, posterior clade probabilities for the node and branch length estimates for the branch:


```{r echo=F, eval=F, comment=NA}
f <- system.file("extdata/MrBayes/Gq_nxs.tre", package="treeio")
x <- readLines(f)
tr <- x[36]
tr <- gsub("\\[[^\\[]+(prob=\\d+\\.\\d)[^\\[]*\\]:", "[&\\1]:", tr) 
tr <- gsub("\\[[^\\[]+(length_mean=[^,]+)[^\\[]*\\]([,\\)]{1})", "[&\\1]\\2", tr)
tr <- gsub("\\[[^\\[]+(prob=\\d+\\.\\d)[^\\[]*\\]\\[", "[&\\1]\\[", tr)
tr <- gsub("\\[[^\\[]+(length_mean=[^,]+)[^\\[]*\\];", "[&\\1];", tr)

tr <- gsub("(\\[&length_mean=\\d+\\.\\d)\\d*(e-)0*([1-9]+)\\]", "[&\\1\\2\\3]", tr)
tr <- gsub("(\\[&length_mean=0)\\.0+e\\+0+\\]", "[&\\1]", tr)
tr <- gsub("e-0*", "e-", tr)
tr <- gsub("\\[&\\[&", "[&", tr)
tr <- gsub("(:\\d+\\.\\d{2})\\d*([e\\,\\[]{1})", "\\1\\2", tr)

#w <- 80
#ii <- 1:floor(nchar(tr)/w)* w
#start <- c(1, ii+1)
#end <- c(ii, nchar(tr))
#x <- c(x[1:35],
#      substring(tr, start, end),
#      x[37:length(x)])
#cat(paste0(x, collapse="\n"))
cat(str_wrap(tr, linewidth))
```

```{r echo=FALSE, comment=NA}
x <- c("  tree con_all_compat = [&U] (8[&prob=1.0]:2.94e-1[&length_mean=2.9e-1],",
  "10[&prob=1.0]:2.25e-1[&length_mean=2.2e-1],((((1[&prob=1.0]:1.43e-1",
  "[&length_mean=1.4e-1],2[&prob=1.0]:1.92e-1[&length_mean=1.9e-1])[&prob=1.0]:",
  "1.24e-1[&length_mean=1.2e-1],9[&prob=1.0]:2.27e-1[&length_mean=2.2e-1])",
  "[&prob=1.0]:1.72e-1[&length_mean=1.7e-1],12[&prob=1.0]:5.11e-1",
  "[&length_mean=5.1e-1])[&prob=1.0]:1.76e-1[&length_mean=1.7e-1],",
  "(((3[&prob=1.0]:5.46e-2[&length_mean=5.4e-2],(6[&prob=1.0]:1.03e-2",
  "[&length_mean=1.0e-2],7[&prob=1.0]:7.13e-3[&length_mean=7.2e-3])[&prob=1.0]:",
  "6.93e-2[&length_mean=6.9e-2])[&prob=1.0]:6.03e-2[&length_mean=6.0e-2],",
  "(4[&prob=1.0]:6.27e-2[&length_mean=6.2e-2],5[&prob=1.0]:6.31e-2",
  "[&length_mean=6.3e-2])[&prob=1.0]:6.07e-2[&length_mean=6.0e-2])[&prob=1.0]:,",
  "1.80e-1[&length_mean=1.8e-1]11[&prob=1.0]:2.37e-1[&length_mean=2.3e-1])",
  "[&prob=1.0]:4.05e-1[&length_mean=4.0e-1])[&prob=1.0]:1.16e+000",
  "[&length_mean=1.162699558201079e+000])[&prob=1.0][&length_mean=0];")

writeLines(x)
```


To save space, most of the inferences were removed and only contains `prob` for clade probability and `length_mean` for mean value of branch length. The full version of this file also contains `prob_stddev`, `prob_range`, `prob(percent)`, `prob+-sd` for probability inferences and `length_median`, `length_95%_HPD` for every branch.

The `r pkg_beast` and `r pkg_mrbayes` outputs are expected to be parsed without inferences (dropped as comments) by software that supports NEXUS. `r pkg_figtree` supports parsing `r pkg_beast`, and `r pkg_mrbayes` outputs with inferences that can be used to display or annotate on the tree. But from there, extracting these data for further analysis is still challenging.

`r pkg_hyphy`\index{HyPhy} [@pond_hyphy:_2005] could do a number of phylogenetic analyses, including ancestral sequence\index{ancestral sequences} reconstruction. For ancestral sequence reconstruction, these sequences and the Newick tree text are stored in NEXUS format as the major analysis output. It did not completely follow the NEXUS\index{NEXUS} definition and only put the ancestral node labels in `TAXA` instead of the external node label. The `MATRIX` block contains sequence alignment of ancestral nodes which cannot be referred back to the tree stored in the `TREES` block since it does not contain node labels. Here is the sample output (to save space, only the first 72bp of alignment are shown):


```{r echo=FALSE, warning=FALSE, comment=NA}
f <- system.file("extdata/HYPHY/ancseq.nex", package="treeio")
x <- readLines(f)
x[25:37] <- substring(x[25:37], 1, 73)
x[4] <- sub("on", "\n\ton",x[4])
x[11] <- sub("'Node15' ", "'Node15'\n\t\t\t", x[11]) 
cat(paste0(x, collapse="\n"))
```


There are other applications that output rich information text that also contains phylogenetic trees with associated data. For example `r pkg_r8s`\index{r8s} [@sanderson_r8s:_2003] output three trees in its log file, namely `TREE`, `RATE`, and `PHYLO` for branches scaled by time, substitution rate, and absolute substitutions, respectively.


Phylogenetic Analysis by Maximum Likelihood (`r pkg_paml`\index{PAML}) [@yang_paml_2007] is a package of programs for phylogenetic analyses of DNA or protein sequences. Two main programs, `r pkg_baseml` and `r pkg_codeml`, implement a variety of models. `r pkg_baseml` estimates tree topology, branch lengths, and substitution parameters using a number of nucleotide substitution models available, including JC69\index{JC69}, K80, F81, F84, HKY85\index{HKY85}, T92, TN93, and GTR\index{GTR}. `r pkg_codeml` estimates synonymous and non-synonymous substitution rates, likelihood ratio test of positive selection under codon substitution models [@goldman_codon-based_1994].

`r pkg_baseml`\index{BaseML} outputs *mlb* file that contains input sequence (taxa) alignment and a phylogenetic tree with branch length as well as substitution model and other parameters estimated. The supplementary result file, *rst*, contains sequence alignment (with ancestral sequence if it performs reconstruction of ancestral sequences) and Naive Empirical Bayes (NBE) probabilities that each site in the alignment evolved. `r pkg_codeml` outputs *mlc* file that contains tree structure and estimation of synonymous and non-synonymous substitution rates. `r pkg_codeml`\index{CodeML} also outputs a supplementary result file, *rst*, that is similar to `r pkg_baseml` except that site is defined as a codon instead of a nucleotide. Parsing these files can be tedious and would oftentimes need a number of post-processing steps and require expertise in programming (e.g., with Python^[<http://biopython.org/wiki/PAML>] or Perl^[<http://bioperl.org/howtos/PAML_HOWTO.html>]).


Introducing square brackets is quite common for storing extra information, including *RAxML* to store bootstrap value, NHX format for annotation, jplace for edge label, and `r pkg_beast` for evolutionary estimation, *etc*. But the positions to place square brackets are not consistent in different software and the contents employ different rules for storing associated data, these properties make it difficult to parse associated data. For most of the software, they will just ignore square brackets and only parse the tree structure if the file is compatible. Some of them contain invalid characters (e.g., curly braces in `tree` field of jplace format), and even the tree structure can't be parsed by standard parsers. 

It is difficult to extract useful phylogeny/taxon-related information from the different analysis outputs produced by various evolutionary inference software, for displaying on the same phylogenetic tree and for further analysis. `r pkg_figtree`\index{FigTree} supports `r pkg_beast` output, but not for most of the other software outputs that contain evolutionary inferences or associated data. For those output-rich text files (e.g., `r pkg_r8s`, `r pkg_paml`, *etc.*), the tree structure cannot be parsed by any tree viewing software and users need expertise to manually extract the phylogenetic tree and other useful result data from the output file. However, such manual operation is slow and error-prone.

<!-- can't be supported and hard to display on the tree. A very common solution is to modified node labels by putting associated data to node labels and dispaly them as node lables. This is quite restricted and error prone.--> 

It was not easy to retrieve phylogenetic trees with evolutionary data from different analysis outputs of commonly used software in the field. Some of them (*e.g.*, `r pkg_paml` output and jplace file) without software or programming library to support parsing the file, while others (*e.g.*, `r pkg_beast` and `r pkg_mrbayes` output) can be parsed without evolutionary inferences as they are stored in square brackets that will be omitted as a comment by most of the software. Although `r pkg_figtree` support visualizing evolutionary statistics inferred by `r pkg_beast` and `r pkg_mrbayes`, extracting these data for further analysis is not supported. Different software packages implement different algorithms for different analyses (*e.g.*, `r pkg_paml` for *d~N~/d~S~*, `r pkg_hyphy` for ancestral sequences, and `r pkg_beast` for skyline analysis). Therefore, in encountering the genomic sequence data, there is a desired need to efficiently and flexibly integrate different analysis inference results for comprehensive understanding, comparison, and further analysis. This motivated us to develop the programming library to parse the phylogenetic trees and data from various sources.

<!--
to complete each other or compare similar analyses from different software (*e.g.*, clade probability inferred from `BEAST` and `MrBayes`). Without programming library to parse these software outputs make it difficult to integrate evolutionary statistics for comparison and further analysis.
-->

## Getting Tree Data with `r Biocpkg("treeio")`

Phylogenetic trees are commonly used to present evolutionary relationships of
species. Information associated with taxon species/strains may be further
analyzed in the context of the evolutionary history depicted by the phylogenetic
tree. For example, the host information of the influenza virus strains in the tree
could be studied to understand the host range of a virus lineage. Moreover, such
meta-data (*e.g.*, isolation host, time, location, *etc.*) directly associated
with taxon strains are also often subjected to further evolutionary or
comparative phylogenetic models and analyses, to infer their dynamics associated
with the evolutionary or transmission processes of the virus. All these
meta-data or other phenotypic or experimental data are stored either as the
annotation data associated with the nodes or branches and are often produced in
inconsistent format by different analysis programs.

Getting trees into R is still limited. Newick and Nexus can be imported by
several packages, including `r CRANpkg("ape")`, `r CRANpkg("phylobase")`. NeXML
format can be parsed by `r CRANpkg("RNeXML")`. However, analysis results from
widely used software packages in this field are not well
supported. SIMMAP output can be parsed by `r CRANpkg("phyext2")` and `r CRANpkg("phytools")`.
Although `r pkg_phyloch` can
import `r pkg_beast` and `r pkg_mrbayes` output, only internal node attributes were parsed and
tip attributes were ignored^[<https://github.com/ropensci/software-review/issues/179#issuecomment-369164110>]. Many other software outputs are mainly required
programming expertise to import the tree with associated data. Linking external
data, including experimental and clinical data, to phylogeny is another obstacle
for evolution biologists.

To fill the gap that most of the tree formats or software outputs cannot be parsed within the same software/platform, an R package `r Biocpkg("treeio")` [@wang_treeio_2020] was developed for parsing various tree file formats and outputs from common evolutionary analysis software. The `r Biocpkg("treeio")` package is developed with the R programming language [@rstats]. Not only the tree structure can be parsed, but also the associated data and evolutionary inferences, including NHX
annotation, clock rate inferences (from `r pkg_beast` or `r pkg_r8s` [@sanderson_r8s:_2003] programs),
synonymous and non-synonymous substitutions (from `r pkg_codeml`), and ancestral
sequence construction (from
`r pkg_hyphy`, `r pkg_baseml` or `r pkg_codeml`), *etc.*.
Currently, `r Biocpkg("treeio")` is able to read
the following file formats: Newick, Nexus, New Hampshire eXtended format (NHX),
jplace and Phylip as well as the data outputs from the following analysis programs:
`r pkg_astral`, 
`r pkg_beast`,
`r pkg_epa`,
`r pkg_hyphy`,
`r pkg_mega`,
`r pkg_mrbayes`,
`r pkg_paml`,
`r pkg_phyldog`,
`r pkg_pplacer`,
`r pkg_r8s`,
`r pkg_raxml` and 
`r pkg_revbayes`, *etc*. 
This is made possible with the several parser functions developed in `r Biocpkg("treeio")` (Table \@ref(tab:treeio-function)) [@wang_treeio_2020].


```{r treeio-function, echo=F, message=FALSE}
ff <- tibble::tribble(
  ~`Parser function`, ~Description,
  "read.astral"			, "parsing output of ASTRAL",
  "read.beast"      	, "parsing output of BEAST",
  "read.codeml"     	, "parsing output of CodeML (rst and mlc files)",
  "read.codeml_mlc" 	, "parsing mlc file (output of CodeML)",
  "read.fasta"			, "parsing FASTA format sequence file",
  "read.hyphy"      	, "parsing output of HYPHY",
  "read.hyphy.seq"  	, "parsing ancestral sequences from HYPHY output",
  "read.iqtree"			, "parsing IQ-Tree Newick string, with the ability to parse SH-aLRT and UFBoot support values",
  "read.jplace"     	, "parsing jplace file including the output of EPA and pplacer",
  "read.jtree"			, "parsing [jtree](#write-jtree) format",
  "read.mega"			, "parsing MEGA Nexus output",
  "read.mega_tabular"	, "parsing MEGA tabular output",
  "read.mrbayes"    	, "parsing output of MrBayes",
  "read.newick"     	, "parsing Newick string, with the ability to parse node label as support values",
  "read.nexus"			, "parsing standard NEXUS file (re-exported from ape)",
  "read.nhx"        	, "parsing NHX file including the output of PHYLDOG and RevBayes",
  "read.paml_rst"   	, "parsing rst file (output of BaseML or CodeML)",
  "read.phylip"     	, "parsing phylip file (phylip alignment + Newick string)",
  "read.phylip.seq" 	, "parsing multiple sequence alignment from phylip file",
  "read.phylip.tree"	, "parsing newick string from phylip file",
  "read.phyloxml"		, "parsing phyloXML file",
  "read.r8s"        	, "parsing output of r8s",
  "read.raxml"      	, "parsing output of RAxML",
  "read.tree"			, "parsing newick string (re-exported from ape)"
  )

library(kableExtra)
knitr::kable(ff, caption = "Parser functions defined in treeio", booktabs = T) %>% 
  kable_styling(latex_options = c("striped", "scale_down"),
                bootstrap_options = c("striped", "hover")) #%>% landscape
```


The `r Biocpkg("treeio")` package defines base
classes and functions for phylogenetic tree input and output. It is an
infrastructure that enables evolutionary evidence inferred by commonly
used software packages to be used in `R`. For instance, *d~N~/d~S~* values or
ancestral sequences inferred
by `r pkg_codeml` [@yang_paml_2007],
clade support values (posterior) inferred
by `r pkg_beast` [@bouckaert_beast_2014] and short read placement
by `r pkg_epa` [@berger_performance_2011]
and `r pkg_pplacer` [@matsen_pplacer_2010]. These pieces of 
evolutionary evidence can be further analyzed in `R` and used to annotate
a phylogenetic tree using `r Biocpkg("ggtree")`
[@yu_ggtree:_2017]. The growth of analysis tools and models introduces
a challenge to integrate different varieties of data and analysis results from
different sources for an integral analysis on the same phylogenetic tree
background. The `r Biocpkg("treeio")` package [@wang_treeio_2020]
provides a `merge_tree` function to allow [combining tree data](#merge-tree) obtained from
different sources. In addition, `r Biocpkg("treeio")` also enables [external data](#link-external-data) to be linked to a phylogenetic tree structure.


After parsing, storage of the tree structure with associated data is made
through an S4 class, `treedata`, defined in the `r CRANpkg('tidytree')` package. These parsed data
are mapped to the tree branches and nodes inside `treedata` object, so that they
can be efficiently used to visually annotate the tree
using `r Biocpkg("ggtree")` [@yu_ggtree:_2017] and `r Biocpkg("ggtreeExtra")` [@ggtreeExtra_2021]. 
A programmable platform for phylogenetic data parsing, integration, and annotations as such makes us more easily to identify the evolutionary dynamics and correlation patterns (Figure \@ref(fig:treeioDiagram)) [@wang_treeio_2020].


(ref:treeioDiagramscap) Overview of the *treeio* package and its relations with *tidytree* and *ggtree*.

(ref:treeioDiagramcap) __Overview of the *treeio* package and its relations with *tidytree* and *ggtree*__. *Treeio* supports parsing a tree with data from a number of file formats and software outputs. A *treedata* object stores a phylogenetic tree with node/branch-associated data. *Treeio* provides several functions to manipulate a tree with data. Users can convert the *treedata* object into a tidy data frame (each row represents a node in the tree and each column represents a variable) and process the tree with data using the tidy interface implemented in *tidytree*. The tree can be extracted from the *treedata* object and exported to a Newick and NEXUS file or can be exported with associated data into a single file (either in the BEAST NEXUS or jtree format). Associated data stored in the *treedata* object can be used to annotate the tree using *ggtree*. In addition, *ggtree* supports a number of tree objects, including *phyloseq* for microbiome data and *obkData* for outbreak data. The *phylo*, multiPhylo (*ape* package), *phylo4*, *phylo4d* (*phylobase* package), *phylog* (*ade4* package), *phyloseq* (*phyloseq* package), and *obkData* (*OutbreakTools* package) are tree objects defined by the R community to store tree with or without domain-specific data. All these tree objects as well as hierarchical clustering results (*e.g.*, *hclust* and *dendrogram* objects) are supported by *ggtree*.

```{r treeioDiagram, out.width="95%", echo=FALSE, fig.cap="(ref:treeioDiagramcap)", fig.scap="(ref:treeioDiagramscap)"}
knitr::include_graphics("img/treeio-diagram.png", auto_pdf = TRUE)
```


### Overview of `r Biocpkg("treeio")`

The `r Biocpkg("treeio")`\index{treeio} package [@wang_treeio_2020] defined `S4` classes for storing phylogenetic trees with diverse types of associated data or covariates from different sources including analysis outputs from different software packages. It also defined corresponding parser functions for parsing phylogenetic trees with annotation data and stored them as data objects in R for further manipulation or analysis (see Table \@ref(tab:treeio-function)). Several accessor functions were defined to facilitate accessing the tree annotation data, including `get.fields` for obtaining annotation features available in the tree object, `get.placements` for obtaining the phylogenetic placement results (*i.e.*, the output of `r pkg_pplacer`, `r pkg_epa`, *etc.*), `get.subs` for obtaining the genetic substitutions from parent node to child node, and `get.tipseq` for getting the tip sequences. 


The `S3` class, `phylo`, which was defined in `r CRANpkg("ape")`\index{ape} [@paradis_ape_2004] package, is widely used in `R` community and many packages. As `r Biocpkg("treeio")` uses `S4` class, to enable those available R packages to analyze the tree imported by `r Biocpkg("treeio")`, `r Biocpkg("treeio")` provides `as.phylo` function to convert `r Biocpkg("treeio")`-generated tree object to `phylo` object that only contains tree structure without annotation data. In the other way, `r Biocpkg("treeio")` also provides `as.treedata` function to convert `phylo` object with evolutionary analysis result (*e.g.*, bootstrap values calculated by `r CRANpkg("ape")` or ancestral states inferred by `r CRANpkg("phangorn")`\index{phangorn} [@schliep_phangorn_2011] *etc.*) to be stored as a `treedata` `S4` object, making it easy to map the data to the tree structure and to be visualized using `r Biocpkg("ggtree")` [@yu_ggtree:_2017].


<!--
(*e.g.*, `phangorn`  for ancestral state reconstruction and `ape` for divergence time estimation) that implemented evolutionary inferences are supporting `phylo` object. 
-->

To allow integration of different kinds of data in a phylogenetic tree, `r Biocpkg("treeio")` [@wang_treeio_2020] provides `merge_tree` function (details in [section 2.2.1](#merge-tree)) for combining evolutionary statistics/evidence imported from different sources including those common tree files and outputs from analysis programs (Table \@ref(tab:treeio-function)). There is other information, such as sampling location, taxonomy information, experimental result, and evolutionary traits, *etc.* that are stored in separate files with the user-defined format. In `r Biocpkg("treeio")`, we could read in these data from the users' files using standard R *IO* functions, and attach them to the tree object by the [full_join](#link-external-data) methods defined in `r CRANpkg("tidytree")` and `r Biocpkg("treeio")` packages (see also the [`%<+%` operator](#attach-operator) defined in `r Biocpkg("ggtree")`). After attaching, the data will become the attributes associated with nodes or branches, which can be compared with other data incorporated or can be visually displayed on the tree [@yu_two_2018].

To facilitate storing complex data associated with the phylogenetic tree, `r Biocpkg("treeio")`\index{treeio} implemented `write.beast` and `write.jtree` functions to export a `treedata` object into a single file (see [Chapter 3](#chapter3)). 

<!-- A full list of functions that defined in *treeio* can be found in Table \@ref(tab:treeio). -->

### Function demonstration

#### Parsing BEAST output


```{r echo=FALSE}
options(show_data_for_treedata = FALSE)
```

```{r}
file <- system.file("extdata/BEAST", "beast_mcc.tree", package="treeio")
beast <- read.beast(file)
beast
```

Since _`%`_ is not a valid character in _`names`_, all the feature names that contain _`x%`_ will convert to _`0.x`_. For example, _`length_95%_HPD`_ will be changed to _`length_0.95_HPD`_.

Not only tree structure but also all the features inferred by `r pkg_beast`\index{BEAST} will be stored in the `S4` object. These features can be used for tree annotation (Figure \@ref(fig:beast)). 

#### Parsing MEGA output {#mega}

Molecular Evolutionary Genetics Analysis (`r pkg_mega`)\index{MEGA} software [@kumar_mega7_2016] supports exporting trees in three distinct formats: Newick, tabular, and Nexus. The Newick file can be parsed using the `read.tree` or `read.newick` functions. MEGA Nexus file is similar to BEAST Nexus and `r Biocpkg("treeio")` [@wang_treeio_2020] provides `read.mega` function to parse the tree.

```{r}
file <- system.file("extdata/MEGA7", "mtCDNA_timetree.nex", 
                    package = "treeio")
read.mega(file)
```

The tabular output contains tree and associated information (divergence time in this example) in a tabular flat text file. The `read.mega_tabular` function can parse the tree with data simultaneously.


```{r}
file <- system.file("extdata/MEGA7", "mtCDNA_timetree_tabular.txt", 
                    package = "treeio")
read.mega_tabular(file) 
```


#### Parsing MrBayes output


Although the Nexus file generated by `r pkg_mrbayes`\index{MrBayes} is different from the output of `r pkg_beast`, they are similar. The `r Biocpkg("treeio")` package provides the `read.mrbayes()` which internally calls `read.beast()` to parse `r pkg_mrbayes` outputs.

```{r}
file <- system.file("extdata/MrBayes", "Gq_nxs.tre", package="treeio")
read.mrbayes(file)
```

#### Parsing PAML output

Phylogenetic Analysis by Maximum Likelihood (`r pkg_paml`)\index{PAML} is a package of tools for phylogenetic analyses of DNA and protein sequences using maximum likelihood. Tree search algorithms are implemented in `r pkg_baseml` and `r pkg_codeml`. The `read.paml_rst()` function provided in `r Biocpkg("treeio")` can parse *rst* file from `r pkg_baseml` and `r pkg_codeml`. The only difference is the space in the sequences.
For `r pkg_baseml`, every ten bases are separated by one space, while for `r pkg_codeml`, every three bases (triplet) are separated by one space.

```{r fig.width=12, fig.height=10, warning=FALSE, fig.align="center"}
brstfile <- system.file("extdata/PAML_Baseml", "rst", package="treeio")
brst <- read.paml_rst(brstfile)
brst
```

Similarly, we can parse the *rst* file from `r pkg_codeml`.

```{r}
crstfile <- system.file("extdata/PAML_Codeml", "rst", package="treeio")
## type can be one of "Marginal" or "Joint"
crst <- read.paml_rst(crstfile, type = "Joint")
crst
```


Ancestral sequences inferred by `r pkg_baseml` or `r pkg_codeml` via marginal or
joint ML reconstruction methods will be stored in the S4 object and mapped to
tree nodes. `r Biocpkg("treeio")` [@wang_treeio_2020] will automatically determine the substitutions between the
sequences at both ends of each branch. The amino acid substitution will also be
determined by translating nucleotide sequences to amino acid sequences. These
computed substitutions will also be stored in the S4 object for efficient tree annotation later (Figure \@ref(fig:codeml)).

`r pkg_codeml` infers selection
pressure and estimated *d~N~/d~S~*, *d~N~* and *d~S~*. These pieces of information are
stored in output file *mlc*, which can be parsed by the `read.codeml_mlc()` function.


```{r}
mlcfile <- system.file("extdata/PAML_Codeml", "mlc", package="treeio")
mlc <- read.codeml_mlc(mlcfile)
mlc
```

The *rst* and *mlc* files can be parsed separately as demonstrated previously, they
can also be parsed together using the `read.codeml()` function.


```{r}
## tree can be one of "rst" or "mlc" to specify
## using tree from which file as base tree in the object
ml <- read.codeml(crstfile, mlcfile, tree = "mlc")
ml
```

All the features in both *rst* and *mlc* files are imported into a single S4
object and hence are available for further annotation and visualization. For
example, we can annotate and display both *d~N~/d~S~* (from *mlc* file) and
amino acid substitutions (derived from *rst* file) on the same phylogenetic tree [@yu_ggtree:_2017].


#### Parsing HyPhy output

Hypothesis testing using Phylogenies (`r pkg_hyphy`)\index{HYPHY} is a software package for analyzing genetic sequences. Ancestral sequences inferred by `r pkg_hyphy` are
stored in the Nexus output file, which contains the tree topology and ancestral
sequences. To parse this data file, users can use the `read.hyphy.seq()` function.


```{r warning=FALSE}
ancseq <- system.file("extdata/HYPHY", "ancseq.nex", package="treeio")
read.hyphy.seq(ancseq)
```

To map the sequences on the tree, users should also provide an
internal-node-labeled tree. If users want to determine substitution, they need
to also provide tip sequences. In this case, substitutions will be determined automatically, just as we parse the output of `r pkg_codeml`.

```{r warning=FALSE}
nwk <- system.file("extdata/HYPHY", "labelledtree.tree", package="treeio")
tipfas <- system.file("extdata", "pa.fas", package="treeio")
hy <- read.hyphy(nwk, ancseq, tipfas)
hy
```


#### Parsing r8s output

The `r pkg_r8s` package uses parametric, semi-parametric, and
non-parametric methods to relax the molecular clock to allow better estimations of
divergence times and evolution rates [@sanderson_r8s:_2003]. It outputs three
trees in a log file, namely, *TREE*, *RATO*, and *PHYLO* for time tree, rate tree,
and absolute substitution tree, respectively.

The time tree is scaled by divergence time, the rate tree is scaled by substitution rate
and the absolute substitution tree is scaled by the absolute number of substitutions.
After parsing the file, all these three trees are stored in a *multiPhylo* object (Figure \@ref(fig:multiPhylo)).


```{r fig.width=4, fig.height=6, width=60, warning=FALSE, fig.align="center"}
r8s <- read.r8s(system.file("extdata/r8s", "H3_r8s_output.log", package="treeio"))
r8s
```


#### Parsing output of RAxML bootstrap analysis

`r pkg_raxml` bootstrapping analysis
outputs a Newick tree text that is not standard, as it stores bootstrap values
inside square brackets after branch lengths. This file usually cannot be parsed
by a traditional Newick parser, such as `ape::read.tree()`. The function
`read.raxml()` can read such files and store the bootstrap as an additional
feature, which can be used to display on the tree or used to color tree
branches, *etc.*


```{r fig.width=12, fig.height=10, width=60, warning=FALSE, fig.align="center"}
raxml_file <- system.file("extdata/RAxML", 
                          "RAxML_bipartitionsBranchLabels.H3", 
                          package="treeio")
raxml <- read.raxml(raxml_file)
raxml
```


#### Parsing NHX tree

NHX (New Hampshire eXtended) format is an extension of Newick by introducing NHX\index{NHX}
tags. NHX is commonly used in phylogenetics software,
including
`r pkg_phyldog` [@boussau_genome-scale_2013],
`r pkg_revbayes` [@hohna_probabilistic_2014],
for storing statistical inferences. The following codes imported an NHX tree with
associated data inferred by `r pkg_phyldog` (Figure \@ref(fig:beastFigtree)A).

```{r}
nhxfile <- system.file("extdata/NHX", "phyldog.nhx", package="treeio")
nhx <- read.nhx(nhxfile)
nhx
```

#### Parsing Phylip tree

Phylip format contains multiple sequence alignment of taxa in Phylip sequence
format with corresponding Newick tree text that was built from taxon sequences.
Multiple sequence alignment can be sorted based on the tree structure and displayed at
the right-hand side of the tree using `r Biocpkg("ggtree")` through the [`msaplot()` function](#msaplot) or in combining with the `r Biocpkg("ggmsa")` package (see also Basic Protocol 5 of [@yu_cp_2020]).

```{r}
phyfile <- system.file("extdata", "sample.phy", package="treeio")
phylip <- read.phylip(phyfile)
phylip
```

#### Parsing EPA and pplacer output

`r pkg_epa` [@berger_performance_2011] and `r pkg_pplacer` [@matsen_pplacer_2010] have a common output file format, `jplace`, which can be
parsed by the `read.jplace()` function.

```{r}
jpf <- system.file("extdata/EPA.jplace",  package="treeio")
jp <- read.jplace(jpf)
print(jp)
```

The number of evolutionary placement on each branch will be calculated and
stored as the *nplace* feature, which can be mapped to line size and/or color
using `r Biocpkg("ggtree")` [@yu_ggtree:_2017].


#### Parsing jtree format{#jtree}

The *jtree* is a JSON-based format that was defined in
this `r Biocpkg("treeio")` package [@wang_treeio_2020] to support tree
data interchange (see [session 3.3](#write-jtree)).
Phylogenetic tree with associated data can be exported to a single *jtree*
file using the `write.jtree()` function. The *jtree* can be easily parsed using any
JSON parser. The *jtree* format contains three keys: tree, data, and meta-data.
The tree value contains tree text extended from Newick tree format by putting
the edge number in curly braces after branch length. The data value contains
node/branch-specific data, while the meta-data value contains additional meta information.


```{r}
jtree_file <- tempfile(fileext = '.jtree')
write.jtree(beast, file = jtree_file)
read.jtree(file = jtree_file)
```


### Converting other tree-like objects to `phylo` or `treedata` objects {#as-treedata}


To extend the application scopes of `r Biocpkg("treeio")`, `r CRANpkg("tidytree")` and `r Biocpkg("ggtree")`, `r Biocpkg("treeio")` [@wang_treeio_2020] provides several `as.phylo` and `as.treedata` methods to convert other tree-like objects, such as `phylo4d` and `pml`, to `phylo` or `treedata` object. So that users can easily [map associated data to the tree structure](#data-integration), [export a tree with/without data to a single file](#chapter3), [manipulate](#chapter2) and [visualize a tree](#chapter4) with/without data. These convert functions (Table \@ref(tab:treeio-treedata-object)) create the possibility of using `r CRANpkg("tidytree")` to process tree using tidy interface and `r Biocpkg("ggtree")` to visualize tree using the grammar of graphic syntax.


```{r treeio-treedata-object, echo=F, message=FALSE}
ff <- tibble::tribble(
  ~`Convert function`, ~`Supported object`, ~Description,
  "as.phylo", 		"ggtree", 	"convert ggtree object to phylo object",
  "as.phylo", 		"igraph",	"convert igraph object (only tree graph supported) to phylo object",
  "as.phylo", 		"phylo4",	"convert phylo4 object to phylo object",
  "as.phylo", 		"pvclust",	"convert pvclust object to phylo object",
  "as.phylo",		"treedata",	"convert treedata object to phylo object",
  "as.treedata",	"ggtree",	"convert ggtree object to treedata object",
  "as.treedata",	"phylo4",	"convert phylo4 object to treedata object",
  "as.treedata",	"phylo4d",	"convert phylo4d object to treedata object",
  "as.treedata",	"pml",		"convert pml object to treedata object",
  "as.treedata",	"pvclust",	"convert pvclust object to treedata object",
  )


require(kableExtra)
caption = "Conversion of tree-like object to phylo or treedata object"

knitr::kable(ff, caption=caption, booktabs = T) %>% 
  collapse_rows(columns = 1, latex_hline = "major", valign ="top") %>%
  kable_styling(latex_options = c("striped", "scale_down"),
                bootstrap_options = c("striped", "hover")) #%>% landscape
```

Here, we used `pml` object which was defined in the `r CRANpkg("phangorn")` package\index{phangorn}, as an example. The `pml()` function computes the likelihood of a phylogenetic tree given a sequence alignment and a model and the `optim.pml()` function optimizes different model parameters. The output is a `pml` object, and it can be converted to a `treedata` object using `as.treedata` provided by `r Biocpkg("treeio")` [@wang_treeio_2020]. The amino acid substitution (ancestral sequence estimated by `pml`) that stored in the `treedata` object can be visualized using `r Biocpkg("ggtree")` as demonstrated in Figure \@ref(fig:pmlTree).


(ref:pmlTreescap) Converting `pml` object to `treedata` object. 

(ref:pmlTreecap) **Converting `pml` object to `treedata` object.** This allows using `r CRANpkg("tidytree")` to process the tree data as well as using `r Biocpkg("ggtree")` and `r Biocpkg("ggtreeExtra")` to visualize the tree with associated data.

```{r pmlTree, fig.width=14, fig.height=14, fig.cap="(ref:pmlTreecap)", fig.scap="(ref:pmlTreescap)", message=FALSE, out.width="100%"}
library(phangorn)
treefile <- system.file("extdata", "pa.nwk", package="treeio")
tre <- read.tree(treefile)
tipseqfile <- system.file("extdata", "pa.fas", package="treeio")
tipseq <- read.phyDat(tipseqfile,format="fasta")
fit <- pml(tre, tipseq, k=4)
fit <- optim.pml(fit, optNni=FALSE, optBf=T, optQ=T,
                 optInv=T, optGamma=T, optEdge=TRUE,
                 optRooted=FALSE, model = "GTR",
				 control = pml.control(trace =0))

pmltree <- as.treedata(fit)
ggtree(pmltree) + geom_text(aes(x=branch, label=AA_subs, vjust=-.5))
```

### Getting information from `treedata` object {#get-treedata-data}

After the tree was imported, users may want to extract information stored
in the `treedata` object. `r Biocpkg("treeio")` provides several accessor
methods to extract tree structure, features/attributes that stored in the object,
and their corresponding values.

The `get.tree()` or `as.phylo()` methods can convert the `treedata` object to a
`phylo` object which is the fundamental tree object in the R community and
many packages work with `phylo` object.

```{r eval=FALSE}
beast_file <- system.file("examples/MCC_FluA_H3.tree", package="ggtree")
beast_tree <- read.beast(beast_file)
# or get.tree
as.phylo(beast_tree)
```

```{r eval=TRUE}
beast_file <- system.file("examples/MCC_FluA_H3.tree", package="ggtree")
beast_tree <- read.beast(beast_file)
# or get.tree
print(as.phylo(beast_tree), printlen=3)
```


The `get.fields` method returns a vector of features/attributes stored in
the object and associated with the phylogeny.

```{r}
get.fields(beast_tree)
```

The `get.data` method returns a tibble of all the associated data.


```{r}
get.data(beast_tree)
```

If users are only interested in a subset of the features/attributes returned by
`get.fields`, they can extract the information from the output of `get.data` or
directly subset the data by `[` or `[[`.

```{r}
beast_tree[, c("node", "height")]
head(beast_tree[["height_median"]])
```

## Summary {#summary1}


Software tools for inferring molecular evolution (*e.g.*, ancestral states, molecular dating\index{molecular dating} and selection pressure, *etc.*) are proliferating, but there is no single data format that is used by all different programs and capable to store different types of phylogenetic data. Most of the software packages have their unique output formats and these formats are not compatible with each other. Parsing software outputs is challenging, which restricts the joint analysis using different tools. The `r Biocpkg("treeio")` package [@wang_treeio_2020] provides a set of functions (Table \@ref(tab:treeio-function)) for parsing various types of phylogenetic data files and a set of converters (Table \@ref(tab:treeio-treedata-object)) to convert tree-like objects to phylo or treedata objects. These phylogenetic data can be integrated that allow further exploration and comparison. To date, most software tools in the field of molecular evolution are isolated and often not fully compatible with each other's input and output files. These software tools are designed to do their analysis and the outputs are often not readable in other software. No tools have been designed to unify the inference data from different analysis programs. Efficient incorporation of data from different inference methods can enhance the comparison and understanding of the study target, which may help discover new systematic patterns and generate new hypotheses.


As phylogenetic trees are growing in their application to identify patterns in an evolutionary context, more different disciplines are employing phylogenetic trees in their research. For example, spatial ecologists may map the geographical positions of the organisms to their phylogenetic trees to understand the biogeography of the species [@schon_age_2015]; disease epidemiologists may incorporate the pathogen sampling time and locations into the phylogenetic analysis to infer the disease transmission dynamics in spatiotemporal space [@he_emergence_2013]; microbiologists may determine the pathogenicity of different pathogen\index{pathogen} strains and map them into their phylogenetic trees to identify the genetic determinants of the pathogenicity [@bosi_comparative_2016]; genomic scientists may use the phylogenetic trees to help taxonomically classify their metagenomic sequence data [@gupta_using_2015]. A robust tool such as `r Biocpkg("treeio")` to import and map different types of data into the phylogenetic tree is important to facilitate these phylogenetics-related research, or `r squote("phylodynamics")`. Such tools could also help integrate different meta-data (time, geography, genotype\index{genotype}, epidemiological information) and analysis results (selective pressure, evolutionary rates) at the highest level and provide a comprehensive understanding of the study organisms. In the field of influenza research, there have been such attempts of studying the phylodynamics\index{phylodynamics} of the influenza virus by mapping different meta-data and analysis results on the same phylogenetic tree and evolutionary timescale [@lam_dissemination_2015].