Tree handling module
ETE's main module allows to read and render trees using the two most common formats: New Hampshire (NH) and New Hampshire eXtended (NHX). Moreover, it allows to generate random trees or create custom tree structures from scratch. In order to increase compatibility with other tools, several newick format standards are currently supported by ETE, both for reading and writing trees (see ETE's extended documentation). ETE's trees are internally encoded as a series of tree node instances connected following a parent-child relationship. Each node is encoded as an independent Python object, which provides many methods to manipulate its connections (i.e. add, remove, delete or detach nodes) and to easily browse its topology (i.e. tree traversal and get terminal, children, sister or descendant nodes). As a consequence, each internal node is treated as a fully featured subtree, thus allowing to analyze different parts of trees separately. ETE's tree object implementation supports multifurcations and can be used to deal with very large structures. As a reference, the NCBI taxonomy newick tree file, with more than 450.000 nodes, could be loaded as an ETE tree object in 40 seconds (Intel Xeon CPU 2.33 GHz).
One of the main advantages of ETE as compared to other toolkits is that nodes can contain additional information other than topology and branch distances. Users are free to associate any external data to the different tree nodes and then use such data in the subsequent analyses, integrate them as part of their tree images or save them using the NHX format. This functionality provides the possibility of creating fully annotated trees.
Finally, a complete set of operations are available to browse, analyze or modify trees: i.e.) pre- and post-order traversing strategies, search nodes or partitions matching specific criteria, midpoint and species-guided automatic tree rooting, calculate distances between branches, detection of common ancestors, random tree building, topology manipulation, tree pruning, tree concatenation, cut and paste partitions, etc.
Tree rendering and visualization
The visualization module provides a programmable drawing system to render hierarchical tree structures as dendrograms. The core of this extension consists of an image drawing algorithm that can be controlled by a set of user-defined rules. Such rules can be written as small python functions and are used to determine, dynamically, the aspect of each tree node. This allows, for instance, to vary the aspect of each node according to its internal properties. Moreover, the rendering engine allows not only to draw tree topologies but also to incorporate external graphical information to each node. Thus, external images, graphs or custom text labels can be added to nodes as part of the general tree image. The programatic use of the tree drawing module allows users to control how trees are rendered and what information is shown for the different nodes. Images can be rendered as PNG or PDF files.
In addition to the drawing engine, trees can be interactively visualized using a built-in Graphical User Interface (GUI), which allows on the fly manipulation of trees. Thus, each tree node has its own displaying method that can be used to start the visualization of its specific topology. The GUI is fully integrated with other ETE's features and allows to interact with tree nodes and their properties in a graphical way. Both the GUI and the rendering engine use the last Qt4 programming libraries to increase the performance of large images visualization. Qt4 is available for GNU/Linux, MacOS and Windows, and it is distributed as free software. A complete description of the usage of this module, as well as examples, can be found within the ETE tutorial at
http://ete.cgenomics.org
.
Phylogenetic extension
Phylogenetic trees are the result of most evolutionary analyses. ETE's phylogenetic extension implements a special type of tree instances which are focused on the analysis of evolutionary trees. Thus, tree leafs are considered OTUs, while internal nodes are considered their ancestors. By applying different strategies, ETE can automatically assign internal nodes to speciation and duplication events, thereby establishing orthology and paralogy relationships between the OTUs [
21
]. Two built-in methods are available to predict speciation and duplication events: a species-overlap algorithm that is independent of the species tree [
15
] and the classical gene-tree/species-tree reconciliation algorithm [
22
]. Both methods return a list of orthology and paralogy predictions and annotate the original tree nodes according to the detected evolutionary events. Additionally, nodes predicted as gene duplication events can be dated using the topology scanning method described in [
15
]. Furthermore, molecular phylogenies can be associated with their corresponding multiple sequence alignments, thus establishing a link between each tree leaf and its sequence. This allows the visualization of aligned sequences together with the tree topology, which helps to detect, for instance, linage-specific changes, specific domain composition or conserved regions. As sequences are considered an additional property of tree leafs, they can be combined with the tree browsing capabilities to program analyses such as detecting clades with certain level of sequence conservation or composition. This extension is fully integrated with the drawing module, and includes a pre-defined visualization layout to explore trees together with its predicted evolutionary events and the sequences associated to nodes (Figure
1
). Other features such as checking the monophyly of nodes, the automatic detection of species names, or the guided selection of outgroups are also available.
PhylomeDB API
PhylomeDB [
16
]
http://phylomedb.org
is a database for complete collections of gene phylogenies (phylomes). It currently stores more than 420,000 trees and 120,000 multiple sequence alignments reconstructed using a high-quality phylogenetic pipeline, which includes Maximum Likelihood or Bayesian tree inference, alignment trimming and evolutionary model testing.
ETE provides a complete Application User Interface (API) for accessing phylomeDB. Throughout this API, users can connect to the phylomeDB database and search for pre-computed gene phylogenies, download complete phylomes or obtain the orthology and paralogy predictions provided by the database. The phylomeDB API is fully integrated with the ETE software and trees can be automatically downloaded as phylogenetic tree instances.
Microarray clustering trees
Microarray expression data are usually encoded as matrices in which rows represent the expression profile of genes across different conditions (columns). A variety of clustering analyses are used to group genes that respond coordinately to a given set of conditions or, conversely, to group conditions according to their gene expression profiles. In such cases, genes are generally represented by the terminal tree nodes whereas internal nodes represent different levels of similarity among the expression profiles of grouped genes.
ETE's clustering extension can be used to import microarray clustering results and link them to their corresponding gene expression patterns. Thus, the expression profile of any internal tree node (cluster) can be calculated as the mean expression pattern of the grouped leaves (usually representing genes). This allows, for instance, to automate the detection of co-expressed genes in large datasets or to find nodes matching certain expression patterns. Moreover, several clustering validation techniques are implemented that allow to evaluate the goodness of fit of the different tree partitions. Currently, inter- and intra-cluster distances, cluster standard deviation, the Dunn index [
23
] and the cluster Silhouette index [
24
] can be calculated. Similarly to the phylogenetic extension, several visualization layouts are also provided for displaying clusters together with their associated profiles (Figure
2
).
This extension can be further used to handle any other type of trees associated to numerical matrices, such as phylogenetic profiles or protein classification results.