1. Introduction
Since the advent of molecular phylogenetics, it has been recognized that birds can be reliably separated into three clades: Paleognathae (ratites and tinamous), Galloanserae (landfowl and waterfowl), and Neoaves (all other birds, representing about 95% of all extant species). The base of Neoaves is one of the most difficult problems in phylogenetics (reviewed by [
1
]). It has long been clear that Neoaves underwent an extremely rapid radiation [
2
,
3
,
4
], probably close in time to the K-Pg (Cretaceous-Paleogene) mass extinction (reviewed by Field et al. [
5
]). Many studies using large sequence datasets [
6
,
7
,
8
,
9
,
10
,
11
,
12
,
13
,
14
,
15
,
16
] have corroborated many clades within Neoaves, but some relationships among these clades deep in the bird tree remain surprisingly recalcitrant to resolution. Reddy et al. [
13
] suggested that Neoaves should be viewed as a radiation of ten major clades, seven clades that comprise multiple orders (“the magnificent seven”) and three “orphan orders.” Independently, Suh [
17
] highlighted a virtually identical set of major clades. Thus, even in the phylogenomic era, relationships among these 10 clades differ among studies, confounding our ability to understand the early evolution of birds.
One explanation for the differences among studies is taxon sampling. Prum et al. [
12
] suggest that their results, using 200 species, differed from those of Jarvis et al. [
11
], which only included 48 species, due to denser taxon sampling. However, Reddy et al. [
13
] analyzed a slightly larger number of species than Prum et al. [
12
] and they recovered a tree with similarities to the primary Jarvis et al. [
11
] tree (which they called the “TENT”). This suggests that differences are due to data type effects. The use of large-scale (“phylogenomic”) datasets to examine relationships among organisms has revealed cases where analyses of different data types (e.g., coding versus non-coding data) yield different tree topologies [
11
,
13
,
18
,
19
,
20
,
21
,
22
,
23
]. Some data type effects are strong enough that the tree topology based on one data type can be rejected in analyses using the other data type. Examples of data type effects involving distinct sources of genomic information include the different topological signals that emerge in analyses of coding vs. non-coding data [
13
,
19
], sites in different protein structural environments [
24
], or proteins with distinct functions [
22
]. While some conflicts may be due to analyses of small data matrices (limited sampling of either taxa or loci) that lack the power to confidently resolve relationships, alternative topologies due to different data types can remain even when large data sets are analyzed (systematic data type effects). These systematic data type effects represent a fundamental challenge for phylogenomic studies.
While recovery of the magnificent seven is independent of data type [
11
,
13
], the relationships among the magnificent seven and the orphan orders exhibit substantial variation. This does not mean that results of analyses using various datasets of non-coding and coding data have yielded absolutely identical topologies; instead, the non-coding and coding topologies represent parts of tree space that share certain features (
Figure 1
). The most prominent feature of non-coding trees is that clades VI (doves, mesites, and sandgrouse) and clade VII (flamingos and grebes) are sister to all other Neoaves (called Passerea by Jarvis et al. [
11
]) with clades VI and VII either united or as successive sister groups of Passerea (
Figure 1
A). In contrast, trees based on large coding datasets (
Figure 1
B) have tended to yield trees with “clade P1” [
13
], which comprises all Neoaves except clade V (nightjars, hummingbirds, swifts, and allies). Coding exon trees may also include an “extended waterbird clade” (Aequorlitornithes sensu Prum et al. [
12
]. Beyond these features, which are present in many (but not all) trees based on each data type, clustering trees using topological distances separates those trees into coding and non-coding groups (cf. Figure 8 in Reddy et al. [
13
]).
It is perhaps telling that the two most important unresolved questions regarding the phylogeny of extant birds identified by Pittman et al. [
25
] appear to reflect issues of data type. Specifically, Pittman et al. [
25
] asked: (1) “Which clade is the sister taxon to the rest of Neoaves?”; and (2) “Are most aquatic avian lineages part of a monophyletic aquatic radiation?” Although we believe that these are the most important issues for the data type effects hypothesis, they do not represent all potential cases where data type might have an influence on the phylogeny of Neoaves. For example, coding data tend to place at least some raptorial landbird lineages sister to the other core landbirds (clade I) [
1
]. The Prum et al. [
12
] tree, largely based on coding data, united clades IV and VI in a clade they called Columbaves (
Figure 1
B) but this grouping was not perfectly congruent with the relevant Jarvis topology and Kuhl et al. [
16
] recovered Columbaves in their non-coding tree, prompting us to exclude it from the coding indicator clades. Reddy et al. [
13
] defined another potential non-coding indicator clade, which they called clade J3
N
; this clade comprises clades I and III (
Figure 1
A). We excluded clade J3
N
from our set of indicator clades because it was not present in the Kuhl et al. [
16
] non-coding tree. Despite the challenges associated with defining data type indicator clades for Neoaves, it seems clear that Passerea vs. clade P1 and the extended waterbirds are likely to be robust indicators.
Figure 1.
Consensus topologies for Neoaves emphasizing the “magnificent seven” and the “indicator clades” that differ in trees resulting from analyses of (
A
) non-coding vs. (
B
) coding data. The primary non-coding indicator clades is Passerea. The coding exon indicator clades are clade P1 (all Neoaves except clade V), the extended waterbird clade, and a paraphyletic assemblage of raptors sister to the other landbirds. The thin dashed lines highlight potential indicator clades that we view as uncertain at this time (see text). Magnificent seven clade names: I = “core landbirds” or Telluraves [
26
]. II = “core waterbirds” or Aequornithes [
27
]; III = Phaethontimorphae [
11
]; IV = Otidimorphae [
11
]; V = Strisores [
28
]; VI = Columbimorphae; [
11
]; VII = Mirandornithes [
27
].
Figure 1.
Consensus topologies for Neoaves emphasizing the “magnificent seven” and the “indicator clades” that differ in trees resulting from analyses of (
A
) non-coding vs. (
B
) coding data. The primary non-coding indicator clades is Passerea. The coding exon indicator clades are clade P1 (all Neoaves except clade V), the extended waterbird clade, and a paraphyletic assemblage of raptors sister to the other landbirds. The thin dashed lines highlight potential indicator clades that we view as uncertain at this time (see text). Magnificent seven clade names: I = “core landbirds” or Telluraves [
26
]. II = “core waterbirds” or Aequornithes [
27
]; III = Phaethontimorphae [
11
]; IV = Otidimorphae [
11
]; V = Strisores [
28
]; VI = Columbimorphae; [
11
]; VII = Mirandornithes [
27
].
Although Reddy et al. [
13
] defined and examined the most important data type indicator clades, that study did have some limitations. The taxon sample that Reddy et al. [
13
] used was similar in size to the Prum et al. [
12
] taxon sample, but the distribution of taxa across the avian tree of life differed between those two datasets. At least some of the benefits of increased taxon sampling are thought to reflect the subdivision of long branches when taxa are added [
29
,
30
] and the Prum et al. [
12
] and Reddy et al. [
13
] studies probably broke up different long branches due to the inclusion of different taxa in each study. Additionally, the two studies used different loci throughout the genome. Given that different parts of the genome have different evolutionary histories, there could be localized biases [
8
,
31
] that may affect one or both of these datasets?but likely in different ways due to the different sampling across the genome. Finally, Reddy et al. [
13
] considered the Prum et al. [
12
] tree to represent a coding tree, yet that data matrix included almost 20% non-coding data (introns, untranslated regions (UTRs), and intergenic regions). Thus, the “Prum tree” is not a coding exon tree in the strict sense. A better approach to testing the data types hypothesis is to use the same species and loci in both the coding and non-coding analyses, but ideally with improved taxon sampling over the 50 species included in Jarvis et al. [
11
]. Given that the Prum data matrix includes both data types for most loci, subdividing this dataset allows the Reddy et al. [
13
] data type effects hypothesis to be tested in a direct manner. If that hypothesis is correct, we predict that:
- (1)
Analyses of the non-coding subset of the Prum et al. [
12
] data matrix will yield trees with a “non-coding-type” topology (
Figure 1
A).
- (2)
Analyses of the coding subset of the Prum et al. [
12
] data matrix will yield trees with a “coding-type” topology (
Figure 1
B).
The basis for the first prediction is straightforward, but the second may seem trivial. After all, it seems likely that analysis of the Prum coding subset will yield a tree similar to the published Prum tree because the complete matrix is 80% coding data. However, it is possible that excluding the non-coding data could alter the topology in various ways. Thus, it is important to examine both predictions empirically.
There is a third prediction that could also be made. Reddy et al. [
13
] hypothesized that the non-coding trees were closer to the true evolutionary history of Neoaves based on two observations: (1) the non-coding cluster includes trees based on rare genomic changes (e.g., the transposable element insertion tree from Suh et al. [
32
]), which are a distinct source of phylogenetic information; and (2) coding data exhibit greater variation in GC-content (guanine-cytosine content) among taxa than non-coding data, violating the assumptions of the time-reversible models used in most maximum-likelihood (ML) and Bayesian analyses of phylogeny. The second point allows us to predict that analytical methods that limit the impact of variation in base composition on phylogenetic estimation will yield coding exon trees that are more congruent with non-coding trees.
Here we perform a direct test of the data type effects hypothesis for the base of Neoaves by conducting phylogenetic analyses of the coding and non-coding subsets of the Prum data matrix. More specifically, we examined the first two predictions by conducting analyses of concatenated nucleotide data. To test the third prediction, we recoded nucleotide sequences for the coding subset as purines (R) and pyrimidines (Y). This approach, called RY coding, is a simple method that limits the impact of base compositional variation [
24
,
33
]. To extend these results into the multispecies coalescent framework, we used ASTRAL [
34
] to estimate the species tree by combining gene trees. However, we used gene trees that were estimated using the original nucleotide alignments and alignments subjected to RY coding of the data. Finally, we discuss the implications of our results for the theory and practice of phylogenomics and for the tree topology at the base of Neoaves.