
Molecular sequences probably contain precious evolutionary signatures of early Eukaryote history, and the task for evolutionary biologists is to uncover this information. These networks allow for the simultaneous comparison of distant homologs in Eukaryotes, Archaebacteria, and Eubacteria, a task that could not have been achieved using phylogenetic trees, because distant eukaryotic homologs could not have been aligned.Īddressing questions about ancient evolution, such as the emergence and early evolution of Eukaryotes, which likely arose ∼2 billion years ago, is a challenging task. Despite being homologous, these two groups do not exhibit significant sequence similarity to each other at the specified thresholds. These gene families contain two groups of eukaryotic genes: one group that is connected to archaebacterial genes and another group that is connected to eubacterial genes ( Fig. The topology of a number of these subgraphs is consistent with a chimerical origin of Eukaryotes resulting from the fusion of an archaebacterium and a eubacterium. The resulting network, which encompasses more than 445,000 sequences connected by ∼8 million edges, provides a previously untapped source of information, namely, dozens of thousands of subgraphs showing both close and distant homology relationships between these sequences. We constructed a network that displayed the similarities among all proteins encoded in the genomes of 52 Archaebacteria, 52 Eubacteria, 14 representatives of all the main eukaryotic lineages, and their mobile genetic elements. We designed a protocol that increases the amount of ancient evolutionary information amenable to evolutionary analyses. Notice that a tree encompassing all the sequences in the gene family cannot be easily constructed, because not all sequences are significantly similar to each other at the specified thresholds. This topology is in agreement with a chimerical origin of Eukaryotes. The connected component exhibits a remarkable Eukaryote-Archaebacteria-Eubacteria-Eukaryote structure, with some eukaryotic genes exhibiting similarity to eubacterial homologs and other eukaryotic genes exhibiting similarity to archaebacterial homologs. Nodes in the network represent sequences, whereas links represent significant similarities. ( B) Relationship between genome size and the archaebacterial-to-eubacterial gene ratio for 14 eukaryotic genomes: Bigelowiella natans, Hemiselmis andersenii, Encephalitozoon intestinalis, Plasmodium knowlesi, Saccharomyces cerevisiae, Giardia lamblia, Entamoeba histolytica, Chlorella variabilis, Naegleria gruberi, Phytophtora infestans, Trypanosoma cruzi, Homo sapiens, Tetrahymena thermophila, and Arabidopsis thaliana. ( A) Significant similarities between eukaryotic (green), archaebacterial (blue) and eubacterial (red) sequences belonging to the same gene family. Our analyses highlight the power of network approaches to study deep evolutionary events. Connected components with prokaryotic and eukaryotic genes tend to include viral and plasmid genes, compatible with a role of gene mobility in the origin of Eukaryotes. Consequently, highly reduced eukaryotic genomes contain more genes of archaebacterial than eubacterial affinity. The archaebacterial repertoire has a similar size in all eukaryotic genomes whereas the number of eubacterium-derived genes is much more variable, suggesting a higher plasticity of this gene repertoire.


Genes of archaebacterial and eubacterial ancestry tend to perform different functions and to act at different subcellular compartments, but in such an intertwined way that suggests an early rather than late integration of both gene repertoires. Furthermore, eukaryotic genes highly connected to prokaryotic genes from one domain tend not to be connected to genes from the other prokaryotic domain. Instead, many eukaryotic sequences were indirectly connected through a “eukaryote–archaebacterium–eubacterium–eukaryote” similarity path. A number of connected components (gene sets with stronger similarities than expected by chance) contain pairs of eukaryotic sequences exhibiting no direct detectable similarity. This network contains multiple signatures of the chimerical origin of Eukaryotes as a fusion of an archaebacterium and a eubacterium that could not have been observed using phylogenetic trees. We reconstructed a gene similarity network comprising the proteomes of 14 eukaryotes, 104 prokaryotes, 2,389 viruses and 1,044 plasmids. The complexity and depth of the relationships between the three domains of life challenge the reliability of phylogenetic methods, encouraging the use of alternative analytical tools.
