Significant progress has been made in recent years in a variety of seemingly unrelated fields such as sequencing, protein structure prediction, and high-throughput transcriptomics and metabolomics. At the same time, new microscopic models have been developed that made it possible to analyze the evolution of genes and genomes from first principles. The results from these efforts enable, for the first time, a comprehensive insight into the evolution of complex systems and organisms on all scales — from sequences to organisms and populations. Every newly sequenced genome uncovers new genes, families, and folds. Where do these new genes come from? How do gene duplication and subsequent divergence of sequence and structure affect the fitness of the organism? What role does regulation play in the evolution of proteins and folds? Emerging synergism between data and modeling provides first robust answers to these questions.
Crowded intracellular environments present a challenge for proteins to form functional specific complexes while reducing non-functional interactions with promiscuous non-functional partners. Here we show how the need to minimize the waste of resources to non-functional interactions limits the proteome diversity and the average concentration of co-expressed and co-localized proteins. Using the results of high-throughput Yeast 2-Hybrid experiments, we estimate the characteristic strength of non-functional protein–protein interactions. By combining these data with the strengths of specific interactions, we assess the fraction of time proteins spend tied up in non-functional interactions as a function of their overall concentration. This allows us to sketch the phase diagram for baker's yeast cells using the experimentally measured concentrations and subcellular localization of their proteins. The positions of yeast compartments on the phase diagram are consistent with our hypothesis that the yeast proteome has evolved to operate closely to the upper limit of its size, whereas keeping individual protein concentrations sufficiently low to reduce non-functional interactions. These findings have implication for conceptual understanding of intracellular compartmentalization, multicellularity and differentiation.
We predict that patterns with correlated surface density of atoms have statistically higher promiscuity (ability to bind stronger to an arbitrary pattern) as compared with noncorrelated patterns with the same average surface density. We suggest that this constitutes a generic design principle for highly connected proteins (hubs) in protein interaction networks. We develop an analytical theory for this effect. We show that our key predictions are generic and independent, qualitatively, on the specific form of the interatomic interaction potential, provided it has a finite range.
Efforts in whole-genome sequencing and structural proteomics start to provide a global view of the protein universe, the set of existing protein structures and sequences. However, approaches based on the selection of individual sequences have not been entirely successful at the quantitative description of the distribution of structures and sequences in the protein universe because evolutionary pressure acts on the entire organism, rather than on a particular molecule. In parallel to this line of study, studies in population genetics and phenomenological molecular evolution established a mathematical framework to describe the changes in genome sequences in populations of organisms over time. Here, we review both microscopic (physics-based) and macroscopic (organism-level) models of protein-sequence evolution and demonstrate that bridging the two scales provides the most complete description of the protein universe starting from clearly defined, testable, and physiologically relevant assumptions.
We apply a simulational proxy of the ϕ-value analysis and perform extensive mutagenesis experiments to identify the nucleating residues in the folding "reactions” of two small lattice Gō polymers with different native geometries. Our findings show that for the more complex native fold (i.e., the one that is rich in nonlocal, long-range bonds), mutation of the residues that form the folding nucleus leads to a considerably larger increase in the folding time than the corresponding mutations in the geometry that is predominantly local. These results are compared to data obtained from an accurate analysis based on the reaction coordinate folding probability Pfold and on structural clustering methods. Our study reveals a complex picture of the transition state ensemble. For both protein models, the transition state ensemble is rather heterogeneous and splits up into structurally different populations. For the more complex geometry the identified subpopulations are actually structurally disjoint. For the less complex native geometry we found a broad transition state with microscopic heterogeneity. These findings suggest that the existence of multiple transition state structures may be linked to the geometric complexity of the native fold. For both geometries, the identification of the folding nucleus via the Pfold analysis agrees with the identification of the folding nucleus carried out with the ϕ-value analysis. For the most complex geometry, however, the applied methodologies give more consistent results than for the more local geometry. The study of the transition state structure reveals that the nucleus residues are not necessarily fully native in the transition state. Indeed, it is only for the more complex geometry that two of the five critical residues show a considerably high probability of having all its native bonds formed in the transition state. Therefore, one concludes that, in general, the ϕ-value correlates with the acceleration/deceleration of folding induced by mutation, rather than with the degree of nativeness of the transition state, and that the "traditional” interpretation of ϕ-values may provide a more realistic picture of the structure of the transition state only for more complex native geometries.
In this study we evaluate, at full atomic detail, the folding processes of two small helical proteins, the B domain of protein A and the Villin headpiece. Folding kinetics are studied by performing a large number of ab initio Monte Carlo folding simulations using a single transferable all-atom potential. Using these trajectories, we examine the relaxation behavior, secondary structure formation, and transition-state ensembles (TSEs) of the two proteins and compare our results with experimental data and previous computational studies. To obtain a detailed structural information on the folding dynamics viewed as an ensemble process, we perform a clustering analysis procedure based on graph theory. Moreover, rigorous p fold analysis is used to obtain representative samples of the TSEs and a good quantitative agreement between experimental and simulated Φ values is obtained for protein A. Φ values for Villin also are obtained and left as predictions to be tested by future experiments. Our analysis shows that the two-helix hairpin is a common partially stable structural motif that gets formed before entering the TSE in the studied proteins. These results together with our earlier study of Engrailed Homeodomain and recent experimental studies provide a comprehensive, atomic-level picture of folding mechanics of three-helix bundle proteins.
Protein–DNA interactions are vital for many processes in living cells, especially transcriptional regulation and DNA modification. To further our understanding of these important processes on the microscopic level, it is necessary that theoretical models describe the macromolecular interaction energetics accurately. While several methods have been proposed, there has not been a careful comparison of how well the different methods are able to predict biologically important quantities such as the correct DNA binding sequence, total binding free energy and free energy changes caused by DNA mutation. In addition to carrying out the comparison, we present two important theoretical models developed initially in protein folding that have not yet been tried on protein–DNA interactions. In the process, we find that the results of these knowledge-based potentials show a strong dependence on the interaction distance and the derivation method. Finally, we present a knowledge-based potential that gives comparable or superior results to the best of the other methods, including the molecular mechanics force field AMBER99.
Studies of the role of sex in evolution typically involve a longitudinal comparison of a single ancestor to several intermediate descendants and to one terminally evolved descendant after many generations of adaptation under a given selective regime. Here we take a complementary, statistical approach to sex in evolution, by describing the distribution of phenotypic similarity in a population of yeast F1 meiotic recombinants. By applying graph theory to fitness measurements of thousands of Saccharomyces cerevisiae recombinants treated with 10 mechanistically distinct, growth-inhibitory small-molecule perturbagens (SMPs), we show that the network of phenotypic similarity among F1 recombinants exhibits a scale-free degree distribution. F1 recombinants are often phenotypically unique and sometimes exceptional, and their fitness strengths are unevenly distributed across the 10 compound treatments. By contrast, highly phenotypically similar F1 recombinants constitute failing hubs that display below-average fitness across all compound treatments and are candidate substrates for purifying selection. Comparison of the F1 generation with the parental strains reveals that (i) there is a specialist more fit in any given single condition than any of the parents but (ii) only rarely are there generalists that exhibit greater fitness than both parental strains across a majority of conditions. This analysis allows us to evaluate and to gain better theoretical understanding of the costs and benefits of sex in the F1 generation.
Here, we address the question of how Darwinian evolution of organisms determines molecular evolution of their proteins and genomes. We developed a microscopic ab initio model of early biological evolution where the fitness (essentially lifetime) of an organism is explicitly related to the evolving sequences of its proteins. The main assumption of the model is that the death rate of an organism is determined by the stability of the least stable of their proteins. A lattice model is used to calculate stability of all proteins in a genome from their amino acid sequence. The simulation of the model starts from 100 identical organisms, each carrying the same random gene, and proceeds via random mutations, gene duplication, organism births via replication, and organism deaths. We find that exponential population growth is possible only after the discovery of a very small number of specific advantageous protein structures. The number of genes in the evolving organisms depends on the mutation rate, demonstrating the intricate relationship between the genome sizes and protein stability requirements. Further, the model explains the observed power-law distributions of protein family and superfamily sizes, as well as the scale-free character of protein structural similarity graphs. Together, these results and their analysis suggest a plausible comprehensive scenario of emergence of the protein universe in early biological evolution.
We study statistical properties of interacting protein-like surfaces and predict two strong, related effects: (i) statistically enhanced self-attraction of proteins; (ii) statistically enhanced attraction of proteins with similar structures. The effects originate in the fact that the probability to find a pattern self-match between two identical, even randomly organized interacting protein surfaces is always higher compared with the probability for a pattern match between two different, promiscuous protein surfaces. This theoretical finding explains statistical prevalence of homodimers in protein–protein interaction networks reported earlier. Further, our findings are confirmed by the analysis of curated database of protein complexes that showed highly statistically significant overrepresentation of dimers formed by structurally similar proteins with highly divergent sequences ("superfamily heterodimers”). We suggest that promiscuous homodimeric interactions pose strong competitive interactions for heterodimers evolved from homodimers. Such evolutionary bottleneck is overcome using the negative design evolutionary pressure applied against promiscuous homodimer formation. This is achieved through the formation of highly specific contacts formed by charged residues as demonstrated both in model and real superfamily heterodimers.
Classical population genetics a priori assigns fitness to alleles without considering molecular or functional properties of proteins that these alleles encode. Here we study population dynamics in a model where fitness can be inferred from physical properties of proteins under a physiological assumption that loss of stability of any protein encoded by an essential gene confers a lethal phenotype. Accumulation of mutations in organisms containing Γ genes can then be represented as diffusion within the Γ-dimensional hypercube with adsorbing boundaries determined, in each dimension, by loss of a protein's stability and, at higher stability, by lack of protein sequences. Solving the diffusion equation whose parameters are derived from the data on point mutations in proteins, we determine a universal distribution of protein stabilities, in agreement with existing data. The theory provides a fundamental relation between mutation rate, maximal genome size, and thermodynamic response of proteins to point mutations. It establishes a universal speed limit on rate of molecular evolution by predicting that populations go extinct (via lethal mutagenesis) when mutation rate exceeds approximately six mutations per essential part of genome per replication for mesophilic organisms and one to two mutations per genome per replication for thermophilic ones. Several RNA viruses function close to the evolutionary speed limit, whereas error correction mechanisms used by DNA viruses and nonmutant strains of bacteria featuring various genome lengths and mutation rates have brought these organisms universally ≈1,000-fold below the natural speed limit.
What mechanisms does Nature use in her quest for thermophilic proteins? It is known that stability of a protein is mainly determined by the energy gap, or the difference in energy, between native state and a set of incorrectly folded (misfolded) conformations. Here we show that Nature makes thermophilic proteins by widening this gap from both ends. The energy of the native state of a protein is decreased by selecting strongly attractive amino acids at positions that are in contact in the native state (positive design). Simultaneously, energies of the misfolded conformations are increased by selection of strongly repulsive amino acids at positions that are distant in native structure; however, these amino acids will interact repulsively in the misfolded conformations (negative design). These fundamental principles of protein design are manifested in the "from both ends of the hydrophobicity scale” trend observed in thermophilic adaptation, whereby proteomes of thermophilic proteins are enriched in extreme amino acids—hydrophobic and charged—at the expense of polar ones. Hydrophobic amino acids contribute mostly to the positive design, while charged amino acids that repel each other in non-native conformations of proteins contribute to negative design. Our results provide guidance in rational design of proteins with selected thermal properties.
The capacity of proteins to interact specifically with one another underlies our conceptual understanding of how living systems function. Systems-level study of specificity in protein–protein interactions is complicated by the fact that the cellular environment is crowded and heterogeneous; interaction pairs may exist at low relative concentrations and thus be presented with many more opportunities for promiscuous interactions compared with specific interaction possibilities. Here we address these questions by using a simple computational model that includes specifically designed interacting model proteins immersed in a mixture containing hundreds of different unrelated ones; all of them undergo simulated diffusion and interaction. We find that specific complexes are quite robust to interference from promiscuous interaction partners only in the range of temperatures T design > T > T rand. At T > T design, specific complexes become unstable, whereas at T < T rand, formation of specific complexes is suppressed by promiscuous interactions. Specific interactions can form only if T design > T rand. This condition requires an energy gap between binding energy in a specific complex and set of binding energies between randomly associating proteins, providing a general physical constraint on evolutionary selection or design of specific interacting protein interfaces. This work has implications for our understanding of how the protein repertoire functions and evolves within the context of cellular systems.
Natural proteins fold to a unique, thermodynamically dominant state. Modeling of the folding process and prediction of the native fold of proteins are two major unsolved problems in biophysics. Here, we show successful all-atom ab initio folding of a representative diverse set of proteins by using a minimalist transferable-energy model that consists of two-body atom-atom interactions, hydrogen bonding, and a local sequence-energy term that models sequence-specific chain stiffness. Starting from a random coil, the native-like structure was observed during replica exchange Monte Carlo (REMC) simulation for most proteins regardless of their structural classes; the lowest energy structure was close to native—in the range of 2–6 Å root-mean-square deviation (rmsd). Our results demonstrate that the successful folding of a protein chain to its native state is governed by only a few crucial energetic terms.
An increasing number of proteins are being discovered with a remarkable and somewhat surprising feature, a knot in their native structures. How the polypeptide chain is able to "knot” itself during the folding process to form these highly intricate protein topologies is not known. Here we perform a computational study on the 160-amino-acid homodimeric protein YibK, which, like other proteins in the SpoU family of MTases, contains a deep trefoil knot in its C-terminal region. In this study, we use a coarse-grained Cα-chain representation and Langevin dynamics to study folding kinetics. We find that specific, attractive nonnative interactions are critical for knot formation. In the absence of these interactions, i.e., in an energetics driven entirely by native interactions, knot formation is exceedingly unlikely. Further, we find, in concert with recent experimental data on YibK, two parallel folding pathways that we attribute to an early and a late formation of the trefoil knot, respectively. For both pathways, knot formation occurs before dimerization. A bioinformatics analysis of the SpoU family of proteins reveals further that the critical nonnative interactions may originate from evolutionary conserved hydrophobic segments around the knotted region.
There have been considerable attempts in the past to relate phenotypic trait—habitat temperature of organisms—to their genotypes, most importantly compositions of their genomes and proteomes. However, despite accumulation of anecdotal evidence, an exact and conclusive relationship between the former and the latter has been elusive. We present an exhaustive study of the relationship between amino acid composition of proteomes, nucleotide composition of DNA, and optimal growth temperature (OGT) of prokaryotes. Based on 204 complete proteomes of archaea and bacteria spanning the temperature range from −10 °C to 110 °C, we performed an exhaustive enumeration of all possible sets of amino acids and found a set of amino acids whose total fraction in a proteome is correlated, to a remarkable extent, with the OGT. The universal set is Ile, Val, Tyr, Trp, Arg, Glu, Leu (IVYWREL), and the correlation coefficient is as high as 0.93. We also found that the G + C content in 204 complete genomes does not exhibit a significant correlation with OGT (R = −0.10). On the other hand, the fraction of A + G in coding DNA is correlated with temperature, to a considerable extent, due to codon patterns of IVYWREL amino acids. Further, we found strong and independent correlation between OGT and the frequency with which pairs of A and G nucleotides appear as nearest neighbors in genome sequences. This adaptation is achieved via codon bias. These findings present a direct link between principles of proteins structure and stability and evolutionary mechanisms of thermophylic adaptation. On the nucleotide level, the analysis provides an example of how nature utilizes codon bias for evolutionary adaptation to extreme conditions. Together these results provide a complete picture of how compositions of proteomes and genomes in prokaryotes adjust to the extreme conditions of the environment.
We will discuss recent developments in bioinformatics analysis and theoretical studies of protein-protein interactions in living cells – an important aspect of systems biology. First, we consider the network of protein-protein interactions and demonstrate that two published independent measurements of these interactions produce graphs that are only weakly correlated with one another despite their strikingly similar topology. We then propose a physical model based on the fundamental principle that (de)solvation is a major physical factor in protein-protein interactions. This model reproduces not only the scale-free nature of such graphs but also a number of higher-order correlations in these networks. A key support of the model is provided by the discovery of a significant correlation between number of interactions made by a protein and the fraction of hydrophobic residues on its surface. Next, we discuss a number of fundamental models for specific protein-protein interactions that provide deep insight into how specific protein multimers could have evolved. In particular we show that homodimers are most likely to have been precursors of modern protein complexes (homodimers are prevalent in modern cells – phenomenon of "molecular narcissism"). Subsequent evolution created homodimers in the process of ‘’negative design’’ against non-specific and homodimeric associations.