The patterns of polymorphisms in genomes are imprints of the evolutionary forces at play in nature. In particular, polymorphisms have been extensively used to infer the fitness effects of mutations and their dynamics of fixation. However, the role and contribution of molecular biophysics to these observations remain unclear. Here, we couple robust findings from protein biophysics, enzymatic flux theory, the selection against the cytotoxic effects of protein misfolding, and explicit population dynamics simulations in the polyclonal regime. First, we recapitulate results on the dynamics of clonal interference and on the shape of the DFE, thus providing them with a molecular and mechanistic foundation. Second, we predict that if evolution is indeed under the dynamic equilibrium of mutation-selection balance, the fraction of stabilizing and destabilizing mutations is almost equal among single-nucleotide polymorphisms segregating at high allele frequencies. This prediction is proven true for polymorphisms in the human coding region. Overall, our results show how selection for protein folding stability predominantly shapes the patterns of polymorphisms in coding regions.
The variation among sequences and structures in nature is both determined by physical laws and by evolutionary history. However, these two factors are traditionally investigated by disciplines with different emphasis and philosophy-molecular biophysics on one hand and evolutionary population genetics in another. Here, we review recent theoretical and computational approaches that address the crucial need to integrate these two disciplines. We first articulate the elements of these approaches. Then, we survey their contribution to our mechanistic understanding of molecular evolution, the polymorphisms in coding region, the distribution of fitness effects (DFE) of mutations, the observed folding stability of proteins in nature, and the distribution of protein folds in genomes.
The emergence of a novel A(H1N1) strain in 2009 was the first influenza pandemic of the genomic age, and unprecedented surveillance of the virus provides the opportunity to better understand the evolution of influenza. We examined changes in the nucleotide coding regions and the amino acid sequences of the hemagglutinin (HA), neuraminidase (NA), and nucleoprotein (NP) segments of the A(H1N1)pdm09 strain using publicly available data. We calculated the nucleotide and amino acid hamming distance from the vaccine strain A/California/07/2009 for each sequence. We also estimated Pepitope–a measure of antigenic diversity based on changes in the epitope regions–for each isolate. Finally, we compared our results to A(H3N2) strains collected over the same period. Our analysis found that the mean hamming distance for the HA protein of the A(H1N1)pdm09 strain increased from 3.6 (standard deviation [SD]: 1.3) in 2009 to 11.7 (SD: 1.0) in 2013, while the mean hamming distance in the coding region increased from 7.4 (SD: 2.2) in 2009 to 28.3 (SD: 2.1) in 2013. These trends are broadly similar to the rate of mutation in H3N2 over the same time period. However, in contrast to H3N2 strains, the rate of mutation accumulation has slowed in recent years. Our results are notable because, over the course of the study, mutation rates in H3N2 similar to that seen with A(H1N1)pdm09 led to the emergence of two antigenic drift variants. However, while there has been an H1N1 epidemic in North America this season, evidence to date indicates the vaccine is still effective, suggesting the epidemic is not due to the emergence of an antigenic drift variant. Our results suggest that more research is needed to understand how viral mutations are related to vaccine effectiveness so that future vaccine choices and development can be more predictive.
Although molecular chaperones are essential components of protein homeostatic machinery, their mechanism of action and impact on adaptation and evolutionary dynamics remain controversial. Here we developed a physics-based ab initio multi-scale model of a living cell for population dynamics simulations to elucidate the effect of chaperones on adaptive evolution. The 6-loci genomes of model cells encode model proteins, whose folding and interactions in cellular milieu can be evaluated exactly from their genome sequences. A genotype-phenotype relationship that is based on a simple yet non-trivially postulated protein-protein interaction (PPI) network determines the cell division rate. Model proteins can exist in native and molten globule states and participate in functional and all possible promiscuous non-functional PPIs. We find that an active chaperone mechanism, whereby chaperones directly catalyze protein folding, has a significant impact on the cellular fitness and the rate of evolutionary dynamics, while passive chaperones, which just maintain misfolded proteins in soluble complexes have a negligible effect on the fitness. We find that by partially releasing the constraint on protein stability, active chaperones promote a deeper exploration of sequence space to strengthen functional PPIs, and diminish the non-functional PPIs. A key experimentally testable prediction emerging from our analysis is that down-regulation of chaperones that catalyze protein folding significantly slows down the adaptation dynamics.
Since divergence ~50 Ma ago from their terrestrial ancestors, cetaceans underwent a series of adaptations such as a ~10–20 fold increase in myoglobin (Mb) concentration in skeletal muscle, critical for increasing oxygen storage capacity and prolonging dive time. Whereas the O2-binding affinity of Mbs is not significantly different among mammals (with typical oxygenation constants of ~0.8–1.2 µM−1), folding stabilities of cetacean Mbs are ~2–4 kcal/mol higher than for terrestrial Mbs. Using ancestral sequence reconstruction, maximum likelihood and Bayesian tests to describe the evolution of cetacean Mbs, and experimentally calibrated computation of stability effects of mutations, we observe accelerated evolution in cetaceans and identify seven positively selected sites in Mb. Overall, these sites contribute to Mb stabilization with a conditional probability of 0.8. We observe a correlation between Mb folding stability and protein abundance, suggesting that a selection pressure for stability acts proportionally to higher expression. We also identify a major divergence event leading to the common ancestor of whales, during which major stabilization occurred. Most of the positively selected sites that occur later act against other destabilizing mutations to maintain stability across the clade, except for the shallow divers, where late stability relaxation occurs, probably due to the shorter aerobic dive limits of these species. The three main positively selected sites 66, 5, and 35 undergo changes that favor hydrophobic folding, structural integrity, and intra-helical hydrogen bonds.
What are the molecular properties of proteins that fall on the radar of protein quality control (PQC)? Here we mutate the E. coli’s gene encoding dihydrofolate reductase (DHFR) and replace it with bacterial orthologous genes to determine how components of PQC modulate fitness effects of these genetic changes. We find that chaperonins GroEL/ES and protease Lon compete for binding to molten globule intermediate of DHFR, resulting in a peculiar symmetry in their action: overexpression of GroEL/ES and deletion of Lon both restore growth of deleterious DHFR mutants and most of the slow-growing orthologous DHFR strains. Kinetic steady-state modeling predicts and experimentation verifies that mutations affect fitness by shifting the flux balance in cellular milieu between protein production, folding, and degradation orchestrated by PQC through the interaction with folding intermediates.
To understand the variation of protein sequences in nature, we need to reckon with evolutionary constraints that are biophysical, cellular, and ecological. Here, we show that under the global selection against protein misfolding, there exists a scaling among protein folding stability, protein cellular abundance, and effective population size. The specific scaling implies that the several-orders-of-magnitude range of protein abundances in the cell should leave imprints on extant protein structures, a prediction that is supported by our structural analysis of the yeast proteome.
Abstract The interface of protein structural biology, protein biophysics, molecular evolution, and molecular population genetics forms the foundations for a mechanistic understanding of many aspects of protein biochemistry. Current efforts in interdisciplinary protein modeling are in their infancy and the state-of-the art of such models is described. Beyond the relationship between amino acid substitution and static protein structure, protein function, and corresponding organismal fitness, other considerations are also discussed. More complex mutational processes such as insertion and deletion and domain rearrangements and even circular permutations should be evaluated. The role of intrinsically disordered proteins is still controversial, but may be increasingly important to consider. Protein geometry and protein dynamics as a deviation from static considerations of protein structure are also important. Protein expression level is known to be a major determinant of evolutionary rate and several considerations including selection at the mRNA level and the role of interaction specificity are discussed. Lastly, the relationship between modeling and needed high-throughput experimental data as well as experimental examination of protein evolution using ancestral sequence resurrection and in vitro biochemistry are presented, towards an aim of ultimately generating better models for biological inference and prediction.
Reproduction is inherently risky, in part because genomic replication can introduce new mutations that are usually deleterious toward fitness. This risk is especially severe for organisms whose genomes replicate “semi-conservatively,” e.g. viruses and bacteria, where no master copy of the genome is preserved. Lethal mutagenesis refers to extinction of populations due to an unbearably high mutation rate (U), and is important both theoretically and clinically, where drugs can extinguish pathogens by increasing their mutation rate. Previous theoretical models of lethal mutagenesis assume infinite population size (N). However, in addition to high U, small N can accelerate extinction by strengthening genetic drift and relaxing selection. Here, we examine how the time until extinction depends jointly on N and U. We first analytically compute the mean time until extinction (τ) in a simplistic model where all mutations are either lethal or neutral. The solution motivates the definition of two distinct regimes: a survival phase and an extinction phase, which differ dramatically in both how τ scales with N and in the coefficient of variation in time until extinction. Next, we perform stochastic population-genetics simulations on a realistic fitness landscape that both (i) features an epistatic distribution of fitness effects that agrees with experimental data on viruses and (ii) is based on the biophysics of protein folding. More specifically, we assume that mutations inflict fitness penalties proportional to the extent that they unfold proteins. We find that decreasing N can cause phase transition-like behavior from survival to extinction, which motivates the concept of “lethal isolation.” Furthermore, we find that lethal mutagenesis and lethal isolation interact synergistically, which may have clinical implications for treating infections. Broadly, we conclude that stably folded proteins are only possible in ecological settings that support sufficiently large populations.
Cold shock proteins (Csps) play an important role in cold shock response of a diverse number of organisms ranging from bacteria to humans. Numerous studies of the Csp from various species showed that a two-state folding mechanism is conserved and the transition state (TS) appears to be very compact. However, the atomic details of the folding mechanism of Csp remain unclear. This study presents the folding mechanism of Csp in atomic detail using an all-atom Go model-based simulations. Our simulations predict that there may exist an en route intermediate, in which β strands 1-2-3 are well ordered and the contacts between β1 and β4 are almost developed. Such an intermediate might be too unstable to be detected in the previous fluorescence energy transfer experiments. The transition state ensemble has been determined from the P(fold) analysis and the TS appears even more compact than the intermediate state.
Gō models are exceedingly popular tools in computer simulations of protein folding. These models are native-centric, i.e., they are directly constructed from the protein's native structure. Therefore, it is important to understand up to which extent the atomistic details of the native structure dictate the folding behavior exhibited by Gō models. Here we address this challenge by performing exhaustive discrete molecular dynamics simulations of a Gō potential combined with a full atomistic protein representation. In particular, we investigate the robustness of this particular type of Gō models in predicting the existence of intermediate states in protein folding. We focus on the N47G mutational form of the Spc-SH3 folding domain (x-ray structure) and compare its folding pathway with that of alternative native structures produced in silico. Our methodological strategy comprises equilibrium folding simulations, structural clustering, and principal component analysis.
We compared the folding pathways of selected mutational variants of the α-spectrin SH3 domain (Spc-SH3) by using a continuum model that combines a full atomistic protein representation with the Gō potential. Experimental data show that the N47G mutant shows very little tendency to aggregate while the N47A and triple mutant D48G(2Y) are both amyloidogenic, with the latter being clearly more aggregation prone. We identified a strikingly similar native-like folding intermediate across the three mutants, in which strand β1 is totally unstructured and more than half of the major hydrophobic core residues are highly solvent exposed. Results from extensive docking simulations show that the ability of the intermediates to dimerize is largely driven by strand β1 and is consistent with the in vitro aggregation behavior reported for the corresponding mutants. They further suggest that residues 44 and 53, which are key players in the nucleation–condensation mechanism of folding, are also important triggers of the aggregation process.
Both urea and guanidinium chloride (GdmCl) are frequently used as protein denaturants. Given that proteins generally adopt extended or unfolded conformations in either aqueous urea or GdmCl, one might expect that the unfolded protein chains will remain or become further extended due to the addition of another denaturant. However, a collapse of denatured proteins is revealed using atomistic molecular dynamics simulations when a mixture of denaturants is used. Both hen egg-white lysozyme and protein L are found to undergo collapse in the denaturant mixture. The collapse of the protein conformational ensembles is accompanied by a decreased solubility and increased non-native self-interactions of hydrophobic residues in the urea/GdmCl mixture. The increase of non-native interactions rather than the native contacts indicates that the proteins experience a simple collapse transition from the fully denatured states. During the protein collapse, the relatively stronger denaturant GdmCl displays a higher tendency to be absorbed onto the protein surface due to their stronger electrostatic interactions with proteins. At the same time, urea molecules also accumulate near the protein surface, resulting in an enhanced ?local crowding? for the protein near its first solvation shell. This rearrangement of denaturants near the protein surface and crowded local environment induce the protein collapse, mainly by burying their hydrophobic residues. These findings from molecular simulations are then further explained by a simple analytical model based on statistical mechanics.
The consistent observation across all kingdoms of life that highly abundant proteins evolve slowly demonstrates that cellular abundance is a key determinant of protein evolutionary rate. However, other empirical findings, such as the broad distribution of evolutionary rates, suggest that additional variables determine the rate of protein evolution. Here, we report that under the global selection against the cytotoxic effects of misfolded proteins, folding stability (ΔG), simultaneous with abundance, is a causal variable of evolutionary rate. Using both theoretical analysis and multiscale simulations, we demonstrate that the anticorrelation between the premutation ΔG and the arising mutational effect (ΔΔG), purely biophysical in origin, is a necessary requirement for abundance–evolutionary rate covariation. Additionally, we predict and demonstrate in bacteria that the strength of abundance–evolutionary rate correlation depends on the divergence time separating reference genomes. Altogether, these results highlight the intrinsic role of protein biophysics in the emerging universal patterns of molecular evolution.
Mutations create the genetic diversity on which selective pressures can act, yet also create structural instability in proteins. How, then, is it possible for organisms to ameliorate mutation-induced perturbations of protein stability while maintaining biological fitness and gaining a selective advantage? Here we used site-specific chromosomal mutagenesis to introduce a selected set of mostly destabilizing mutations into folA—an essential chromosomal gene of Escherichia coli encoding dihydrofolate reductase (DHFR)—to determine how changes in protein stability, activity, and abundance affect fitness. In total, 27 E. coli strains carrying mutant DHFR were created. We found no significant correlation between protein stability and its catalytic activity nor between catalytic activity and fitness in a limited range of variation of catalytic activity observed in mutants. The stability of these mutants is strongly correlated with their intracellular abundance, suggesting that protein homeostatic machinery plays an active role in maintaining intracellular concentrations of proteins. Fitness also shows a significant correlation with intracellular abundance of soluble DHFR in cells growing at 30 °C. At 42 °C, the picture was mixed, yet remarkable: A few strains carrying mutant DHFR proteins aggregated, rendering them nonviable, but, intriguingly, the majority exhibited fitness higher than wild type. We found that mutational destabilization of DHFR proteins in E. coli is counterbalanced at 42 °C by their soluble oligomerization, thereby restoring structural stability and protecting against aggregation.
Adaptive immunity is an amazing mechanism, whereby new protein functions—affinity of antibodies (Immunoglobulins) to new antigens—evolve through mutation and selection in a matter of a few days. Despite numerous experimental studies, the fundamental physical principles underlying immune response are still poorly understood. In considerable departure from past approaches, here, we propose a microscopic multiscale model of adaptive immune response, which consists of three essential players: The host cells, viruses, and B-cells in Germinal Centers (GC). Each moiety carries a genome, which encodes proteins whose stability and interactions are determined from their sequences using laws of Statistical Mechanics, providing an exact relationship between genomic sequences and strength of interactions between pathogens and antibodies and antibodies and host proteins (autoimmunity). We find that evolution of potent antibodies (the process known as Affinity Maturation (AM)) is a delicate balancing act, which has to reconcile the conflicting requirements of protein stability, lack of autoimmunity, and high affinity of antibodies to incoming antigens. This becomes possible only when antibody producing B cells elevate their mutation rates (process known as Somatic Hypermutation (SHM)) to fall into a certain range—not too low to find potency increasing mutations but not too high to destroy stable Immunoglobulins and/or already achieved affinity. Potent antibodies develop through clonal expansion of initial B cells expressing marginally potent antibodies followed by their subsequent affinity maturation through mutation and selection. As a result, in each GC the population of mature potent Immunoglobulins is monoclonal being ancestors of a single cell from initial (germline) pool. We developed a simple analytical theory, which provides further rationale to our findings. The model and theory reveal the molecular factors that determine the efficiency of affinity maturation, thereby providing insight into the variability of the immune response to cytopathic viruses (the direct response by germline antibodies) and poorly cytopathic viruses (a crucial role of SHM in the response).
In this work, we apply a detailed all-atom model with a transferable knowledge-based potential to study the folding kinetics of Formin-Binding protein, FBP28, which is a canonical three-stranded β-sheet WW domain. Replica exchange Monte Carlo simulations starting from random coils find native-like (Cα RMSD of 2.68 Å) lowest energy structure. We also study the folding kinetics of FBP28 WW domain by performing a large number of ab initio Monte Carlo folding simulations. Using these trajectories, we examine the order of formation of two β-hairpins, the folding mechanism of each individual β-hairpin, and transition state ensemble (TSE) of FBP28 WW domain and compare our results with experimental data and previous computational studies. To obtain detailed structural information on the folding dynamics viewed as an ensemble process, we perform a clustering analysis procedure based on graph theory. Further, a rigorous P(fold) analysis is used to obtain representative samples of the TSEs showing good quantitative agreement between experimental and simulated Φ values. Our analysis shows that the turn structure between first and second β strands is a partially stable structural motif that gets formed before entering the TSE in FBP28 WW domain and there exist two major pathways for the folding of FBP28 WW domain, which differ in the order and mechanism of hairpin formation.
Numerous experiments demonstrate a high level of promiscuity and structural disorder in organismal proteomes. Here, we ask the question what makes a protein promiscuous, that is, prone to nonspecific interactions, and structurally disordered. We predict that multi-scale correlations of amino acid positions within protein sequences statistically enhance the propensity for promiscuous intra- and inter-protein binding. We show that sequence correlations between amino acids of the same type are statistically enhanced in structurally disordered proteins and in hubs of organismal proteomes. We also show that structurally disordered proteins possess a significantly higher degree of sequence order than structurally ordered proteins. We develop an analytical theory for this effect and predict the robustness of our conclusions with respect to the amino acid composition and the form of the microscopic potential between the interacting sequences. Our findings have implications for understanding molecular mechanisms of protein aggregation diseases induced by the extension of sequence repeats.
We predict analytically that diagonal correlations of amino acid positions within protein sequences statistically enhance protein propensity for nonspecific binding. We use the term “promiscuity” to describe such nonspecific binding. Diagonal correlations represent statistically significant repeats of sequence patterns where amino acids of the same type are clustered together. The predicted effect is qualitatively robust with respect to the form of the microscopic interaction potentials and the average amino acid composition. Our analytical results provide an explanation for the enhanced diagonal correlations observed in hubs of eukaryotic organismal proteomes [J. Mol. Biol. 409, 439 (2011), 10.1016/j.jmb.2011.03.056]. We suggest experiments that will allow direct testing of the predicted effect.
We report a set of atomistic folding/unfolding simulations for the hairpin ribozyme using a Monte Carlo algorithm. The hairpin ribozyme folds in solution and catalyzes self-cleavage or ligation via a specific two-domain structure. The minimal active ribozyme has been studied extensively, showing stabilization of the active structure by cations and dynamic motion of the active structure. Here, we introduce a simple model of tertiary-structure formation that leads to a phase diagram for the RNA as a function of temperature and tertiary-structure strength. We then employ this model to capture many folding/unfolding events and to examine the transition-state ensemble (TSE) of the RNA during folding to its active "docked" conformation. The TSE is compact but with few tertiary interactions formed, in agreement with single-molecule dynamics experiments. To compare with experimental kinetic parameters, we introduce a novel method to benchmark Monte Carlo kinetic parameters to docking/undocking rates collected over many single molecular trajectories. We find that topology alone, as encoded in a biased potential that discriminates between secondary and tertiary interactions, is sufficient to predict the thermodynamic behavior and kinetic folding pathway of the hairpin ribozyme. This method should be useful in predicting folding transition states for many natural or man-made RNA tertiary structures.