Motivation: Given a large family of homologous protein sequences, many methods can divide the family into smaller groups that correspond to the different functions carried out by proteins within the family. One important problem, however, has been the absence of a general method for selecting an appropriate level of granularity, or size of the groups. Results: We propose a consistent way of choosing the granularity that is independent of the sequence similarity and sequence clustering method used. We study three large, well-investigated protein families: basic leucine zippers, nuclear receptors and proteins with three consecutive C2H2 zinc fingers. Our method is tested against known functional information, the experimentally determined binding specificities, using a simple scoring method. The significance of the groups is also measured by randomizing the data. Finally, we compare our algorithm against a popular method of grouping proteins, the TRIBE-MCL method. In the end, we determine that dividing the families at the proposed level of granularity creates very significant and useful groups of proteins that correspond to the different DNA-binding motifs. We expect that such groupings will be useful in studying not only DNA binding but also other protein interactions.
The determination of the phylogenetic relationships among microorganisms has long relied primarily on gene sequence information. Given that prokaryotic organisms often lack morphological characteristics amenable to phylogenetic analysis, prokaryotic phylogenies, in particular, are often based on sequence data. In this work, we explore a new source of phylogenetic information, the distribution of protein structural domains within fully sequenced prokaryotic genomes. The evolution of the structural domains we use has been studied extensively, allowing us to base our phylogenetic methods on testable theoretical models of structural evolution. We find that the methods that produce reasonable phylogenetic relationships are indeed the methods that are most consistent with theoretical evolutionary models. This work represents, to our knowledge, the first such theoretically motivated phylogeny, as well as the first application of structural information to phylogeny on this scale. Our results have strong implications for the phylogenetic relationships among prokaryotic organisms and for the understanding of protein evolution as a whole.
Evolutionary traces of thermophilic adaptation are manifest, on the whole-genome level, in compositional biases toward certain types of amino acids. However, it is sometimes difficult to discern their causes without a clear understanding of underlying physical mechanisms of thermal stabilization of proteins. For example, it is well-known that hyperthermophiles feature a greater proportion of charged residues, but, surprisingly, the excess of positively charged residues is almost entirely due to lysines but not arginines in the majority of hyperthermophilic genomes. All-atom simulations show that lysines have a much greater number of accessible rotamers than arginines of similar degree of burial in folded states of proteins. This finding suggests that lysines would preferentially entropically stabilize the native state. Indeed, we show in computational experiments that arginine-to-lysine amino acid substitutions result in noticeable stabilization of proteins. We then hypothesize that if evolution uses this physical mechanism as a complement to electrostatic stabilization in its strategies of thermophilic adaptation, then hyperthermostable organisms would have much greater content of lysines in their proteomes than comparably sized and similarly charged arginines. Consistent with that, high-throughput comparative analysis of complete proteomes shows extremely strong bias toward arginine-to-lysine replacement in hyperthermophilic organisms and overall much greater content of lysines than arginines in hyperthermophiles. This finding cannot be explained by genomic GC compositional biases or by the universal trend of amino acid gain and loss in protein evolution. We discovered here a novel entropic mechanism of protein thermostability due to residual dynamics of rotamer isomerization in native state and demonstrated its immediate proteomic implications. Our study provides an example of how analysis of a fundamental physical mechanism of thermostability helps to resolve a puzzle in comparative genomics as to why amino acid compositions of hyperthermophilic proteomes are significantly biased toward lysines but not similarly charged arginines.
We investigate all-atom potentials of mean force for estimating free energies in protein folding and fold recognition. We search through the space potentials and design novel atomic potentials with a random mixing approximation and a contact-correlated Gaussian approximation of decoy states. We show that the two derived potentials are highly correlated, supporting the use of the random energy model as an accurate statistical description of protein conformational states. The novel atomic potentials perform well in a Z-score and fold decoy recognition test. Furthermore, the designed atomic potential performs slightly and significantly better than atomic potentials derived under a quasi-chemical assumption. While accounting for connectivity correlations between atom types does not improve the performance of the designed potential, we show these correlations lead to ambiguities in the distribution of energetic contributions for atoms on the same residue. Within the confines of the model then, many potentials may exist which stabilize all native folds in subtly different ways. Comparison of different protein conformations under the various atomic potentials reveals both a remarkable degree of correspondence in the estimated free energies and a remarkable degree of correspondence in the identity of the contacts types that make the dominant contributions to the estimated free energies. This consistency may be interpreted as a sign that the design procedure is extracting physically meaningful quantities.
Analysis of structures and sequences of several hyperthermostable proteins from various sources reveals two major physical mechanisms of their thermostabilization. The first mechanism is "structure-based,” whereby some hyperthermostable proteins are significantly more compact than their mesophilic homologues, while no particular interaction type appears to cause stabilization; rather, a sheer number of interactions is responsible for thermostability. Other hyperthermostable proteins employ an alternative, "sequence-based” mechanism of their thermal stabilization. They do not show pronounced structural differences from mesophilic homologues. Rather, a small number of apparently strong interactions is responsible for high thermal stability of these proteins. High-throughput comparative analysis of structures and complete genomes of several hyperthermophilic archaea and bacteria revealed that organisms develop diverse strategies of thermophilic adaptation by using, to a varying degree, two fundamental physical mechanisms of thermostability. The choice of a particular strategy depends on the evolutionary history of an organism. Proteins from organisms that originated in an extreme environment, such as hyperthermophilic archaea (Pyrococcus furiosus), are significantly more compact and more hydrophobic than their mesophilic counterparts. Alternatively, organisms that evolved as mesophiles but later recolonized a hot environment (Thermotoga maritima) relied in their evolutionary strategy of thermophilic adaptation on "sequence-based” mechanism of thermostability. We propose an evolutionary explanation of these differences based on physical concepts of protein designability.
Certain amino acid residues in a protein, when mutated, change the protein's function. We present an improved method of finding these specificity-determining positions that uses all the protein sequence data available for a family of homologous proteins. We study in detail two families of eukaryotic transcription factors, basic leucine zippers and nuclear receptors, because of the large amount of sequences and experimental data available. These protein families also have a clear definition of functional specificity: DNA-binding specificity. We compare our results to three other methods, including the evolutionary trace algorithm and a method that depends on orthology relationships. All of the predictions are compared to the available mutational and crystallographic data. We find that our method provides superior predictions of the known specificity-determining residues and also predicts residue positions within these families that deserve further study for their roles in functional specificity.
We present a verified computational model of the SH3 domain transition state (TS) ensemble. This model was built for three separate SH3 domains using experimental ϕ-values as structural constraints in all-atom protein folding simulations. While averaging over all conformations incorrectly considers non-TS conformations as transition states, quantifying structures as pre-TS, TS, and post-TS by measurement of their transmission coefficient ("probability to fold”, or pfold) allows for rigorous conclusions regarding the structure of the folding nucleus and a full mechanistic analysis of the folding process. Through analysis of the TS, we observe a highly polarized nucleus in which many residues are solvent-exposed. Mechanistic analysis suggests the hydrophobic core forms largely after an early nucleation step. SH3 presents an ideal system for studying the nucleation-condensation mechanism and highlights the synergistic relationship between experiment and simulation in the study of protein folding.
In this study, we explore nucleation and the transition state ensemble of the ribosomal protein S6 using a Monte Carlo (MC) Go model in conjunction with restraints from experiment. The results are analyzed in the context of extensive experimental and evolutionary data. The roles of individual residues in the folding nucleus are identified, and the order of events in the S6 folding mechanism is explored in detail. Interpretation of our results agrees with, and extends the utility of, experiments that shift φ-values by modulating denaturant concentration and presents strong evidence for the realism of the mechanistic details in our MC Go model and the structural interpretation of experimental φ-values. We also observe plasticity in the contacts of the hydrophobic core that support the specific nucleus. For S6, which binds to RNA and protein after folding, this plasticity may result from the conformational flexibility required to achieve biological function. These results present a theoretical and conceptual picture that is relevant in understanding the mechanism of nucleation in protein folding.
In this paper, we study the equilibrium behavior of Eigen’s quasispecies equations for an arbitrary gene network. We consider a genome consisting of N genes, so that the full genome sequence σ may be written as σ=σ1σ2⋯σN, where σi are sequences of individual genes. We assume a single fitness peak model for each gene, so that gene i has some "master” sequence σi,0 for which it is functioning. The fitness landscape is then determined by which genes in the genome are functioning and which are not. The equilibrium behavior of this model may be solved in the limit of infinite sequence length. The central result is that, instead of a single error catastrophe, the model exhibits a series of localization to delocalization transitions, which we term an "error cascade.” As the mutation rate is increased, the selective advantage for maintaining functional copies of certain genes in the network disappears, and the population distribution delocalizes over the corresponding sequence spaces. The network goes through a series of such transitions, as more and more genes become inactivated, until eventually delocalization occurs over the entire genome space, resulting in a final error catastrophe. This model provides a criterion for determining the conditions under which certain genes in a genome will lose functionality due to genetic drift. It also provides insight into the response of gene networks to mutagens. In particular, it suggests an approach for determining the relative importance of various genes to the fitness of an organism, in a more accurate manner than the standard "deletion set” method. The results in this paper also have implications for mutational robustness and what C.O. Wilke termed "survival of the flattest.”
This paper develops a Hamming class formalism for the semiconservative quasispecies equations with imperfect lesion repair, first presented and analytically solved in Y. Brumer and E.I. Shakhnovich (q-bio.GN/0403018, 2004). Starting from the quasispecies dynamics over the space of genomes, we derive an equivalent dynamics over the space of ordered sequence pairs. From this set of equations, we are able to derive the infinite sequence length form of the dynamics for a class of fitness landscapes defined by a master genome. We use these equations to solve for a generalized single-fitness-peak landscape, where the master genome can sustain a maximum number of lesions and remain viable. We determine the mean equilibrium fitness and error threshold for this class of landscapes, and show that when lesion repair is imperfect, semiconservative replication displays characteristics from both conservative replication and semiconservative replication with perfect lesion repair. The work presented here provides a formulation of the model which greatly facilitates the analysis of a relatively broad class of fitness landscapes, and thus serves as a convenient springboard into biological applications of imperfect lesion repair.
We address the response of a random heteropolymer to preferential solvation of certain monomer types at the globule-solvent interface. For each set of monomers that can comprise the molecule’s surface, we represent the ensemble of allowed configurations by a Gaussian distribution of energy levels, whose mean and variance depend on the set’s composition. Within such a random energy model, mean surface composition is proportional to solvation strength under most conditions. The breadth of this linear response regime arises from the approximate statistical independence of surface and volume energies. Fluctuations play a crucial role in determining the excess of solvophilic monomers at the surface, and for a diverse set of monomer types can be overcome only by very strong solvent preference.
In this paper, we extend a model of host-parasite coevolution to incorporate the semiconservative nature of DNA replication for both the host and the parasite. We find that the optimal mutation rate for the semiconservative and conservative hosts converge for realistic genome lengths, thus maintaining the admirable agreement between theory and experiment found previously for the conservative model and justifying the conservative approximation in some cases. We demonstrate that, while the optimal mutation rate for a conservative and semiconservative parasite interacting with a given immune system is similar to that of a conservative parasite, the properties away from this optimum differ significantly. We suspect that this difference, coupled with the requirement that a parasite optimize survival in a range of viable hosts, may help explain why semiconservative viruses are known to have significantly lower mutation rates than their conservative counterparts.
This paper develops a two-gene, single fitness peak model for determining the equilibrium distribution of genotypes in a unicellular population which is capable of genetic damage repair. The first gene, denoted by σvia, yields a viable organism with first-order growth rate constant k>1 if it is equal to some target "master” sequence σvia,0. The second gene, denoted by σrep, yields an organism capable of genetic repair if it is equal to some target "master” sequence σrep,0. This model is analytically solvable in the limit of infinite sequence length, and gives an equilibrium distribution which depends on μ≡Lε, the product of sequence length and per base pair replication error probability, and εr, the probability of repair failure per base pair. The equilibrium distribution is shown to exist in one of the three possible "phases.” In the first phase, the population is localized about the viability and repairing master sequences. As εr exceeds the fraction of deleterious mutations, the population undergoes a "repair” catastrophe, in which the equilibrium distribution is still localized about the viability master sequence, but is spread ergodically over the sequence subspace defined by the repair gene. Below the repair catastrophe, the distribution undergoes the error catastrophe when μ exceeds lnk/εr, while above the repair catastrophe, the distribution undergoes the error catastrophe when μ exceeds lnk/fdel, where fdel denotes the fraction of deleterious mutations.
Knowledge-based potentials have been found useful in a variety of biophysical studies of macromolecules. Recently, it has also been shown in self-consistent studies that it is possible to extract quantities consistent with pair potentials from model structural databases. In this study, we attempt to extend the results obtained from these self-consistent studies toward the extraction of realistic pair potentials from the Protein Data Bank (PDB). The new method utilizes a clustering approach to define atom types within the PDB consistent with the optimal effective pairwise potential. The method has been integrated into the SMoG drug design package, resulting in an improved approach for the rapid and accurate estimation of binding affinities from structural information. Using this approach, it is possible to generate simple knowledge-based potentials that correlate (R = 0.61) with experimental binding affinities in a database of 118 diverse complexes. Furthermore, predictions performed on a random 1/3 of the database consistently show an average unsigned error of 1.5 log Ki units. It is also possible to generate specialized knowledge-based potentials, targeted to specific protein families. This approach is capable of generating potentials that correlate strongly with experimental binding affinities within these families (R = 0.8?0.9). Predictions on 1/3 of these family databases yield average unsigned errors ranging from 1.1 to 1.3 log Ki units. In summary, we describe a physically motivated approach to optimizing knowledge-based potentials for binding energy prediction that can be integrated into a variety of stages within a lead discovery protocol.
This paper extends Eigen’s quasispecies equations to account for the semiconservative nature of DNA replication. We solve the equations in the limit of infinite sequence length for the simplest case of a static, sharply peaked fitness landscape. We show that the error catastrophe occurs when μ, the product of sequence length and per base pair mismatch probability, exceeds 2 ln[2∕(1+1∕k)], where k>1 is the first-order growth rate constant of the viable "master” sequence (with all other sequences having a first-order growth rate constant of 1). This is in contrast to the result of ln k for conservative replication. In particular, as k→∞, the error catastrophe is never reached for conservative replication, while for semiconservative replication the critical μ approaches 2 ln 2. Semiconservative replication is therefore considerably less robust than conservative replication to the effect of replication errors. We also show that the mean equilibrium fitness of a semiconservatively replicating system is given by k(2e−μ∕2−1) below the error catastrophe, in contrast to the standard result of ke−μ for conservative replication (derived by Kimura and Maruyama in 1966). From this result it is readily shown that semiconservative replication is necessary to account for the observation that, at sufficiently high mutagen concentrations, faster replicating cells will die more quickly than more slowly replicating cells. Thus, in contrast to Eigen’s original model, the semiconservative quasispecies equations are able to provide a mathematical basis for explaining the efficacy of mutagens as chemotherapeutic agents.
An accurate characterization of the transition state ensemble (TSE) is central to furthering our understanding of the protein folding reaction. We have extensively tested a recently reported method for studying a protein's TSE, utilizing φ-value data from protein engineering experiments and computational studies as restraints in all-atom Monte Carlo (MC) simulations. The validity of interpreting experimental φ-values as the fraction of native contacts made by a residue in the TSE was explored, revealing that this definition is unable to uniquely specify a TSE. The identification of protein G's second hairpin, in both pre and post-transition conformations demonstrates that high experimental φ-values do not guarantee a residue's importance in the TSE. An analysis of simulations based on structures restrained by experimental φ-values is necessary to yield this result, which is not obvious from a simplistic interpretation of individual φ-values. The TSE that we obtain corresponds to a single, specific nucleation event, characterized by six residues common to all three observed, convergent folding pathways. The same specific nucleus was independently identified from computational and experimental data, and "Conservation of Conservation” analysis in the protein G fold. When associated strictly with complete nucleus formation and concomitant chain collapse, folding is a well-defined two state event. Once the nucleus has formed, the folding reaction enters a slow relaxation process associated with side-chain packing and small, local backbone rearrangements. A detailed analysis of φ-values and their relationship to the transition state ensemble allows us to construct a unified theoretical model of protein G folding.
Recent work has shown that the network of structural similarity between protein domains exhibits a power-law distribution of edges per node. The scale-free nature of this graph, termed the protein domain universe graph or PDUG, may be reproduced via a divergent model of structural evolution. The performance of this model, however, does not preclude the existence of a successful convergent model. To further resolve the issue of protein structural evolution, we explore the predictions of both convergent and divergent models directly. We show that when nodes from the PDUG are partitioned into subgraphs on the basis of their occurrence in the proteomes of particular organisms, these subgraphs exhibit a scale-free nature as well. We explore a simple convergent model of structural evolution and find that the implications of this model are inconsistent with features of these organismal subgraphs. Importantly, we find that biased convergent models are inconsistent with our data. We find that when speciation mechanisms are added to a simple divergent model, subgraphs similar to the organismal subgraphs are produced, demonstrating that dynamic models can easily explain the distributions of structural similarity that exist within proteomes. We show that speciation events must be included in a divergent model of structural evolution to account for the non-random overlap of structural proteomes. These findings have implications for the long-standing debate over convergent and divergent models of protein structural evolution, and for the study of the evolution of organisms as a whole.
The transition from a normal to cancerous cell requires a number of highly specific mutations that affect cell cycle regulation, apoptosis, differentiation, and many other cell functions. One hallmark of cancerous genomes is genomic instability, with mutation rates far greater than those of normal cells. In microsatellite instability (MIN tumors), these are often caused by damage to mismatch repair genes, allowing further mutation of the genome and tumor progression. These mutation rates may lie near the error catastrophe found in the quasispecies model of adaptive RNA genomes, suggesting that further increasing mutation rates will destroy cancerous genomes. However, recent results have demonstrated that DNA genomes exhibit an error threshold at mutation rates far lower than their conservative counterparts. Furthermore, while the maximum viable mutation rate in conservative systems increases indefinitely with increasing master sequence fitness, the semiconservative threshold plateaus at a relatively low value. This implies a paradox, wherein inaccessible mutation rates are found in viable tumor cells. In this paper, we address this paradox, demonstrating an isomorphism between the conservatively replicating (RNA) quasispecies model and the semiconservative (DNA) model with post-methylation DNA repair mechanisms impaired. Thus, as DNA repair becomes inactivated, the maximum viable mutation rate increases smoothly to that of a conservatively replicating system on a transformed landscape, with an upper bound that is dependent on replication rates. On a specific single fitness peak landscape, the repair-free semiconservative system is shown to mimic a conservative system exactly. We postulate that inactivation of post-methylation repair mechanisms is fundamental to the progression of a tumor cell and hence these mechanisms act as a method for the prevention and destruction of cancerous genomes.
We attempt to understand the evolutionary origin of protein folds by simulating their divergent evolution with a three-dimensional lattice model. Starting from an initial seed lattice structure, evolution of model proteins progresses by sequence duplication and subsequent point mutations. A new gene's ability to fold into a stable and unique structure is tested each time through direct kinetic folding simulations. Where possible, the algorithm accepts the new sequence and structure and thus a "new protein structure” is born. During the course of each run, this model evolutionary algorithm provides several thousand new proteins with diverse structures. Analysis of evolved structures shows that later evolved structures are more designable than seed structures as judged by recently developed structural determinant of protein designability, as well as direct estimate of designability for selected structures by thermodynamic sampling of their sequence space. We test the significance of this trend predicted on lattice models on real proteins and show that protein domains that are found in eukaryotic organisms only feature statistically significant higher designability than their prokaryotic counterparts. These results present a fundamental view on protein evolution highlighting the relative roles of structural selection and evolutionary dynamics on genesis of modern proteins.