The number of protein structures from structural genomics centers dramatically increases in the Protein Data Bank (PDB). Many of these structures are functionally unannotated because they have no sequence similarity to proteins of known function. However, it is possible to successfully infer function using only structural similarity.
This paper develops the semiconservative quasispecies equations for genomes consisting of an arbitrary number of chromosomes. We assume that the chromosomes are distinguishable, so that we are effectively considering haploid genomes. We derive the quasispecies equations under the assumption of arbitrary lesion repair efficiency, and consider the cases of both random and immortal strand chromosome segregation. We solve the model in the limit of infinite sequence length for the case of the static single fitness peak landscape, where the master genome has a first-order growth rate constant of k > 1 , and all other genomes have a first-order growth rate constant of 1. If we assume that each chromosome can tolerate an arbitrary number of lesions, so that only one master copy of the strands is necessary for a functional chromosome, then for random chromosome segregation we obtain an equilibrium mean fitness of κ ¯ ( t = ∞ ) = k 2 e - ( 1 / N ) μ λ / 2 + e - ( 1 / N ) μ ( 1 - λ / 2 ) 2 N - 1 , below the error catastrophe, while for immortal strand co-segregation we obtain κ ¯ ( t = ∞ ) = k [ e - μ ( 1 - λ / 2 ) + e - μ λ / 2 - 1 ] (N denotes the number of chromosomes, λ denotes the lesion repair efficiency, and μ ≡ ε L , where ε is the per base-pair mismatch probability, and L is the total genome length). It follows that immortal strand co-segregation leads to significantly better preservation of the master genome than random segregation when lesion repair is imperfect. Based on this result, we conjecture that certain classes of tumor cells exhibit immortal strand co-segregation.
In this work, we discovered a fundamental connection between selection for protein stability and emergence of preferred structures of proteins. Using a standard exact three-dimensional lattice model we evolve sequences starting from random ones and determine the exact native structure after each mutation. Acceptance of mutations is biased to select for stable proteins. We found that certain structures, "wonderfolds”, are independently discovered numerous times as native states of stable proteins in many unrelated runs of selection. The strong dependence of lattice fold usage on the structural determinant of designability quantitatively reproduces uneven fold usage in natural proteins. Diversity of sequences that fold into wonderfold structures gives rise to superfamilies, i.e. sets of dissimilar sequences that fold into the same or very similar structures. The present work establishes a model of pre-biotic structure selection, which identifies dominant structural patterns emerging upon optimization of proteins for survival in a hot environment. Convergently discovered pre-biotic initial superfamilies with wonderfold structures could have served as a seed for subsequent biological evolution involving gene duplications and divergence.
Through extensive experiment, simulation, and analysis of protein S6 (1RIS), we find that variations in nucleation and folding pathway between circular permutations are determined principally by the restraints of topology and specific nucleation, and affected by changes in chain entropy. Simulations also relate topological features to experimentally measured stabilities. Despite many sizable changes in ϕ values and the structure of the transition state ensemble that result from permutation, we observe a common theme: the critical nucleus in each of the mutants share a subset of residues that can be mapped to the critical nucleus residues of the wild-type. Circular permutations create new N and C termini, which are the location of the largest disruption of the folding nucleus, leading to a decrease in both ϕ values and the role in nucleation. Mutant nuclei are built around the wild-type nucleus but are biased towards different parts of the S6 structure depending on the topological and entropic changes induced by the location of the new N and C termini.
The size and origin of the protein fold universe is of fundamental and practical importance. Analyzing randomly generated, compact sticky homopolypeptide conformations constructed in generic simplified and all-atom protein models, all have similar folds in the library of solved structures, the Protein Data Bank, and conversely, all compact, single-domain protein structures in the Protein Data Bank have structural analogues in the compact model set. Thus, both sets are highly likely complete, with the protein fold universe arising from compact conformations of hydrogen-bonded, secondary structures. Because side chains are represented by their Cβ atoms, these results also suggest that the observed protein folds are insensitive to the details of side-chain packing. Sequence specificity enters both in fine-tuning the structure and thermodynamically stabilizing a given fold with respect to the set of alternatives. Scanning the models against a three-dimensional active-site library, close geometric matches are frequently found. Thus, the presence of active-site-like geometries also seems to be a consequence of the packing of compact, secondary structural elements. These results have significant implications for the evolution of protein structure and function.
To explore the plasticity and structural constraints of the protein-folding nucleus we have constructed through circular permutation four topological variants of the ribosomal protein S6. In effect, these topological variants represent entropy mutants with maintained spatial contacts. The proteins were characterized at two complementary levels of detail: by φ-value analysis estimating the extent of contact formation in the transition-state ensemble and by Hammond analysis measuring the site-specific growth of the folding nucleus. The results show that, although the loop-entropy alterations markedly influence the appearance and structural location of the folding nucleus, it retains a common motif of one helix docking against two strands. This nucleation motif is built around a shared subset of side chains in the center of the hydrophobic core but extends in different directions of the S6 structure following the permutant-specific differences in local loop entropies. The adjustment of the critical folding nucleus to alterations in loop entropies is reflected by a direct correlation between the φ-value change and the accompanying change in local sequence separation.
It has recently been demonstrated that many biological networks exhibit a "scale-free” topology, for which the probability of observing a node with a certain number of edges (k) follows a power law: i.e., p(k) ∼ k -γ. This observation has been reproduced by evolutionary models. Here we consider the network of protein-protein interactions (PPIs) and demonstrate that two published independent measurements of these interactions produce graphs that are only weakly correlated with one another despite their strikingly similar topology. We then propose a physical model based on the fundamental principle that (de)solvation is a major physical factor in PPIs. This model reproduces not only the scale-free nature of such graphs but also a number of higher-order correlations in these networks. A key support of the model is provided by the discovery of a significant correlation between the number of interactions made by a protein and the fraction of hydrophobic residues on its surface. The model presented in this paper represents a physical model for experimentally determined PPIs that comprehensively reproduces the topological features of interaction networks. These results have profound implications for understanding not only PPIs but also other types of scale-free networks.
It has long been known that a protein's amino acid sequence dictates its native structure. However, despite significant recent advances, an ensemble description of how a protein achieves its native conformation from random coil under physiologically relevant conditions remains incomplete. Here we present a detailed all-atom model with a transferable potential that is capable of ab initio folding of entire protein domains using only sequence information. The computational efficiency of this model allows us to perform thousands of microsecond-time scale-folding simulations of the engrailed homeodomain and to observe thousands of complete independent folding events. We apply a graph-theoretic analysis to this massive data set to elucidate which intermediates and intermediary states are common to many trajectories and thus important for the folding process. This method provides an atomically detailed and complete picture of a folding pathway at the ensemble level. The approach that we describe is quite general and could be used to study the folding of proteins on time scales orders of magnitude longer than currently possible.
In this work we develop a theory of interaction of randomly patterned surfaces as a generic prototype model of protein-protein interactions. The theory predicts that pairs of randomly superimposed identical (homodimeric) random patterns have always twice as large magnitude of the energy fluctuations with respect to their mutual orientation, as compared with pairs of different (heterodimeric) random patterns. The amplitude of the energy fluctuations is proportional to the square of the average pattern density, to the square of the amplitude of the potential and its characteristic length, and scales linearly with the area of surfaces. The greater dispersion of interaction energies in the ensemble of homodimers implies that strongly attractive complexes of random surfaces are much more likely to be homodimers, rather than heterodimers. Our findings suggest a plausible physical reason for the anomalously high fraction of homodimers observed in real protein interaction networks.
Protein structure is generally conceptualized as the global arrangement or of smaller, local motifs of helices, sheets, and loops. These regular, recurring secondary structural elements have well understood and standardized definitions in terms of amino acid backbone geometry and the manner in which hydrogen bonding requirements are satisfied. Recently, "tube” models have been proposed to explain protein secondary structure in terms of the geometrically optimal packing of a featureless cylinder. However, atomically detailed simulations demonstrate that such packing considerations alone are insufficient for defining secondary structure; both excluded volume and hydrogen bonding must be explicitly modeled for helix formation. These results have fundamental implications for the construction and interpretation of realistic and meaningful biomacromolecular models.
A generalized computational method for folding proteins with a fully transferable potential and geometrically realistic all-atom model is presented and tested on seven helix bundle proteins. The protocol, which includes graph-theoretical analysis of the ensemble of resulting folded conformations, was systematically applied and consistently produced structure predictions of ≈3 Å without any knowledge of the native state. To measure and understand the significance of the results, extensive control simulations were conducted. Graph theoretic analysis provides a means for systematically identifying the native fold and provides physical insight, conceptually linking the results to modern theoretical views of protein folding. In addition to presenting a method for prediction of structure and folding mechanism, our model suggests that an accurate all-atom amino acid representation coupled with a physically reasonable atomic interaction potential and hydrogen bonding are essential features for a realistic protein model.
This paper develops a point-mutation model describing the evolutionary dynamics of a population of adult stem cells. Such a model may prove useful for quantitative studies of tissue aging and the emergence of cancer. We consider two modes of chromosome segregation: (1) random segregation, where the daughter chromosomes of a given parent chromosome segregate randomly into the stem cell and its differentiating sister cell and (2) "immortal DNA strand” co-segregation, for which the stem cell retains the daughter chromosomes with the oldest parent strands. Immortal strand co-segregation is a mechanism, originally proposed by [ Cairns Nature (London) 255 197 (1975)], by which stem cells preserve the integrity of their genomes. For random segregation, we develop an ordered strand pair formulation of the dynamics, analogous to the ordered strand pair formalism developed for quasispecies dynamics involving semiconservative replication with imperfect lesion repair (in this context, lesion repair is taken to mean repair of postreplication base-pair mismatches). Interestingly, a similar formulation is possible with immortal strand co-segregation, despite the fact that this segregation mechanism is age dependent. From our model we are able to mathematically show that, when lesion repair is imperfect, then immortal strand co-segregation leads to better preservation of the stem cell lineage than random chromosome segregation. Furthermore, our model allows us to estimate the optimal lesion repair efficiency for preserving an adult stem cell population for a given period of time. For human stem cells, we obtain that mispaired bases still present after replication and cell division should be left untouched, to avoid potentially fixing a mutation in both DNA strands.
Understanding the observed variability in the number of homologs of a gene is a very important unsolved problem that has broad implications for research into coevolution of structure and function, gene duplication, pseudogene formation, and possibly for emerging diseases. Here, we attempt to define and elucidate some possible causes behind the observed irregularity in sequence space. We present evidence that sequence variability and functional diversity of a gene or fold family is influenced by quantifiable characteristics of the protein structure. These characteristics reflect the structural potential for sequence plasticity, i.e., the ability to accept mutation without losing thermodynamic stability. We identify a structural feature of a protein domain—contact density—that serves as a determinant of entropy in sequence space, i.e., the ability of a protein to accept mutations without destroying the fold (also known as fold designability). We show that (log) of average gene family size exhibits statistical correlation (R2 > 0.9.) with contact density of its three-dimensional structure. We present evidence that the size of individual gene families are influenced not only by the designability of the structure, but also by evolutionary history, e.g., the amount of time the gene family was in existence. We further show that our observed statistical correlation between gene family size and contact density of the structure is valid on many levels of evolutionary divergence, i.e., not only for closely related sequence, but also for less-related fold and superfamily levels of homology.
In this article we study the full semiconservative treatment of a model for the coevolution of a virus and an adaptive immune system. Regions of viability are calculated for both conservatively and semiconservatively replicating viruses interacting with a realistic semiconservatively replicating immune system. The conservative virus is found to have a selective advantage in the form of an ability to survive in regions with a wider range of mutation rates than its semiconservative counterpart, as well as an increased replication rate where both species can survive. This may help explain the existence of a rich range of viruses with conservatively replicating genomes, a trait that is found nowhere else in nature.
We use an integrated computational approach to reconstruct accurately the transition state ensemble (TSE) for folding of the src-SH3 protein domain. We first identify putative TSE conformations from free energy surfaces generated by importance sampling molecular dynamics for a fully atomic, solvated model of the src-SH3 protein domain. These putative TSE conformations are then subjected to a folding analysis using a coarse-grained representation of the protein and rapid discrete molecular dynamics simulations. Those conformations that fold to the native conformation with a probability (Pfold) of approximately 0.5, constitute the true transition state. Approximately 20% of the putative TSE structures were found to have a Pfold near 0.5, indicating that, although correct TSE conformations are populated at the free energy barrier, there is a critical need to refine this ensemble. Our simulations indicate that the true TSE conformations are compact, with a well-defined central β sheet, in good agreement with previous experimental and theoretical studies. A structured central β sheet was found to be present in a number of pre-TSE conformations, however, indicating that this element, although required in the transition state, does not define it uniquely. An additional tight cluster of contacts between highly conserved residues belonging to the diverging turn and second β-sheet of the protein emerged as being critical elements of the folding nucleus. A number of commonly used order parameters to identify the transition state for folding were investigated, with the number of native Cβ contacts displaying the most satisfactory correlation with Pfold values.
It has recently been discovered that many biological systems, when represented as graphs, exhibit a scale-free topology. One such system is the set of structural relationships among protein domains. The scale-free nature of this and other systems has previously been explained using network growth models that, although motivated by biological processes, do not explicitly consider the underlying physics or biology. In this work we explore a sequence-based model for the evolution protein structures and demonstrate that this model is able to recapitulate the scale-free nature observed in graphs of real protein structures. We find that this model also reproduces other statistical feature of the protein domain graph. This represents, to our knowledge, the first such microscopic, physics-based evolutionary model for a scale-free network of biological importance and as such has strong implications for our understanding of the evolution of protein structures and of other biological networks.
Quasispecies theory has emerged as an important tool for modeling the evolutionary dynamics of biological systems. We review recent advances in the field, with an emphasis on the quasispecies dynamics of semiconservatively replicating genomes. Applications to cancer and adult stem cell growth are discussed. Additional topics, such as genetic repair and many-gene genomes, are covered as well.
Motivation: Given a large family of homologous protein sequences, many methods can divide the family into smaller groups that correspond to the different functions carried out by proteins within the family. One important problem, however, has been the absence of a general method for selecting an appropriate level of granularity, or size of the groups. Results: We propose a consistent way of choosing the granularity that is independent of the sequence similarity and sequence clustering method used. We study three large, well-investigated protein families: basic leucine zippers, nuclear receptors and proteins with three consecutive C2H2 zinc fingers. Our method is tested against known functional information, the experimentally determined binding specificities, using a simple scoring method. The significance of the groups is also measured by randomizing the data. Finally, we compare our algorithm against a popular method of grouping proteins, the TRIBE-MCL method. In the end, we determine that dividing the families at the proposed level of granularity creates very significant and useful groups of proteins that correspond to the different DNA-binding motifs. We expect that such groupings will be useful in studying not only DNA binding but also other protein interactions.