Theory

On the Role of Stop Codons in the Genetic Code

John Cole

Proposition of a possible mode of action:

We can consider the set of possible SNSs as a simple Markov process. Given any genetic code—a table of all 64 codons and their specified amino acids—we can construct a Markov matrix composed of elements M_ij specifying the probability that under a SNS, an amino acid “i” will become an amino acid “j”. The construction of this matrix is straighforward—for each codon that specifies the i^th amino acid, we sum up all the ways (weighted by their probabilities) that that codon can mutatue into a codon specifying the j^th amino acid; we then sum over each codon that specifies the i^th amino acid. The element M_ij is this sum normalized such that the sum of all elements in each column of the matrix is 1:

M_ij = [ Σ_k Σ_l P(codon_k → codon_l) ] / A_j

Here, k ranges over the number of codons that code for amino acid (or stop) i, and l ranges over the number of codons that code for amino acid (or stop) j, and A_j is a normalization constant such that the sum of elements in column j of the matrix is 1

The probabilities that a given codon can become another codon are governed by the type of SNS occurring—in general mutations U ↔ A, and C ↔ G are more likely than U or A ↔ C or G, and mutations in the second base are less likely than those in the first or third base, with the third base specifically showing little preference for the type of mutation. Including the stop codons, we have a 21 by 21 matrix, whose steady state eigenvector (the eigenvector associated with eigenvalue 1) is simply the relative frequencies of the amino acids within the genetic code (for example leucine would have probability 6/64, stop would have probability 3/64, etc.).

The solution, v_{steady state}, to:

(M – I) v_{steady state} = 0

is the long term steady state vector composed of the relative frequencies of the different amino acids.

What is the significance of this eigenvector? This eigenvector is the long term steady state of the system, meaning if we allowed coding sequences to evolve via SNS's for an eternity, the relative frequency that the i^th amino acid would appear would be given by the i^th element in this vector. But this assumes no evolutionary pressure toward functional proteins. In particular this model allows for amino acids to mutate to and from stop codons, which when we consider the effect on the produced protein is likely to be devastating.

Imagine a protein made up of 1000 amino acids undergoing a mutation in its 300^th codon that changes a tryptophan to a stop codon; protein production is truncated early, and the peptides produced are likely to have lost all of their capacity for biological function. Likewise consider a stop codon that mutates to a tryptophan, this is likely to produce a long random peptide chain appended to the end of an otherwise well-formed protein likely to interfere with both folding and function. In either case, these mutations are not likely to be passed on to future generations.

So how does this impact our Markov model? We want to model SNSs that occur only between amino acids, and never between amino acids and stop codons. This can be accomplished easily by removing the row and column that correspond to the stop codons from our previous Markov matrix, and re-normalizing the columns to produce a 20 by 20 Markov matrix relating only the amino acids. The steady state eigenvector of this system is not as trivial as for the 21 by 21 system, and depends on the locations and inter-connections that the stop codons have with the amino acids in the genetic code. In this way, the stop codons have a unique impact on the long term steady state frequencies of the other amino acids within protein coding sequences. This can be thought of as an evolutionary pressure on protein coding sequences toward a sequence with these steady state amino acid frequencies.

In essence what I am proposing is that the genetic code may have evolved under pressure to in turn provide evolutionary pressure on protein coding sequences.

While the importance of relative amino acid frequency has been discussed in the literature (Goodarzi, 2004), to my knowledge none have considered it as an evolutionary pressure in itself. The literature that I am aware of generally considers the genetic code to have evolved to give rise to the relative amino acid frequencies observed in nature, not that the genetic code evolved to influence the relative amino acid frequencies. Moreover, so far as I am aware, none have considered the capacity of stop codons to alter these amino acid frequencies.

The question then becomes what amino acid frequencies are likely to be beneficial for an organism to be pressured toward? It seems reasonable that a simple organism, such as a bacterium—which has to manufacture all of its own amino acids prior to incorporation into proteins—would prefer to use generally less “costly” amino acids. But the “cost” of an amino acid is a difficult thing to quantify. There is the energetic cost—energy that could be used to produce ATP or do other intracellular work—as well as the materials cost—using atoms or chemical groups that could be used elsewhere within the cell. With these as inspiration one might expect a simple organism would benefit from a genetic code that pressured its evolving proteins toward preferentially incorporating amino acids with greater “chemical stability” or less material (eg, number of heavy atoms, or total mollecular mass). But even “chemical stability” is a tricky thing when comparing differing molecules. I propose for the purposes of this investigation using two related definitions—the ratio of H_f to either mollecular mass, or the number of heavy atoms.

Did the genetic code evolve partially to provide evolutionary pressure on protein coding sequences?