Epigenetic patterns - how genes are regulated

Although all human cells have almost the same DNA sequence, they have completely different functions. This is because not all genes are active in each cell. As the omnipotent fertilised ovum develops into a heart, muscle or nerve cell, some regions of the gene have to be switched on and others switched off. This regulation of the genes is called epigenetics. The rewriting of the DNA to RNA, i.e. the transcription, is regulated by enzymes. We investigate how the epigenetic information is transferred to the enzymes. A hyperfunction of these enzymes plays an important role in many diseases, such as cancer. If we had a better understanding of the epigentic code, it could provide us with new approaches for controlling this hyperfunction and developing new drugs.

The genome is the “book of life” written in an alphabet with four letters: A as in adenosine, G as in guanine, C as in cytosine and T as in thymine – the four bases of the DNA code. Knowing the genes means knowing the person and with this knowledge, illnesses such as cancer, diabetes and Alzheimer’s disease can be conquered. These were the hopes people had when the human genome was deciphered in 2000. There was great astonishment when it was found that humans have only around 23,500 protein-encoding genes. Only marginally more than the fruit fly or the earthworm, and significantly fewer than some plants! This means it is not the number of the genes which is decisive for the development of complex organisms. Neither is it solely the size of the genome, which is around three billion bases in humans and thus very large. Humans are more than the sum of their genes. The crucial factor is that the gene activity can be specifically regulated by a code.

But how exactly are genes controlled? What does the epigenetic code look like? Two major epigenetic effects are known. The first code option is a modification of the DNA: methyl groups (-CH3) attach to the DNA strand and thus prevent the subsequent gene sequences from being read. The gene is practically switched off. Conversely, genes can be activated. In order to understand this second epigenetic effect, one first has to bear in mind how the DNA are arranged in the cell nucleus. This is not trivial, because the human DNA strand is around two metres long! How on earth does it fit into a nucleus with a diameter of a few micrometres? This is done with the aid of a clever packing technique: the DNA strand is wound around millions of proteins, so-called histones, like a coil. The histones then resemble a chain of pearls. The DNA is thus compressed by a factor of around 10,000! The DNA protein complex generated here is called chromatin. In order to activate genes, the genetic material must first be “unpacked” again at the appropriate position. If the histone tails are acetylated (acetyl group: -C(O)CH3), the chromatin structure slackens and it becomes easier to read the DNA.

The first step of protein biosynthesis is the transcription, i.e. the rewriting of the DNA to RNA. The DNA is located in the nucleus, where no proteins can be produced. The genetic code must be transported to the location of the protein biosynthesis – the ribosomes. This is done by so-called mRNA (messenger RNA), a complementary copy of a DNA section. It is important that the epigenetic information is not lost during the transcription because, after all, the subsequent translation – of the RNA into proteins – should produce only those proteins it was supposed to. The transcription is divided into three steps: initiation, elongation and termination. The RNA polymerase II enzyme is involved in all three steps. During the initiation, the RNA polymerase II binds to promoter molecules of the DNA strand and “unwinds” the double helix. During the elongation, the DNA is transcribed from DNA into mRNA. By attaching free ribonucleotides, the RNA polymerase II thereby synthesises an mRNA strand which is complementary to the DNA. If the RNA polymerase II meets a terminator sequence, it stops and the mRNA strand detaches. How does the RNA polymerase II here succeed in recording the epigenetic information – methylation of the DNA and acetylation of the histones – and passing it on to the mRNA? It cannot be too complicated, because if antibodies have to be formed in the event of an infection, it should happen quickly. In order to record the epigenetic information of the genome, the RNA polymerase has developed its own platform – the so-called C-terminal domain.

Die C-terminal domain (CTD) of RNA polymerase II

Each protein has one amino end (N-terminus) and one carboxy end (C-terminus). In humans, there is an unusual region which comprises 52 repeats of a hepta-peptide (i.e. a peptide with seven amino acids) at the C-terminus of the RNA polymerase II. This is easy to remember, because it corresponds to the number of weeks in a year and the number of days in a week. The sequence which is continually repeated – the so-called consensus sequence – is (Figure 1):

tyrosine – serine – proline – threonine – serine – proline – serine.

Figure 1: The C-terminal domain (CTD) of human RNA polymerase II. Large multicellular organisms have a continually repeating amino acid sequence (consensus sequence), which is composed of the Tyr-Ser-Pro-Thr-Ser-Pro-Ser motif (YSPTSPS in the single letter code of the amino acids), at the C-terminus of the large subunits of the RNA polymerase II. Humans have 52 repeats of this motif. Deviations from the consensus sequence occur mainly in the rear part of the CTD.

The RNA polymerase II of all multicellular organisms and a few unicellular organisms has this structure [1]; they are different in length, however. In humans, the number of repeats is comparatively high. In the remote region of this domain, there are small deviations within the consensus sequence; the rhythm of seven is maintained, however. The simplest model organism for a more highly developed life form, baker’s yeast, has precisely 26 heptad repeat sequences, i.e. half as many as humans. It is not yet clear whether a metric system is behind this.

The heptad repeat sequence is the blackboard on which the epigenetic code is written. But in which language? The amino acids are modified. Each serine, threonine or tyrosine in the CTD can be phosphorylated. For humans alone this adds up to 326 possible phosphorylation sites which can be occupied by all possible combinations. In addition to phosphorylation, there are also further modification possibilities: glycosylation, acetylation, and an isomerisation of the prolines. The number of possible modification patterns – and thus the quantity of information which can be written on the CTD – is large beyond description! The modifications are all reversible, i.e. the blackboard can be wiped clean again at any time. The programme for gene expression – i.e. the question as to which proteins are really formed in the cell – is uniquely determined by the modification pattern of the CTD. But who or what actually writes on the blackboard?

Transcription kinases write a code on the CTD

We are concerned mainly with the “writing enzymes”, which put phosphorylation patterns onto the CTD – the so-called kinases. Four of the seven amino acids are serines and threonines, which represent the typical target of the kinases. Kinases have a binding pocket for adenosine triphosphate (ATP), which is split up into adenosine diphosphate (ADP) and a free phosphate group in the phosphorylation reaction. The phosphate group is transferred to the recipient protein – in our case the CTD.

Kinases regulate many processes in the human metabolism. They are thus some of the most important target molecules for the development of new drugs. Five kinases which regulate transcription have so far been found. They each form a complex with a second protein, a cyclin. These kinases are therefore also called cyclin-dependent kinases (Cdk). We are interested in how kinases guide the transcription from the initiation to the elongation, and how precisely they phosphorylate the CTD of the RNA polymerase II in this process.

The first step in the initiation – the start of the transcription – is the phosphorylation of the serine residues at position 7 of the consensus sequence. The RNA polymerase II then takes a break – which is surprising at first glance! It is possible that a further check is then made as to whether the gene really is to be read. The serine residues at position 5 of the consensus sequence are now phosphorylated. This gives the signal for the elongation of the transcription; the RNA copies of the DNA are produced. Before the mRNA can get to the ribosomes, the immature mRNA needs to be spliced. Some sections of the mRNA are removed here and the remaining ones are reconnected with each other. How exactly this is to take place is again mediated by a kinase. It phosphorylates the serine 2 positions of the consensus sequence. This epigenetic code can be very quickly transferred to the CTD by the “writing enzymes” – the kinases. The epigenetic code of the DNA – methylation of the strand and acetylation of the histones – is transiently “translated” into an easily readable epigenetic code: a phosphorylation motif on the CTD of the RNA polymerase II. The corresponding “reading enzymes” are then recruited and the information is used to splice the mRNA [2]. The code can be quickly deleted again, however, so that the RNA polymerase II is available for new transcription processes.

The crystal structure of Cdk12/cyclin K

We are mainly interested in the two cyclin-dependent kinases Cdk9 and Cdk12 and the complexes which both kinases form with their respective cyclin. We have crystallised the complex of Cdk12 and cyclin K. We can determine the three-dimensional structure of the two proteins with the aid of X-ray diffraction (Figure 2) [3].

Figure 2: X-ray crystall structure of the Cdk12/cyclin K complex. The Cdk12 kinase forms a protein complex with cyclin K which allows it to phosphorylate the CTD of the RNA polymerase II. The active centre of the kinase contains ATP, which is split up into adenosine diphosphate (ADP) and a free phosphate group upon the phosphorylation reaction.

We have thus identified a tail at the C-terminal end of the kinase – a motif of two amino acids (Figure 3). A detailed analysis showed that other kinases which are involved in the regulation of the transcription also contain this motif. In order to investigate its function, we successively shortened the protein sequence of the kinase and then measured its activity. It turned out that the motif is imperative for the transcription activity of the kinase: a fragment of Cdk12 shortened by 20 amino acids exhibits an activity which is reduced by a factor of five. We have meanwhile also been able to confirm that the corresponding motif is also crucial for the function of this protein in a second kinase, Cdk9.

Figure 3:
Structure of the C-terminal tail in Cdk12. The C-terminal tail in Cdk12 is in contact with the ADP nucleotide via several water molecules. The connecting hydrogen bridges are shown in the diagram. The magnesium ions (Mg2+) provide the coordination.

Cdk12 produces a specific phosphorylation motif

It is still unclear which cyclin-dependent kinases are involved in which phosphorylation step. We investigated which amino acids within the CTD are phosphorylated by Cdk12. It turned out that the Cdk12 kinase is particularly active when serine is already phosphorylated at position 7 of the CTD consensus sequence. As has been described above, the phosphorylation of the serine at position 7 is a special characteristic of the initiation. Cdk12 is thus predestined to trigger the transition from the initiation into the elongation. A detailed analysis showed that Cdk12 really does phosphorylate the serine at position 5 and thus gives the signal for elongation (Figure 4). Cdk12 is incapable of phosphorylating a CTD peptide if this contains a lysine instead of a serine at position 7, however. The phosphorylation motif at positions 5 and 7 of the consensus sequence is specific to Cdk12; it has not yet been possible to observe it in any other cyclin-dependent kinase.

Figure 4: Cdk12 produces a specific phosphorylation motif on the CTD. The Cdk12 kinase primarily phosphorylates a CTD substrate where serine is already phosphorylated at position 7 of the consensus sequence. The Cdk12 transfers a phosphorylation to the serine at position 5 (left) in the process. Conversely, Cdk12 is not able to phosphorylate the CTD if serine 7 is replaced by lysine (K7 in the single letter code) (right). This produces a specific phosphorylation motif, which is also called the CTD code.

Mutations on Cdk12 are associated with cancers

The misregulation of the transcription is being increasingly recognised as the cause of many diseases. Changes in the Cdk12 gene could mean a predisposition towards certain diseases. Mutations in Cdk12 have been identified in lung, breast and ovarian carcinoma and melanoma [4, 5]. If the structure and action mechanism of the enzyme is known, it is possible to look for specific inhibitors. With our description of the three-dimensional structure of Cdk12 we lay the foundations for the targeted development of new drugs.


Bösken, C.A., Farnung, L., Hintermair, C., Merzel Schachter, M., Vogel-Bachmayr, K., Blazek, D., Anand, K., Fisher, R.P., Eick, D. & Geyer, M. (2014) “The structure and substrate specificity of human Cdk12/Cyclin K” Nature Communications 5:3505