KEY CONCEPTS:
- Only 1% of the human genome consists of coding regions.
- The exons comprise ~5% of each gene, so genes (exons plus introns) comprise ~25% of the genome.
- The human genome has 30,000-40,000 genes.
- ~60% of human genes are alternatively spliced.
- Up to 80% of the alternative splices change protein sequence, so the proteome has ~50,000-60,000 members.
The human genome was the first vertebrate genome to be sequenced (Venter et al., 2001, International Human Genome Sequencing Consortium., 2001). This massive task has revealed a wealth of information about the genetic makeup of our species, and about the evolution of the genome in general. (Methods used for genome sequencing are reviewed in 32.12 Genome mapping.) Our understanding is deepened further by the ability to compare the human genome sequence with the more recently sequenced mouse genome (Waterston et al., 2002).
Mammal and rodent genomes generally fall into a narrow size range, ~ 3
× 10
9 bp (see 3.5 Why are genomes so large?). The mouse genome is ~14% smaller than the human genome, probably because it has had a higher rate of deletion. The genomes contain similar gene families and genes, with most genes having an ortholog in the other genome, but with differences in the number of members of a family, especially in those cases where the functions are specific to the species (see 3.10 The conservation of genome organization helps to identify genes). The estimate of 30,000 genes for the mouse genome is at the lower end of the range of estimates for the human genome. Figure 3.20 plots the distribution of the mouse genes. The 30,000 protein-coding genes are accompanied by ~4000 pseudogenes. There are ~800 genes representing RNAs that do not code for proteins; these are generally small (aside from the rRNAs). Almost half of these genes code for tRNAs, for which a large number of pseudogenes also have been identified.
The human (haploid) genome contains 22 autosomes plus the X or Y. The chromosomes range in size from 45-279 Mb of DNA, making a total genome content of 3,286 Mb (~3.3 × 109 bp). On the basis of chromosome structure, the overall genome can be divided into regions of euchromatin (potentially containing active genes) and heterochromatin (see 19.7 Chromatin is divided into euchromatin and heterochromatin). The euchromatin comprises the majority of the genome, ~2.9 × 109 bp. The identified genome sequence represents ~90% of the euchromatin. In addition to providing information on the genetic content of the genome, the sequence also identifies features that may be of structural importance (see 19.8 Chromosomes have banding patterns).
Figure 3.21 shows that a tiny proportion (~1%) of the human genome is accounted for by the exons that actually code for proteins. The introns that constitute the remaining sequences in the genes bring the total of DNA concerned with producing proteins to ~25%. As shown in Figure 3.22, the average human gene is 27 kb long, with 9 exons that include a total coding sequence of 1,340 bp. The average coding sequence is therefore only 5% of the length of the gene.
Based on comparisons with other species and with known protein-coding genes, there are ~24,000 clearly identifiable genes. Sequence analysis identifies ~12,000 more potential genes. Two independent analyses have produced estimates of ~30,000 and ~40,000 genes, respectively (Venter et al., 2001, International Human Genome Sequencing Consortium., 2001). One measure of the accuracy of the analyses is whether they identify the same genes. The surprising answer is that the overlap between the two sets of genes is only ~50%, as summarized in Figure 3.23 (Hogenesch et al., 2001). An earlier analysis of the human gene set based on RNA transcripts had identified ~11,000 genes, almost all of which are present in both the large human gene sets, and which account for the major part of the overlap between them. So there is no question about the authenticity of half of each human gene set, but we have yet to establish the relationship between the other half of each set. The discrepancies illustrate the pitfalls of large scale sequence analysis! As the sequence is analyzed further (and as other genomes are sequenced with which it can be compared), the number of valid genes seems to decline, and is now generally thought to be ~30,000.
By any measure, the total human gene number is much less than we had expected ?most previous estimates had been ~100,000. It shows a relatively small increase over flies and worms (13,600 and 18,500, respectively), not to mention the plant Arabidopsis (25,000) (see Figure 3.9). However, we should not be particularly surprised by the notion that it does not take a great number of additional genes to make a more complex organism. The difference in DNA sequences between man and chimpanzee is extremely small (there is >99% similarity), so it is clear that the functions and interactions between a similar set of genes can produce very different results. The functions of specific groups of genes may be especially important, because detailed comparisons of orthologous genes in man and chimpanzee suggest that there has been accelerated evolution of certain classes of genes, including some involved in early development, olfaction, hearing ?all functions that are relatively specific for the species (Clark et al., 2003).
The number of genes is less than the number of potential proteins because of alternative splicing. The extent of alternative splicing is greater in Man than in fly or worms; it may affect as many as 60% of the genes, so the increase in size of the human proteome relative to the other eukaryotes may be larger than the increase in the number of genes. A sample of genes from two chromosomes suggests that the proportion of the alternative splices that actually result in changes in the protein sequence may be as high as 80%. This could increase the size of the proteome to 50,000-60,000 members.
In terms of the diversity of the number of gene families, however, the discrepancy between Man and the other eukaryotes may not be so great. Many of the human genes belong to families. An analysis of ~25,000 genes identified 3500 unique genes and 10,300 gene pairs. As can be seen from Figure 3.15, this extrapolates to a number of gene families only slightly larger than worm or fly.
KEY CONCEPTS:Repeated sequences (present in more than one copy) account for >50% of the human genome. The great bulk of repeated sequences consist of copies of nonfunctional transposons. There are many duplications of large chromosome regions.Are...
KEY TERMS:Synteny describes a relationship between chromosomal regions of different species where homologous genes occur in the same order. KEY CONCEPTS: Algorithms for identifying genes are not perfect and many corrections must be made to the initial...
KEY TERMS:The proteome is the complete set of proteins that is expressed by the entire genome. Because some genes code for multiple proteins, the size of the proteome is greater than the number of genes. Sometimes the term is used to describe complement...
KEY CONCEPTS:Genome sequences show that there are 500-1200 genes in parasitic bacteria, 1500-7500 genes in free-living bacteria, and 1500-2700 genes in archaea. Large-scale efforts have now led to the sequencing of many genomes. A range is summarized...
KEY TERMS:The genome is the complete set of sequences in the genetic material of an organism. It includes the sequence of each chromosome plus any DNA in organelles. The transcriptome is the complete set of RNAs present in a cell, tissue, or organism....