An Exhaustive Search for Extensive Chromosomal Regions Duplicated within the Human Genome.

Tadashi Imanishi, Toshinori Endo and Takashi Gojobori.

Center for Information Biology, National Institute of Genetics, Mishima, Japan.

Crick here to contact the authors (TI).

Abstract

To find the duplicated chromosomal regions within the human genome, we examined the distribution of homologous genes on human chromosomes. By conducting similarity searches among all the human protein-coding genes in a public database that have assigned their chromosomal locations, we found numerous homologous gene pairs among all of 22 autosomes and two sex-chromosomes. Furthermore, we found a total of 128 pairs of chromosomal bands on which three or more pairs of homologous genes are located. These findings suggest either that extensive chromosomal regions have been frequently duplicated in the past or that the entire genome of an ancestral vertebrate have been duplicated several hundreds of million years ago. Further analyses using the complete genomic sequences of human chromosomes and comparative analyses of vertebrate genomes should reveal the evolutionary processes of chromosomal or genomic duplications in the course of human evolution.


Introduction

To elucidate how the human genome have evolved and how it is organized to function, we are systematically analyzing all available sequence data of human genes.
Traces of ancient duplications of extensive chromosomal regions are being discovered within the human genomes. For example, human MHC (major histocompatibility complex) gene region on chromosome 6p21.3, which extends at least a few megabases, encodes a series of genes that are homologous to those on chromosome 9q33-q34 (Fig. 1) (Endo et al., 1997; Kasahara et al., 1997). By constructing phylogenetic trees of homologous genes on these regions, we found that regional duplication took place at least twice: once around the time when vertebrates emerged and once around the time when eukaryotes emerged (Endo et al., 1997).
Are duplications of long chromosomal regions found throughout the human genome? To answer this question, we examined the distribution of homologous genes on human chromosomes using computers and DNA databases.

Method (summarized in Table 1)

DNA sequences of human genes were extracted from a public nucleotide sequence database, DDBJ/EMBL/GenBank (DDBJ release 27; Oct. 1996). In the human division of DDBJ, there were 5,906 human DNA sequences that were mapped on some chromosomal bands. These DNA sequences encoded a total of 6,315 proteins. To remove "redundant" data such as polymorphic alleles, tandemly duplicated genes, and duplicate submissions, highly similar sequences on the same chromosomal bands were grouped together. Finally, all the human protein sequences were classified into 3,326 "non-redundant" protein groups.
We then conducted pair-wise similarity searches using the FASTA programs (version 2.0u6) (Pearson, 1996; Pearson and Lipman, 1988). The protein sequences were aligned using Smith and Waterman's algorithm (Smith and Waterman, 1981), and the similarity scores of alignment were calculated using the BLOSUM50 scoring matrix. A pair of protein sequences was assumed to be homologous only when the similarity score was larger than 200 and the alignment was longer than 100 amino acids. This criteria corresponds to about 25% sequence identity of proteins that are 200 amino acids long.

Results

Quite a few homologous pairs of proteins exist among all of 22 autosomes and two sex-chromosomes. Two such examples are shown in Figure 2 and Figure 3. Interestingly, many homologous genes are focused on certain chromosomal bands. Such regions could be the position where extensive chromosomal regions have duplicated.
To detect duplicated chromosomal regions between chromosomes, we looked for pairs of chromosomal bands on which three or more homologous genes exist. We discovered a total of 128 instances that three or more pairs of homologous genes are located on a certain pair of chromosomal bands (Table 2). Among them, 10 pairs are between autosomes and the X-chromosome. Some regions have more than one duplicated regions. For example, the MHC region (6p21.3) has 11 homologous regions (1q23, 3p21, 5q21, 7q22, 9q33-q34, 11q21-q23, 17q21, 19q13, 20q13, 21q22 and 22q11). Furthermore, four chromosomal regions, 10q24, 17q21, 19q13, and 22q13, seem to be homologous to each other.
Figure 4 shows the distribution of the number of homologous genes on different chromosomes for all genes. Among 3,326 non-redundant genes, 2,286 genes (about two thirds) have no homologous genes on other chromosomes. The remainder (about one third) have more than one homologous genes. Genes that have extremely large number of homologous genes seem to have multiple domains.
We are constructing a database of duplicated human chromosomal regions, which is now partially open to the public via World Wide Web. The URL is "http://www.cib.nig.ac.jp/dda/timanish/dup.html" (Fig. 5).

Discussion

Most of the duplicated chromosomal regions discovered in this study might be the result of regional, chromosomal, or genomic duplications in the past. These regions might have directly or indirectly duplicated. It is now clear that the duplication of such extensive regions is an important mechanism of genomic evolution. We are now analyzing the time, length, and order of each duplication.
Ohno (1970) hypothesized that vertebrates arose after successive genome tetraploidization (the doubling of total genomes). Our findings that so many duplicated chromosomal regions exist within the human genome might be the result of the tetraploidization around that period. At the moment, however, there is no clear-cut evidence supporting the tetraploidization hypothesis (the histogram in Fig. 4 should have a peak at 3 if tetraploidization took place twice, but no peak was obvious).
The knowledge of duplicated chromosomal regions is useful for predicting the existence of new genes, because duplicated regions should have a series of homologous genes. After chromosomal duplications, the duplicated genes might have functionally differentiated. Thus, the information of duplicated chromosomal regions is expected to be an important material for elucidating the evolution of various biological functions. (Table 3)

Reference

Endo T, Imanishi T, Gojobori T and Inoko H (1997) Evolutionary significance of intra-genome duplications on human chromosomes. Gene 205:19-27.
Goffeau A, Barrell BG, Bussey H, et al. (1996) Life with 6000 genes. Science 274:546-567.
Kasahara M, Hayashi M, Tanaka K, et al. (1997) Chromosomal localization of the proteasome Z subunit gene reveals an ancient chromosomal duplication involving the major histocompatibility complex. Proc Natl Acad Sci U S A 93:9096-9101.
Lundin LG (1993) Evolution of the vertebrate genome as reflected in paralogous chromosomal regions in man and the house mouse. Genomics 16:1-19.
Ohno S (1970) Evolution by gene duplication. Springer-Verlag, New York.
Pearson WR (1996) Effective Protein Sequence Comparison. Methods in Enzymology 235:227-258.
Pearson WR and Lipman DJ (1988) Improved tools for biological sequence comparison. Proc Natl Acad Sci U S A 85:2444-8.
Smith TF and Waterman MS (1981) J Mol Biol 147: 195.
Wolfe K and Shields D (1996) URL: "http://acer.gen.tcd.ie/~khwolfe/yeast/topmenu.html".