Predicting regulatory elements in repetitive sequences using transcription factor binding sites
National Science Council
of the Republic of China under Contract No. NSC 89-2213-E-008-061.
Recently, the International Human Genome Project has announced a working draft of an initial sequence of the human genome. The complete genetic blueprint for human beings is oncoming. This gives an enormous amount of information for the studies of how the complete human genome as a whole is organized and how it functions. However, extracting knowledge from this information may be even a more challenging task than the genome sequencing. The data mining and machine learning techniques will probably play an essential role in the knowledge extraction by finding interesting, statistically unexpected patterns and thus providing more investigation for biologists.
Repetitive sequences are the most abundant ones in the extragenic region of genomes. Biologists have already found a large number of regulatory elements in this region. These elements may profoundly impact the chromatin structure formation in nucleus and also contain important clues in genetic evolution and phylogeny study.
The genes in an eukaryotic genome have each a particular combination of transcription factors binding sites that activate or repress their transcription. Usually these sites are specific DNA sequences of length from about five to twenty-five nucleic acids, and they are arrayed within several hundreds base pairs predominantly upstream from the transcription initiation site in the promoter region (Brazma et al. 1997). Many transcription factor binding sites have been collected in databases. TRANSFAC (Heinemeyer et al. 1998; Heinemeyer et al. 1999) is the most complete and well maintained database for transcription factor binding sites. Notably, consensus patterns or nucleotide distribution matrices can be used to describe transcription factor binding sites.
This study initially identifies the combinations of transcription factor binding sites in repetitive sequences. Data mining techniques are then applied to mine the associations from the combinations of transcription factor binding sites that occur in repeat sequences. The data mining technique can mine an enormous number of associations. The enormous number of associations makes it extremely difficult for a human user to identify those useful or interesting ones. Next, the associations are used to remove insignificant ones and find a set of useful associations. In addition, the discovered associations are used to partially classify the repeat sequences in our repeat database.
Steps of the proposed approach:
Our experimental genome sequences include C. elegans, human chromosome 22, yeast, and several bacteria. The rules mined can be used to find genes in complete genomes as well as partially cluster the repetitive sequences in a repetitive database.
Brazma, A., Vilo, J., Ukkonen, E. and Valtonen, K. (1997). Data mining for regulatory elements in yeast genome. In: International Conference Intelligent Systems for Molecular Biology, 5th. Halkidiki, Greece, June. pp. 65-74.
Heinemeyer, T., Chen, X., Karas, H., Kel, H., A. E. , Kel, O. V., Liebich, I., Meinhardt, T., Reuter, I., Schacherer, F. and Wingender, E. (1999). Expanding the TRANSFAC database towards an expert system of regulatory molecular mechanisms. Nucleic Acids Research 27:318-322.
Heinemeyer, T., Wingender, E., Reuter, E., Hermjakob, I. H., Kel, A. E., Kel, O. V., Ignatieva, E. V., Ananko, E. A., Podkolodnaya, O. A., Kolpakov, F. A., Podkolodny, N. L. and Kolchanov, N. A. (1998). Databases on transcriptional regulation: TRANSFAC, TRRD and COMPEL. Nucleic Acids Research 26:362-367.
Klemettinen, M., Mannila, H., Ronkainen, P., Toivonen, H. and Verkamo, A. I. (1994). Finding interesting rules from large sets of discovered association rules. In: Conference on Information and Knowledge Management. Gaithersburg, Maryland, November. pp. 401-407.
Toivonen, H., Klemettinen, M., Ronkainen, P., Hatonen, K. and Mannila, H. (1995). Pruning and grouping discovered association rules. In: MLnet Workshop on Statistics, Machine Learning, and Discovery in Databases. Heraklion, Crete, Greece, September. pp.47-52.
Home | Mail to Editor | Search | Archive