Molecular Biology and Genetics
EJB Electronic Journal of Biotechnology ISSN: 0717-3458
© 2000 by Universidad Católica de Valparaíso -- Chile
BIP RESEARCH ARTICLE

Predicting regulatory elements in repetitive sequences using transcription factor binding sites

Jorng-Tzong Horng*
Department of Computer Science and Information Engineering
National Central University
Taiwan
Tel: +886-3-4227151 Ext. 4519
Fax: +886-3-4222681

E-mail: horng@db.csie.ncu.edu.tw

Wen-Fu Cho
Applied Research Lab., Telecommunications Labs.
Chunghwa Telecom Co., Ltd.
Yang-Mei, Taoyuan, Taiwan
Tel: +886-3-4244197
Fax: +886-3-4244167

*Corresponding author

Financial Support: National Science Council of the Republic of China under Contract No. NSC 89-2213-E-008-061.
Keywords: binding sites, data mining, genomes, regulatory elements, transcription factors.


BIP Article

Recently, the International Human Genome Project has announced a working draft of an initial sequence of the human genome. The complete genetic blueprint for human beings is oncoming. This gives an enormous amount of information for the studies of how the complete human genome as a whole is organized and how it functions. However, extracting knowledge from this information may be even a more challenging task than the genome sequencing. The data mining and machine learning techniques will probably play an essential role in the knowledge extraction by finding interesting, statistically unexpected patterns and thus providing more investigation for biologists.

Repetitive sequences are the most abundant ones in the extragenic region of genomes. Biologists have already found a large number of regulatory elements in this region. These elements may profoundly impact the chromatin structure formation in nucleus and also contain important clues in genetic evolution and phylogeny study.

The genes in an eukaryotic genome have each a particular combination of transcription factors binding sites that activate or repress their transcription. Usually these sites are specific DNA sequences of length from about five to twenty-five nucleic acids, and they are arrayed within several hundreds base pairs predominantly upstream from the transcription initiation site in the promoter region (Brazma et al. 1997). Many transcription factor binding sites have been collected in databases. TRANSFAC (Heinemeyer et al. 1998; Heinemeyer et al. 1999) is the most complete and well maintained database for transcription factor binding sites. Notably, consensus patterns or nucleotide distribution matrices can be used to describe transcription factor binding sites.

This study initially identifies the combinations of transcription factor binding sites in repetitive sequences. Data mining techniques are then applied to mine the associations from the combinations of transcription factor binding sites that occur in repeat sequences. The data mining technique can mine an enormous number of associations. The enormous number of associations makes it extremely difficult for a human user to identify those useful or interesting ones. Next, the associations are used to remove insignificant ones and find a set of useful associations. In addition, the discovered associations are used to partially classify the repeat sequences in our repeat database.

Steps of the proposed approach:

  1. Determine the number of item sets of the transcription factor binding sites in TRANSFAC.
  2. For categorical binding sites, identification of a binding site is mapped to a set of transcription factor names.
  3. Find the combinations of transcription factors in repeat sequences.
  4. Apply the data mining approach to generate association rules.
  5. Determine the interesting rules using Chi-square significance measure.
  6. Prune redundant rules (Klemettinen et al.1994; Toivonen et al. 1995).
  7. Classify rules to cover and non-cover sets.
  8. Partially classify repeat sequences using association rules mined.

Our experimental genome sequences include C. elegans, human chromosome 22, yeast, and several bacteria. The rules mined can be used to find genes in complete genomes as well as partially cluster the repetitive sequences in a repetitive database.

References

Brazma, A., Vilo, J., Ukkonen, E. and Valtonen, K. (1997). Data mining for regulatory elements in yeast genome. In: International Conference Intelligent Systems for Molecular Biology, 5th. Halkidiki, Greece, June. pp. 65-74.

Heinemeyer, T., Chen, X., Karas, H., Kel, H., A. E. , Kel, O. V., Liebich, I., Meinhardt, T., Reuter, I., Schacherer, F. and Wingender, E. (1999). Expanding the TRANSFAC database towards an expert system of regulatory molecular mechanisms. Nucleic Acids Research 27:318-322.

Heinemeyer, T., Wingender, E., Reuter, E., Hermjakob, I. H., Kel, A. E., Kel, O. V., Ignatieva, E. V., Ananko, E. A., Podkolodnaya, O. A., Kolpakov, F. A., Podkolodny, N. L. and Kolchanov, N. A. (1998). Databases on transcriptional regulation: TRANSFAC, TRRD and COMPEL. Nucleic Acids Research 26:362-367.

Klemettinen, M., Mannila, H., Ronkainen, P., Toivonen, H. and Verkamo, A. I. (1994). Finding interesting rules from large sets of discovered association rules. In: Conference on Information and Knowledge Management. Gaithersburg, Maryland, November. pp. 401-407.

Toivonen, H., Klemettinen, M., Ronkainen, P., Hatonen, K. and Mannila, H. (1995). Pruning and grouping discovered association rules. In: MLnet Workshop on Statistics, Machine Learning, and Discovery in Databases. Heraklion, Crete, Greece, September. pp.47-52.

Supported by UNESCO / MIRCEN network
Home | Mail to Editor | Search | Archive