Mining New Motifs from Cdna Sequence Data

نویسندگان

  • Christian Schönbach
  • Hideo Matsuda
  • H. Matsuda
چکیده

General biological databases that store basic information on genome, transcriptome, and proteome are indispensable sequence discovery resources. However, they are not necessarily useful for inferring functions of proteins. To see this, we observe that SWISS-PROT —a protein knowledgebase containing curated protein sequences and functional information on domains and diseases—has grown a mere 26-fold in 15 years, from 3,939 entries in 1986 to 126,147 entries in 2003. Similarly, despite the human draft genome and the mouse draft genome and transcriptome, the number of human and mouse protein sequences with some functional information has remained low—7,471 (7.4%) for man and 4,816 (4.7%) for mouse—compared to an estimated proteome of 0.5–1.0 sequences. The majority of sequences in the TrEMBL database of SWISSPROT/TrEMBL, FANTOM, and other similar databases are hypothetical proteins, or are uninformative sequences described as “similar to DKFZ ...” or ”weakly similar to KIAA ....” These sequences have no informative homolog that had diverged from a common ancestor, and have matched to a non-informative homolog. Algorithms for identification of motifs are commonly used to classify these sequences, and to provide functional clues on binding sites, catalytic sites, and active sites, or structure/functions relations. For example 5,873 of 21,050 predicted FANTOM1 protein sequences contain InterPro motifs or domains. In fact, the InterPro name is the only functional description of 900 sequences. Extrapolations from current mouse cDNA data indicate that the proteome is significantly larger than the genome. This underlines the importance of exploring protein sequences, motifs, and modules, to derive potential functions and interactions for these sequences. Strictly defined new protein sequence motifs are either conserved sequences of common ancestry, or are convergence (functional motifs)

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

WebTraceMiner: a web service for processing and mining EST sequence trace files

Expressed sequence tags (ESTs) remain a dominant approach for characterizing the protein-encoding portions of various genomes. Due to inherent deficiencies, they also present serious challenges for data quality control. Before GenBank submission, EST sequences are typically screened and trimmed of vector and adapter/linker sequences, as well as polyA/T tails. Removal of these sequences presents...

متن کامل

Mining Protein Sequences for Motifs

We use methods from Data Mining and Knowledge Discovery to design an algorithm for detecting motifs in protein sequences. The algorithm assumes that a motif is constituted by the presence of a "good" combination of residues in appropriate locations of the motif. The algorithm attempts to compile such good combinations into a "pattern dictionary" by processing an aligned training set of protein ...

متن کامل

A DNA based Approach to find Closed Repetitive Gapped Subsequences from a Sequence Database

In bioinformatics, the discovery of transcription factor binding affinities is important. This is done by sequence analysis of micro array data. The determination of continuous and gapped motifs accurately from the given long sequence of data, say genetic data is challenging and requires a detailed study. In this paper, we propose an algorithm that can be used for finding short continuous, shor...

متن کامل

New Seed Selection Technique for Protein Sequeunce Motif Identification

Bioinformatics is a field devoted to the interpretation and analysis of biological data using computational techniques. In recent years the study of bioinformatics has grown tremendously due to huge amount of biological information generated by the scientific community. Protein sequence motifs are short fragments of conserved amino acids often associated with specific function. Identifying such...

متن کامل

Sequential Data Mining for Information Extraction from Texts

This paper shows the benefit of using data mining methods for Biological Natural Language Processing. A method for discovering linguistic patterns based on a recursive sequential pattern mining is proposed. It does not require a sentence parsing nor other resource except a training data set. It produces understandable results and we show its interest in the extraction of relations between named...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2008