Methods for Pattern Discovery in Unaligned Biological Sequences
نویسندگان
چکیده
Pattern discovery in biological sequences is the problem of ®nding patterns that are overrepresented in a set of unaligned DNA or protein sequences of related biological function. Such patterns could correspond to regions of the sequences responsible for the function itself, and could be used later for the functional annotation of newly determined sequences. Despite many studies, this problem can be considered far from being solved. The main dif®culty lies in the fact that signi®cant patterns can appear within each sequence with mutations, insertions or deletions of nucleotides or amino acids, without losing their biological function. This paper provides a survey of a number of existing pattern discovery algorithms, focusing both on the methods underlying them and their availability for the scienti®c community. INTRODUCTION In the last few years, the amount of biological data generated by the scienti®c community has grown exponentially, and a huge number of sequences are now available in several databases. Thus, the focus of many research projects has shifted from the generation of the data to their analysis, that is, the extraction of any kind of biological meaning from the sequences. Given a set of functionally related sequences, the main aim of patterndiscovery algorithms is to ®nd new and a priori unknown patterns that appear in every sequence, or at least in a signi®cant number of sequences, of the set. Such patterns could correspond to the regions of the sequences responsible for their function, and could be also used later for the functional annotation of newly determined sequences. From a computational point of view, the main dif®culty lies in the fact that the same pattern can appear within each sequence in a different and approximate form (ie with mutations, insertions or deletions of nucleotides or amino acids), keeping intact its biological function. The longer the patterns and the more degraded their occurrences, the harder (and slower) it is to ®nd them for patterndiscovery algorithms. A large number of different methods has been so far introduced, but we do not have yet good models or reliable algorithms that guarantee to ®nd all (or most of) the biologically meaningful patterns. This paper, without the claim of being exhaustive, provides a survey of a number of different approaches to the problem, presenting the main ideas underlying the methods and mentioning, when available, the software tools based on the different algorithms.
منابع مشابه
Solving Longest Common Subsequence Problem with Memetic Algorithms
Pattern discovery in unaligned DNA sequences is a challenge problem. A pattern is some specific nucleotide combination that it can be used to measure the similarity degree among biological sequences. The longest common subsequence (LCS) can be viewed as a pattern discovery problem and it is also a well-known NP-hard problem. In this paper, we present a memetic algorithm-based approach to solve ...
متن کاملMining Biological Repetitive Sequences Using Support Vector Machines and Fuzzy SVM
Structural repetitive subsequences are most important portion of biological sequences, which play crucial roles on corresponding sequence’s fold and functionality. Biggest class of the repetitive subsequences is “Transposable Elements” which has its own sub-classes upon contexts’ structures. Many researches have been performed to criticality determine the structure and function of repetitiv...
متن کاملGenetic Algorithm Based Probabilistic Motif Discovery in Multiple Unaligned Biological Sequences
Many computational approaches have been introduced for the problem of motif identification in a set of biological sequences, which are classified according to the type of motifs discovered. In this study, we propose a model to discover motif in large set of unaligned sequences in considerably minimum time using genetic algorithm based probabilokistic Motif discovery model. The proposed algorith...
متن کاملAn efficient, versatile and scalable pattern growth approach to mine frequent patterns in unaligned protein sequences
MOTIVATION Pattern discovery in protein sequences is often based on multiple sequence alignments (MSA). The procedure can be computationally intensive and often requires manual adjustment, which may be particularly difficult for a set of deviating sequences. In contrast, two algorithms, PRATT2 (http//www.ebi.ac.uk/pratt/) and TEIRESIAS (http://cbcsrv.watson.ibm.com/) are used to directly identi...
متن کاملRecognition of multiple patterns in unaligned sets of sequences. Comparison of kernel clustering method with other methods
MOTIVATION Transcription factor binding sites often differ significantly in their primary sequence and can hardly be aligned. Often one set of sites can contain several subsets of sequences that follow not just one but several different patterns. There is a need for sensitive methods to reveal multiple patterns in unaligned sets of sequences. RESULTS We developed a novel method for analysis o...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- Briefings in Bioinformatics
دوره 2 شماره
صفحات -
تاریخ انتشار 2001