Discovering Transcription Factor Binding Motif Sequences
نویسنده
چکیده
Introduction In biology, sequence motifs are short sequence patterns, usually with fixed lengths, that represent many features of DNA, RNA, and protein molecules. Sequence motifs can represent transcription factor binding sites for DNA, splice junctions for RNA, and binding domains for proteins. Thus, discovering sequence motifs can lead to a better understanding of transcriptional regulation, mRNA splicing, and the formation of protein complexes. Furthermore, protein motifs can represent the active sites of enzymes or regions involved in protein structure and stability. Motif discovery is an important computational problem because it allows the discovery of patterns in biological sequences in order to better understand the structure and function of the molecules the sequences represent. Especially, identifying regulatory elements, especially the binding sites in DNA for transcription factor, is important to understand the mechanisms that regulate gene expression. These DNA motif patterns are usually fairly short (5~20 base pairs long) and is known to recur in different genes or several times within a gene [1]. A DNA sequence can have zero, one, or multiple copies of a motif. In addition to these more common forms DNA motifs, there are also palindromic motifs (subsequence that is exactly the same as its own reverse complement) and gapped motifs (two smaller conserved sites separated by a gap) [2]. The high diversity and variability of motifs make them very difficult to identify. A large number of algorithms for finding DNA motifs have been developed. These algorithms mostly detect overrepresented motifs and conserved motifs that might be good candidates for being transcription factor binding sites. Algorithms that detect overrepresented motifs deduce motifs by considering the regulatory region (promoter) of several co-regulated or co-expressed genes. Co-regulated genes are known to share some similarities in their regulatory mechanism, possibly at transcriptional level, so their promoter regions might contain some common motifs that are binding sites for transcription factors. Thus, the way to detect these regulatory elements is to search for statistically overrepresented motifs in the promoter region of such a set of co-expressed genes. However, algorithms that detect overrepresented motifs perform not as well in higher organisms. To overcome this, some algorithms consider conserved motifs from orthologous species. Since selective pressure causes functional sequences to evolve slower than non-functional sequences, well-conserved sites represent possible candidates for DNA motifs. Recent algorithms have also combined the two approaches to achieve improvement in motif finding. In this report, we will review a …
منابع مشابه
Finding Motifs with Insufficient Number of Strong Binding Sites
A molecule called transcription factor usually binds to a set of promoter sequences of coexpressed genes. As a result, these promoter sequences contain some short substrings, or binding sites, with similar patterns. The motif discovering problem is to find these similar patterns and motifs in a set of sequences. Most existing algorithms find the motifs based on strong-signal sequences only (i.e...
متن کاملWordSpy: identifying transcription factor binding motifs by building a dictionary and learning a grammar
Transcription factor (TF) binding sites or motifs (TFBMs) are functional cis-regulatory DNA sequences that play an essential role in gene transcriptional regulation. Although many experimental and computational methods have been developed, finding TFBMs remains a challenging problem. We propose and develop a novel dictionary based motif finding algorithm, which we call WordSpy. One significant ...
متن کاملDiscovering Motifs in Ranked Lists of DNA Sequences
Computational methods for discovery of sequence elements that are enriched in a target set compared with a background set are fundamental in molecular biology research. One example is the discovery of transcription factor binding motifs that are inferred from ChIP-chip (chromatin immuno-precipitation on a microarray) measurements. Several major challenges in sequence motif discovery still requi...
متن کاملMEME: discovering and analyzing DNA and protein sequence motifs
MEME (Multiple EM for Motif Elicitation) is one of the most widely used tools for searching for novel 'signals' in sets of biological sequences. Applications include the discovery of new transcription factor binding sites and protein domains. MEME works by searching for repeated, ungapped sequence patterns that occur in the DNA or protein sequences provided by the user. Users can perform MEME s...
متن کاملW-AlignACE: an improved Gibbs sampling algorithm based on more accurate position weight matrices learned from sequence and gene expression/ChIP-chip data
MOTIVATION Position weight matrices (PWMs) are widely used to depict the DNA binding preferences of transcription factors (TFs) in computational molecular biology and regulatory genomics. Thus, learning an accurate PWM to characterize the binding sites of a specific TF is a fundamental problem that plays an important role in modeling regulatory motifs and also in discovering the regulatory targ...
متن کامل