Inclusion of neighboring base interdependencies substantially improves genome-wide prokaryotic transcription factor binding site prediction

نویسندگان

  • Rafik A. Salama
  • Dov J. Stekel
چکیده

Prediction of transcription factor binding sites is an important challenge in genome analysis. The advent of next generation genome sequencing technologies makes the development of effective computational approaches particularly imperative. We have developed a novel training-based methodology intended for prokaryotic transcription factor binding site prediction. Our methodology extends existing models by taking into account base interdependencies between neighbouring positions using conditional probabilities and includes genomic background weighting. This has been tested against other existing and novel methodologies including position-specific weight matrices, first-order Hidden Markov Models and joint probability models. We have also tested the use of gapped and ungapped alignments and the inclusion or exclusion of background weighting. We show that our best method enhances binding site prediction for all of the 22 Escherichia coli transcription factors with at least 20 known binding sites, with many showing substantial improvements. We highlight the advantage of using block alignments of binding sites over gapped alignments to capture neighbouring position interdependencies. We also show that combining these methods with ChIP-on-chip data has the potential to further improve binding site prediction. Finally we have developed the ungapped likelihood under positional background platform: a user friendly website that gives access to the prediction method devised in this work.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Sequence Analysis A non-independent energy based multiple sequence alignment improves prediction of transcription factor binding sites

Motivation: Multiple Sequence Alignments (MSAs) are usually scored under the assumption that the sequences being aligned have evolved by common descent. Consequently, the differences between sequences reflect the impact of insertions, deletions and mutations. However, non-coding DNA binding sequences, such as transcription factor binding sites (TFBS), are frequently not related by common descen...

متن کامل

PFP: A Computational Framework for Phylogenetic Footprinting in Prokaryotic Genomes

Phylogenetic footprinting is a widely used approach for the prediction of transcription factor binding sites (TFBSs) through identification of conserved motifs in the upstream sequences of orthologous genes in eukaryotic genomes. However, this popular strategy may not be directly applicable to prokaryotic genomes, where typically about half of the genes in a genome form multiple-gene transcript...

متن کامل

DNA-MATRIX a tool for DNA motif discovery and weight matrix construction

In computational molecular biology, gene regulatory binding sites prediction in whole genome remains a challenge for the researchers. Now a days, the genome wide regulatory binding site prediction tools required either direct pattern sequence or weight matrix. Although there are known transcription factor binding sites databases available for genome wide prediction but no tool is available whic...

متن کامل

A flexible integrative approach based on random forest improves prediction of transcription factor binding sites

Transcription factor binding sites (TFBSs) are DNA sequences of 6-15 base pairs. Interaction of these TFBSs with transcription factors (TFs) is largely responsible for most spatiotemporal gene expression patterns. Here, we evaluate to what extent sequence-based prediction of TFBSs can be improved by taking into account the positional dependencies of nucleotides (NPDs) and the nucleotide sequenc...

متن کامل

Genome-Wide De Novo Prediction of Cis-Regulatory Binding Sites in Mycobacterium tuberculosis H37Rv.

The transcription regulatory system of Mycobacterium tuberculosis (M. tb) remains incompletely understood. In this study, we have applied the eGLECLUBS algorithm to a group of related prokaryotic genomes for de novo genome-wide prediction of cis-regulatory binding sites (CRBSs) in M. tb H37Rv. The top 250 clusters from our prediction recovered 83.3% (50/60) of all known CRBSs in extracted inter...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره 38  شماره 

صفحات  -

تاریخ انتشار 2010