Supervised detection of conserved motifs in DNA sequences with cosmo.

نویسندگان

  • Oliver Bembom
  • Sunduz Keles
  • Mark J van der Laan
چکیده

A number of computational methods have been proposed for identifying transcription factor binding sites from a set of unaligned sequences that are thought to share the motif in question. We here introduce an algorithm, called cosmo, that allows this search to be supervised by specifying a set of constraints that the position weight matrix of the unknown motif must satisfy. Such constraints may be formulated, for example, on the basis of prior knowledge about the structure of the transcription factor in question. The algorithm is based on the same two-component multinomial mixture model used by MEME, with stronger reliance, however, on the likelihood principle instead of more ad-hoc criteria like the E-value. The intensity parameter in the ZOOPS and TCM models, for instance, is estimated based on a profile-likelihood approach, and the width of the unknown motif is selected based on BIC. These changes allow cosmo to outperform MEME even in the absence of any constraints, as evidenced by 2- to 3-fold greater sensitivity in some simulation studies. Additional improvements in performance can be achieved by selecting the model type (OOPS, ZOOPS, or TCM) data-adaptively or by supplying correctly specified constraints, especially if the motif appears only as a weak signal in the data. The algorithm can data-adaptively choose between working in a given constrained model or in the completely unconstrained model, guarding against the risk of supplying mis-specified constraints. Simulation studies suggest that this approach can offer 3 to 3.5 times greater sensitivity than MEME. The algorithm has been implemented in the form of a stand-alone C program as well as a web application that can be accessed at http://cosmoweb.berkeley.edu. An R package is available through Bioconductor (http://bioconductor.org).

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

In silico investigation of lactoferrin protein characterizations for the prediction of anti-microbial properties

Lactoferrin (Lf) is an iron-binding multi-functional glycoprotein which has numerous physiological functions such as iron transportation, anti-microbial activity and immune response. In this study, different in silico approaches were exploited to investigate Lf protein properties in a number of mammalian species. Results showed that the iron-binding site, DNA and RNA-binding sites, signal pepti...

متن کامل

Functional motifs in Escherichia coli NC101

Escherichia coli (E. coli) bacteria can damage DNA of the gut lining cells and may encourage the development of colon cancer according to recent reports. Genetic switches are specific sequence motifs and many of them are drug targets. It is interesting to know motifs and their location in sequences. At the present study, Gibbs sampler algorithm was used in order to predict and find functional m...

متن کامل

Enhancing motif finding models using multiple sources of genome-wide data

The SUCcESS package implements the CTCM model (particularly logistic regression model) proposed by Shim and Keleş (2008) for integrating quantitative information into motif finding as well as its extension to use multiple data sources at a time (e.g., ChIP-seq, nucleosome occupancy, or conservation score). We implemented them as an extended module for cosmo, developed by Bembom et al. (2007), i...

متن کامل

Molecular and Bioinformatics Analysis of Allelic Diversity in IGFBP2 Gene Promoter in Indigenous Makuee and Lori-Bakhtiari Sheep Breeds

The aim of this study was to perform molecular and bioinformatics analysis of IGFBP2 gene promoter in association with some economic traits in indigenous Makuee (MS) and Lori-Bakhtiari (LB) breeds. DNA was extracted from blood samples of 120 MS and 200 LB and a 297 bp fragment from the upstream sequences of studied gene was amplified and genotyped by single-strand conformational polymo...

متن کامل

A Protein Family Classification Method for Analysis of Large DNA Sequences

A method is described for identification and classification of proteins encoded in large DNA sequences. Previously, an automated system was introduced for the general detection of amino acid sequence motifs within diverse protein families. The system generated a database consisting of aligned sequence segments (blocks) that correspond to the most highly conserved regions of proteins. This datab...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • Statistical applications in genetics and molecular biology

دوره 6  شماره 

صفحات  -

تاریخ انتشار 2007