Systematic and Fully Automated Identi cation of Protein Sequence Patterns

نویسندگان

  • REECE K. HART
  • AJAY K. ROYYURU
  • GUSTAVO STOLOVITZKY
  • ANDREA CALIFANO
چکیده

We present an efŽ cient algorithm to systematically and automatically identify patterns in protein sequence families. The procedure is based on the Splash deterministic pattern discovery algorithm and on a framework to assess the statistical signiŽ cance of patterns. We demonstrate its application to the fully automated discovery of patterns in 974 PROSITE families (the complete subset of PROSITE families which are deŽ ned by patterns and contain DR records). Splash generates patterns with better speciŽ city and undiminished sensitivity, or vice versa, in 28% of the families; identical statistics were obtained in 48% of the families, worse statistics in 15%, and mixed behavior in the remaining 9%. In about 75% of the cases, Splash patterns identify sequence sites that overlap more than 50% with the corresponding PROSITE pattern. The procedure is sufŽ ciently rapid to enable its use for daily curation of existing motif and proŽ le databases. Third, our results show that the statistical signiŽ cance of discovered patterns correlates well with their biological signiŽ cance. The trypsin subfamily of serine proteases is used to illustrate this method’s ability to exhaustively discover all motifs in a family that are statistically and biologically signiŽ cant. Finally, we discuss applications of sequence patterns to multiple sequence alignment and the training of more sensitive scorebased motif models, akin to the procedure used by PSI-BLAST. All results are available at http://www.research.ibm.com/spat/.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

SECURING INTERPRETABILITY OF FUZZY MODELS FOR MODELING NONLINEAR MIMO SYSTEMS USING A HYBRID OF EVOLUTIONARY ALGORITHMS

In this study, a Multi-Objective Genetic Algorithm (MOGA) is utilized to extract interpretable and compact fuzzy rule bases for modeling nonlinear Multi-input Multi-output (MIMO) systems. In the process of non- linear system identi cation, structure selection, parameter estimation, model performance and model validation are important objectives. Furthermore, se- curing low-level and high-level ...

متن کامل

Automated Identi cation of Three-Dimensional Motif in Proteins

This paper describes an approach to automated identi cation of three-dimensional (3-D) motif in proteins. Here, the structure of a protein was reduced into abstract representation which consists of the -helix and -strand secondary structure elements, these being described by vectors in 3-D space rather than the point-like atoms that are used in the simple C approximation. The algorithms and the...

متن کامل

Automated Identi cation of Three-Dimensional Common Structural Features of Proteins

This paper describes an approach to automated identi cation of three-dimensional (3D) common structural features of proteins. The structure of a protein was represented by a set of secondary structure elements (SSEs) in the same manner used in our previous work, where only -helices and -strands were considered. The maximal common subgraph matching algorithm, based on a graph theoretical clique ...

متن کامل

Intraspeci c ITS Variability in the Kingdom Fungi as Expressed in the International Sequence Databases and Its Implications for Molecular Species Identi cation

The internal transcribed spacer (ITS) region of the nuclear ribosomal repeat unit is the most popular locus for species identi cation and subgeneric phylogenetic inference in sequence-based mycological research. The region is known to show certain variability even within species, although its intraspeci c variability is often held to be limited and clearly separated from interspeci c variabi...

متن کامل

Towards a fully automated protein structure classi cation : How to get CATH classi cation from FSSP Z - scores

Currently, each week about 50 new protein structures are made available in public databases. The attention is focused on developing automatic methods of classi cation. The work of organization is being done by several groups, to a large extent independently. To our knowledge, the consistency of di erent classi cations has never been examined on a protein by protein basis. Moreover, the potentia...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2000