Systematic and Fully Automated Identi cation of Protein Sequence Patterns
نویسندگان
چکیده
We present an ef cient algorithm to systematically and automatically identify patterns in protein sequence families. The procedure is based on the Splash deterministic pattern discovery algorithm and on a framework to assess the statistical signi cance of patterns. We demonstrate its application to the fully automated discovery of patterns in 974 PROSITE families (the complete subset of PROSITE families which are de ned by patterns and contain DR records). Splash generates patterns with better speci city and undiminished sensitivity, or vice versa, in 28% of the families; identical statistics were obtained in 48% of the families, worse statistics in 15%, and mixed behavior in the remaining 9%. In about 75% of the cases, Splash patterns identify sequence sites that overlap more than 50% with the corresponding PROSITE pattern. The procedure is suf ciently rapid to enable its use for daily curation of existing motif and pro le databases. Third, our results show that the statistical signi cance of discovered patterns correlates well with their biological signi cance. The trypsin subfamily of serine proteases is used to illustrate this method’s ability to exhaustively discover all motifs in a family that are statistically and biologically signi cant. Finally, we discuss applications of sequence patterns to multiple sequence alignment and the training of more sensitive scorebased motif models, akin to the procedure used by PSI-BLAST. All results are available at http://www.research.ibm.com/spat/.
منابع مشابه
SECURING INTERPRETABILITY OF FUZZY MODELS FOR MODELING NONLINEAR MIMO SYSTEMS USING A HYBRID OF EVOLUTIONARY ALGORITHMS
In this study, a Multi-Objective Genetic Algorithm (MOGA) is utilized to extract interpretable and compact fuzzy rule bases for modeling nonlinear Multi-input Multi-output (MIMO) systems. In the process of non- linear system identi cation, structure selection, parameter estimation, model performance and model validation are important objectives. Furthermore, se- curing low-level and high-level ...
متن کاملAutomated Identi cation of Three-Dimensional Motif in Proteins
This paper describes an approach to automated identi cation of three-dimensional (3-D) motif in proteins. Here, the structure of a protein was reduced into abstract representation which consists of the -helix and -strand secondary structure elements, these being described by vectors in 3-D space rather than the point-like atoms that are used in the simple C approximation. The algorithms and the...
متن کاملAutomated Identi cation of Three-Dimensional Common Structural Features of Proteins
This paper describes an approach to automated identi cation of three-dimensional (3D) common structural features of proteins. The structure of a protein was represented by a set of secondary structure elements (SSEs) in the same manner used in our previous work, where only -helices and -strands were considered. The maximal common subgraph matching algorithm, based on a graph theoretical clique ...
متن کاملIntraspeci c ITS Variability in the Kingdom Fungi as Expressed in the International Sequence Databases and Its Implications for Molecular Species Identi cation
The internal transcribed spacer (ITS) region of the nuclear ribosomal repeat unit is the most popular locus for species identi cation and subgeneric phylogenetic inference in sequence-based mycological research. The region is known to show certain variability even within species, although its intraspeci c variability is often held to be limited and clearly separated from interspeci c variabi...
متن کاملTowards a fully automated protein structure classi cation : How to get CATH classi cation from FSSP Z - scores
Currently, each week about 50 new protein structures are made available in public databases. The attention is focused on developing automatic methods of classi cation. The work of organization is being done by several groups, to a large extent independently. To our knowledge, the consistency of di erent classi cations has never been examined on a protein by protein basis. Moreover, the potentia...
متن کامل