Profiling Protein Families from Partially Aligned Sequences
نویسندگان
چکیده
Profile Hidden Markov Models (PHMMs) are recognized as powerful computational vehicles for homology search of protein sequences. Extant PHMM training approaches either use completely unaligned or aligned sequences. The PHMMs resulting from these two training approaches present contrasting tradeoffs w.r.t. alignment information and the accuracy of the search outcome. This paper describes a PHMM based technique for modeling protein families from partially aligned sequences. By exploiting the observation that partially aligned sequences give rise to independent subsequences, PHMMs corresponding to these subsequences are composed to build PHMMs for the entire sequences. An interesting aspect of the technique is that it gives rise to a family of PHMMs which are parameterized w.r.t. the alignment information. We present experimental comparison of the performance of our technique against several state of the art homology detection methods.
منابع مشابه
Exploring the plant transcriptome through phylogenetic profiling.
Publicly available protein sequences represent only a small fraction of the full catalog of genes encoded by the genomes of different plants, such as green algae, mosses, gymnosperms, and angiosperms. By contrast, an enormous amount of expressed sequence tags (ESTs) exists for a wide variety of plant species, representing a substantial part of all transcribed plant genes. Integrating protein an...
متن کاملOptimal classification of protein sequences and selection of representative sets from multiple alignments: application to homologous families and lessons for structural genomics.
Hierarchical classification is probably the most popular approach to group related proteins. However, there are a number of problems associated with its use for this purpose. One is that the resulting tree showing a nested sequence of groups may not be the most suitable representation of the data. Another is that visual inspection is the most common method to decide the most appropriate number ...
متن کاملSuperior ab initio identification, annotation and characterisation of TEs and segmental duplications from genome assemblies
Transposable Elements (TEs) are mobile DNA sequences that make up significant fractions of amniote genomes. However, they are difficult to detect and annotate ab initio because of their variable features, lengths and clade-specific variants. We have addressed this problem by refining and developing a Comprehensive ab initio Repeat Pipeline (CARP) to identify and cluster TEs and other repetitive...
متن کاملIdentifying Coevolving Partners from Paralogous Gene Families
Many methods have been developed to detect coevolution from aligned sequences. However, all the existing methods require a one-to-one mapping of candidate coevolving partners (nucleotides, amino acids) a priori. When two families of sequences have distinct duplication and loss histories, finding the one-to-one mapping of coevolving partners can be computationally involved. We propose an algorit...
متن کاملAnalysis and Professional Designing of COBRA (Computationally Optimized Broadly Reactive Antigen) Vaccine for Bm86 midgut Protein of R. microplus and R. annulatus Ticks
Introduction: The cattle tick Rhipicephalus spp. causes significant economic losses due to diseases in animals and human. Bm86 is a midgut protein and vaccine candidate, which its sequences among the isolates of Ripsephalus spp are geographically separated, variable, and are the main reason for reducing effectiveness, and subsequently, the failure of the recombinant vaccines. Method: In this bi...
متن کامل