Clustering of highly homologous sequences to reduce the size of large protein databases

نویسندگان

  • Weizhong Li
  • Lukasz Jaroszewski
  • Adam Godzik
چکیده

We present a fast and flexible program for clustering large protein databases at different sequence identity levels. It takes less than 2 h for the all-against-all sequence comparison and clustering of the non-redundant protein database of over 560,000 sequences on a high-end PC. The output database, including only the representative sequences, can be used for more efficient and sensitive database searches.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Designing Of Degenerate Primers-Based Polymerase Chain Reaction (PCR) For Amplification Of WD40 Repeat-Containing Proteins Using Local Allignment Search Method

Degenerate primers-based polymerase chain reaction (PCR) are commonly used for isolation of unidentified gene sequences in related organisms. For designing the degenerate primers, we propose the use of local alignment search method for searching the conserved regions long enough to design an acceptable primer pair. To test this method, a WD40 repeat-containing domain protein from Beauveria bass...

متن کامل

A graph-based clustering method applied to protein sequences

The number of amino acid sequences is increasing very rapidly in the protein databases like Swiss-Prot, Uniprot, PIR and others, but the structure of only some amino acid sequences are found in the Protein Data Bank. Thus, an important problem in genomics is automatically clustering homologous protein sequences when only sequence information is available. Here, we use graph theoretic techniques...

متن کامل

Signal processing approaches as novel tools for the clustering of N-acetyl-β-D-glucosaminidases

Nowadays, the clustering of proteins and enzymes in particular, are one of the most popular topics in bioinformatics. Increasing number of chitinase genes from different organisms and their sequences have beenidentified. So far, various mathematical algorithms for the clustering of chitinase genes have been used butmost of them seem to be confusing and sometimes insufficient. In the...

متن کامل

Fuzzy clustering of time series data: A particle swarm optimization approach

With rapid development in information gathering technologies and access to large amounts of data, we always require methods for data analyzing and extracting useful information from large raw dataset and data mining is an important method for solving this problem. Clustering analysis as the most commonly used function of data mining, has attracted many researchers in computer science. Because o...

متن کامل

Order-Sensitive Clustering for Remote Homologous Protein Detection

Traditional sequence alignment methods are effective in identifying homologous proteins that are highly similar. However, these approaches do not perform well for remote homologous proteins, that is, proteins whose 3D structures are similar but their sequences are not. Recent biological research reveals that protein sequences contain residues that determine the 3D structure of proteins. In this...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • Bioinformatics

دوره 17 3  شماره 

صفحات  -

تاریخ انتشار 2001