Searching Protein 3-D Structures in Linear Time

نویسنده

  • Tetsuo Shibuya
چکیده

One of the most important issues in the post-genomic molecular biology is the analysis of protein three-dimensional (3-D) structures, and searching over the 3-D structure databases of them is becoming more and more important. The root mean square deviation (RMSD) is the most popular similarity measure for comparing two molecular structures. In this article, we propose new theoretically and practically fast algorithms for the basic problem of finding all the substructures of structures in a structure database of chain molecules (such as proteins), whose RMSDs to the query are within a given constant threshold. The best-known worst-case time complexity for the problem is O(N log m), where N is the database size and m is the query size. The previous best-known expected time complexity for the problem is also O(N log m). We also propose a new breakthrough linear-expected-time algorithm. It is not only a theoretically significant improvement over previous algorithms, but also a practically faster algorithm, according to computational experiments. Our experiments over the whole Protein Data Bank (PDB) database show that our algorithm is 3.6-28 times faster than previously known algorithms, to search for similar substructures whose RMSDs are within 1A to queries of ordinary lengths. We also propose a series of preprocessing algorithms that enable faster queries, though there have been no known indexing algorithm whose query time complexity is better than the above O(N log m) bound. One is an O(N log(2)N)-time and O(N log N)-space preprocessing algorithm with expected query time complexity of O(m + N given complex square root of m). Another is an O(N log N)-time and O(N)-space preprocessing algorithm with expected query time complexity of O(N given complex square root of m + m log (N given m)).(1)

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Searching Protein 3-D Structures in Faster Than Linear Time

Searching for similar structures from a 3-D structure database of proteins is one of the most important problems in post-genomic computational biology. To compare two structures, we ordinarily use a measure called the RMSD (root mean square deviation) as the similarity measure. We consider a very fundamental problem of finding all the substructures whose RMSDs to the query are within some given...

متن کامل

Efficient Dynamic Range Searching Using Data Replication

Given the lower bound of (n(d 1)=d) for range query time complexity on n d-dimensional point data, we investigate whether little replication can improve the query and update times significantly. We propose linear-space index structures that minimize the query and update times; the query time we achieve is O(n ) for any > 0, and, the update time is O(logn).

متن کامل

Colored Range Searching on Internal Memory

Recent advances in various application fields, like GIS, finance and others, has lead to a large increase in both the volume and the characteristics of the data being collected. Hence, general range queries on these datasets are not sufficient enough to obtain good insights and useful information from the data. This leads to the need for more sophisticated queries and hence novel data structure...

متن کامل

The Analysis of a Probabilistic Approach to Nearest Neighbor Searching

Given a set S of n data points in some metric space. Given a query point q in this space, a nearest neighbor query asks for the nearest point of S to q. Throughout we will assume that the space is real d-dimensional space <d, and the metric is Euclidean distance. The goal is to preprocess S into a data structure so that such queries can be answered efficiently. Nearest neighbor searching has ap...

متن کامل

Optimal Dynamic Range Searching in Non-replicating Index Structures

We consider the problem of dynamic range searching in tree structures that do not replicate data. We propose a new dynamic structure, called the O-tree, that achieves a query time complexity of O(n(d 1)=d) on n d-dimensional points and an amortized insertion/deletion time complexity of O(logn). We show that this structure is optimal when data is not replicated. In addition to optimal query and ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • Journal of computational biology : a journal of computational molecular cell biology

دوره 17 3  شماره 

صفحات  -

تاریخ انتشار 2009