Indexing DNA Sequences Using q-Grams
نویسندگان
چکیده
We have observed in recent years a growing interest in similarity search on large collections of biological sequences. Contributing to the interest, this paper presents a method for indexing the DNA sequences efficiently based on q-grams to facilitate similarity search in a DNA database and sidestep the need for linear scan of the entire database. Two level index – hash table and c-trees – are proposed based on the q-grams of DNA sequences. The proposed data structures allow the quick detection of sequences within a certain distance to the query sequence. Experimental results show that our method is efficient in detecting similarity regions in a DNA sequence database with high sensitivity.
منابع مشابه
A comparison of sub-word indexing methods for information retrieval
This paper compares different methods of subword indexing and their performance on the English and German domain-specific document collection of the Cross-language Evaluation Forum (CLEF). Four major methods to index sub-words are investigated and compared to indexing stems: 1) sequences of vowels and consonants, 2) a dictionary-based approach for decompounding, 3) overlapping character n-grams...
متن کاملmSQL: SQL Extensions and Database Mechanisms for Managing Biosequences
mSQL is an extended SQL query language targeting the expanding area of biological sequence databases and sequence analysis methods. The core aspects include first-class data types for biological sequences, operators based on an extended-relational algebra, an ability to define logical views of sequences as overlapping q-grams and the materialization of those views as metric-space indices. We fi...
متن کاملImproving KNN Arabic Text Classification with N-Grams Based Document Indexing
Text classification is the task of assigning a document to one or more of pre-defined categories based on its contents. This paper presents the results of classifying Arabic language documents by applying the KNN classifier, one time by using N-Gram namely unigrams and bigrams in documents indexing, and another time by using traditional single terms indexing method (bag of words) which supposes...
متن کاملA Online Appendix to: Analysis and Optimization for Boolean Expression Indexing
String tokenization using q-grams maps the string into a high-dimensional vector space model, in which the domain of each dimension is binary. The size of this space is exponential in the length of q-grams. For instance, q-grams of size three results in a space with 26 dimensions. The vector space model representation of a tokenized string (e.g., {‘str’, ‘tri’, ‘rin’, ‘ing’}) can be expressed b...
متن کاملEfficient In-memory Data Structures for n-grams Indexing
Indexing n-gram phrases from text has many practical applications. Plagiarism detection, comparison of DNA of sequence or spam detection. In this paper we describe several data structures like hash table or B+ tree that could store n-grams for searching. We perform tests that shows their advantages and disadvantages. One of neglected data structure for this purpose, ternary search tree, is deep...
متن کامل