Semi-Supervised SimHash for Efficient Document Similarity Search

نویسندگان

Qixia Jiang

Maosong Sun

چکیده

Searching documents that are similar to a query document is an important component in modern information retrieval. Some existing hashing methods can be used for efficient document similarity search. However, unsupervised hashing methods cannot incorporate prior knowledge for better hashing. Although some supervised hashing methods can derive effective hash functions from prior knowledge, they are either computationally expensive or poorly discriminative. This paper proposes a novel (semi-)supervised hashing method named Semi-Supervised SimHash (SH) for high-dimensional data similarity search. The basic idea of SH is to learn the optimal feature weights from prior knowledge to relocate the data such that similar data have similar hash codes. We evaluate our method with several state-of-the-art methods on two large datasets. All the results show that our method gets the best performance.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

In Defense of Minhash over Simhash

MinHash and SimHash are the two widely adopted Locality Sensitive Hashing (LSH) algorithms for large-scale data processing applications. Deciding which LSH to use for a particular problem at hand is an important question, which has no clear answer in the existing literature. In this study, we provide a theoretical answer (validated by experiments) that MinHash virtually always outperforms SimHa...

متن کامل

Feature-based approach to semi-supervised similarity learning

For the management of digital document collections, automatic database analysis still has difficulties to deal with semantic queries and abstract concepts that users are looking for. Whenever interactive learning strategies may improve the results of the search, system performances still depend on the representation of the document collection. We introduce in this paper a weakly supervised opti...

متن کامل

Improved Nearest Neighbor Methods For Text Classification

We present new nearest neighbor methods for text classification and an evaluation of these methods against the existing nearest neighbor methods as well as other well-known text classification algorithms. Inspired by the language modeling approach to information retrieval, we show improvements in k-nearest neighbor (kNN) classification by replacing the classical cosine similarity with a KL dive...

متن کامل

Improved Nearest Neighbor Methods For Text Classification With Language Modeling and Harmonic Functions

متن کامل

Automatic Annotation Techniques for Supervised and Semi-supervised Query-focused Summarization

In this paper, we study one semi-supervised and several supervised methods for extractive query-focused multi-document summarization. Traditional approaches to multidocument summarization are either unsupervised or supervised. The unsupervised approaches use heuristic rules to select the most important sentences, which are hard to generalize. On the other hand, huge amount of annotated data is ...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2011

Semi-Supervised SimHash for Efficient Document Similarity Search

نویسندگان

چکیده

منابع مشابه

In Defense of Minhash over Simhash

Feature-based approach to semi-supervised similarity learning

Improved Nearest Neighbor Methods For Text Classification

Improved Nearest Neighbor Methods For Text Classification With Language Modeling and Harmonic Functions

Automatic Annotation Techniques for Supervised and Semi-supervised Query-focused Summarization

عنوان ژورنال:

اشتراک گذاری