Centering Similarity Measures to Reduce Hubs

نویسندگان

  • Ikumi Suzuki
  • Kazuo Hara
  • Masashi Shimbo
  • Marco Saerens
  • Kenji Fukumizu
چکیده

The performance of nearest neighbor methods is degraded by the presence of hubs, i.e., objects in the dataset that are similar to many other objects. In this paper, we show that the classical method of centering, the transformation that shifts the origin of the space to the data centroid, provides an effective way to reduce hubs. We show analytically why hubs emerge and why they are suppressed by centering, under a simple probabilistic model of data. To further reduce hubs, we also move the origin more aggressively towards hubs, through weighted centering. Our experimental results show that (weighted) centering is effective for natural language data; it improves the performance of the k-nearest neighbor classifiers considerably in word sense disambiguation and document classification tasks.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Combining Features Reduces Hubness in Audio Similarity

In audio based music similarity, a well known effect is the existence of hubs, i.e. songs which appear similar to many other songs without showing any meaningful perceptual similarity. We verify that this effect also exists in very large databases (> 250000 songs) and that it even gets worse with growing size of databases. By combining different aspects of audio similarity we are able to reduce...

متن کامل

A scale-free distribution of false positives for a large class of audio similarity measures

The “bag of frames” approach (BOF) to audio pattern recognition models signals as the long-term statistical distribution of their local spectral features, a prototypical implementation of which being Gaussian Mixture Models of Mel-Frequency Cepstrum Coefficients. This approach is the most predominent paradigm to extract high-level descriptions from music signals, such as their instrument, genre...

متن کامل

SOME SIMILARITY MEASURES FOR PICTURE FUZZY SETS AND THEIR APPLICATIONS

In this work, we shall present some novel process to measure the similarity between picture fuzzy sets. Firstly, we adopt the concept of intuitionistic fuzzy sets, interval-valued intuitionistic fuzzy sets and picture fuzzy sets. Secondly, we develop some similarity measures between picture fuzzy sets, such as, cosine similarity measure, weighted cosine similarity measure, set-theoretic similar...

متن کامل

An Empirical Comparison of Distance Measures for Multivariate Time Series Clustering

Multivariate time series (MTS) data are ubiquitous in science and daily life, and how to measure their similarity is a core part of MTS analyzing process. Many of the research efforts in this context have focused on proposing novel similarity measures for the underlying data. However, with the countless techniques to estimate similarity between MTS, this field suffers from a lack of comparative...

متن کامل

HESITANT FUZZY INFORMATION MEASURES DERIVED FROM T-NORMS AND S-NORMS

In this contribution, we first introduce the concept of metrical T-norm-based similarity measure for hesitant fuzzy sets (HFSs) {by using the concept of T-norm-based distance measure}. Then,the relationship of the proposed {metrical T-norm-based} similarity {measures} with the {other kind of information measure, called the metrical T-norm-based} entropy measure {is} discussed. The main feature ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2013