Centering Similarity Measures to Reduce Hubs
نویسندگان
چکیده
The performance of nearest neighbor methods is degraded by the presence of hubs, i.e., objects in the dataset that are similar to many other objects. In this paper, we show that the classical method of centering, the transformation that shifts the origin of the space to the data centroid, provides an effective way to reduce hubs. We show analytically why hubs emerge and why they are suppressed by centering, under a simple probabilistic model of data. To further reduce hubs, we also move the origin more aggressively towards hubs, through weighted centering. Our experimental results show that (weighted) centering is effective for natural language data; it improves the performance of the k-nearest neighbor classifiers considerably in word sense disambiguation and document classification tasks.
منابع مشابه
Combining Features Reduces Hubness in Audio Similarity
In audio based music similarity, a well known effect is the existence of hubs, i.e. songs which appear similar to many other songs without showing any meaningful perceptual similarity. We verify that this effect also exists in very large databases (> 250000 songs) and that it even gets worse with growing size of databases. By combining different aspects of audio similarity we are able to reduce...
متن کاملA scale-free distribution of false positives for a large class of audio similarity measures
The “bag of frames” approach (BOF) to audio pattern recognition models signals as the long-term statistical distribution of their local spectral features, a prototypical implementation of which being Gaussian Mixture Models of Mel-Frequency Cepstrum Coefficients. This approach is the most predominent paradigm to extract high-level descriptions from music signals, such as their instrument, genre...
متن کاملSOME SIMILARITY MEASURES FOR PICTURE FUZZY SETS AND THEIR APPLICATIONS
In this work, we shall present some novel process to measure the similarity between picture fuzzy sets. Firstly, we adopt the concept of intuitionistic fuzzy sets, interval-valued intuitionistic fuzzy sets and picture fuzzy sets. Secondly, we develop some similarity measures between picture fuzzy sets, such as, cosine similarity measure, weighted cosine similarity measure, set-theoretic similar...
متن کاملAn Empirical Comparison of Distance Measures for Multivariate Time Series Clustering
Multivariate time series (MTS) data are ubiquitous in science and daily life, and how to measure their similarity is a core part of MTS analyzing process. Many of the research efforts in this context have focused on proposing novel similarity measures for the underlying data. However, with the countless techniques to estimate similarity between MTS, this field suffers from a lack of comparative...
متن کاملHESITANT FUZZY INFORMATION MEASURES DERIVED FROM T-NORMS AND S-NORMS
In this contribution, we first introduce the concept of metrical T-norm-based similarity measure for hesitant fuzzy sets (HFSs) {by using the concept of T-norm-based distance measure}. Then,the relationship of the proposed {metrical T-norm-based} similarity {measures} with the {other kind of information measure, called the metrical T-norm-based} entropy measure {is} discussed. The main feature ...
متن کامل