Information Theoretic Text Classification Using the Ziv-Merhav Method
نویسندگان
چکیده
Most approaches to text classification rely on some measure of (dis)similarity between sequences of symbols. Information theoretic measures have the advantage of making very few assumptions on the models which are considered to have generated the sequences, and have been the focus of recent interest. This paper addresses the use of the Ziv-Merhav method (ZMM) for the estimation of relative entropy (or Kullback-Leibler divergence) from sequences of symbols as a tool for text classification. We describe an implementation of the ZMM based on a modified version of the Lempel-Ziv algorithm (LZ77). Assessing the accuracy of the ZMM on synthetic Markov sequences shows that it yields good estimates of the Kullback-Leibler divergence. Finally, we apply the method in a text classification problem (more specifically, authorship attribution) outperforming a previously proposed (also information theoretic) method.
منابع مشابه
A measure of relative entropy between individual sequences with application to universal classification
A new notion of empirical informational divergence (relative entropy) between two individual sequences is introduced. If the two sequences are independent realizations of two finiteorder, finite alphabet, stationary Markov processes, the empirical relative entropy converges to the relative entropy almost surely. This new empirical divergence is based on a version of the Lempel-Ziv data compress...
متن کاملUniversal Prediction of Individual Sequences
Unlike standard statistical approaches to forecasting, prediction of individual sequences does not impose any probabilistic assumption on the data-generating mechanism. Yet, prediction algorithms can be constructed that work well for all possible sequences, in the sense that their performance is always nearly as good as the best forecasting strategy in a given reference class. In this report, t...
متن کاملString kernels and similarity measures for information retrieval
Measuring a similarity between two strings is a fundamental step in many applications in areas such as text classification and information retrieval. Lately, kernel-based methods have been proposed for this task, both for text and biological sequences. Since kernels are inner products in a feature space, they naturally induce similarity measures. Information-theoretical approaches have also bee...
متن کاملEstimating the number of states of a finite-state source
The problem of estimating the number of states of a finite-alphabet, finite-state source is investigated. An estimator is developed that asymptotically attains the minimum probability of underestimating the number of states, among all estimators with a prescribed exponential decay rate of overestimation probability. The proposed estimator relies on the Lempel-Ziv data compression algorithm in a...
متن کامل