A Cryptographic Approach to Language Identification: PPM
نویسنده
چکیده
The problem of language discrimination may arise in situations when many texts belonging to different source languages are at hand but we are not sure to which language each belongs to. This might usually be the case during information retrieval via Internet. We propose a cryptographic solution to the language identification problem: Employing the Prediction by Partial Matching (PPM) model, we generate a language model and then use this model to discriminate languages. PPM is a cryptographic tool based on an adaptive statistical model. It yields compression rates (measured in bits per character –bpc) to far better levels than that of many other conventional lossless compression tools. Language identification experiment results obtained on sample texts from five different languages as English, French, Turkish, German and Spanish Corpora are given. The rate of success yielded that the performance of the system is highly dependent on the diversity, as well as the target text and training text file sizes. The results also indicate that the PPM model is highly sensitive to input language. In cryptographic aspect, if the training text itself is kept secret, our language identification system would provide security to promising degrees.
منابع مشابه
مقایسه روش های طیفی برای شناسایی زبان گفتاری
Identifying spoken language automatically is to identify a language from the speech signal. Language identification systems can be divided into two categories, spectral-based methods and phonetic-based methods. In the former, short-time characteristics of speech spectrum are extracted as a multi-dimensional vector. The statistical model of these features is then obtained for each language. The ...
متن کاملSecure Bio-Cryptographic Authentication System for Cardless Automated Teller Machines
Security is a vital issue in the usage of Automated Teller Machine (ATM) for cash, cashless and many off the counter banking transactions. Weaknesses in the use of ATM machine could not only lead to loss of customer’s data confidentiality and integrity but also breach in the verification of user’s authentication. Several challenges are associated with the use of ATM smart card such as: card clo...
متن کاملNative Language Identification with PPM
This paper reports on our work in the NLI shared task 2013 on Native Language Identification. The task is to automatically detect the native language of the TOEFL essays authors in a set of given test documents in English. The task was solved by a system that used the PPM compression algorithm based on an n-gram statistical model. We submitted four runs; word-based PPMC algorithm with normaliza...
متن کاملDifferential Power Analysis: A Serious Threat to FPGA Security
Differential Power Analysis (DPA) implies measuring the supply current of a cipher-circuit in an attempt to uncover part of a cipher key. Cryptographic security gets compromised if the current waveforms obtained correlate with those from a hypothetical power model of the circuit. As FPGAs are becoming integral parts of embedded systems and increasingly popular for cryptographic applications and...
متن کاملToken Identification Using HMM and PPM Models
Hidden markov models (HMMs) and prediction by partial matching models (PPM) have been successfully used in language processing tasks including learning-based token identification. Most of the existing systems are domainand language-dependent. The power of retargetability and applicability of these systems is limited. This paper investigates the effect of the combination of HMMs and PPM on token...
متن کامل