Session 8: Statistical Language Modeling
نویسنده
چکیده
Over the past several years, the successful application of statistical techniques in natural language processing has penetrated further and further into written language technology, proceding with time from the periphery of written language processing into deeper and deeper aspects of language processing. At the periphery of natural language understanding, Hidden Markov Models were first applied over ten years ago to the problem of determining part of speech (POS). HMM POS taggers have yielded quite good results for many tasks (96%+ correct, on a per word basis), and have been widely used in written language systems for the last several years. A little closer in from the periphery, extensions to probabilistic context free parsing (PCFG) methods have greatly increased the accuracy of probabilistic parsing methods within the last several years; these methods condition the probabilities of standard CFG rules on aspects of extended lingustic context. Just within the last year or two, we have begun to see the first applications of statistical methods to the problem of word sense determination and lexical semantics. It is worthy of note that the first presentation of a majority of these techniques has been within this series of Workshops sponsored by ARPA.
منابع مشابه
Session 4: Statistical Language Modeling
Corpus based Natural Language Processing (NLP) is now a well established paradigm in NLP. The availability of large corpora, often annotated in various way has led to the development of a variety of approaches to statistical language modeling. The papers in this session represent many of these important approaches. I will try to classify these papers along different dimensions, thus providing t...
متن کاملDynamic Web log session identification with statistical language models
on statistical language modeling. Unlike standard timeout methods, which use fixed time thresholds for session identification, we use an information theoretic approach that yields more robust results for identifying session boundaries. We evaluate our new approach by learning interesting association rules from the segmented session files. We then compare the performance of our approach to three...
متن کاملSession 2: Language Modeling
This session presented four interesting papers on statistical language modeling aimed for improved large-vocabulary speech recognition. The basic problem in language modeling is to derive accurate underlying representations from a large amount of training data, which shares the same fundamental problem as acoustic modeling. As demonstrated in this session, many techniques used for acoustic mode...
متن کاملSession 11 - Natural Language III
The five papers in this session, as well as the ten papers in the other two natural language sessions, can be classified into three broad categories: (1) statistical approaches to natural language processing and the automatic acquisition of linguistic structure (2 out of 5 papers in this session; 8 out of 15 overall); (2) robust processing of texts by combining multiple partial analyses (2 out ...
متن کاملA Tool for the Automatic Insertion of Diacritics in French (Zodiac : Insertion automatique des signes diacritiques du français) [in French]
In this demo session, we propose to show how the software module Zodiac works. It allows the automatic insertion of diacritical marks (accents, cedillas, etc.) in text written in French. Zodiac is implemented as a Microsoft Word add-in under Windows, allowing automatic corrections as the user is typing. Under Linux and Mac OS X, it is implemented as a command-line utility, lending itself natura...
متن کامل