A Pylonic Decision-Tree Language Model- with Optimal Question Selection

نویسنده

  • Adrian Corduneanu
چکیده

This paper discusses a decision-tree approach to the problem of assigning probabilities to words following a given text. In contrast with previous decision-tree language model at tempts , an algorithm for selecting nearly optimal questions is considered. The model is to be tested on a standard task, The Wall Street Journal, allowing a fair comparison with the well-known trigram model. 1 I n t r o d u c t i o n In many applications such as automatic speech recognition, machine translation, spelling correction, etc., a statistical language model (LM) is needed to assign ~probabilities to sentences. This probability assignment may be used, e.g., to choose one of many transcriptions hypothesized by the recognizer or to make decisions about capitalization. Without any loss of generality, we consider models that operate left-to-right on the sentences, assigning a probability to the next word given its word history. Specifically, we consider statistical LM's which compute probabilities of the type P{wn ]Wl, W2,..-, Wn--1}, where wi denotes the i-th word in the text. Even for a small vocabulary, the space of word histories is so large that any a t tempt to estimate the conditional probabilities for each distinct history from raw frequencies is infeasible. To make the problem manageable, one partitions the word histories into some classes C ( w l , w 2 , . . . , W n 1 ) , and identifies the word probabilities with P{wn [ C (w l , w2,. . . , Wn-1)}. Such probabilities are easier to estimate as each class gets significantly more counts from a training corpus. With this setup, building a language model becomes a classification problem: group the word histories into a small number of classes 606 while preserving their predictive power. Currently, popular N-gram models classify the word histories by their last N 1 words. N varies from 2 to 4 and the tr igram model P{wn [Wn-2, wn-1} is commonly used. Although these simple models perform surprisingly well, there is much room for improvement. The approach used in this paper is to classify the histories by means of a decision tree: to cluster word histories Wl,W2,... ,wn-1 for which the distributions of the following word Wn in a training corpus are similar. The decision tree is pylonic in the sense that histories at different nodes in the tree may be recombined in a new node to increase the complexity of questions and avoid data fragmentation. The method has been tried before (Bahl et al., 1989) and had promising results. In the work presented here we made two major changes to the previous attempts: we have used an optimal tree growing algorithm (Chou, 1991) not known at the time of publication of (Bahl et al., 1989), and we have replaced the ad-hoc clustering of vocabulary items used by Bahl with a data-driven clustering scheme proposed in (Lucassen and Mercer, 1984). 2 D e s c r i p t i o n o f t h e M o d e l 2.1 T h e D e c i s i o n T r e e Class i f ie r The purpose of the decision-tree classifier is to cluster the word history wl, w2 , . . . , Wn-1 into a manageable number of classes Ci, and to estimate for each class the next word conditional distribution P{wn [C i}. The classifier, together with the collection of conditional probabilities, is the resultant LM. The general methodology of decision tree construction is well known (e.g., see (Jelinek, 1998)). The following issues need to be addressed for our specific application. • A tree growing criterion, often called the measure of purity; • A set of permitted questions (partitions) to be considered at each node; • A stopping rule, which decides the number of distinct classes. These are discussed below. Once the tree has been grown, we address one other issue: the estimation of the language model at each leaf of the resulting tree classifier. 2.1.1 T h e T r e e G r o w i n g C r i t e r i o n We view the training corpus as a set of ordered pairs of the following word wn and its word history (wi,w2, . . . , w n i ) . We seek a classification of the space of all histories (not just those seen in the corpus) such that a good conditional probability P{wn I C(w i , w2 , . . . , Wni ) } can be estimated for each class of histories. Since several vocabulary items may potentially follow any history, perfect "classification" or prediction of the word that follows a history is out of the question, and the classifier must partition the space of all word histories maximizing the probability P{wn I C ( w i , w2, . . . , Wni ) } as" signed to the pairs in the corpus. We seek a history classification such that C ( w i , w 2 , . . . ,Wni ) is as informative as possible about the distribution of the next word. Thus, from an information theoretical point of view, a natural cost function for choosing questions is the empirical conditional entropy of the training data with respect to the tree: H = Z I c,)log f ( w I C,).

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

An Integrated Queuing Model for Site Selection and Inventory Storage Planning of a Distribution Center with Customer Loss Consideration

    Discrete facility location,   Distribution center,   Logistics,   Inventory policy,   Queueing theory,   Markov processes, The distribution center location problem is a crucial question for logistics decision makers. The optimization of these decisions needs careful attention to the fixed facility costs, inventory costs, transportation costs and customer responsiveness. In this paper we stu...

متن کامل

A Decision Tree for Technology Selection of Nitrogen Production Plants

Nitrogen is produced mainly from its most abundant source, the air, using three processes: membrane, pressure swing adsorption (PSA) and cryogenic. The most common method for evaluating a process is using the selection diagrams based on feasibility studies. Since the selection diagrams are presented by different companies, they are biased, and provide unsimilar and even controversial results. I...

متن کامل

Selection of Classifier in Acute Abdominal Pain Diagnosis with Decision Tree Model

The article presents the application of the decision tree classifier to the acute abdominal pain diagnosis. The recognition task model is based on a decision tree. In this model the decision tree structure is given by the experts. For the assumed structure of the decision tree the locally optimal strategy is considered. The problem discussed in the work shows a selection of different classifier...

متن کامل

Modeling the Container Selection for Freight Transportation: Case Study of Iran

Significant advantages of intermodal and containerized transport have increased the global interest to this mode of transportation. This growing interest is reflected in the annual volume of container cargo growth. However, the container transport inside Iran does not have a proper place. Comparing the count of containers entering and leaving ports with the statistics obtained from railway and ...

متن کامل

A New Model Selection Test with Application to the Censored Data of Carbon Nanotubes Coating

Model selection of nano and micro droplet spreading can be widely used to predict and optimize of different coating processes such as ink jet printing, spray painting and plasma spraying. The idea of model selection is beginning with a set of data and rival models to choice the best one. The decision making on this set is an important question in statistical inference. Some tests and criteria a...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1999