A bootstrap technique for building domain-dependent language models

نویسندگان

  • Ganesh N. Ramaswamy
  • Harry Printz
  • Ponani S. Gopalakrishnan
چکیده

In this paper, we propose a new bootstrap technique to build domain-dependent language models. We assume that a seed corpus consisting of a small amount of data relevant to the new domain is available, which is used to build a reference language model. We also assume the availability of an external corpus, consisting of a large amount of data from various sources, which need not be directly relevant to the domain of interest. We use the reference language model and a suitable metric, such as the perplexity measure, to select sentences from the external corpus that are relevant to the domain. Once we have a sufficient number of new sentences, we can rebuild the reference language model. We then continue to select additional sentences from the external corpus, and this process continues to iterate until some satisfactory termination point is achieved. We also describe several methods to further enhancethe bootstrap technique, such as combining it with mixture modeling and class-based modeling. The performance of the proposed approach was evaluated through a set of experiments, and the results are discussed. Analysis of the convergence properties of the approach and the conditions that need to be satisfied by the external corpus and the seed corpus are highlighted, but detailed work on these issues is deferred for the future.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Three-dimensional analytical models for time-dependent coefficients through uniform and varying plane input source in semi-infinite adsorbing porous media.

In the present study, analytical solutions are developed for three-dimensional advection-dispersion equation (ADE) in semi-infinite adsorbing saturated homogeneous porous medium with time dependent dispersion coefficient. It means porosity of the medium is filled with single fluid(water). Dispersion coefficient is considered proportional to seepage velocity while adsorption coefficient inversel...

متن کامل

Three-dimensional analytical models for time-dependent coefficients through uniform and varying plane input source in semi-infinite adsorbing porous media.

In the present study, analytical solutions are developed for three-dimensional advection-dispersion equation (ADE) in semi-infinite adsorbing saturated homogeneous porous medium with time dependent dispersion coefficient. It means porosity of the medium is filled with single fluid(water). Dispersion coefficient is considered proportional to seepage velocity while adsorption coefficient inversel...

متن کامل

The Hybrid Bootstrap: A Drop-in Replacement for Dropout

Regularization is an important component of predictive model building. The hybrid bootstrap is a regularization technique that functions similarly to dropout except that features are resampled from other training points rather than replaced with zeros. We show that the hybrid bootstrap offers superior performance to dropout. We also present a sampling based technique to simplify hyperparameter ...

متن کامل

Rapid Language Model Development for New Task Domains

Data sparseness has been regularly indicted as the primary problem in statistical language modelling. We go one step further to consider the situation when no text data is available for the target domain. We present two techniques for building efficient language models quickly for new domains. The first technique is based on using a context-free grammar to generate a corpus of word collocations...

متن کامل

Towards Unsupervised Spoken Language Understanding: Exploiting Query Click Logs for Slot Filling

In this paper, we present a novel approach to exploit user queries mined from search engine query click logs to bootstrap or improve slot filling models for spoken language understanding. We propose extending the earlier gazetteer population techniques to mine unannotated training data for semantic parsing. The automatically annotated mined data can then be used to train slot specific parsing m...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1998