Bayesian Stratified Sampling to Assess Corpus Utility

نویسندگان

  • Judith Hochberg
  • Clint Scovel
  • Timothy Thomas
  • Sam Hall
چکیده

This paper describes a method for asking statistical questions about a large text corpus. We exemplify the method by addressing the question, "What percentage of Federal Register documents are real documents, of possible interest to a text researcher or analyst?" We estimate an answer to this question by evaluating 200 documents selected from a corpus of 45,820 Federal Register documents. Stratified sampling is used to reduce the sampling uncertainty of the estimate from over 3100 documents to fewer than 1000. The stratification is based on observed characteristics of real documents, while the sampling procedure incorporates a Bayesian version of Neyman allocation. A possible application of the method is to establish baseline statistics used to estimate recall rates for information retrieval systems.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Multiple utility constrained multi-objective programs using Bayesian theory

A utility function is an important tool for representing a DM’s preference. We adjoin utility functions to multi-objective optimization problems. In current studies, usually one utility function is used for each objective function. Situations may arise for a goal to have multiple utility functions. Here, we consider a constrained multi-objective problem with each objective having multiple utili...

متن کامل

Frequentist and Bayesian Coverage Estimations for Stratified Fault-Injection*

Abstract. This paper addresses the problem of estimating the coverage of fault tolerance through statistical processing of observations collected in fault-injection experiments. In an earlier paper, we have studied various frequentist estimation methods based on simple sampling in the whole fault/activity input space and stratified sampling in a partitioned space. In this paper, Bayesian estima...

متن کامل

Bayesian Model Assessment and Comparison Using Cross-Validation Predictive Densities

In this work, we discuss practical methods for the assessment, comparison, and selection of complex hierarchical Bayesian models. A natural way to assess the goodness of the model is to estimate its future predictive capability by estimating expected utilities. Instead of just making a point estimate, it is important to obtain the distribution of the expected utility estimate because it describ...

متن کامل

Bayesian sample size Determination Using a Scaled Exponential Utility Function According to Numerical Method

‎In this paper we propose a utility function and obtain the Bayese stimate and the optimum sample size under this utility function‎. ‎This utility function is designed especially to obtain the Bayes estimate when the posterior follows a gamma distribution‎. ‎We consider a Normal with known mean‎, ‎a Pareto‎, ‎an Exponential and a Poisson distribution for an optimum sample size under the propose...

متن کامل

Expected Utility Estimation via Cross-Validation

We discuss practical methods for the assessment, comparison and selection of complex hierarchical Bayesian models. A natural way to assess the goodness of the model is to estimate its future predictive capability by estimating expected utilities. Instead of just making a point estimate, it is important to obtain the distribution of the expected utility estimate in order to describe the associat...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • CoRR

دوره cmp-lg/9806012  شماره 

صفحات  -

تاریخ انتشار 1998