Crowdsourced Nonparametric Density Estimation Using Relative Distances

نویسندگان

  • Antti Ukkonen
  • Behrouz Derakhshan
  • Hannes Heikinheimo
چکیده

In this paper we address the following density estimation problem: given a number of relative similarity judgements over a set of itemsD, assign a density value p(x) to each item x ∈ D. Our work is motivated by human computing applications where density can be interpreted e.g. as a measure of the rarity of an item. While humans are excellent at solving a range of different visual tasks, assessing absolute similarity (or distance) of two items (e.g. photographs) is difficult. Relative judgements of similarity, such as A is more similar to B than to C, on the other hand, are substantially easier to elicit from people. We provide two novel methods for density estimation that only use relative expressions of similarity. We give both theoretical justifications, as well as empirical evidence that the proposed methods produce good estimates. Introduction A common application of crowdsourcing is to collect training labels for machine learning algorithms. However, human computation can also be employed to solve computational problems directly (Amsterdamer et al. 2013; Chilton et al. 2013; Trushkowsky et al. 2013; Parameswaran et al. 2011; Bragg, Mausam, and Weld 2013). Some of this work is concerned with solving machine learning problems with the crowd (Gomes et al. 2011; Tamuz et al. 2011; van der Maaten and Weinberger 2012; Heikinheimo and Ukkonen 2013). In this paper we focus on the problem of nonparametric density estimation given a finite sample from an underlying distribution. This is a fundamental problem in statistics and machine learning that has many applications, such as classification, regression, clustering, and outlier detection. Density can be understood simply as “the number of data points that are close to a given data point”. Any item in a high density region should thus be very similar to a fairly large number of other items. We argue that in the context of crowdsourcing, density can be viewed for instance as a measure of “commonality” of the items being studied. That is, all items in a high density region can be thought of as Copyright c © 2015, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. having a large number of common aspects or features. Likewise, items from low density regions are outliers or in some sense unusual. To give an example, consider a collection of photographs of galaxies (see e.g. (Lintott et al. 2008)). It is reasonable to assume that some types of galaxies are fairly common, while others are relatively rare. Suppose our task is to find the rare galaxies. However, in the absence of prior knowledge this can be tricky. One approach is to let workers label each galaxy as either common or rare according to the workers’ expertise. We argue that this has two drawbacks. First, evaluating commonality in absolute terms in a consistent manner may be difficult. Second, the background knowledge of the workers may be inconsistent with the data distribution. Perhaps a galaxy that would be considered as rare under general circumstances is extremely common in the given data. We argue that in such circumstances density estimation may result in a more reliable method for identifying the rare galaxies. A galaxy should be considered as rare if there are very few (or none) other galaxies that are similar to it in the studied data. The textbook approach for nonparametric density estimation are kernel density estimators (Hastie, Tibshirani, and Friedman 2009, p. 208ff). These methods usually consider the absolute distances between data points. That is, the distance between items A and B is given on some (possibly arbitrary) scale. Absolute distances between data points are used also in other elementary machine learning techniques, such as hierarchical clustering or dimensionality reduction. However, in the context of human computation, it is considerably easier to obtain information about relative distances. For example, statements of the form “the distance between items A and B is shorter than the distance between items C and D” are substantially easier to elicit from people than absolute distances. Moreover, such statements can be collected in an efficient manner via crowdsourcing using appropriately formulated HITs (Wilber, Kwak, and Belongie 2014). It is thus interesting to study what is the expressive power of relative distances, and are absolute distances even needed to solve some problems? A common application of relative distances are algorithms for computing low-dimensional embeddings (representations of the data in R) either directly (van der Maaten and Weinberger 2012) or via semi-supervised Proceedings, The Third AAAI Conference on Human Computation and Crowdsourcing (HCOMP-15)

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Statistical Topology Using the Nonparametric Density Estimation and Bootstrap Algorithm

This paper presents approximate confidence intervals for each function of parameters in a Banach space based on a bootstrap algorithm. We apply kernel density approach to estimate the persistence landscape. In addition, we evaluate the quality distribution function estimator of random variables using integrated mean square error (IMSE). The results of simulation studies show a significant impro...

متن کامل

The Relative Improvement of Bias Reduction in Density Estimator Using Geometric Extrapolated Kernel

One of a nonparametric procedures used to estimate densities is kernel method. In this paper, in order to reduce bias of  kernel density estimation, methods such as usual kernel(UK), geometric extrapolation usual kernel(GEUK), a bias reduction kernel(BRK) and a geometric extrapolation bias reduction kernel(GEBRK) are introduced. Theoretical properties, including the selection of smoothness para...

متن کامل

THE COMPARISON OF TWO METHOD NONPARAMETRIC APPROACH ON SMALL AREA ESTIMATION (CASE: APPROACH WITH KERNEL METHODS AND LOCAL POLYNOMIAL REGRESSION)

Small Area estimation is a technique used to estimate parameters of subpopulations with small sample sizes.  Small area estimation is needed  in obtaining information on a small area, such as sub-district or village.  Generally, in some cases, small area estimation uses parametric modeling.  But in fact, a lot of models have no linear relationship between the small area average and the covariat...

متن کامل

Convergence Rates of Posterior Distributions for Noniid Observations By

We consider the asymptotic behavior of posterior distributions and Bayes estimators based on observations which are required to be neither independent nor identically distributed. We give general results on the rate of convergence of the posterior measure relative to distances derived from a testing criterion. We then specialize our results to independent, nonidentically distributed observation...

متن کامل

Convergence Rates of Posterior Distributions for Noniid Observations

We consider the asymptotic behavior of posterior distributions and Bayes estimators based on observations which are required to be neither independent nor identically distributed. We give general results on the rate of convergence of the posterior measure relative to distances derived from a testing criterion. We then specialize our results to independent, nonidentically distributed observation...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2015