Analytic Methods for Optimizing Realtime Crowdsourcing

نویسندگان

  • Michael S. Bernstein
  • David R. Karger
  • Rob Miller
  • Joel Brandt
چکیده

Realtime crowdsourcing research has demonstrated that it is possible to recruit paid crowds within seconds by managing a small, fast-reacting worker pool. Realtime crowds enable crowd-powered systems that respond at interactive speeds: for example, cameras, robots and instant opinion polls. So far, these techniques have mainly been proof-of-concept prototypes: research has not yet attempted to understand how they might work at large scale or optimize their cost/performance trade-offs. In this paper, we use queueing theory to analyze the retainer model for realtime crowdsourcing, in particular its expected wait time and cost to requesters. We provide an algorithm that allows requesters to minimize their cost subject to performance requirements. We then propose and analyze three techniques to improve performance: push notifications, shared retainer pools, and precruitment, which involves recalling retainer workers before a task actually arrives. An experimental validation finds that precruited workers begin a task 500 milliseconds after it is posted, delivering results below the one-second cognitive threshold for an end-user to stay in flow. INTRODUCTION Crowdsourcing is no longer constrained to slow, offline tasks. Just as traditional programming evolved from offline batch processes to realtime results and interaction, crowdsourcing is now transitioning from wait times of hours (Ipeirotis 2010) to seconds (Bernstein, Brandt, Miller & Karger 2011). Techniques that place workers on active retainer can now recruit crowds in two seconds (Bernstein et al. 2011), complete traditional crowdsourced votes in five seconds, and maintain continuous control of remote interfaces (Lasecki, Murray, White, Miller & Bigham 2011). These realtime crowds open the door to deployable applications that react intelligently, including smart cameras, robot navigators, spreadsheets, and on-demand graphic design. However, existing realtime techniques have largely been prototypes aimed at demonstrating feasibility—they did not attempt to understand how these approaches would work at large scale or to optimize cost/performance trade-offs. We focus on the retainer model, which pays workers a small extra wage to be on call while they pursue other tasks, then respond quickly when a realtime request arrives (Bernstein et al. 2011). Currently, the retainer model is not optimized for cost or performance, nor do requesters have any analytic framework to understand the relationship between retainer pool size, cost, and response time. This paper analyzes the retainer model using queueing theory (Gross & Harris 1998) to understand its performance at scale, in particular the trade-off between expected wait time and cost. We introduce a simple algorithm for choosing the optimal size of the retainer pool to minimize total cost to the requester subject to the requester’s performance requirements: maximum expected wait time or maximum probability of missing a request. We then propose several improvements to the retainer model that reduce expected wait time. First, retainer subscriptions allow workers to sign up for push notifications for recruitment, which reduces the length of time it takes to recruit new workers onto retainer. Second, combining retainer pools across requesters leads to both cost and wait time improvements. Large retainer pools can then be made more effective by using task routing to connect appropriate workers to the tasks that need them. Third, a precruitment strategy recalls workers from retainer a few moments before a task is expected to arrive, dramatically lowering response time. We perform an early empirical evaluation demonstrating that precruitment results in median response times of just 500 milliseconds. Our analysis carries several benefits. First, realtime tasks can now directly minimize their cost for a given performance requirement. Second, the retainer subscriptions allows workers to register for the tasks they like best and have them delivered, rather than constantly seeking out new work. Third, we demonstrate empirically that these techniques can overcome previous limits of “crowds in two seconds” to deliver the feedback to the user within 500 milliseconds—finally under the onesecond cognitive threshold for an end-user to remain in flow (Nielsen 1993). We begin by surveying related work on realtime crowdsourcing and wait times in crowdsourcing systems. We then describe the retainer model and use queueing theory to analyze and optimize wait time and cost. We introduce our improvements to the model—retainer subscriptions, global retainer pools, and precruitment—and integrate them into our analysis. Finally, we discuss limitations of our approach and point to future work realizing the vision of realtime crowds. RELATED WORK Crowdsourcing researchers have a strong interest in fast task completion times. Paying more will lead to more work completed (Mason & Watts 2009), but not at realtime speed. QuikTurKit introduced two techniques ar X iv :1 20 4. 29 95 v1 [ cs .S I] 1 3 A pr 2 01 2 PROCEEDINGS, CI 2012 to improve response time: repeatedly posting tasks so as to stay visible in the recent task list, and keeping workers primed through old tasks until a new task is ready (Bigham, Jayant, Ji, Little, Miller, Miller, Miller, Tatrowicz, White, White & Yeh 2010). The retainer model builds on QuikTurKit by paying workers a small fee and notifying them when work is ready, recruiting crowds in two seconds (Bernstein et al. 2011). Workers can also maintain continuous realtime control of an interface by electing temporary leaders (Lasecki et al. 2011). We contribute a more thorough analysis of the techniques in these systems and an algorithmic approach to helping these systems achieve target wait times at minimum cost. Accurate models of crowdsourcing platforms help us understand the underlying processes and predict behavior when parameters change. Queueing theory has been used to estimate throughput and wages on Mechanical Turk (Ipeirotis 2010). Survival analysis is another popular model for predicting task completion time, especially in non-realtime scenarios (Faridani, Hartmann & Ipeirotis 2011). We model crowdsourcing task arrival processes as Poisson; empirical data suggests that the Poisson approximation is accurate when parametrized by time of day (Faridani et al. 2011). RETAINER MODEL The retainer model is a recruitment approach for realtime crowdsourcing. This model was introduced for realtime interfaces like instant feedback votes and a crowdsourced camera shutter (Bernstein et al. 2011). It pays workers a small wage to be on call and return quickly when a task is ready. These workers accept the task in advance and are paid extra to keep their browser window open. While they wait for the task to arrive, workers are free to work on other tasks. There are many methods for recalling the worker; Bernstein et al. (2011) used a modal dialog and an audio alert. Evaluations demonstrated that workers messaged in this way begin work in two to three seconds. QUEUEING THEORY ANALYSIS In this section, we investigate a mathematical model of retainers. This model allows us to predict how long realtime tasks will need to wait. To begin, suppose each task type has its own set of retainer workers. When a task comes in, a worker leaves the retainer pool to work on the task and the retainer system recruits another worker to refill the pool. The goal is to maintain a large enough pool of retainer workers to handle incoming tasks. In other words, we want to minimize the probability that the retainer pool will be empty (no retainer workers left), subject to cost constraints. The risk is that a burst of task arrivals may exhaust the retainer pool before we can recruit replacement workers. We will model this problem using queueing theory. In queueing theory, a set of servers are available to handle jobs as they arrive. If all servers are busy handling a job when a new job arrives, that job enters a queue of waiting tasks and is serviced as soon as it reaches the front of the queue. In our scenario, tasks are jobs, and retainer workers are servers. In this paper, we will consider a class of algorithms that set an optimal retainer pool size. Suppose the retainer pool is c workers. As jobs come in and remove workers from the retainer pool, assume that the system always puts out enough requests for new workers to bring the pool back to c. That is, if there are c0 workers in the pool, the system has issued c− c0 outstanding requests. If, when a job arrives, the pool is empty, the system sets it aside for special processing: it directly recruits a worker, not for the pool, but for that job. In effect, a user with a diverted job is immediately alerted that the system is over capacity and the job will be handled out-of-band after a short delay. This final assumption may not accurately reflect how a running system would work, but it provides an upper bound on expected wait time and makes it easier to analyze the probability that a task will be serviced in realtime. Suppose that tasks arrive as a Poisson process at rate λ, and retainer workers arrive after they are requested as a Poisson process at rate μ. Then, the empty spots in the retainer pool, each of which will become filled when a worker arrives, can be thought of as busy machines occupied with a job whose completion time is a Poisson process with rate μ. In our setup, we also divert jobs that arrive when all machines are busy. In other words, this is an M/M/c/c queue where jobs arrive at rate λ and have processing time μ. A basic M/M/1 queue assumes Poisson arrival and completion processes, a single server, and a potentially infinite queue. An M/M/c/c queue has c parallel machines instead of one, and rejects or redirects requests when there are no servers to immediately handle the incoming request (Gross & Harris 1998). Imagine a telephone system, for example, that gives a busy signal if all c lines are busy. The meaning of μ has changed slightly to indicate worker recruitment time instead of a job completion time, but the mathematical analysis is the same. To optimize performance, we need to understand the probability that all workers are busy, since that is the case where a job has to wait (for expected time 1/μ). We also need to understand the cost of having a retainer pool of size c. Since the system pays workers proportional to how long they are on retainer without a job, the total cost is proportional to the average number of idle machines—these are the ones representing workers waiting on retainer. Finally, we will eventually need to integrate worker abandonment into our model, since not all workers respond to the retainer alert. These assumptions are perhaps overly ideal. Job arrivals on Mechanical Turk are heavy-tailed (Ipeirotis 2010). However, much of our analysis is independent of the arrival distribution, and systems can always substitute empirically observed distributions and solve numerically. PROCEEDINGS, CI 2012 Probability of an Empty Pool The probability that a job must wait can be derived using Erlang’s loss formula (Gross & Harris 1998). We set ρ, the traffic intensity, to be the percentage of system resources that are are required to service newly incoming tasks: ρ = λ/μ. In M/M/c/c queueing systems, as we will demonstrate, ρ < c is necessary for the system to keep up with incoming requests. The probability of an empty retainer pool (all c “servers” busy) is Erlang’s loss formula: π(c) = ρ/c! ∑c i=0 ρ i/i! (1) A remarkable property of Erlang’s loss formula is that this relationship requires no assumptions about the distributions of job arrival time or worker recruitment time, in particular whether they are Poisson. It only depends on the means μ and λ. Expected Waiting Time For some applications, the probability of a task needing to wait is less important than the expected wait time for the task. The two quantities are directly related. The expected wait time is the probability of an empty retainer pool multiplied by the expected wait time when the pool is empty, or 1 μπ(c): 1 μ π(c) = 1 μ ρ/c! ∑c i=0 ρ i/i! (2) This expression gives us a direct relationship between the size of the retainer pool, the arriving task and worker rates, and the expected wait time. As a sanity check: when λ → 0 (few arrivals) we have ρ → 0 in which case π(c) → 0. In other words, we are very unlikely to have an empty pool so the expected wait time also goes to zero. This relationship is visualized in Figure 1(c). Expected Cost Once we understand expected waiting time, we can analyze the retainer model’s cost characteristics. Bernstein et al.’s (2011) experiments suggested that workers could be maintained on retainer for $0.30 per hour at a rate of 1 2¢ per minute, but this analysis is fairly simplistic. To understand cost more completely, we need to know the expected number of workers on retainer. The probability of having i busy servers in an M/M/c/c queue is a more general version of Erlang’s loss formula: π(i) = ρ/i! ∑c i=0 ρ i/i! (3) Actually, π(c)→ ρ/c! We can derive the closed form expression of the expected number of busy servers: E[i] = ∑c i=0 iρ /i! ∑c i=0 ρ i/i! = ρ ∑c−1 i=0 ρ /i! ∑c i=0 ρ i/i! = ρ(1− π(c)) (4) In steady state, we need to pay all retainer workers who are not busy. That is, we expect to have c− ρ(1− π(c)) workers waiting on retainer. If our retainer salary rate is s (e.g., s = 1 2¢), we would pay s(c − ρ(1 − π(c))) per unit time on average. Visualizing the Relationships While these equations give us precise relationships, they may not convey intuitions about the performance of the platform. Figure 1 plots these relationships for several possible values of ρ. These figures show a knee in the curve at c ≈ ρ for getting a good probability of response. A pool size c > ρmeans that an empty pool’s overall rate of recruitment of workers, cμ, exceeds the arrival rate of tasks. In other words, we begin to catch up and rebuild a set of available workers. On the other hand, if c < ρ, then even an empty queue will not recruit workers fast enough to cover all arriving tasks, so it will stay empty. Figure 2 visualizes the relationship between the requester’s cost and the probability of waiting. We derive this parametric curve by choosing values of c, then finding the cost and probability of waiting given that value. Paying more (i.e., for a larger pool) always improves the probability that the system can immediately handle a request. However, for small values of ρ, e.g. ρ ≤ 1, paying 1–1.5¢ per minute brings the probability of waiting near zero. When tasks arrive quite quickly, 2.5¢ or more is necessary to achieve similar performance. OPTIMAL RETAINER POOL SIZE A queueing theory model allows us to determine the number of workers to keep on active retainer. The size of the retainer pool is typically the only value that requesters can manipulate, and it impacts both cost and expected wait time. Requesters want to minimize their costs by keeping the retainer pool as small as possible while also maintaining a low probability that the task cannot be served in realtime. In this section, we present techniques for choosing the size of the retainer pool. Our goal is to find an optimal value of c, given 1) the arrival rates λ and μ, and 2) desired performance, in terms of the probability of a miss π(c) or total cost. We assume that the requester knows λ and μ either through empirical observation or estimation. We also assume that λ When ρ/c→ 0, the number of free workers goes to c−ρ(1− ρ/c!), or effectively c. As ρ → ∞, the number of free workers goes to c− ρc/(ρ+ c) = c(1− ρ/(ρ+ c)) which goes to 0. PROCEEDINGS, CI 2012

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Content Centric Crowdsourcing and Dissemination of Realtime Public Transportation Information

Road traffic congestion is a worldwide issue today. Intelligent Transportation Systems (ITS) seek to alleviate this issue by making an informed use of transportation infrastructure. Most of existing ITS techniques are not applicable in developing regions because of chaotic traffic conditions in developing countries. Hence, it is desirable to develop ITS techniques for developing regions, enabli...

متن کامل

Dialog System Using Real-Time Crowdsourcing and Twitter Large-Scale Corpus

We propose a dialog system that creates responses based on a large-scale dialog corpus retrieved from Twitter and real-time crowdsourcing. Instead of using complex dialog management, our system replies with the utterance from the database that is most similar to the user input. We also propose a realtime crowdsourcing framework for handling the case in which there is no adequate response in the...

متن کامل

Perform Three Data Mining Tasks with Crowdsourcing Process

For data mining studies, because of the complexity of doing feature selection process in tasks by hand, we need to send some of labeling to the workers with crowdsourcing activities. The process of outsourcing data mining tasks to users is often handled by software systems without enough knowledge of the age or geography of the users' residence. Uncertainty about the performance of virtual user...

متن کامل

Blockchain-Assisted Crowdsourced Energy Systems

Crowdsourcing relies on people’s contributions to meet productor system-level objectives. Crowdsourcing-based methods have been implemented in various cyber-physical systems and realtime markets. This paper explores a framework for Crowdsourced Energy Systems (CES), where small-scale energy generation or energy trading is crowdsourced from distributed energy resources, electric vehicles, and sh...

متن کامل

A reputation-aware decision-making approach for improving the efficiency of crowdsourcing systems

A crowdsourcing system is a useful platform for utilizing the intelligence and skills of the mass. Nevertheless, like any open system that involves the exchange of things of value, selfish and malicious behaviors exist in crowdsourcing systems and need to be mitigated. Trust management has been proven to be a viable solution in many systems. However, a major difference between crowdsourcing sys...

متن کامل

Optimizing Open-Ended Crowdsourcing: The Next Frontier in Crowdsourced Data Management

Crowdsourcing is the primary means to generate training data at scale, and when combined with sophisticated machine learning algorithms, crowdsourcing is an enabler for a variety of emergent automated applications impacting all spheres of our lives. This paper surveys the emerging field of formally reasoning about and optimizing open-ended crowdsourcing, a popular and crucially important, but s...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • CoRR

دوره abs/1204.2995  شماره 

صفحات  -

تاریخ انتشار 2012