Straggler Mitigation in Distributed Optimization Through Data Encoding
نویسندگان
چکیده
Slow running or straggler tasks can significantly reduce computation speed in distributed computation. Recently, coding-theory-inspired approaches have been applied to mitigate the effect of straggling, through embedding redundancy in certain linear computational steps of the optimization algorithm, thus completing the computation without waiting for the stragglers. In this paper, we propose an alternate approach where we embed the redundancy directly in the data itself, and allow the computation to proceed completely oblivious to encoding. We propose several encoding schemes, and demonstrate that popular batch algorithms, such as gradient descent and L-BFGS, applied in a coding-oblivious manner, deterministically achieve sample path linear convergence to an approximate solution of the original problem, using an arbitrarily varying subset of the nodes at each iteration. Moreover, this approximation can be controlled by the amount of redundancy and the number of nodes used in each iteration. We provide experimental results demonstrating the advantage of the approach over uncoded and data replication strategies.
منابع مشابه
Redundancy Techniques for Straggler Mitigation in Distributed Optimization and Learning
Performance of distributed optimization and learning systems is bottlenecked by “straggler” nodes and slow communication links, which significantly delay computation. We propose a distributed optimization framework where the dataset is “encoded” to have an over-complete representation with built-in redundancy, and the straggling nodes in the system are dynamically left out of the computation at...
متن کاملSpeculation-aware Resource Allocation for Cluster Schedulers
Resource allocation and straggler mitigation (via “speculative” copies) are two key building blocks for analytics frameworks. Today, the two solutions are largely decoupled from each other, losing the opportunities of joint optimization. Resource allocation across jobs assumes that each job runs a fixed set of tasks, ignoring their need to dynamically run speculative copies for stragglers. Cons...
متن کاملNear-Optimal Straggler Mitigation for Distributed Gradient Methods
Modern learning algorithms use gradient descent updates to train inferential models that best explain data. Scaling these approaches to massive data sizes requires proper distributed gradient descent schemes where distributed worker nodes compute partial gradients based on their partial and local data sets, and send the results to a master node where all the computations are aggregated into a f...
متن کاملStraggler Mitigation in Distributed Matrix Multiplication: Fundamental Limits and Optimal Coding
We consider the problem of massive matrix multiplication, which underlies many data analytic applications, in a large-scale distributed system comprising a group of worker nodes. We target the stragglers’ delay performance bottleneck, which is due to the unpredictable latency in waiting for slowest nodes (or stragglers) to finish their tasks. We propose a novel coding strategy, named entangled ...
متن کاملMulti-Task Learning for Straggler Avoiding Predictive Job Scheduling
Parallel processing frameworks (Dean and Ghemawat, 2004) accelerate jobs by breaking them into tasks that execute in parallel. However, slow running or straggler tasks can run up to 8 times slower than the median task on a production cluster (Ananthanarayanan et al., 2013), leading to delayed job completion and inefficient use of resources. Existing straggler mitigation techniques wait to detec...
متن کامل