Scalable System Scheduling for HPC and Big Data

نویسندگان

Albert Reuther

Chansup Byun

William Arcand

David Bestor

Bill Bergeron

Matthew Hubbell

Michael Jones

Peter Michaleas

Andrew Prout

Antonio Rosa

Jeremy Kepner

چکیده

In the rapidly expanding field of parallel processing, job schedulers are the “operating systems” of modern big data architectures and supercomputing systems. Job schedulers allocate computing resources and control the execution of processes on those resources. Historically, job schedulers were the domain of supercomputers, and job schedulers were designed to run massive, long-running computations over days and weeks. More recently, big data workloads have created a need for a new class of computations consisting of many short computations taking seconds or minutes that process enormous quantities of data. For both supercomputers and big data systems, the efficiency of the job scheduler represents a fundamental limit on the efficiency of the system. Detailed measurement and modeling of the performance of schedulers are critical for maximizing the performance of a large-scale computing system. This paper presents a detailed feature analysis of 15 supercomputing and big data schedulers. For big data workloads, the scheduler latency is the most important performance characteristic of the scheduler. A theoretical model of the latency of these schedulers is developed and used to design experiments targeted at measuring scheduler latency. Detailed benchmarking of four of the most popular schedulers (Slurm, Son of Grid Engine, Mesos, and Hadoop YARN) are conducted. The theoretical model is compared with data and demonstrates that scheduler performance can be characterized by two key parameters: the marginal latency of the scheduler ts and a nonlinear exponent αs. For all four schedulers, the utilization of the computing system decreases to <10% for computations lasting only a few seconds. Multi-level schedulers (such as LLMapReduce) that transparently aggregate short computations can improve utilization for these short computations to >90% for all four of the schedulers that were tested.

متن کامل

منابع مشابه

MPJ Express Meets YARN: Towards Java HPC on Hadoop Systems

Many organizations—including academic, research, commercial institutions—have invested heavily in setting up High Performance Computing (HPC) facilities for running computational science applications. On the other hand, the Apache Hadoop software—after emerging in 2005— has become a popular, reliable, and scalable open-source framework for processing large-scale data (Big Data). Realizing the i...

متن کامل

Dynamic configuration and collaborative scheduling in supply chains based on scalable multi-agent architecture

Due to diversified and frequently changing demands from customers, technological advances and global competition, manufacturers rely on collaboration with their business partners to share costs, risks and expertise. How to take advantage of advancement of technologies to effectively support operations and create competitive advantage is critical for manufacturers to survive. To respond to these...

متن کامل

A note on new trends in data-aware scheduling and resource provisioning in modern HPC systems

The Big Data era [1,2] poses new challenges as well as significant opportunities for High-Performance Computing (HPC) systems such as how to efficiently turn massively large data into valuable information and meaningful knowledge? It is clear that computationally optimized new data-driven HPC techniques are required for processing Big Data in rapidly-increasing number of applications, such as L...

متن کامل

Data Replication-Based Scheduling in Cloud Computing Environment

Abstract— High-performance computing and vast storage are two key factors required for executing data-intensive applications. In comparison with traditional distributed systems like data grid, cloud computing provides these factors in a more affordable, scalable and elastic platform. Furthermore, accessing data files is critical for performing such applications. Sometimes accessing data becomes...

متن کامل

VSFS: A Versatile Searchable File System for HPC Analytics

Emerging HPC analytics applications urgently demand filesearch services to drastically reduce the scale of the input data in real-time, so that the speed of computation and data analytics can be greatly accelerated. Unfortunately, the existing file-search solutions are either poorly scalable for large-scale systems, or lack a well-integrated interface to allow applications to easily use them fo...

متن کامل

Dataflow-Based Scheduling for Scientific Workflows in HPC with Storage Constraints

In high-performance computing (HPC), workflow-based workloads are usually data intensive for exploratory analysis of a scientific computation problem that may involve a large parameter space. To achieve the best performance, storage resource constraint is always a pragmatic concern in reality as the potential problem space scale, especially in big data science, as well as its required dataset a...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

متن کامل

عنوان ژورنال:

J. Parallel Distrib. Comput.

دوره 111 شماره

صفحات -

تاریخ انتشار 2018

Scalable System Scheduling for HPC and Big Data

نویسندگان

چکیده

منابع مشابه

MPJ Express Meets YARN: Towards Java HPC on Hadoop Systems

Dynamic configuration and collaborative scheduling in supply chains based on scalable multi-agent architecture

A note on new trends in data-aware scheduling and resource provisioning in modern HPC systems

Data Replication-Based Scheduling in Cloud Computing Environment

VSFS: A Versatile Searchable File System for HPC Analytics

Dataflow-Based Scheduling for Scientific Workflows in HPC with Storage Constraints

عنوان ژورنال:

اشتراک گذاری