Scalable System Scheduling for HPC and Big Data
نویسندگان
چکیده
In the rapidly expanding field of parallel processing, job schedulers are the “operating systems” of modern big data architectures and supercomputing systems. Job schedulers allocate computing resources and control the execution of processes on those resources. Historically, job schedulers were the domain of supercomputers, and job schedulers were designed to run massive, long-running computations over days and weeks. More recently, big data workloads have created a need for a new class of computations consisting of many short computations taking seconds or minutes that process enormous quantities of data. For both supercomputers and big data systems, the efficiency of the job scheduler represents a fundamental limit on the efficiency of the system. Detailed measurement and modeling of the performance of schedulers are critical for maximizing the performance of a large-scale computing system. This paper presents a detailed feature analysis of 15 supercomputing and big data schedulers. For big data workloads, the scheduler latency is the most important performance characteristic of the scheduler. A theoretical model of the latency of these schedulers is developed and used to design experiments targeted at measuring scheduler latency. Detailed benchmarking of four of the most popular schedulers (Slurm, Son of Grid Engine, Mesos, and Hadoop YARN) are conducted. The theoretical model is compared with data and demonstrates that scheduler performance can be characterized by two key parameters: the marginal latency of the scheduler ts and a nonlinear exponent αs. For all four schedulers, the utilization of the computing system decreases to <10% for computations lasting only a few seconds. Multi-level schedulers (such as LLMapReduce) that transparently aggregate short computations can improve utilization for these short computations to >90% for all four of the schedulers that were tested.
منابع مشابه
MPJ Express Meets YARN: Towards Java HPC on Hadoop Systems
Many organizations—including academic, research, commercial institutions—have invested heavily in setting up High Performance Computing (HPC) facilities for running computational science applications. On the other hand, the Apache Hadoop software—after emerging in 2005— has become a popular, reliable, and scalable open-source framework for processing large-scale data (Big Data). Realizing the i...
متن کاملDynamic configuration and collaborative scheduling in supply chains based on scalable multi-agent architecture
Due to diversified and frequently changing demands from customers, technological advances and global competition, manufacturers rely on collaboration with their business partners to share costs, risks and expertise. How to take advantage of advancement of technologies to effectively support operations and create competitive advantage is critical for manufacturers to survive. To respond to these...
متن کاملA note on new trends in data-aware scheduling and resource provisioning in modern HPC systems
The Big Data era [1,2] poses new challenges as well as significant opportunities for High-Performance Computing (HPC) systems such as how to efficiently turn massively large data into valuable information and meaningful knowledge? It is clear that computationally optimized new data-driven HPC techniques are required for processing Big Data in rapidly-increasing number of applications, such as L...
متن کاملData Replication-Based Scheduling in Cloud Computing Environment
Abstract— High-performance computing and vast storage are two key factors required for executing data-intensive applications. In comparison with traditional distributed systems like data grid, cloud computing provides these factors in a more affordable, scalable and elastic platform. Furthermore, accessing data files is critical for performing such applications. Sometimes accessing data becomes...
متن کاملVSFS: A Versatile Searchable File System for HPC Analytics
Emerging HPC analytics applications urgently demand filesearch services to drastically reduce the scale of the input data in real-time, so that the speed of computation and data analytics can be greatly accelerated. Unfortunately, the existing file-search solutions are either poorly scalable for large-scale systems, or lack a well-integrated interface to allow applications to easily use them fo...
متن کاملDataflow-Based Scheduling for Scientific Workflows in HPC with Storage Constraints
In high-performance computing (HPC), workflow-based workloads are usually data intensive for exploratory analysis of a scientific computation problem that may involve a large parameter space. To achieve the best performance, storage resource constraint is always a pragmatic concern in reality as the potential problem space scale, especially in big data science, as well as its required dataset a...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- J. Parallel Distrib. Comput.
دوره 111 شماره
صفحات -
تاریخ انتشار 2018