Production Experiences with the Cray-Enabled TORQUE Resource Manager
نویسندگان
چکیده
High performance computing resources utilize batch systems to manage the user workload. Cray systems are uniquely different from typical clusters due to Cray’s Application Level Placement Scheduler (ALPS). ALPS manages binary transfer, job launch and monitoring, and error handling. Batch systems require special support to integrate with ALPS using an XML protocol called BASIL. Previous versions of Adaptive Computing’s TORQUE and Moab batch suite integrated with ALPS from within Moab, using PERL scripts to interface with BASIL. This would occasionally lead to problems when all the components would become unsynchronized. Version 4.1 of the TORQUE Resource Manager introduced new features that allow it to directly integrate with ALPS using BASIL. This paper describes production experiences at Oak Ridge National Laboratory using the new TORQUE software versions, as well as ongoing and future work to improve TORQUE. Keywords-TORQUE; Resource Manager; Adaptive Computing; Cray; ALPS; Moab; HPC; Titan; Gaea
منابع مشابه
Experiences Running and Optimizing the Berkeley Data Analytics Stack on Cray Platforms
The Berkeley Data Analytics Stack (BDAS) is an emerging framework for big data analytics. It consists of the Spark analytics framework, the Tachyon in-memory filesystem, and the Mesos cluster manager. Spark was designed as an in-memory replacement for Hadoop that can in some cases improve performance by up to 100X. In this paper, we describe our experiences running BDAS on the new Cray Urika-XA...
متن کاملProduction I/O Characterization on the Cray XE6
I/O performance is an increasingly important factor in the productivity of large-scale HPC systems such as Hopper, a 153,216 core Cray XE6 system operated by the National Energy Research Scientific Computing Center. The scientific workload diversity of such systems presents a challenge for I/O performance tuning, however. Applications vary in terms of data volume, I/O strategy, and access metho...
متن کاملManagement of Virtual Large-scale High-performance Computing Systems
Linux is widely used on high-performance computing (HPC) systems, from commodity clusters to Cray supercomputers (which run the Cray Linux Environment). These platforms primarily differ in their system configuration: some only use SSH to access compute nodes, whereas others employ full resource management systems (e.g., Torque and ALPS on Cray XT systems). Furthermore, the latest improvements i...
متن کاملThe CRAY T3E System
Pioneers began using scalable computing systems over 10 years ago. Slowly the potential audience for scalable systems has grown as programming methods and systems have matured. The CRAY T3D system enabled production users to use scalable systems. Now, the CRAY T3E system is pushing beyond the CRAY T3D base by delivering the most advanced scalable technology at lower price points, and to users o...
متن کاملOpportunity Scheduling: An Unfair CPU Scheduler for UNICOS
Fair Share is the standard scheduling algorithm used for political resource control on large, multi-user UNIX systems. Promising equity, Fair Share has instead delivered frustration to its Los Alamos UNICOS users, who perceive misallocations of interactive response within a system of unreasonable complexity. This paper reviews the design of the Kay/Lauder Fair Share system, as well as its Cray ...
متن کامل