Analyzing Inter-Job Contention in Dragonfly Networks
نویسندگان
چکیده
Interconnection networks are increasing in importance as node counts increase in high-end machines. To achieve better application performance, newer supercomputers frequently have interconnects with more connections, higher bandwidth, and lower diameter. One example of such an interconnect is a dragonfly topology, which has appeared in multiple recent supercomputers. Adaptive routing and high bandwidth on dragonfly networks leads to the belief that sharing of the network between jobs will not lead to performance degradation. In this paper, we analyze the performance of a production HPC application, MILC, on a dragonfly-based Cray supercomputer. We find that, in fact, the performance of MILC varies by a factor of more than three, and that the performance variation is due to communication delays from network interference. First, we analyze a communication trace of MILC to relate per-rank delays to network activity. Then we use machine learning to develop a predictive model for runtime based on network counters. Our model performs well, with a mean squared prediction error of 0.22.
منابع مشابه
Evaluating System Parameters on a Dragonfly using Simulation and Visualization
The dragonfly topology is becoming a popular choice for building high-radix, low-diameter networks with high-bandwidth links. Even with a powerful network, preliminary experiments on Edison at NERSC have shown that for communication heavy applications, job interference and thus presumably job placement remains an important factor. In this paper, we explore the effects of job placement, job size...
متن کاملTrade-Off Study of Localizing Communication and Balancing Network Traffic on a Dragonfly System
Dragonfly networks are being widely adopted in high-performance computing systems. On these networks, however, interference caused by resource sharing can lead to significant network congestion and performance variability. We present a comparative analysis exploring the trade-off between localizing communication and balancing network traffic. We conduct trace-based simulations for applications ...
متن کاملOptimal Placement and Sizing of Multiple Renewable Distributed Generation Units Considering Load Variations Via Dragonfly Optimization Algorithm
The progression towards smart grids, integrating renewable energy resources, has increased the integration of distributed generators (DGs) into power distribution networks. However, several economic and technical challenges can result from the unsuitable incorporation of DGs in existing distribution networks. Therefore, optimal placement and sizing of DGs are of paramount importance to improve ...
متن کاملContention of Communications in Switched Networks with Applications to Parallel Sorting REU Site: Interdisciplinary Program in High Performance Computing
Contention of communications across a switched network that connects multiple compute nodes in a distributed-memory cluster may seriously degrade performance of parallel code. The InfiniBand network is the most popular interconnect for compute clusters. While one may correctly assume that increased resource contention leads to decreased application performance, alternate methods such as virtual...
متن کاملAlleviating MAC Layer Self-Contention in Ad-hoc Networks
The distributed coordination function (DCF) mode of IEEE 802.11 has become the defacto standard media access control mechanism for wireless ad-hoc network research. By design the IEEE 802.11 MAC protocol is unaware of the transport layer connection a packet belongs to. As a result packets belonging to the same connection contend for local spectra during transmission at neighboring nodes. This p...
متن کامل