A Study of Multithreaded Benchmarks on the Hewlett-Packard X- and V-Class Architectures

نویسنده

  • Sharon Brunett
چکیده

The Hewlett-Packard Xand V-Class ccNUMA systems appear well suited to exploiting coarse and ne-grained parallelism, using multithreading techniques. This paper brie y summarizes the multilevel memory subsystem for the Xand V-Class platforms. Typical MPP distributed memory programming concerns for the codes under investigation, such as explicit memory localization and load balancing, are compared to relevant issues when porting and tuning for the Xand V-Class. This paper uses two small benchmarks as the basis for investigating di erences running multithreaded codes in SPP-UX and HP-UX environments. One code is from the Command, Control, Communication and Intelligence (C3I) Parallel Benchmark suite, shown to have the potential for large-scale parallelization with straightforward multithreading techniques. The second benchmark exhibits the computationally dynamic behavior of a thermally-driven explosion model. Both codes are shown to stress the HP systems' ability to keep memory close to processors and appropriate threads of execution. 1 System Architecture High-Level Overview The Hewlett Packard X-Class and V-Class servers are symmetric multiprocessor (SMP) cache coherent nonuniform memory-access (ccNUMA) systems, providing the illusion of simple, integrated memory. The fundamental building block of the X and V-Class systems is the hypernode. Each hypernode is an SMP, containing multiple processors, a crossbar, caches and synchronous DRAM (SDRAM) distributed across multiple memory boards. Hypernodes are connected through a ring interface, referred to as Coherent Toroidal Interconnect (CTI). 1.1 Globally Shared Memory All processors in X and V-Class servers share memory within a hypernode (local memory) and across the entire collection of hypernodes (remote memory). The global shared memory (GSM) subsystem is two-level; each of which is tuned for a particular class of data sharing. Level one includes a crossbar, connecting memory, processors and I/O within a hypernode. The crossbar provides high bandwidth, low latency nonblocking access from processors and I/O channels to hypernode local memory. Level two encompasses the interconnection between hypernodes through the use of CTI rings. The CTI is a collection of rings used to access remote memory across hypernodes. CTI is specially designed to enable extremely high-bandwidth (15.3 Gigabytes/second) data movement between processors, I/O devices and memory on a node. Memory references not satis ed by level one memory subsystem requests use the crossbar and CTI (level two) interconnection, to access data not in hypernode local memory. 1.2 Globally Shared Memory and Cache Coherence All processor references to memory, cause copies of the accessed data to be encached into either the instruction or data cache of each processor. Processor local cache holds the most frequently accessed data. When cache misses are encountered, an attempt is made to retrieve data from node local SDRAM. When requested data is not resident in local cache nor local memory, the data resides in memory of another node. Such remote data is obtained via the CTI interconnect. Since remote memory accesses come with a high latency cost, each node has local memory dedicated to CTI cache. The CTI cache is responsible for caching data accessed by other nodes, thus reducing the time to retrieve remote memory. When a processor modi es data within its data cache, and another processor references the same data, stale data conditions exist. The Xand V-Class hardware supports cache coherence, thus each processor is assured caches always contain the latest data values. Hardware supported cache coherence relieves the programmer from explicitly ushing cache and tending to expensive synchronization details. Maintaining coherent copies of processor cache is achieved by adherence to the following rules: Any number of read encachements on a cache line can occur concurrently. A cache line can be readshared in multiple caches. A processor must exclusively \own" a cache line in order to write data into a cache line. Modi ed cache lines must be written back to memory from the cache before overwriting occurs. Particular hardware characteristics for the platforms used in our benchmarking are listed in Table 1. System CPU CPUs/Node Nodes Clock Speed Data/Instr Cache Memory/Node OS V2250 PA-8200 16 2 240 MHz 2MB/2MB 8GB HP-UX 11.01 X2000 PA-8000 16 16 180 MHz 1MB/1MB 4GB SPP-UX 5.3 Tab. 1: Platform Speci cs 2 The Benchmark Problems The U.S. Air Force Rome Laboratory C3I Parallel Benchmark Suite [1] consists of eight problems chosen to compactly represent the essential elements of real C3I applications. The C3I representative benchmark discussed in this paper is Threat Analysis|a time-stepped simulation involving the trajectories of incoming ballistic threats, with computation of options for intercepting the threats. The benchmark includes input data sets, a sequential C program, and output sets to validate correctness. Threat Analysis is computationally intensive, compact, and involves non-trivial data and control structures. It should make a good test for compiler parallelization e ectiveness and multithreaded performance analysis on the Xand V-Class architectures.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

On a class of nonlinear fractional Schrödinger-Poisson systems

In this paper, we are concerned with the following fractional Schrödinger-Poisson system:    (−∆s)u + V (x)u + φu = m(x)|u|q−2|u|+ f(x,u), x ∈ Ω, (−∆t)φ = u2, x ∈ Ω, u = φ = 0, x ∈ ∂Ω, where s,t ∈ (0,1], 2t + 4s > 3, 1 < q < 2 and Ω is a bounded smooth domain of R3, and f(x,u) is linearly bounded in u at infinity. Under some assumptions on m, V and f we obtain the existence of non-trivial so...

متن کامل

Software

Zymark announces the release of the Zymark/ChemStation interface software. When utilized in conjunction with software Version 2.1 for the MultiDose family of automated dissolution testing workstations and Version A.06.03 of Hewlett-Pakcard’s ChemStation, which hosts a dissolution testing software add-on module, a direct link between the MultiDose or MultiDose Plus to the Hewlett-Packard ChemSta...

متن کامل

Hewlett-Packard Company Unlocks the Value Potential from Time-Sensitive Returns

Hewlett-Packard (HP) and other companies producing short life-cycle products with rapid value erosion squander the opportunity to profit from returned time-sensitive products when they treat them as a nuisance. Instead of focusing on cost minimization and technical quality, they should recognize returns as a value stream and maximize the revenue from smart and fast disposition, proper refurbish...

متن کامل

Hewlett - Packard Jot ] Rnal

The Hewlett-Packard Journal is published bimonthly by the Hewlett-Packard Company to recognize technical contributions made by Hewlett-Packard (HP) personnel. While the information found in this publication is believed to be accurate, the Hewlett-Packard Company disclaims all warranties of merchantability and fitness for a panicular purpose and all obligations and liabilities for damages, inclu...

متن کامل

Performance Evaluation of Heterogeneous Microprocessor Architectures

This paper focuses on the evaluation of the performance of heterogeneous microprocessor architectures using user and system perspective metrics. The tests have been conducted using a new simulation technique – Interval simulation, and a computer architecture simulator which is based on this technique. The evaluation is done by using multi-programmed multithreaded workloads constructed by using ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2004