Modeling of L2 Cache Behavior for Thread-Parallel Scientific Programs on Chip Multi-Processors

نویسندگان

Fengguang Song

Shirley Moore

Jack Dongarra

چکیده

It is critical to provide high performance for scientific programs running on a Chip MultiProcessor (CMP). A CMP architecture often has a shared L2 cache and lower storage hierarchy. The shared L2 cache can reduce the number of cache misses if the data are commonly shared by several threads, but it can also lead to performance degradation due to resource contention. Sometimes running threads on all cores can cause severe contention and increase the number of cache misses greatly. To investigate how a thread’s performance varies when it runs together with other threads on different cores, we develop an analytical model to predict the number of misses on the shared L2 cache, especially for thread-parallel numerical codes. We assume that the parallel threads work on homogeneous tasks and share a fully associative L2 cache. Stack processing technique and circular sequences are used to analyze the L2 trace to predict the number of compulsory misses, capacity misses on shared data, and capacity misses on private data, respectively. It is the first work to predict the number of L2 misses for threads that ∗This material is based upon work supported by the National Science Foundation under grant No. 0444363. have the nature of memory sharing. The model has been validated by three typical scientific programs: matrix multiplication, blocked matrix multiplication, and sparse matrix-vector product on a variety of matrix sizes. The average relative error lies between 2% and 12%.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Impact of the Interconnect on Performance and Area/Power for High Core Count (> 8) CMPs

Introduction Faithful to Moore’s law, silicon processing improvements have continually increased the number of transistors available for implementing CPUs within a fixed die area. The designer is left with the choice of how to put those transistors to use. Superscalar processors are organized into parallel pipelines which aggressively seek to execute instructions within a single thread in paral...

متن کامل

Performance of Multithreaded Chip Multiprocessors and Implications for Operating System Design

We investigated how operating system design should be adapted for multithreaded chip multiprocessors (CMT) – a new generation of processors that exploit thread-level parallelism to mask the memory latency in modern workloads. We determined that the L2 cache is a critical shared resource on CMT and that an insufficient amount of L2 cache can undermine the ability to hide memory latency on these ...

متن کامل

Power-aware Speed-up for Multithreaded Numerical Linear Algebraic Solvers on Chip Multicore Processors

With the advent of multicore chips new parallel computing metrics and models have become essential for redesigning traditional scientific application libraries tuned to a single chip. In this paper we evolve metrics specific to generalized chip multicore processors (CMP) and use them for parallel performance modeling of numerical linear algebra routines that are commonly available as shared obj...

متن کامل

Prefetch Threads for Database Operations on a Simultaneous Multi-threaded Processor

Simultaneous Multi-threading (SMT) has been developed to increase instruction level parallelism by allowing instructions from a different thread to run during a stall. Inter-thread cache interference, however, might limit the benefit of running multiple independent threads. SMT processors can be utilized in a different model, where a helper thread is used to prefetch cache blocks for the main e...

متن کامل

Building a Domain-Knowledge Guided System Software Environment to Achieve High-Performance of Multi-core Processors

Although multi-core processors have become dominant computing units in basic system platforms from laptops to supercomputers, software development for effectively running various multi-threaded applications on multi-cores has not made much progress, and effective solutions are still limited to high performance applications relying on exiting parallel computing technology. In practice, majority ...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2006

Modeling of L2 Cache Behavior for Thread-Parallel Scientific Programs on Chip Multi-Processors

نویسندگان

چکیده

منابع مشابه

Impact of the Interconnect on Performance and Area/Power for High Core Count (> 8) CMPs

Performance of Multithreaded Chip Multiprocessors and Implications for Operating System Design

Power-aware Speed-up for Multithreaded Numerical Linear Algebraic Solvers on Chip Multicore Processors

Prefetch Threads for Database Operations on a Simultaneous Multi-threaded Processor

Building a Domain-Knowledge Guided System Software Environment to Achieve High-Performance of Multi-core Processors

عنوان ژورنال:

اشتراک گذاری