NUMA-Aware Multicore Matrix Multiplication

نویسندگان

  • Wail Y. Alkowaileet
  • David Carrillo-Cisneros
  • Robert V. Lim
  • Isaac D. Scherson
چکیده

A novel user-level scheduling, along with a specific data alignment method is presented for matrix multiplication in cache-coherent Non-Uniform Memory Access (ccNUMA) architectures. Addressing the data locality problem that occurs in such systems alleviates memory bottlenecks in problems with large input data sets. It is shown experimentally that a large number of cache misses occur when using an agnostic thread scheduler (such as OpenMP 3.1) with its own data placement on a ccNUMA machine. The problem is alleviated using the proposed technique for tuning an existing matrix multiplication implementation found in the BLAS library. The data alignment with its associated scheduling reduces the number of cache-misses by 67% and consequently the computation time by up to 22%. The evaluating metric is a relationship between the number of cache-misses and the gained speedup.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Lock cohorting: A general technique for designing NUMA locks Citation

Multicore machines are quickly shifting to NUMA and CC-NUMA architectures, making scalable NUMA-aware locking algorithms, ones that take into account the machines’ non-uniform memory and caching hierarchy, ever more important. This paper presents lock cohorting, a general new technique for designing NUMA-aware locks that is as simple as it is powerful. Lock cohorting allows one to transform any...

متن کامل

A Comparative Review of Contention-Aware Scheduling Algorithms to Avoid Contention in Multicore Systems

Contention for shared resources on multicore processors is an emerging issue of great concern, as it affects directly performance of multicore CPU systems. In this regard, Contention-Aware scheduling algorithms provide a convenient and promising solution, aiming to reduce contention, by applying different thread migration policies to the CPU cores. The main problem faced by latest research when...

متن کامل

Modeling Memory System Performance of NUMA Multicore-Multiprocessors

The performance of many applications depends closely on the way they interact with the computer’s memory system: Many applications obtain good performance only if they utilize the memory system efficiently. Unfortunately, obtaining good memory system performance is often difficult, as developing memory system-aware (system) software requires a thorough and detailed understanding of both the cha...

متن کامل

Efficient Multicore Sparse Matrix-Vector Multiplication for Finite Element Electromagnetics on the Cell-BE processor

Multicore systems are rapidly becoming a dominant industry trend for accelerating electromagnetics computations, driving researchers to address parallel programming paradigms early in application development. We present a new sparse representation and a two level partitioning scheme for efficient sparse matrix-vector multiplication on multicore systems, and show results for a set of finite elem...

متن کامل

Performance of a Multicore Matrix Multiplication Library

Multicore processors promise dramatic improvements in performance, but their diverse and often unique architectures are a major inhibitor to software adoption. Algorithm libraries that operate at the chip level and are optimized across multiple cores provide the quickest route by which programmers can port or develop highperformance software for multicores. This paper reports on a flexible matr...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • Parallel Processing Letters

دوره 24  شماره 

صفحات  -

تاریخ انتشار 2014