NUMA-Aware Multicore Matrix Multiplication
نویسندگان
چکیده
A novel user-level scheduling, along with a specific data alignment method is presented for matrix multiplication in cache-coherent Non-Uniform Memory Access (ccNUMA) architectures. Addressing the data locality problem that occurs in such systems alleviates memory bottlenecks in problems with large input data sets. It is shown experimentally that a large number of cache misses occur when using an agnostic thread scheduler (such as OpenMP 3.1) with its own data placement on a ccNUMA machine. The problem is alleviated using the proposed technique for tuning an existing matrix multiplication implementation found in the BLAS library. The data alignment with its associated scheduling reduces the number of cache-misses by 67% and consequently the computation time by up to 22%. The evaluating metric is a relationship between the number of cache-misses and the gained speedup.
منابع مشابه
Lock cohorting: A general technique for designing NUMA locks Citation
Multicore machines are quickly shifting to NUMA and CC-NUMA architectures, making scalable NUMA-aware locking algorithms, ones that take into account the machines’ non-uniform memory and caching hierarchy, ever more important. This paper presents lock cohorting, a general new technique for designing NUMA-aware locks that is as simple as it is powerful. Lock cohorting allows one to transform any...
متن کاملA Comparative Review of Contention-Aware Scheduling Algorithms to Avoid Contention in Multicore Systems
Contention for shared resources on multicore processors is an emerging issue of great concern, as it affects directly performance of multicore CPU systems. In this regard, Contention-Aware scheduling algorithms provide a convenient and promising solution, aiming to reduce contention, by applying different thread migration policies to the CPU cores. The main problem faced by latest research when...
متن کاملModeling Memory System Performance of NUMA Multicore-Multiprocessors
The performance of many applications depends closely on the way they interact with the computer’s memory system: Many applications obtain good performance only if they utilize the memory system efficiently. Unfortunately, obtaining good memory system performance is often difficult, as developing memory system-aware (system) software requires a thorough and detailed understanding of both the cha...
متن کاملEfficient Multicore Sparse Matrix-Vector Multiplication for Finite Element Electromagnetics on the Cell-BE processor
Multicore systems are rapidly becoming a dominant industry trend for accelerating electromagnetics computations, driving researchers to address parallel programming paradigms early in application development. We present a new sparse representation and a two level partitioning scheme for efficient sparse matrix-vector multiplication on multicore systems, and show results for a set of finite elem...
متن کاملPerformance of a Multicore Matrix Multiplication Library
Multicore processors promise dramatic improvements in performance, but their diverse and often unique architectures are a major inhibitor to software adoption. Algorithm libraries that operate at the chip level and are optimized across multiple cores provide the quickest route by which programmers can port or develop highperformance software for multicores. This paper reports on a flexible matr...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- Parallel Processing Letters
دوره 24 شماره
صفحات -
تاریخ انتشار 2014