The Combinatorics of Cache Misses during Matrix Multiplication
نویسندگان
چکیده
In this paper we construct an analytic model of cache misses during matrix multiplication. The analysis in this paper applies to square matrices of size 2m where the array layout function is given in terms of a function that interleaves the bits in the binary expansions of the row and column indices. We first analyze the number of cache misses for direct-mapped caches and then indicate how to extend this analysis to A -way associative caches. The work in this paper accomplishes two things. First, we construct fast algorithms to estimate the number of cache misses. Second, we develop theoretical understanding of cache misses that will allow us, in subsequent work, to approach the problem of minimizing cache misses by appropriately choosing the bit interleaving function that goes into the array layout function.
منابع مشابه
Comparative Study of Cache Utilization for Matrix Multiplication Algorithms
In this work, the performance of basic and strassen’s matrix multiplication algorithms are compared in terms of memory hierarchy utilization. The problem taken here is MATRIX MULTIPLICATION (Basic and Strassen’s). Strassen’s Matrix Multiplication Algorithm has time complexity of O(n) with respect to the Basic multiplication algorithm with time complexity of O(n). This slight reduction in time m...
متن کاملModeling of L2 Cache Behavior for Thread-Parallel Scientific Programs on Chip Multi-Processors
It is critical to provide high performance for scientific programs running on a Chip MultiProcessor (CMP). A CMP architecture often has a shared L2 cache and lower storage hierarchy. The shared L2 cache can reduce the number of cache misses if the data are commonly shared by several threads, but it can also lead to performance degradation due to resource contention. Sometimes running threads on...
متن کاملcient Matrix Multiplication Using Cache Conscious Data Layouts
This paper demonstrates performance improvements for matrixmultiplication and mesh generation for Finite Element Method (FEM) by optimizing the memory hierarchy of traditional processors. The theory developed earlier is used to perform such optimizations. Our work provides a uniform methodology across multiple HPC platforms for optimizing the performance of the kernel codes (such as matrix tran...
متن کاملPerformance Optimization and Evaluation for Linear Codes
In this paper, we develop a probabilistic model for estimation of the numbers of cache misses during the sparse matrix-vector multiplication (for both general and symmetric matrices) and the Conjugate Gradient algorithm for 3 types of data caches: direct mapped, s-way set associative with random or with LRU replacement strategies. Using HW cache monitoring tools, we compare the predicted number...
متن کاملNUMA-Aware Multicore Matrix Multiplication
A novel user-level scheduling, along with a specific data alignment method is presented for matrix multiplication in cache-coherent Non-Uniform Memory Access (ccNUMA) architectures. Addressing the data locality problem that occurs in such systems alleviates memory bottlenecks in problems with large input data sets. It is shown experimentally that a large number of cache misses occur when using ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- J. Comput. Syst. Sci.
دوره 63 شماره
صفحات -
تاریخ انتشار 2001