Modeling Algorithm Performance on Highly-threaded Many-core Architectures

نویسندگان

  • Lin Ma
  • Roger Chamberlain
  • James Buckley
  • Jeremy Buhler
  • Tao Ju
چکیده

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii Chapter 1: Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 Examples of Highly-threaded Many-core Architectures . . . . . . . . . . . . 4 1.2 Research Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.3 Methodology for Performance Modeling . . . . . . . . . . . . . . . . . . . . . 9 1.3.1 Find Key Factors of Performance . . . . . . . . . . . . . . . . . . . . 10 1.3.2 Correlate 3 Spaces of Parameters . . . . . . . . . . . . . . . . . . . . 13 1.3.3 Define Performance Metric . . . . . . . . . . . . . . . . . . . . . . . . 14 1.4 Contribution and Dissertation Structure . . . . . . . . . . . . . . . . . . . . 14 Chapter 2: Background and Related Work . . . . . . . . . . . . . . . . . . . 17 2.1 GPU Architectures and Programming Model . . . . . . . . . . . . . . . . . . 17 2.2 Abstract Machine Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.2.1 Sequential Machine Models . . . . . . . . . . . . . . . . . . . . . . . 20 2.2.2 Parallel Machine Models . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.2.3 GPU Machine Models . . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.3 Calibrated Performance Models . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.4 Algorithms for Memory Constrained Applications . . . . . . . . . . . . . . . 25 Chapter 3: Threaded Many-core Memory (TMM) Model . . . . . . . . . . . 27 3.1 Abstraction of Highly-threaded Many-core Machines . . . . . . . . . . . . . . 27 ii 3.1.1 Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.1.2 Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 3.1.3 Applicability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 3.2 TMM Analysis Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 Chapter 4: Application of the TMM Model . . . . . . . . . . . . . . . . . . . 36 4.1 All-pairs Shortest Path (APSP) . . . . . . . . . . . . . . . . . . . . . . . . . 36 4.1.1 Dynamic Programming via Matrix Multiplication . . . . . . . . . . . 37 4.1.2 Johnson’s Algorithm: Dijkstra’s Algorithm (Binary Heaps) . . . . . . 40 4.1.3 Johnson’s Algorithm: Dijkstra’s Algorithm (Arrays) . . . . . . . . . . 42 4.1.4 n Iterations of Bellman-Ford Algorithm . . . . . . . . . . . . . . . . . 45 4.1.5 Comparison of Various Algorithms . . . . . . . . . . . . . . . . . . . 47 4.1.6 Effect of Problem Size . . . . . . . . . . . . . . . . . . . . . . . . . . 51 4.1.7 Empirical Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 4.2 String Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 4.2.1 Suffix Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 4.2.2 Suffix Array . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 4.2.3 Comparison and Empirical Validation . . . . . . . . . . . . . . . . . . 70 4.3 Fast Fourier Transform (FFT) . . . . . . . . . . . . . . . . . . . . . . . . . . 74 4.4 Merge Sort . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 4.4.1 Blocked Merge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 4.4.2 Merge Sort . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 4.5 List Ranking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 4.6 Analysis of Additional Algorithms . . . . . . . . . . . . . . . . . . . . . . . . 82 Chapter 5: Calibrated Performance Model . . . . . . . . . . . . . . . . . . . . 83 5.1 Performance Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 5.1.1 Base Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 5.1.2 Model Extension . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 5.2 Model Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 5.2.1 Synthetic Micro-benchmark for Hashing . . . . . . . . . . . . . . . . 92 5.2.2 Parallel Bloom Filters Algorithm Design and Implement . . . . . . . 97 5.2.3 Bloom Filters in BLAST . . . . . . . . . . . . . . . . . . . . . . . . . 103 5.2.4 Model Use to Evaluate Performance Tradeoffs . . . . . . . . . . . . . 113

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

High-Order Finite-differences on multi-threaded architectures using OCCA

High-order finite-difference methods are commonly used in wave propagators for industrial subsurface imaging algorithms. Computational aspects of the reduced linear elastic vertical transversely isotropic propagator are considered. Thread parallel algorithms suitable for implementing this propagator on multi-core and many-core processing devices are introduced. Portability is addressed through ...

متن کامل

Addressing Processor Over-provisioning on Large-scale Multi-core Platforms

Modern micro-architectures have embraced multi-core processors and thread-level parallelism for performance growth, because of the difficulty of increasing single core performance without significantly increasing processor power consumption. To meet the ever growing need for speed, current large-scale computing platforms are Nonuniform Memory Accesses (NUMA) architectures equipped with dozens o...

متن کامل

Efficient implementation of sorting on multi-core SIMD CPU architecture

Sorting a list of input numbers is one of the most fundamental problems in the field of computer science in general and high-throughput database applications in particular. Although literature abounds with various flavors of sorting algorithms, different architectures call for customized implementations to achieve faster sorting times. This paper presents an efficient implementation and detaile...

متن کامل

Multi-threaded Sparse Matrix-Matrix Multiplication for Many-Core and GPU Architectures

Sparse Matrix-Matrix multiplication is a key kernel that has applications in several domains such as scientific computing and graph analysis. Several algorithms have been studied in the past for this foundational kernel. In this paper, we develop parallel algorithms for sparse matrixmatrix multiplication with a focus on performance portability across different high performance computing archite...

متن کامل

Efficient mapping and acceleration of AES on custom multi-core architectures

Multi-core processors can deliver significant performance benefits for multi-threaded software by adding processing power with minimal latency, given the proximity of the processors. Cryptographic applications are inherently complex and involve large computations. Most cryptographic operations can be translated into logical operations, shift operations, and table look-ups. In this paper we desi...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2015