A High-Bandwidth Load-Store Unit for Single- and Multi-Threaded Processors

نویسنده

  • Amir Roth
چکیده

Abstract A store queue (SQ) is a critical component of the load execution machinery. High ILP processors require high load execution bandwidth, but providing high bandwidth SQ access is difficult. Address banking, which works well for caches, conflicts with age-ordering which is required for the SQ and multi-porting exacerbates the latency of the associative searches that load execution requires. In this paper, we present a new high-bandwidth load-store unit design that exploits the predictability of forwarding behavior. To start with, a simple predictor filters loads that are not likely to require forwarding from accessing the SQ enabling a reduction in the number of associative ports. A subset of the loads that do not access the SQ are re-executed prior to retirement to detect over-aggressive filtering and train the predictor. A novel adaptation of a Bloom filter keeps the re-execution subset minimal. Next, the same predictor filters stores that don’t forward values to nearby loads from the SQ enabling a substantial capacity reduction. To enable this optimization and maintain in-order store retirement, we add a second SQ that contains all stores, but only to retirement and Bloom filter management; this queue is large but isn’t associatively searched. Finally, to boost both load and store filtering and to handle programs with heavy forwarding bandwidth requirements we add a second, address-banked forwarding structure that handles “easy” forwarding instances, leaving the globally-ordered SQ to handle only “tricky” cases. Our design does not directly address load queue scalability, but does dovetail with a recent proposal that also uses re-execution to tackle this issue. Performance simulations on SPEC2000 and MediaBench benchmarks show that our design comes within 2% (7% in the worst case) of the performance of an ideal multi-ported SQ, using only a 16-entry queue with a single associative lookup port.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Volume Editors

The window-based stream join is an important operator in all data streaming systems. It has often high resource requirements so that many efficient sequential as well as parallel versions of it were proposed in the literature. The parallel stream join operators recently gain increasing interest because hardware is getting more and more parallel. Most of these operators, however, are only optimi...

متن کامل

Stream Join Processing on Heterogeneous Processors

The window-based stream join is an important operator in all data streaming systems. It has often high resource requirements so that many efficient sequential as well as parallel versions of it were proposed in the literature. The parallel stream join operators recently gain increasing interest because hardware is getting more and more parallel. Most of these operators, however, are only optimi...

متن کامل

Addressing Processor Over-provisioning on Large-scale Multi-core Platforms

Modern micro-architectures have embraced multi-core processors and thread-level parallelism for performance growth, because of the difficulty of increasing single core performance without significantly increasing processor power consumption. To meet the ever growing need for speed, current large-scale computing platforms are Nonuniform Memory Accesses (NUMA) architectures equipped with dozens o...

متن کامل

Improving Memory Access Performance Using a Code Coalescing Unit

High clock frequencies combined with deep pipelining employed by many of the state-of-the-art processors have forced cache hit accesses to be multi-cycle operations. For many programs, untolerated load latencies account for a signiicant portion of total execution time. In this paper, we present a mechanism called the Code Coalescing Unit (CCU) that can identify and eliminate at run-time several...

متن کامل

Generating Multi-Threaded code from Polychronous Specifications

SIGNAL, Lustre, Esterel, and a few other synchronous programming language compilers accomplish automated sequential code generation from synchronous specifications. In generating sequential code, the concurrency expressed in the synchronous programs is sequentialized mostly because such embedded software was designed to run on single-core processors. With the widespread advent of multi-core pro...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2004