Microarchitecture for Billion-Transistor VLSI Superscalar Processors

نویسندگان

Gabriel Hsiuwei Loh

Dana S. Henry

چکیده

Microarchitecture for Billion-Transistor VLSI Superscalar Processors Gabriel Hsiuwei Loh 2002 The vast computational resources in billion-transistor VLSI microchips can continue to be used to build aggressively clocked uniprocessors for extracting large amounts of instruction level parallelism. This dissertation addresses the problems of implementing wide issue, out-of-order execution, superscalar processors capable of handling hundreds of in-flight instructions. The specific issues covered by this dissertation are the critical circuits that comprise the superscalar core, the increasing level-one data cache latency, the need for more accurate branch prediction to keep such a large processor busy, and the difficulty in quickly evaluating such complex processor designs. Using scalable circuit designs, large instruction windows may be implemented with fast clock speeds. We design and optimize the critical circuits in a superscalar execution core. At comparable clock speeds, an instruction window implemented with our circuits can simultaneously wakeup and schedule 128 instructions, compared to only twenty instructions in the Alpha 21264. Augmenting our processor with clustered, speculative Level Zero (L0) data caches provides fast accesses to the data cache despite the increasing distance across the core to the Level One cache. Large superscalar execution cores of future processors may take up so much area that a load from memory requires multiple cycles to propagate across the core, access the cache, and propagate the result back. Multiple L0 caches provide fast, one-cycle cache accesses at the cost that the value read from an L0 cache may occasionally be incorrect. An eight-cluster superscalar processor augmented with our L0 caches achieves an overall performance that is within 2% of an unimplementable processor that does not account for additional wire delay of propagating signals across the large execution core, We show how the L0 caches can boost the performance of large superscalar processors as well as a range of other possible design points. Highly accurate prediction of conditional branches is necessary to maintain a steady flow of instructions to the execution core. We explore how to take advantage of the large transistor budget of future processors to build more accurate hardware branch prediction algorithms. In particular, we make use of results from the machine learning field in combining results from multiple predictions. At a 32KB hardware budget, our predictor outperforms the best previous published branch predictor with a 200KB budget. We also take an information theoretic approach to the analysis of existing branch prediction structures. Our results show that the average information content conveyed by the hysteresis bit of a saturating two-bit counter in an 8192-entry gshare predictor is only 1.11 bits. This motivates our shared split counter which shares some state between multiple counters, achieving an effective cost of less than 1.5 bits per counter. Using shared split counters instead of saturating two bit counters enables the implementation of smaller, and therefore faster, branch prediction structures. As the size and complexity of processors increase, so does the difficulty of the computational task of evaluating potential processor designs. The final contribution of this dissertation is a critical-path based approach to estimating the performance of superscalar processors. Our technique uses a fast in-order functional processor simulator to provide a program trace. By applying a set of efficient time-stamping rules to the trace, we obtain an accurate estimate of the critical path of the program in less than half of the simulation time of a cycle-accurate simulator.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Application of instruction analysis/scheduling techniques to resource allocation of superscalar processors

This paper presents the development of instruction analysis/scheduling CAD techniques to measure the distribution of functional unit usage and the micro operation level parallelism (MLP), which together determine the proper functional unit allocation for superscalar microprocessors, such as the x86 microprocessors. The proposed techniques fit in the early design exploration phase in which the t...

متن کامل

Superspeculative Microarchitecture for Beyond AD 2000

I n its brief lifetime of 26 years, the microprocessor has achieved a total performance growth of 10,000 times thanks to technology improvements and microarchitecture innovations. Transistor count and clock frequency have increased by an order of magnitude in each of the first two decades of microprocessors; transistor count increased from 10,000 to 100,000 in the 1970s and up to 1 million in t...

متن کامل

ExtraTime: A Framework for Exploration of Clock and Power Gating for BTI and HCI Aging Mitigation

Bias Temperature Instability (BTI) and Hot Carrier Injection (HCI) are two major causes for transistor aging at nano-scale, leading to slower devices, more failures during runtime, and ultimately reduced lifetime. Typically these issues are handled by adding extra guardbands to the design, i. e. overdesign, which results in lower clock frequencies and hence, performance losses. Alternatively, e...

متن کامل

Hierarchical Multi-Threading For Exploiting Parallelism at Multiple Granularities

As we approach billion-transistor processor chips, the need for a new architecture to make eÆcient use of the increased transistor budget arises. Many studies have shown that signi cant amounts of parallelism exist at di erent granularities that is yet to be exploited. Architectures such as superscalar and VLIW use centralized resources, which prohibit scalability and hence the ability to make ...

متن کامل

The Microarchitecture of Superscalar Processors

Superscalar processing is the latest in a long series of innovations aimed at producing ever-faster microprocessors. By exploiting instruction-level parallelism, superscalar processors are capable of executing more than one instruction in a clock cycle. This paper discusses the microarchitecture of superscalar processors. We begin with a discussion of the general problem solved by superscalar p...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2002

Microarchitecture for Billion-Transistor VLSI Superscalar Processors

نویسندگان

چکیده

منابع مشابه

Application of instruction analysis/scheduling techniques to resource allocation of superscalar processors

Superspeculative Microarchitecture for Beyond AD 2000

ExtraTime: A Framework for Exploration of Clock and Power Gating for BTI and HCI Aging Mitigation

Hierarchical Multi-Threading For Exploiting Parallelism at Multiple Granularities

The Microarchitecture of Superscalar Processors

عنوان ژورنال:

اشتراک گذاری