Evaluating Memory Architectures for Media Applications on Coarse-Grained Recon.gurable Architectures
نویسندگان
چکیده
Reconfigurable ALU Array (RAA) architectures—representing a popular class of Coarse-grained Reconfigurable Architectures—are gaining in popularity especially for media applications due to their flexibility, regularity, and efficiency. In such architectures, memory is critical not only for configuration data but also for the heavy data traffic required by the application. Hence, system designers would like to evaluate the effects of different memory architectures and memory traffic early in the design process. In this paper, we offer a scheme for system designers to quickly estimate the performance of media applications on RAA architectures. The proposed scheme is based on the performance-oriented model of RAA architectures, which we develop to model different memory architectures in a uniform way so as to allow for easy mapping of application loops and early performance estimation. Our experimental results estimating the performance of multimedia applications on three memory architectures demonstrate the flexibility of our memory architecture evaluation scheme as well as the varying effects of the memory architectures on the application performance, which also signifies the need for memory architecture evaluation early in the design process. A trend in the architectural platforms for media applications is the adoption of reconfigurable computing elements for cost, performance, and flexibility issues [1]. Coarse-grained Reconfigurable Architectures (CRAs), mostly stressed by reconfigurable computing, are known to be flexible as well as efficient as demonstrated by recent architectures [2]–[4]. Reconfigurable ALU Array (RAA) architectures [4]–[8] form the most popular class of CRAs, which are built on a 2D array of Processing Elements (PEs) communicating via programmable interconnects. Though each PE can perform only a limited set of operations such as addition and multiplication, through dynamic reconfiguration during runtime the 2D array datapath can be programmed to perform different algorithms—typically, critical loops of the application—very efficiently. This regular datapath structure makes the RAA architectures very well suited to media applications, which are also characterized by structured and regular computations on large data sets [9]. This research was conducted while the first two authors were visiting UC Irvine, and supported in part by grants from NSF (CCR-0203813 and CCR-0205712) and Hitachi Ltd. We also thank members of the UCI EXPRESS compiler team for their assistance. The acceleration of RAA architectures mainly comes in two areas. First, the arithmetic operations can be parallelized on the PE array, with the maximum speedup equal to the number of PEs executing in parallel. Second, the memory operations can be implemented in a much more efficient way on RAA architectures by utilizing hardware addressing support provided by the architecture. For example, in MorphoSys [4] the local frame buffer can generate data streams that are sequentially used by the PE array, so that there is no explicit addressing needed for data transfers. Thus, RAA architectures can accelerate not only the memory access operations (through parallel memory access) but also the array index manipulation operations if the array is accessed with a scan pattern supported by the memory architecture. On the other hand, there is some overhead with RAA execution as well. First, there is reconfiguration overhead whenever a new loop is loaded on the RAA. Also, if a loop uses more than one configuration (because, for example, the loop contains too many operations to be mapped onto the PE array using only one configuration), it may be needed to switch between multiple configurations throughout the loop execution, significantly adding to the reconfiguration overhead. Second, the input data for the PE array may need to be transferred from the main memory to the RAA local data buffer before or during the loop execution. Likewise, the output results may need to be transferred, too. Note, however, that some overhead may effectively be removed by optimizations. For example, the initial reconfiguration overhead of a loop may be hidden by starting the reconfiguration long before the loop is reached. We target our research at the rapid and quantitative evaluation of RAA architectures for architecture design and exploration. This early evaluation can be very valuable, as today’s application mapping is typically done by hand making it virtually impossible to compare and explore various architectural options. To derive first-order performance estimation with reasonable accuracy, we need to identify critical parameters of the architecture that have the biggest impact on the performance of applications. From this point of view, the memory architecture and memory operations are very important, not only because memory operations account for a large portion of the execution cycles for media applications but also because there can be potentially greater diversity in the memory architecture [10] than in the PE array (assuming a fixed granularity for the RAA). Hence, in this paper we focus on the memory subsystem of RAA architectures, although our performance estimation scheme covers the other parts as well. Our performance estimation is based on the performance-oriented view, which we present as an abstract model of RAA architectures for early performance estimation. The performance-oriented view has a set of array-level operations defined with it, allowing for a more natural representation of RAA architectures as well as an easier mapping of application loops for early performance estimation. We demonstrate the efficacy of our technique through a set of experiments estimating the performance of multimedia applications on three memory architectures. Our experimental results not only demonstrate the flexibility of our memory architecture evaluation scheme but also show that the memory architecture can have quite different effects on the application performance depending on the characteristics of the application, signifying the need for memory architecture evaluation early in the design process. The rest of the paper is organized as follows. In Section 2 we review some of the related work. In Section 3 we briefly describe the target architecture model and present the performance-oriented view of RAA architectures. In Section 4 we describe our performance estimation flow for media applications, which is based on the performance-oriented view. In Section 5 we present our experimental results using multimedia application benchmarks, and conclude the paper in Section 6. Coarse-grained reconfigurable architecture has become an area of active research recently, with the increasing interest in reconfigurable computing in image and video applications [1], [11]. Lee et al. [12] have proposed a generic reconfigurable architecture template called DRAA representing a wide class of RAA architectures, for which a core mapping algorithm (placement and routing) has also been developed for loops based on loop pipelining [13]. Weinhardt et al. [14] have proposed loop pipelining technique to exploit the high degree of parallelism in reconfigurable hardware. In loop pipelining, loops with loop-carried dependency are difficult to get pipelined and achieve high throughput. To address this problem, Bondalapati [15] has proposed a technique called data context switching, which can improve the throughput by exploiting local memory elements to store the contexts. Maestre et al. [16] have provided another level of optimization for RAA architectures, which considers task scheduling and configuration memory management to reduce the configuration switch overhead. Media applications have been extensively studied as they form a dominant workload in the computer industry. Ienne et al. [17] have shown a limit study on the performance improvement of media applications using the MediaBench application benchmark suite [18]. By incrementally relaxing the microarchitectural constraints of possible coprocessors (called ad-hoc functional units), they show significant speedup of up to 6 times is possible. In their study they find that the ability of ad-hoc functional units to access the data memory is a particularly critical architectural feature. Another study [9] on media benchmarks also confirms the importance of memory accesses in media applications. In an attempt to characterize the media applications on the memory activity, they have found the overall execution time of media applications has a good correlation with the amount of temporary memory accesses, although the memory access instructions are less than a third of the overall instruction mix. These studies point to the need as well as the feasibility of evaluating memory architectures for media applications. The importance of the memory architecture for media applications have been noted in the context of CRAs as well. As CRAs increase the computation throughput by deep pipelining, it becomes more important for the memory interface to provide an increased data rate to keep up with the computation rate. For this, Herz [10] has presented efficient memory interface architectures that support address generation and data sequencing for CRAs. Although they are very effective compared to software solutions, there is no quantitative study or technique that allows for the rapid evaluation of different options in organizing the memory subsystem of CRAs. In this paper, we address this problem by providing a performance-oriented model of memory architectures and an evaluation scheme for media applications on RAA architectures. In this section we first describe the target architecture, the DRAA (Dynamically Reconfigurable ALU Array). The DRAA represents a broad range of RAA architectures, facilitating compilation as well as the design space exploration of RAA architectures. Next we present the performanceoriented view of the DRAA, developed for early performance estimation. 3.1 DRAA: Generic Architecture Template Figure 1 (a) illustrates the DRAA, a generic architecture template for RAA architectures. The DRAA [12] serves as an architectural platform which defines, with a set of architectural parameters, Registers Context uration Cache ConfigReconfigurable Plane Memory Interface Memory Main Processor Main C Compilation DRAA Mapping Conventional Performance # Partitioning Application Cycle Level Simulation Parameters Architectural Binary & Configuration
منابع مشابه
Exploiting the Distributed Foreground Memory in Coarse Grain Reconfigurable Arrays for Reducing the Memory Bottleneck in DSP Applications
This paper presents a methodology for memory-aware mapping on 2-Dimensional coarse-grained reconfigurable architectures that aims in the minimization of the data memory accesses for DSP and multimedia applications. Additionally, the realistic 2-Dimensional coarse-grained reconfigurable architecture template to which the mapping methodology targets, models a large number of existing coarse-grain...
متن کاملSpace-efficient Mapping of 2D-DCT onto Dynamically Configurable Coarse-Grained Architectures
This paper shows an eecient design for 2D-DCT on dynamically conngurable coarse-grained architectures. Such coarse-grained ar-chitectures can provide improved performance for computationally demanding applications as compared to ne-grained FPGAs. We have developed a novel technique for deriving computation structures for two dimensional homogeneous computations. In this technique, the speed of ...
متن کاملMapping Homogeneous Computations onto Dynamically Configurable Coarse-Grained Architectures
The Problem Conventional FPGAs are ne-grained architectures, mainly designed for implementing bit-level tasks and random logic functions. Their performance is limited for computationally demanding applications over large word length data. A highly promising avenue that is being explored by many research groups is coarse-grained con gurable architectures. These architectures are datapathoriented...
متن کاملThe parXXL Environment: Scalable Fine Grained Development for Large Coarse Grained Platforms
We present a new integrated environment for cellular computing and other fine grained applications. It is based upon previous developments concerning cellular computing environments (the ParCeL family) and coarse grained algorithms (the SSCRAP toolbox). It is aimed to be portable and efficient, and at the same time to offer a comfortable abstraction for the developer of fine grained programs. A...
متن کاملPower-Efficient Reconfiguration Control in Coarse-Grained Dynamically Reconfigurable Architectures
Coarse-grained reconfigurable architectures deliver high performance and energy efficiency for computationally intensive applications like mobile multimedia and wireless communication. This paper deals with the aspect of power-efficient dynamic reconfiguration control techniques in such architectures. Proper clock domain partitioning with custom clock gating combined with automatic clock gating...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- IJES
دوره 3 شماره
صفحات -
تاریخ انتشار 2003