ion on a F ixed - S ize Linear Systol ic Array
نویسندگان
چکیده
K e y w o r d s M a t r i x v e c t o r multiplication, Linear systolic arrays, Hardware synthesis. 1. I N T R O D U C T I O N C o m p l e x e m b e d d e d sys tems such as those found in number of control , avionics, indus t r i a l , medical, au tomot ive , and communica t i on equipment , t yp ica l ly consist ou t of he te rogeneous mix of h a r d w a r e blocks: processor cores, genera l pu rpose macro blocks, and app l i ca t i on specific archit e c tu re s [1-3]. Since e m b e d d e d sys tems are usua l ly imp lemen ted by processors and a p p l i c a t i o n specific ha rdware , the mos t c o m m o n a rch i t ec tu re in these sys t ems can be cha rac te r i zed as one of coprocess ing, i.e., a processor working in conj unct ion wi th ded ica ted h a r d w a r e to del iver a specific app l i ca t ion . These acce le ra tor blocks are requi red to execute a lgor i thms of high c omple x i t y such as convolu t ion [4], co r re la t ion [5], F F T [6], D C T / I D C T [7], 1D and 2D fi l ter ing [8], ma t r i x -ve c to r and m a t r i x m a t r i x mul t ip l i ca t ion [9-11], etc. In th is pape r , we will concen t ra te on: (i) t he gene ra t ion of an app l i ca t ion specific coprocessor a rch i t ec tu re of t y p e S A / P A dedi c a t e d to c o m p u t e several different s ignal process ing and scientific a lgor i thms which ut i l ise m a t r i x v e c t o r opera t ions ; 0898-1221/00/$ see front matter (~) 2000 Elsevier Science Ltd. All rights reserved. Typeset by A~S-TEX PII: S0898-1221 (00)00231-5 1190 E.I. MILOVANOVI(; et al. (ii) host/coprocessor interface which define another architectural variable that strongly affects data transfer rate between the host and accelerator; and (iii) study on the performance of the final design. The proposed approach is limited to a specific class of coprocessor architectures, linear systolic/processor array as one of the most popular and important regular array which includes several attractive features such as bounded I/O, fault tolerance, minimal communication pattern, and modular extendability. The research reported in this paper focuses on the design where hardware-software partition is assumed to be fixed. Beginning with an initial hardware-software partition, we seek a design solution in order to maximize the speedup for a given application. Accordingly, in the rest of the paper we first propose an efficient algorithm for matrix-vector multiplication which improves the hardware utilization of the SA/PA, then we present a specific hardware interface, located between the host and SA/PA, intended for address transformation which optimizes memory access by elimination of extraneous main-memory operations. 2. A N O V E R V I E W O F T H E R E L A T E D W O R K SAs and PAs consisting of large numbers of identical processing elements (PEs) have been popular accelerator candidates in VLSI/WSI technology. Parallel and pipelined processing capabilities of SAs/PAs can provide very high computational throughput in real-time applications. Different interconnection topologies have different properties and different algorithms run best on different accelerator architectures. For instance, hexagonally connected meshes are used for LU decomposition, binary trees for sorting, torus for transitive closure, double trees for searching, linear arrays for convolution, correlation, F F T and matrix-vector multiplication, rectangular or hexagonal arrays for matrix multiplication, and so on [5]. This paper examines the problem of determining an efficient linear SA/PA as an accelerator architecture for matrix-vector multiplication. Historically, a group of researchers, headed by Kung, has introduced the systolic concept for parallel architectures [12,13] in the period from 1978 to 1982. The most interesting architectural concept that inspired a lot of researchers for better design solutions due to its simple design, was the bidirectional linear systolic array (BLSA) since it is amenable for solving a variety of problems in engineering, scientific, and signal processing applications [14-16]. The array introduced in [13] used for matrix-vector multiplication was neither time nor space optimal [17]. Therefore, the problem how to optimize space and/or time parameters of this array has stimulated a considerable research interest. Optimization of space and/or time parameters of this array can be accomplished either through higher complexity of processing elements (PE) (see, for example, [16,18-20]), or by modifying the design procedure and keeping the PE's complexity intact (see, for example, [21 24]). When the size of problem is larger than the number or PEs available in the array, algorithm partitioning and time multiplexing of hardware resources must take place [25]. For several reasons, the partitioning of algorithms for SAs is not a simple problem. First, a poor allocation of computations to processors may lower the speedup factor. Also related to this problem is the amount of external storage, and the communication links, introduced by the partitioning. Among the papers considering this problem are [25,26]. The approach used in [25] to algorithm partitioning is to divide the index space into bands and to map those bands into the processor space. The authors have proved that partitioning and mapping of the iterative algorithms into array processors can be done by the same transformation function as for the nonpartitioned case if only an extra constraint is satisfied. However, the approach used in [25] does not minimize the processing time. Navarro e t al. [26] proposed a method to transform dense matrices of any dimension into band matrices of the desired bandwidth, so that easy and adequate matching to the dimension of BLSA is achieved. The proposed transformation allows good utilization of the PEs in the BLSA. Linear Systolic Array 1191 Most of the previous approaches to matrix-vector multiplication on the SAs view the design process independent to hardware implementation [13,16,18-20,23-26]. Also, it is usually assumed that communication operations are conducted using memory-mapped I /O by the host. However, memory-mapped I /O is often an inefficient mechanism for data transfer. More efficient methods, including dedicated device drivers, and interconnection network consisting of pipeline registers, take care of moving data from main memory to accelerator architecture, and vice versa, without creating a communication bottleneck, but their relation to partitioning has not been articulated yet [5]. One of the main goals in hardware synthesis of the accelerator for matrix-vector multiplication is the optimization of storage requirements. The intensive memory referencing of these behaviors necessitates the use of secondary storage (e.g., a memory system) since the primary store (e.g., PEs register storage) if sufficiently large enough, would be impractical. This memory is explicitly addressed in a synthesized system by memory operations containing indexing functions. However, due to the bottleneck that a memory system represents, memory accessing operations must be effectively scheduled so as to optimize memory access. In this t)aper, we present a specific hardware for address generation which optimize memory access by elimination of extraneous main-memory operations. We consider the multiplication of matrix A = (aik),,×,. by vector b = (bk)nxl on the BLSA comprised of p _< [n/2] processing elements. Since p <__ n, matrix A is partit ioned into quasi-diagonal blocks. Each block matrix contains p quasidiagonals. The computation begins with multiplying the first block matrix with vector b which gives the first iteration of 5", i.e., ~.(1). In order to enable the second iteration to begin immediately, index transformation in the next block matrix and g(1) is performed. The transformation is a function of n and p. It can be described as a perfect shuffle followed by shifting. This tran. sformation enables tha t there is no null element insertion between the iterations, which decreases the processing time approximately two times compared to that in [26]. The memory system of the accelerator architecture is realized as dual-port RAMs. 3. G L O B A L A R C H I T E C T U R A L M O D E L O F T H E A C C E L E R A T O R Before we proceed with mathematical model and a discussion of the design methodolog?z, we summarize the assumed system architecture. Figure 1 shows the overall structure of the system. It is comprised of a host, which includes CPU, main memory, and I /O subsystem, and a hardware accelerator intended for matrix-vector multiplication. Two parts can be distinguished within the accelerator, the BLSA with p < [n/2] PEs which performs the computation, and a memoryinterface subsystem (MIS) used as an efficient interconnection consisting of dual-port RAMs that take care of moving data to / f rom the BLSA without creating a communication bottleneck. The MIS is comprised of p dual-port RAMs, denoted as DPR-Ai (i = 0, 1 , . . . ,p 1), one denoted as DPR_B, and one denoted as DPR_C. Dual-port RAMs are used for storing data elements prior to being fed into the BLSA. Namely, matrix A is stored into DPR_A~, i = 0 . . . . ,p 1, vector b into DPR_B, and all locations of DPR_C are set to zero. 4. M O D E L F O R M A T R I X V E C T O R M U L T I P L I C A T I O N O N F I X E D S I Z E A R R A Y We consider the computation of ~" -Ab, where A is an n x n matrix, and b and ~" are two vectors of size n implemented on a fixed size BLSA. A crucial design goal is to attain a minimal execution time with a given number of PEs in the BLSA. In order to avoid data conflicts when minimizing the execution time, it is desirable that n is odd (see, for example, [24]). If not so, then zero entries should be added to the matrix A and 1192 E.I . MILOVANOVI(~ et al.
منابع مشابه
L L O O N N G
In ancient t imes, wax seals impressed with signet r ings were aff ixed to documents as evidence of their authent ic i ty . A digi ta l counterpart is a message authent icat ion code f ixed f i rmly to each important document. We sketch an archite cture that accompl ishes this by using encapsulat ion of content with provenance metadata, cryptographic seal ing, and webs of t rust rooted in respe...
متن کاملSystolic Arithmetic Architectures
In this paper parallel-ism on che algorithmic, architectural , and arithmetic levels is exploited in the design of a Residue Number System (RNS) based archite:;ture. The architecture is basecl on modulo processors. Each modulo processor is implemented 1 :) y two dimensional systol-ic arr,:iy composed of very simple cells. 'rhe decoding stage is im-plementled using a 2-D array, too. The dec:adin...
متن کاملMetal Cation/anion Speciation via Paired-ion, Reversed Phase Hplc with Refractive Index And/or Inductively Coupled Plasma Emission Spectroscopic Detection Methods
Conventional high performance 1 i qu i d chromatography instrumentation and packing materials can be inexpensively and rapidly utilized for the qualitative and quantitative analysis of various metal cations or anions. The final approaches utilize reversed phase HPLC in the form of paired-ion separations. The detection of individually eluted, fully resolved metal cations or anions is possible co...
متن کاملBearing Estimation with Acoustic Vector-Sensor Arrays
1 I N T R O D U C T I O N The pass ive d i rec t ion-of-ar r iva l (DOA) e s t ima t ion p rob lem, in which the bea r ings of a n u m b e r of far-field acous t ic sources are d e t e r m i n e d , is of g rea t i m p o r t a n c e in m a n y unde rwa te r appl ica t ions . Tradi t iona l ly , the solut ion is to use a spa t i a l l y d i s t r i b u t e d a r ray of omnid i r ec t iona l pres...
متن کاملFingerCode: A Filterbank for Fingerprint Representation and Matching
With the ident i ty fraud in o u r society reaching u n precedented proportions and wi th a n increasing emphas i s o n the emerging automat ic posit ive personal ident i f ication applications, biometrics-based identif ication, especially fingerprint-based identif ication, i s receiving a lot of a t ten t ion . There are t w o m a j o r shortcomings of t he traditional approaches t o f ingerpr...
متن کاملTetomerk site position heterogeneity in macronudear DNA of Paramedum primaureha
S In Paramecium p r i m a u r e l i a , the macronudear gene encoding the G surface protein ia located near a telomere. In th is study, mul t ip le copies of t h i s te lomere have been i s o l a t e d and the subte lomer ic and t e l o m e r i c regions of some of them have been sequenced. The te loroer ic sequences consist of tandem repeats of the hexanucleotidea C»A2 or C3A3. We show that th...
متن کامل