نتایج جستجو برای: suitable locality of processing unit

تعداد نتایج: 21213500  

2002
Krishnan Kailas Manoj Franklin Kemal Ebcioglu

In Clustered Instruction-level Parallel (ILP) processors, the function units are partitioned and resources such as register file and cache are either partitioned or replicated and then grouped together into onchip clusters. We present a novel partitioned register file architecture for clustered ILP processors which exploits the temporal locality of references to remote registers in a cluster an...

2011
Peng Di Jingling Xue

DOALL loops are tiled to exploit DOALL parallelism and data locality on GPUs. In contrast, due to loop-carried dependences, DOACROSS loops must be skewed first in order to make tiling legal and exploit wavefront parallelism across the tiles and within a tile. Thus, tile size selection, which is performance-critical, becomes more complex for DOACROSS loops than DOALL loops on GPUs. This paper pr...

1993
Kumar N. Ganapathy Benjamin W. Wah

In this paper, we present the design of an application-speci c coprocessor for algorithms that can be modeled as uniform recurrences or \uniformized" a ne recurrences. The coprocessor has a regular array of processors connected to an access-unit for intermediate storage of data. The distinguishing feature of our approach is that we assume the coprocessor to be interfaced to a standard, slow (si...

1996
Michelangelo Grigni Fredrik Manne

We consider the problem of mapping an array onto a mesh of processors in such a way that locality is preserved. When the computational work associated with the array is distributed in an unstructured way the generalized block distribution has been recognized as an e cient way of achieving an even load balance while at the same time imposing a simple communication pattern. In this paper we consi...

2016
Xian-He Sun Yu-Hang Liu

In addition to locality, data access concurrency has emerged as a pillar factor of memory performance. In this research, we introduce a concurrency-aware solution, the memory Sluice Gate Theory, for solving the outstanding memory wall problem. Sluice gates are designed to control data transfer at each memory layer dynamically, and a global control algorithm, named layered performance matching, ...

2002
Carlo Fantozzi Andrea Pietracaprina Geppino Pucci

We prove an analogue of Brent’s lemma for BSP-like parallel machines featuring a hierarchical structure for both the interconnection and the memory. Specifically, for these machines we present a uniform scheme to simulate any computation designed for v processors on a v0-processor configuration with v0 v and the same overall memory size. For a wide class of computations the simulation exhibits ...

1996
Ben H. H. Juurlink Harry A. G. Wijshoff

The BSP model was proposed as a step towards general purpose parallel computing. This paper introduces the E-BSP model that extends the BSP model in two ways. First, it provides a way to deal with unbalanced communication patterns, i.e., communication patterns in which the amount of data sent or received by each processor is different. Second, it adds a notion of general locality to the BSP mod...

1996
Bruce Hendrickson Robert W. Leland Rafael Van Driessche

Terminal propagation is a method developed in the circuit placement community for adding constraints to graph partitioning problems. This paper adapts and expands this idea, and applies it to the problem of partitioning data structures among the processors of a parallel computer. We show how the constmints in terminal propagation can be used to encourage partitions in which messages are communi...

1997
Charles A. Salisbury Rami G. Melhem

Improvements in optical technology will enable the construction of high bandwidth, low latency switching networks. These networks have many applications in massively parallel processing. However current circuit switching and packet switching techniques are not quite suitable for controlling such networks. Time division multiplexing (TDM) schemes can improve the performance of circuit switched o...

1996
Michelangelo Grigni Fredrik Manne

We consider the problem of mapping an array onto a mesh of processors in such a way that locality is preserved. When the computational work associated with the array is distributed in an unstructured way the generalized block distribution has been recognized as an eecient way of achieving an even load balance while at the same time imposing a simple communication pattern. In this paper we consi...

نمودار تعداد نتایج جستجو در هر سال

با کلیک روی نمودار نتایج را به سال انتشار فیلتر کنید