Graph reduction on shared-memory multiprocessors
نویسنده
چکیده
reduction machine. The next major efficiency improvement was to compile the program into application specific combinators, which are called super combinator [Hughes82]. This idea has been successfully implemented in the G-machine [Johnsson84, Augustsson84]. A program is first transformed through lambda lifting into a set of super combinators, which are then compiled into native assembly code for efficiency. The update of the root application node of the redex in step (3) is needed to maintain the sharing of delayed computations. As a side-effect the references to the remaining graph nodes of the original redex are discarded, but the nodes cannot be reclaimed straight away since they might be referenced from other parts of the global computation graph. A garbage collector is needed to properly handle shared nodes when reclaiming garbage nodes in the heap. The presence of cycles in the computation graph complicates the garbage reclamation process [Cohen81]. To increase the performance of the basic graph reduction mechanism, the functional language compilers use numerous optimisations to avoid the construction and interpretation of graphs. For example, if the result of a single rewrite is an application spine then the graph reducer will immediately unwind the spine. Hence, the construction of the spine in the graph can be avoided altogether by pushing the arguments on the stack, and calling the function at the bottom of the spine directly. For large applications lazy functional language implementations use much more (heap) memory than their imperative counterparts despite strictness analysis and other high-level compiler optimisations. At the low implementation level, space requirements can be cut down: tags can be encoded in a few bits in the pointer to the object instead of in the object itself (Chapter 5), and often chains of application nodes can be encoded in one vector apply node. These variable length vectors, however, complicate the allocation and reclamation of nodes in the heap. Reference counting or mark&scan garbage collectors have difficulty accommodating variable length vectors, so compacting garbage collectors that move live data into one contiguous block are used in general. To efficiently support garbage collection, several abstract graph reduction machines contain multiple stacks to separate heap pointers from other stack items like return addresses and basic data values (integers, floating point numbers, etc.). Multiple stacks are more difficult to manage, and the alternative is either to tag all values or to record the pointer positions in each stack frame. A comprehensive description of the basic graph reduction principles and optimised implementation techniques can be found in [Peyton Jones87b]. 2.5. Graph reduction 35 2.5.1 Strictness analysis Strictness analysis is an important optimisation technique that determines for each function which parameter values are needed to compute the result. As a consequence the strict arguments of a function may be evaluated safely before calling the function without violating the lazy evaluation semantics. Thus strictness analysis allows the compiler to use efficient call-by-value semantics for certain parameters instead of call-by-need semantics that forces the construction of graphs. This dramatically increases performance of lazy functional languages, for example, [Hartel91b] reports up to 92% reduction in claimed heap nodes when switching strictness analysis on. A function is strict if the result cannot be computed when its argument value is undefined. Formally A function f is strict iff f ? = ? The special ?-symbol (called “bottom”) denotes a non-terminating computation like the function inf defined as inf = inf+1. The job of the compiler is to determine for each function whether the above condition holds or not. Numerous (formal) strictness analysis methods have been devised [Abramsky87], but in essence these program analysis techniques may be thought of as propagating information through a syntax tree. For example, consider the strictness analysis of the following function: divide x y = NaN, if y=0 || return exception
منابع مشابه
Locality and False Sharing in Coherent-Cache Parallel Graph Reduction
Parallel graph reduction is a model for parallel program execution in which shared-memory is used under a strict access regime with single assignment and blocking reads. We outline the design of an ee-cient and accurate multiprocessor simulation scheme and the results of a simulation study of the performance of a suite of benchmark programs operating under a cache coherency protocol that is rep...
متن کاملLocaltiy and False Sharing in Coherent-Cache Parallel Graph Reduction
Parallel graph reduction is a model for parallel program execution in which shared-memory is used under a strict access regime with single assignment and blocking reads. We outline the design of an ee-cient and accurate multiprocessor simulation scheme and the results of a simulation study of the performance of a suite of benchmark programs operating under a cache coherency protocol that is rep...
متن کاملA Study on the Impact of Memory Consistency Models on Parallel Algorithms for Shared-Memory Multiprocessors
Memory consistency model is an integral part of the shared-memory multiprocessor system, and directly affects the performance. Most current multiprocessors adopt relaxed consistency models in quest of higher performance. In this paper we study the impact of memory consistency model on the design, implementation and performance of parallel algorithms for graph problems that remain challenging du...
متن کاملModeling and Performance Evaluation of Multi-Processors Organization with Shared Memories
This paper is primarily concerned with theoretical evaluation of the performance of multiprocessors system. A markovian waiting line model has been developed for various different multi-processors configurations, with shared memory. The system is analysed at the request level rather than job level.
متن کاملMemory Latency Reduction with Fine-grain Migrating Threads in Numa Shared-memory Multiprocessors
In order to fully realize the potential performance benefits of large-scale NUMA shared memory multiprocessors, efficient techniques to reduce/tolerate long memory access latencies in such systems are to be developed. This paper discusses the concept, software and hardware support for memory latency reduction through fine-grain non-transparent thread migration, referred to as mobile multithread...
متن کاملActive Memory Techniques for ccNUMA Multiprocessors
Our recent work on uniprocessor and single-node multiprocessor (SMP) active memory systems uses address remapping techniques in conjunction with extended cache coherence protocols to improve access locality in processor caches. We extend our previous work in this paper and introduce the novel concept of multi-node active memory systems. We present the design of multi-node active memory cache co...
متن کامل