Chunking of Large Multidimensional Arrays
نویسندگان
چکیده
Very large multidimensional arrays are commonly used in data intensive scientific computations as well as on-line analytical processing applications referred to as MOLAP. The storage organization of such arrays on disks is done by partitioning the large global array into fixed size sub-arrays called chunks or tiles that form the units of data transfer between disk and memory. Typical queries involve the retrieval of sub-arrays in a manner that accesses all chunks that overlap the query results. An important metric of the storage efficiency is the expected number of chunks retrieved over all such queries. The question that immediately arises is “what shapes of array chunks give the minimum expected number of chunks over a query workload?” The problem of optimal chunking was first introduced by Sarawagi and Stonebraker [14] who gave an approximate solution. In this paper we develop exact mathematical models of the problem and provide exact solutions using steepest descent and geometric programming methods. Experimental results, using synthetic and real life workloads, show that our solutions are consistently less than 2.0% of the true number of chunks retrieved for any number of dimensions. In contrast, the approximate solution of [14] can deviate considerably from the true result with increasing number of dimensions.
منابع مشابه
Design and Implementation of a Scalable Parallel System for Multidimensional Analysis and OLAP
Multidimensional Analysis and On-Line Analytical Processing (OLAP) uses summary information that requires aggregate operations along one or more dimensions of numerical data values. Query processing for these applications require different views of data for decision support. The Data Cube operator provides multi-dimensional aggregates, used to calculate and store summary information on a number...
متن کاملIteration Aware Prefetching for Large Multidimensional Scientific Datasets
Most caching and prefetching research does not take advantage of prior knowledge of access patterns, or does not adequately address the storage issues inherent with multidimensional scientific data. Armed with an access pattern specified as an iteration over a multidimensional array stored in a disk file, we use prefetching to greatly reduce the number of disk accesses and partially hide the co...
متن کاملSpatial prefetching for out-of-core visualization of multidimensional data
In this paper we propose a technique called storage-aware spatial prefetching that can provide significant performance improvements for out-of-core visualization. This approach is motivated by file chunking in which a multidimensional data file is reorganized into multidimensional sub-blocks that are stored linearly in the file. This increases the likelihood that data close in the n-dimensional...
متن کاملAn Efficient Encoding Scheme to Handle the Address Space Overflow for Large Multidimensional Arrays
We present a new implementation scheme of multidimensional array for handling large scale high dimensional datasets that grows incrementally. The scheme implements a dynamic multidimensional extendible array employing a set of two dimensional extendible arrays. The multidimensional arrays provide many advantages but it has some problems as well. The Traditional Multidimensional array is not dyn...
متن کاملMARCINKIEWICZ-TYPE STRONG LAW OF LARGE NUMBERS FOR DOUBLE ARRAYS OF NEGATIVELY DEPENDENT RANDOM VARIABLES
In the following work we present a proof for the strong law of large numbers for pairwise negatively dependent random variables which relaxes the usual assumption of pairwise independence. Let be a double sequence of pairwise negatively dependent random variables. If for all non-negative real numbers t and , for 1 < p < 2, then we prove that (1). In addition, it also converges to 0 in ....
متن کامل