The Pipelined Set Cover Problem
نویسندگان
چکیده
A classical problem in query optimization is to find the optimal ordering of a set of possibly correlated selections. We provide an abstraction of this problem as a generalization of set cover called pipelined set cover, where the sets are applied sequentially to the elements to be covered and the elements covered at each stage are discarded. We show that several natural heuristics for this NP-hard problem, such as the greedy set-cover heuristic and a local-search heuristic, can be analyzed using a linear-programming framework. These heuristics lead to efficient algorithms for pipelined set cover that can be applied to order possibly correlated selections in conventional database systems as well as datastream processing systems. We use our linear-programming framework to show that the greedy and local-search algorithms are 4-approximations for pipelined set cover. We extend our analysis to minimize the lp-norm of the costs paid by the sets, where p ≥ 2 is an integer, to examine the improvement in performance when the total cost has increasing contribution from initial sets in the pipeline. Finally, we consider the online version of pipelined set cover and present a competitive algorithm with a logarithmic performance guarantee. Our analysis framework may be applicable to other problems in query optimization where it is important to account for correlations. 1 Motivation A common operation in database query processing is to find the subset of records in a relation that satisfy a given set of selection conditions. To execute this operation efficiently, a query processor prefers to determine the optimal order in which to evaluate the individual selection conditions, so we call this operation pipelined filters [2, 4, 12, 18]. Optimality in pipelined filters is usually with respect to minimizing the total processing time [4, 12]. For example, consider a relation packets, where each record contains the header and an initial part of the payload of network packets logged by a network ? Computer Science Department, Duke University. Part of this work was done while the author was at Stanford University supported by NIH 1HFZ465. [email protected] ?? Computer Science Department, Stanford University. Supported by NSF under grants IIS-0118173 and IIS-9817799. [email protected] ? ? ? Computer Science Department, Stanford University. Supported in part by NSF Grants IIS-0118173 and EIA-0137761, NSF ITR Award Number 0331640, and grants from Microsoft and Veritas. [email protected] † Computer Science Department, Stanford University. Supported by NSF under grants IIS-0118173 and IIS-9817799. [email protected] router. Suppose a query needs to compute the subset of packets where each record r in the result satisfies the following three conditions: 1. p1: destPort = 80, where destPort is the destination port field of r. 2. p2: domain(destAddr) = “yahoo.com”, where destAddr is the destination address field of r, and domain is a function that returns the Internet domain name of an address passed as input. 3. p3: The payload of r contains the regular expression “ˆ[ˆ\\n]∗HTTP/1.∗” [9]. A query processor might use three selection operators on packets, denoted Op1 , Op2 , and Op3 , to evaluate these three conditions respectively. In this case the query processor might choose to apply Op1 first on each record in packets so that Op2 and Op3 need only process records that are selected by Op1 . Since both Op2 and Op3 involve complex functions, applying either of them before Op1 could increase the total processing time by orders of magnitude. Further, the query processor may choose to process Op2 before Op3 , since packets selected by Op1 are also likely to be selected by Op3 . As this example shows, it is important to choose a good, if not the optimal, order for applying the selection operators on the records of the input relation. Also note that both the expected fraction of records selected (the selectivity) and the record-processing time of each selection operator must be taken into account. Suppose the selection conditions are independent; that is, the selectivity s of any operator O among the records that O processes is independent of the operators that appear before O in the order. Under this assumption, computing the order that minimizes total processing time is easy: We simply order the operators in nonincreasing ratio of 1 − s and the record-processing time. Most previous work on the selection-ordering problem and on related problems make the independence assumption and use this ordering technique [4, 12, 18, 23]. The independence assumption reduces the complexity of finding the optimal order, but it is often violated in practice [6, 25]. It can be shown that when the independence assumption does not hold, the total processing time can be O(n) times worse than optimal when n operators are ordered in nonincreasing ratio of 1− s and the record-processing time. Without the independence assumption, the problem is NP-hard. Previous work [17, 22, 23] on ordering of dependent (correlated) operators either uses exhaustive search—which requires selectivity estimates for an exponentially large number of operator subsequences—or proposes simple heuristics with no provable performance guarantees for the solution obtained. As databases are being extended to manage complex data types such as multimedia and XML, the use of expensive selection conditions are becoming frequent, making the problem of ordering dependent selections even more important [4, 12]. The pipelined filters problem also captures restricted types of relational joins and combinations of joins and selections; see [2]. Pipelined filters can be formulated as a generalization of the classical set cover problem [13, 15]: The relation represents the elements to be covered, and each selection operator is a set which drops (or covers) a certain number of records (or elements). The sets are applied sequentially to the elements to be covered, with each set removing the elements that it covers from further processing; the cost of applying a set depends linearly on the number of elements that are still not covered when the set is applied. The solution desired is an ordering of the sets that minimizes the total cost of applying the sets sequentially. We call this problem pipelined set cover, the key difference with classical set cover being the cost function. The mapping from pipelined filters to pipelined set cover is straightforward: the operators map to the sets, and the operator ordering, or pipeline, maps to the ordering of the sets. 2 Our Contribution Pipelined set cover has been considered previously in a non-database context by Feige et al. [11] and by Cohen and Kaplan [7]. They show that the uniform cost version of this problem is MAX-SNP hard and develop a greedy 4-approximation algorithm for the uniform cost version. In addition to showing the application of pipelined set cover to classical optimization problems in database and datastream processing, we extend previous work significantly in this paper, as follows. 2.1 Approximation Algorithms for Pipelined Set Cover We provide two approximation algorithms for pipelined set cover, one based on the greedy heuristic for classical set cover and another based on an intuitive local-search heuristic. (In separate work we have implemented both algorithms efficiently in a data-stream processing system [2].) Using a different and more general analysis technique from previous work, we show that both these algorithms are 4-approximations, even when the linear cost function depends on the set. This relatively new analysis technique is based on formulating the worstcase performance of the algorithms as linear programs. (This technique was first used by Jain, Mahdian, and Saberi to analyze the performance of a dual-fitting algorithm for facility location [14].) This technique has several advantages. In addition to bounding the approximation ratio, the linear program can be used to analyze running time, e.g., the rate of convergence of the local search heuristic. The linear program gives new insights about the approximation algorithms, with strong implications for query optimization: The bound on approximation depends on the number of sets (operators) n; for n ≤ 20, this bound ≤ 2.35. Furthermore, this technique can be used to analyze other algorithms for pipelined set cover, including a simple move-to-front algorithm which can be implemented very efficiently. We can view our problem as minimizing the l1-norm of the vector of the number of elements processed (or the cost paid) by each set. The classical set cover problem can be viewed as minimizing the l0-norm —it gives a cost to any set that is independent of the number of elements it processes, so long as that set processes at least one element. For set cover, the performance of the greedy algorithm is logarithmic [13, 15, 24], and this approximation factor is optimal [10], assuming P 6= NP. The approximation ratio improves to 4 for our l1-norm formulation, where the cost of each set is weighted by the number of elements it 1 Of course, technically speaking, there is no such norm. However, we can adopt the view that the set cover objective function is minimizing a Hamming measure, which is sometimes treated as a substitute for the l0-norm [8]. processes. A natural question to ask is what happens to the approximation ratio when the goal is to minimize the lp-norm of the costs paid by the sets, for integers p ≥ 2. As p increases, this formulation gives increasing weight to sets at the start of the pipeline that process more elements. The intuition is that the performance of the greedy algorithm should improve with increasing p, and it should reach the optimal solution when we are minimizing the l∞-norm. Since the objective function is nonlinear, linear programming techniques fail to apply. We develop a Lagrangian-relaxation analysis technique for p ≥ 2 to show that the approximation ratio of the greedy algorithm is 9 1 p when the processing costs are uniform (independent of the set), and that local search is a 4 1 p -approximation when the processing costs are nonuniform. The improvement in performance of greedy confirms the intuition that as we skew the total cost in favor of the initial sets chosen, greedy’s performance should improve for uniform processing costs. 2.2 Online Pipelined Set Cover Our original motivation for defining and analyzing pipelined set cover came from our work on processing pipelined filters in a data-stream query processor [2]. A stream, as opposed to a relation, is a continuous unbounded flow of records arriving at a stream-processing system [1]. Example streams include network packets, stock tickers, and sensor observations. Pipelined filters are common in stream processing, e.g., packets may be a stream in our example query introduced at the beginning of this section. Another common example of pipelined filters in stream processing is a join of a stream S with a set of relations R1, R2, . . . , Rk: For each record s arriving in S, we need to find R ′ i ⊆ Ri, 1 ≤ i ≤ k, such that each record ri ∈ R ′ i satisfies ri.A = s.A where A is a field that is common among S, R1, R2, . . . , Rk. (We have defined a restricted version of the problem for succinctness [2].) The join output for s is the set of concatenated records s ·r1 ·r2 · · · rk for each combination of r1 ∈ R ′ 1, r2 ∈ R ′ 2, . . . , rk ∈ R ′ k. If any of the R ′ i’s are empty, then s produces no join output and we say that s is dropped. For processing the join efficiently, we must order R1, R2, . . . , Rk for computing R ′ 1, R ′ 2, . . . , R ′ k such that records in S that get dropped eventually consume minimal processing time. Note that the processing required for records that are not dropped is independent of the ordering. Pipelined filters over data streams motivate the online version of pipelined set cover. In online pipelined set cover, some number of elements arrive at each time step. Our online algorithm has to choose an ordering of the sets in advance at every time step, and process the incoming elements according to this ordering. The performance of our online algorithm is compared against the performance of the best possible offline algorithm that does not change its ordering for the entire course of the request sequence. For online pipelined set cover, we present an O(log n) competitive algorithm for the uniform cost case, where n is the number of sets. This algorithm can be extended to an O(log n + log cmax cmin ) competitive algorithm for the nonuniform cost case, where cmax is the largest per-element processing cost among all sets, and cmin is the smallest such cost.
منابع مشابه
Automatic QoS-aware Web Services Composition based on Set-Cover Problem
By definition, web-services composition works on developing merely optimum coordination among a number of available web-services to provide a new composed web-service intended to satisfy some users requirements for which a single web service is not (good) enough. In this article, the formulation of the automatic web-services composition is proposed as several set-cover problems and an approxima...
متن کاملAll-Norms and All-L_p-Norms Approximation Algorithms
In many optimization problems, a solution can be viewed as ascribing a “cost” to each client, and the goal is to optimize some aggregation of the per-client costs. We often optimize some Lp-norm (or some other symmetric convex function or norm) of the vector of costs—though different applications may suggest different norms to use. Ideally, we could obtain a solution that optimizes several norm...
متن کاملPerformance optimization of pipelined logic circuits using peripheral retiming and resynthesis
We consider the problem of minimizing the cycle time of a given pipelined circuit. Existing approaches are suboptimal since they do not consider the possibility of simultaneously resynthesizing the combinational logic and moving the latches using retiming. In 1101 the idea of simultaneous retiming and resynthesis was introduced. We use the concepts presented there to optimize a pipelined circui...
متن کاملAn Online Algorithm for Maximizing Submodular Functions
We consider the following two problems. We are given as input a set of activities and a set of jobs to complete. Our goal is to devise a schedule for allocating time to the various activities so as to achieve one of two objectives: minimizing the average time required to complete each job, or maximizing the number of jobs completed within a fixed time T . Formally, a schedule is a sequence 〈(v1...
متن کاملAnalysis of Integral Nonlinearity in Radix-4 Pipelined Analog-to-Digital Converters
In this paper an analytic approach to estimate the nonlinearity of radix-4 pipelined analog-to-digital converters due to the circuit non-idealities is presented. Output voltage of each stage is modeled as sum of the ideal output voltage and non-ideal output voltage (error voltage), in which non-ideal output voltage is created by capacitor mismatch, comparator offset, input offset, and finite ga...
متن کامل