You are on page 1of 15

Data & Knowledge Engineering 83 (2013) 111125

Contents lists available at SciVerse ScienceDirect

Data & Knowledge Engineering


journal homepage: www.elsevier.com/locate/datak

An integer programming approach for the view and index selection problem
Zohreh Asgharzadeh Talebi a,, Rada Chirkova b, Yahya Fathi c
a b c

SAS Institute, United States Computer Science Department, NC State University, United States Department of Industrial and Systems Engineering, NC State University, United States

a r t i c l e

i n f o

a b s t r a c t
The view- and index-selection problem is a combinatorial optimization problem that arises in the context of on-line analytical processing (OLAP) in database-management systems. We propose an integer programming (IP) model for this problem and study the properties of the views and indexes that appear in the optimal solution for this model. We then use these properties to remove a number of variables and constraints from the corresponding IP model and obtain a model that is significantly smaller, yet its optimal solution is guaranteed to be optimal for the original problem. This allows us to solve realistic-size instances of the problem in reasonable time using commercial IP solvers. Subsequently, we propose heuristic strategies to further reduce the size of this IP model and dramatically reduce its execution time, although we no longer guarantee that the reduced IP model offers a globally optimal solution for the original problem. Finally, we carry out an extensive computational study to evaluate the effectiveness of these IP models for solving the OLAP view- and index-selection problem. 2012 Elsevier B.V. All rights reserved.

Article history: Received 1 August 2011 Received in revised form 5 November 2012 Accepted 5 November 2012 Available online 16 November 2012 Keywords: Business intelligence Data warehouse and repository OLAP Materialized views View and index selection Integer programming Heuristics

1. Introduction On-line analytical processing (OLAP) and data warehousing are used by executives, managers, and analysts to make better and faster decisions [1,15]. OLAP applications include (but are not limited to) marketing, business and management reporting, budgeting, forecasting, health care, and systems analysis. Users are often interested in summary information of a measure as a function of some business aspects or dimensions. For instance, if we consider a warehouse that keeps information related to a company's sales, its corresponding dimensions could be product sold, date and time of sale, customer, and sales representative. In practice, the number of dimensions in a data warehouse can be relatively large, and each dimension can have a number of distinct attributes that are stored in a separate dimension table (e.g., attributes of product could be its color, size, and weight). User queries typically specify these attributes, and preparing a response to a query could involve an extensive search through a number of dimension tables for proper attribute values. An additional significant challenge in answering aggregate queries is the often very large size of the fact table, which stores the raw data, for instance the data about individual sales transactions. The challenge in query evaluation is the time-consuming process of traversing through either the entire fact table or a large part of it in order to compute a relatively small number of aggregated values in the query answer. As a result, it may be quite time consuming to answer aggregate queries directly from the stored data in the database. In order to accelerate query evaluation, a common practice is to pre-compute and store (materialize) auxiliary data, such as views or indexes [14]. For a given database, the total number of applicable views and indexes can be extremely large. Thus, it is not always practical to materialize all potentially beneficial views and indexes, due to (among other reasons) the limited amount of storage space that could be available to store the auxiliary data. This is where the problem of selecting a subset of views and indexes arises. In this context, we are typically interested in making a selection that would maximize the associated benefit (e.g., minimize the response time for a given collection of queries) while observing a given storage-space limit.
Corresponding author. E-mail addresses: zohreh.asgharzadeh@sas.com (Z. Asgharzadeh Talebi), chirkova@csc.ncsu.edu (R. Chirkova), fathi@ncsu.edu (Y. Fathi). 0169-023X/$ see front matter 2012 Elsevier B.V. All rights reserved. http://dx.doi.org/10.1016/j.datak.2012.11.001

112

Z. Asgharzadeh Talebi et al. / Data & Knowledge Engineering 83 (2013) 111125

We consider the following optimization problem that we refer to as the OLAP view- and index-selection problem. Given a data-warehouse schema, a set of data-analysis queries of interest, and an upper bound b on the available storage space, find a collection of views and indexes that would fit within the storage limit b and would minimize the cost measure (evaluation time) for the given queries. Since the total number of possible views (subsets of the set of attributes) and their associated indexes (permutations of the attributes in the given view) is finite, in theory this problem can always be solved using a complete enumeration of all solutions. But even for a database with a relatively small number of attributes, this approach is not practical, since the total number of such solutions can be extremely large. Note that for a collection of k attributes, we have 2 k views in the data cube and 2 2k distinct subsets of the views. In addition, for each view v we have (|v|)! possible indexes, where |v| represents the number of attributes in view v. Alternatively, a naive approach would be to separate the problem into two phases where in the first phase we select a collection of views and in the second phase we select a collection of indexes for these views. Each of these operations can be carried out using existing methods. The problem of selection of views is addressed in [4,8,25,32,38,44], and the problem of selection of indexes is addressed in [2,1113,16,17,28,30], among others. Clearly, this approach would be computationally easier than the complete enumeration method. However, as pointed out in [21], it could also result in suboptimal solutions. In this article we seek to develop a computationally efficient methodology to obtain an optimal solution for the combined view- and index-selection problem as defined above. In addition to the obvious merits of obtaining an optimal solution to the problem, as opposed to obtaining a suboptimal solution, such a methodology can also provide a means for evaluating the quality of solutions obtained via various heuristic methods by providing the optimal solution for a given collection of instances. To this end we introduce an integer programming (IP) model for solving the problem and propose a strategy to reduce the size of this IP model to manageable levels so that it can be solved using a commercial IP solver (such as CPLEX 10). 1 1.1. Related work The prominent role of materialized views and indexes in improving query-processing performance has long been recognized, see, for instance, [10] and [33]. Enterprise-class database-management systems that provide modules for generic view and index selection include Microsoft SQL Server (see [3] and [34]) and DB2 (see [7,42, and [45]). At the same time, while it can be relatively easy to improve to some degree query-evaluation costs by using, for instance, greedy strategies for choosing indexes or views, it is highly nontrivial to arrive at a globally optimal solution, i.e., one that reduces the processing costs of the given OLAP queries as much as it is theoretically possible. Gupta et al. in [21] show that a variant of the view- and index-selection problem is NP-hard. Furthermore, as shown in [26], an approximation algorithm with nontrivial performance guarantees cannot exist for the general view- and index-selection problem unless P = NP. Hence it is natural to look for heuristic approaches for solving the problem. Well-known past efforts in this direction include the work presented in [3] and [21]. Gupta et al. in [21] proposed two families of greedy algorithms for solving the problem of view and index selection in a generalization of the OLAP setting; the algorithms select views and indexes together in iterations, instead of selecting views first and indexes second. In fact their approach builds on the results of their previous work in [22] where they proposed a greedy heuristic algorithm for solving the view selection problem, and claimed a performance guarantee for their heuristic. Karloff and Mihail in [26] disproved the strong performance bounds of their algorithm, by showing that the underlying approach of [22] cannot provide the stated worst-case performance ratios unless P = NP. A well-known tool for automated selection of materialized views and indexes for a wide variety of query, view, and index classes in relational database systems is presented in [3]. The approach of [3], implemented in Microsoft SQL Server, is based partly on the authors' previous work on index selection [16]. The contributions stated in [3] are (i) an end-to-end framework for view and index selection in practical systems, and (ii) a module for heuristic building (pruning) the search space of potential views and indexes for a given query workload. The methodology that we present here also prunes the search space of views and indexes, but, unlike the approach in [3] and [21], all but one of our proposed strategies keep at least one globally optimal solution in the search space. As a result the corresponding solution that we obtain is guaranteed to be globally optimal for the problem. A uniform approach for selecting views and indexes for OLAP queries is proposed in [19]. This approach considers view- and index-maintenance costs alongside query-response costs. The paper proposes to use a bond energy algorithm for initial clustering of indexes, and then to apply a partitioning method to select a set of views or indexes. Once the best partition is found, views or indexes are eliminated in a greedy manner, until the storage-space constraint is satisfied. Another heuristic approach for selecting views and indexes is proposed by Bellatreche et al. in [9]. In their approach they allocate an initial space to views and use the algorithm proposed in [44] to generate a set of candidate views. Next, they use the remaining space for indexes where the initial set of indexes is generated through a greedy algorithm. Finally views and indexes are selected iteratively from the initial sets through two greedy algorithms called index spy and view spy. The problem of selecting views and indexes for databases where attributes have strong correlations is considered in [29]. In this paper, authors designed an algorithm to generate a set of candidate views based on query groups, where queries are grouped based on the similarity of their attributes. Then the algorithm generates a set of clustered indexes for the set of candidate views.
1 An extended abstract of this work, with only one preliminary experiment reported and with full formulations and proofs of the technical results suppressed, appeared in [5].

Z. Asgharzadeh Talebi et al. / Data & Knowledge Engineering 83 (2013) 111125

113

Finally they introduce an integer programming model to select among the candidate views and indexes. As in this work only databases with strong attribute correlations are considered, the search space of dominant views and indexes is significantly smaller than the case without strong attribute correlations. Other past work considers either selection of views only, e.g., [4,8,25,32,38,44] and references therein, or selection of indexes only, e.g., [2,1113,16,17,28,30] and references therein for OLAP. Also, the problem of selecting a set of fragmented views is considered in [20]. In this work, authors propose a 01 integer programming model to select the best fragments, which minimizes the total cost of answering the queries in the workload. The 01 integer programming models are also used in [18] and [39] for related problems. As stated earlier, this article differs from above-mentioned related work in the sense that we address the problem of selecting views and indexes simultaneously, and we seek an optimal solution for the combined problem.

2. Preliminaries Throughout this work we assume that the data warehouse under consideration has a star-schema structure (see [27]) with a single fact table and several dimension tables. We consider relational selectprojectjoin queries with grouping and aggregation (SPJGA) (see [41]) and assume that users frequently ask a limited number of SPJGA queries for a variety of parameters (attributes), such as the itemized daily sales reports for products, locations, and so on. In addition, we assume that the time (cost) of evaluating a query is proportional to the number of stored-data rows scanned by the query-processing system when evaluating the query.2 Our original search space of views is the set of all views in the view lattice as defined in [22]. This consists of the raw-data view, which is the table resulting from the join of all stored (both fact and dimension) tables, plus all the star-join views with grouping and aggregation (JGA views) defined on the raw-data view. Thus each view in our original view set is associated with a distinct subset of the collection of attributes in the data warehouse, and vice versa. We have a total of 2 k views in this view set, where k is the number of attributes in the data warehouse. Our (original) search space of indexes includes B +-tree indexes (see [41]) on all views in the view set. To make the problem more manageable, in our study we consider only fat indexes over the view set an index for a given view v is said to be a fat index if it is associated with a permutation of all of the grouping attributes of view v. Throughout this article we use the character to represent both a fat index and the permutation vector associated with that fat index. Considering other types of indexes would make the problem considerably harder, as the corresponding search space would become significantly larger and more complex. Studying the types of indexes to which our current solutions can be directly extended is part of our ongoing work. We say a SPJGA query q can be answered using a JGA view v if and only if the set of grouping attributes of v is a superset of the set of attributes in the GROUP BY clause of q and those attributes in the WHERE clause of q that are compared with constants. Furthermore, if view v is chosen for answering query q, then at most one index of view v can be used to answer query q. By definition, each query q can be answered using the raw-data view in the view set. As mentioned earlier, a query q can be answered using a view v only if the set of grouping attributes of v is a superset of the set of attributes in the GROUP BY clause of q and of those attributes in the WHERE clause of q that are compared with constants. For ease of presentation, throughout this article we use the letter v to represent both a view and the collection of grouping attributes for that view, and we use the letter q to represent both a query and the collection of attributes in the GROUP BY clause of that query plus those attributes in the WHERE clause of the query that are compared with constants. Example 1. Consider view v = {a, b, c, d} and queries q1 = {a, b} and q2 = {d, e} where the letters a, b, c, d, and e represent distinct attributes in the database. Attributes a, b, c, and d are the grouping attributes of view v. Attributes a and b form the collection of attributes that are either in the GROUP BY clause of q1, or among those attributes in the WHERE clause of q1 that are compared with constants. Similarly, attributes d and e form the collection of attributes that are either in the GROUP BY clause of q2 or among those attributes in the WHERE clause of q2 that are compared with constants. Since q1 v, view v can answer query q1. However, since q2 v, view v cannot answer query q2.

2.1. Cost model The cost model that we use is similar to the one proposed by Gupta et al. in [21], i.e., the cost of answering query q using view v is the size of that portion of v that must be processed (scanned) in order to construct the result of query q. We measure the size of a view or a portion of a view as the number of rows in that view or in the portion thereof. When we answer query q using only view v with no indexes, then we have to scan all rows of v. Hence the corresponding cost is equal to the size of view v itself. However, when we answer query q using view v and an index of v, we only need to read the part of v referenced by with respect to q, hence the corresponding cost is potentially smaller. Naturally, the cost of answering a query in this situation depends on the actual contents of the data set under consideration, and it can be factually determined only after we have scanned the corresponding data. But in order to compare various courses of action and devise an appropriate action plan we need to evaluate this cost prior to scanning the data. Gupta et al. in [21] propose an approach to obtain a reasonable

When indexes are used in the evaluation, we count the scans of only the rows retrieved using the applicable indexes.

114

Z. Asgharzadeh Talebi et al. / Data & Knowledge Engineering 83 (2013) 111125

estimate for this cost using the available information about the size of various views in the view set. We adopt this approach to estimate the cost coefficients in our models, and in the remainder of this section we introduce and explain this approach. Suppose A1 is the set of attributes in the GROUP BY clause of query q, and A2 is the set of attributes that are compared with constants in the WHERE clause of query q. Also suppose B is the set of grouping attributes of view v. Let represent an index over view v, i.e., a permutation of attributes in the set B. (Recall that we use the character to represent both the index and the corresponding permutation vector of its attributes). As mentioned earlier, view v can be used to answer query q if and only if (A1 A2) p B. Let us use the notation cq(v, ) to denote the estimated cost of answering query q using view v and its index ; cq(v, ) is defined only if view v can be used to answer query q, i.e., if (A1 A2) p B. The value of cq(v, ) depends on the size of view v and on the relationship between its index and the collection of attributes A1 and A2. In particular, let v(q) denote the view whose set of grouping attributes is identical to the largest subset of A2 that forms a prefix (not necessarily proper) of , i.e., the largest subset of attributes that are compared with constants in the WHERE clause of q, and form a prefix of . Then the approach of [21] suggests the following formula to estimate the cost of answering query q using view v and index . sizev : sizev q

cq v;

Here, size(v) represents the size of view v as defined above. 3 For notational convenience, we use v = to represent the view which is aggregated on all of the attributes of the database, and define size() = 1. Furthermore, similar to the frameworks of [22,31] and [21], we assume that the total cost of answering a given collection of queries is the sum of the costs of evaluating the individual queries. In order to determine the size of a view in the view set, we can use either the sampling method or the analytical method proposed in [21]. For a given view, if we know that its grouping attributes are statistically independent, we can estimate its size analytically from the size of the raw-data view. In this case the size of the view is the number of distinct values of the grouping attributes of the view. Otherwise, we estimate the size of the view by sampling from the raw data. For ease of presentation, throughout this paper we assume that for every query q in the workload, all of its attributes are in its WHERE clause and are compared with constants, whereas the GROUP BY clause of q is empty, i.e., we assume q = A2. Hence from here onward the notation v(q) in Eq. (1) denotes the view whose grouping attributes are identical to the largest subset of q that forms a prefix of . With minor modifications, all results that we obtain here are valid without this assumption. Example 2. Consider a view v = {a, b, c, d}, an index = (a, c, b, d) over view v, and a query q = {a, b}. According to the assumption stated above, a and b are in the WHERE clause of query q and are compared with constants, whereas the GROUP BY clause of q is empty. In this example we have v(q) = {a}, and the estimated cost of answering query q using view v and index is cq v; sizefa;b;c;dg. sizefag 2.2. Problem statement In practical settings, the amount of available storage space is a natural constraint in the (OLAP) view- and index-selection problem, as storing all possibly beneficial views and indexes is infeasible in today's database systems (see [3,21]). We consider the following OLAP view- and index-selection (OLAP-VI) problem: given a star-schema data warehouse and a set (i.e., workload) of parameterized SPJGA queries, our goal is to minimize the estimated evaluation cost of the queries in the workload, by selecting and pre-computing (1) a set of views that can be used in answering the queries, and (2) some fat indexes over those views. We consider this minimization problem under a given storage-space limit, which is an upper bound on the amount of disk space that can be allocated for the materialized views and indexes. Thus, our problem input is of the form (D, Q, b), where D is a database, Q is the workload (which is a set of parameterized queries), and b is the storage limit. Denition 1. (Feasibility) For a problem input (D, Q, b), a set of views and indexes V I is feasible if (1) each query in Q can be answered using the views in VI , and (2) the set VI satises the storage limit b. Denition 2. (Optimality) For a problem input (D, Q, b), an optimal set of views and indexes is a set of views and indexes VI such that (1) VI is feasible for the problem input, and (2) VI minimizes the cost of evaluating Q on the database Dv, among all feasible sets of views and indexes for the problem input. Here, Dv is the database that results from adding to D the stored data for all of the views and indexes in VI . Denition 3. (OLAP-VI problem) For a given problem input D; Q; b, the OLAP view- and index-selection (OLAP-VI) problem is the problem of nding an optimal set of views and indexes, as dened above.

Note that if the set of the grouping attributes of view v1 is a subset of the set of grouping attributes of view v2, then we have size(v1) size(v2).

Z. Asgharzadeh Talebi et al. / Data & Knowledge Engineering 83 (2013) 111125

115

A solution for a given instance of problem OLAP-VI consists of a set of materialized views V* (which includes the raw-data view on D and all additional views that we choose to materialize), a set * of indexes over the views in V*, and an association between each element of Q and its corresponding elements of V* and *, i.e., which view in V* and which index in * (if any) should be used to answer each query in Q. Following the work in [21] and [22], we assume that the raw-data view is always in the solution, although in the context of our proposed models this assumption can be easily removed.

3. An integer programming model and its properties We begin this section by introducing an integer programming model for the OLAP view- and index-selection problem. Integer programming models belong to a larger collection of models known as mathematical programming models. A mathematical programming model consists of a group of variables that represent various decisions in the context of the problem, a function of these variables that is referred to as the objective function, and a collection of constraints (restrictions) on the values of the decision variables. In this context we seek to find a set of values for the decision variable (i.e., a solution) that would minimize (or maximize) the objective function while satisfying all constraints. Such a solution is commonly referred to as an optimal solution. The specific characteristic that separates an integer programming (IP) model from other mathematical programming models is the fact that in this model all variables are restricted to integer values. In the past few decades IP models have been extensively studied by various researchers and great strides have been made in developing effective and efficient techniques for solving such models. See [36] and [43] for a comprehensive discussion of this subject. Also presently there are commercially available software systems (e.g., [24,37]) that employ these techniques, and such systems are routinely used to solve relatively large instances of the problem. But the computational requirements of these techniques (and the corresponding execution time of any software system that employs these techniques) typically depend on the mathematical structure of the model, and these requirements tend to grow exponentially with the size of the model (i.e., with the number of variables and constrains). Hence, in order to keep the computational requirements and the corresponding execution time relatively small, it is essential to keep the size of the model as small as possible. In this section, after introducing an IP model for the OLAP index- and view-selection problem, we study the structural properties of this IP model and use these properties to remove a relatively large number of variables and constraints from the model. This results in an IP model that is significantly smaller, yet its optimal solution is guaranteed to be optimal for the original problem. This in turn, allows us to solve larger instances of the problem via this approach. We summarize these findings in Theorem 1 that we state in Section 3.3, and demonstrate their computational effectiveness through a numeric study that we present in Section 3.4.

3.1. Integer programming model IP1 For a given problem input D; Q; b, we define the following notation. V (v) Q(v) the set of all views in the original search space of views the set of all fat indexes of view v, v V the set of all queries in the set Q that can be answered by view v, v V.

The cardinality of the set V is 2 k, where k is the total number of distinct attributes in the database. We use the notation vj to represent the jth view in the set V, for j = 1 to 2 k, and use the letter J to represent the corresponding collection of subscripts (i.e., J = {1, 2, 3,, 2 k}). For ease of reference in Table 1 we present a list of the symbols that we frequently use in the remainder of this section and throughout the paper. In order to introduce the decision variables for the integer programming model, we need additional notation, as follows. Clearly, for each view v, the cardinality of the corresponding set of (fat) indexes (v): is equal to the total number of permutations of the elements of v, i.e., |(v)| =(|v|)!. For a given view vj we denote its lth index by jl, for l=1,,(|v|)!, and, for brevity, we denote the collection of all indexes associated with vj by j. In other words, we use j to denote the set (vj), for j =1,, 2k. We use the notation qi (for i =1,, m) to denote the ith element of the given set of queries Q, i.e., Q = {q1,q2,,qm}. For each i {1,, m}, let Vi = {vj V:vj t qi} represent the collection of views each of which can be used to answer query qi, and let Ji represent the corresponding collection of subscripts, i.e., Ji = {j J:vj Vi}. We are now prepared to define the decision variables for the integer programming model. The following variables are defined for subscript values i = 1, 2,, m, j Ji, and l = 1, 2,, (|vj|)!.
( sij & yijl

1 0

if view vj is used to answer query qi with no index otherwise

1 if view vj and its index jl are used to answer query qi 0 otherwise:

116

Z. Asgharzadeh Talebi et al. / Data & Knowledge Engineering 83 (2013) 111125 Table 1 The notations used in the paper. D Q b v q cq(v, ) V V vj (v) jl Q(v) The database The set of input queries The available storage space A view on D A permutation of attributes of v A query in Q The cost of answering query q using view v and index An optimal set of views for (D, Q, b) An optimal set of indexes for (D, Q, b) The set of all views The jth view in the set V The set of all indexes for a view v The lth index of view vj The set of all the queries in the set Q that can be answered by view v

The following variables are defined for subscript values j = 1,2,,2 k and l = 1,2,,(|vj|)!. ( tj & xjl 1 if view vj is materialized 0 otherwise 1 if index jl of view vj is materialized 0 otherwise:

Our problem OLAP-VI can now be stated as the following integer programming model that we refer to as model IP1. In this model we use the notation cijl to represent the value cqi vj ; jl , which is the estimated cost of answering query qi using view vj and its index jl, as defined earlier. Correspondingly, we use the notation dij to represent the estimated cost of answering query qi using view vj with no indexes. As stated earlier, we have dij = size(vj).
m X

min

i1 jJ i

4dij sij
2

X ! jvj j
l1

3 cijl yijl 5 IP1 2

subject to 4sij
jJ i

X ! jvj j
l1

3 yijl 5 1 for all i 1 to m 3

2 X
k

2 3 X ! jvj j   4t j size vj xjl 5 b


l0

j1

xjl t j sij t j yijl xjl t1 1

for all j 1 to 2 ; and l 1 to for all i 1 to m ; and j J i for all i 1 to m;

    vj  !

5 6 7 8

    j J i ; and l 1 to vj  !

t j ; sij ; xjl ; yijl f0; 1g

i; j; l:

Eq. (3) states that each query must be answered by exactly one view and either without indexes or with exactly one of its indexes. Constraint (4) states that the total storage requirement for the selected views and indexes should not exceed the pre-specified limit b. Recall that we measure the size of each view by the number of rows in that view. Also note that the size of each index for a view is the same as the size of the view itself. Correspondingly, we state the storage limit b in terms of the number of rows that we can store. Constraint (5) states that index jl for view vj can be materialized only if the view itself is materialized. Similarly, constraint (6) states that query qi can be answered by view vj (without indexes) only if this view is materialized, and

Z. Asgharzadeh Talebi et al. / Data & Knowledge Engineering 83 (2013) 111125

117

constraint (7) states that query qi can be answered by view vj and its index jl only if this index is materialized. Finally, constraint (8) simply states that the raw-data view is always selected. 4

3.2. Reducing the size of the IP model So far in the problem OLAP-VI and in the resulting integer programming model IP1 we have considered all views in the view set and all their corresponding indexes to be in the search space of the problem. For a realistic-size instance of the problem the total number of views and indexes in this search space, and hence the size of the corresponding integer programming model IP1, can be quite large. Note that for a database with k dimensions (attributes) the total number of views in the view set is 2k and each view vj has (|vj|)!    k  vj  ! 2k indexes. For an instance of the problem with k attributes and m queries this results in m jJ i j jJi vj  ! 2 i1 j1     k vj  ! m 2 constraints in the integer programming model IP1. Hence even for variables and m jJ i j jJi vj  ! +2 i1 j1 relatively small values of k and m, the resulting integer programming model can be quite large, and the corresponding execution time for solving this model can be excessively long even if we use relatively fast IP solvers, such as CPLEX 10. This, in turn, limits the applicability of this approach (i.e., using the integer programming model) to only very small instances of the problem. In this subsection we characterize various properties of the views and indexes that appear in an optimal solution for this problem. This allows us to identify a relatively small subset of views and indexes that contains at least one set of optimal views and indexes for the problem. This, in turn, allows us to reduce the size of the corresponding integer programming model for a given instance of problem OLAP-VI, hence enabling us to solve larger instances of the problem using this approach within reasonable execution times.

3.2.1. Reducing the set of views We start by making a few observations regarding the properties of the views that appear in an optimal solution for a given OLAP-VI problem. Proofs of these observations and of the lemma are straightforward and, for brevity, we do not include them in this article. Please see [6] for the details. Observation 1. Given an instance of the OLAP-VI problem with input D; Q; b, if a view v V is not a superset of at least one query in the set Q, then the problem has an optimal solution that does not include view v. Observation 2. Given an instance of the OLAP-VI problem with input data D; Q ; b, if view v V has at least one attribute that is not in any of the queries in Q(v), then the problem has an optimal solution that does not include view v. It follows that we can easily reduce the search space of views in the OLAP-VI problem by removing from the set V every view v that satisfies the condition of either observation, 5 and the resulting problem (with the smaller collection of views in its search space) is guaranteed to have at least one optimal solution that is also optimal for the original OLAP-VI problem. We refer to this (reduced) set of views as V. Obviously, the task of determining the set V itself requires some effort. In order to evaluate the corresponding computational requirements, we note that there are two methods to construct this set. In the first method we take the union of each subset of Q with r queries (for all 1 r jQj) and add the resulting view to the set V. In the second method, we consider each of the 2 k views in the view set and check whether its set of attributes is equal to the union of the sets of attributes of the queries that it can answer. Note that we always add the raw-data view to the set V. The computational requirement of the first method is of order     O 2jQj , while the computational requirement of the second method is of order O jQj2k . Depending on the specific values of k and jQj, we choose either the first or the second method, whichever results in smaller computational effort.

3.2.2. Reducing the set of indexes We now focus on the properties of indexes that appear in an optimal solution for the problem, and use these properties to identify a relatively small subset of these indexes for inclusion in our model. In particular, for each view v V we identify a subset (v) of (v) that contains at least one optimal index for this view in the context of any optimal solution for the problem. In order to characterize this restricted collection of indexes (v) for each view v, we define and construct a directed graph (digraph) Gv associated with this view. Each node of the digraph Gv corresponds to a set of attributes that either is equal to the set of attributes of one of the queries in Q(v), or is equal to the intersection of the sets of attributes of two or more queries in Q(v). It follows that associated with each combination of r queries, for r = 1,2,,|Q(v)|, there is a node in digraph Gv. Two additional nodes are also included in Gv: one node is associated with the view v itself, and the other node represents the empty set . For each pair of nodes w1 and w2 in Gv, there is an arc from w1 to w2 if and only if w1 w2 and there is no node w Gv where w1 w w2. Note that Gv has a single source and a single sink v. For a given view v, the total number of nodes in Gv is at most min{2 |v|,2 |Q(v)|}. In practice, however, the actual number of nodes in Gv may be smaller than this limit.
4 5

In our IP model we denote this raw-data view by v1. Note that in model IP1 we have already excluded the views covered by Observation 1, via subsets Vi and Ji.

118

Z. Asgharzadeh Talebi et al. / Data & Knowledge Engineering 83 (2013) 111125

Denition 4. Given a view v and its corresponding digraph Gv, let P represent a path that begins at the source node and ends at any node w of Gv. For a given index (v) we say that is associated with path P if the set of attributes of each and every node in P is a prefix of . Denition 5. Given a view v and its corresponding digraph Gv, let P represent a path that begins at the source node and ends at a node ws. Suppose the order of the nodes on path P is , w1, w2,, wi 1, wi, wi + 1,, ws. Given a query q Q(v), we say that query q agrees with path P up to node wi if the set of attributes of each of the nodes w1, w2,, wi 1 is a subset of q, but the set of attributes of the node wi is not a subset of q. Lemma 1. Given a view v and its corresponding digraph Gv, if two indexes 1 and 2 of view v are associated with the same source-sink path in Gv, then cq(v,1) = cq(v,2) for every query q Q(v). Proof. Let the source-sink path be denoted by P. Consider the relationship between an arbitrary query q Q(v) and the path P. Suppose query q agrees with P up to node z on P. Also, suppose w is the node immediately before z on P. From Definition 5 we have q w = w and q z = w. This is because if q has some of the attributes in z w then there must exist a node w = q z on P between nodes w and z. We know that this is not the case, since there is an arc from node z to node w in Gv. On the other hand, since 1 and 2 are associated with path P, both of them have all of the attributes in w and all of the attributes in z as their prefix; hence neither of them has any of the attributes in q w after their first |w| attributes. Thus, the largest subset of q that forms a prefix of 1 is the same as the largest subset of q that forms a prefix of 2, i.e., v1 q =v2 q. As a result, cq(v,1) = cq(v,2). Lemma 2. Given a view v and its corresponding digraph Gv, if an index of view v is not associated with any source-sink path in digraph Gv, then there exists another index of view v associated with a source-sink path in Gv such that cq(v,) cq(v,) for every query q Q(v). Proof. Let P() be the longest path of Gv that is associated with . Let w be the last node in P() and let r = |w|. Since is not associated with any source-sink path of Gv, P() is not a source-sink path; thus w v and 0 r b |v|. Suppose the order of attributes in after the first r attributes is (ar+1,ar+2,,a|v|). We define the set Q j for r + 1 j |v| as follows: Q j = {q Q(v)|w {ar+1,,aj} q}. Let t = maxr+1j |v|{j|Qj }. We identify node wj in Gv, for r + 1 j t, as the node that corresponds to the intersection of all the queries in Q j. We have w p wr+1 p wr+2 p p wt. As a result, there exists a source-sink path P that contains all of the nodes on P() and nodes wr+1, wr+2,, wr+t. Suppose is an index associated with P. Also, suppose the order of the first r attributes in is the same as the order of the first r attributes in . We now show that cq(v, ) cq(v, ) for all q a Q(v). Consider any query q a Q(v). One of the following cases is true: (1) (q w) w; or (2) q = w; or (3) q w. If case (1) is true, i.e., q has some (but not all) of the attributes of w, then we have v(q) = v(q). This is because both and have all of the attributes of w as their prefix in the same order. If case (2) is true, then we clearly have v(q) = v(q) = w. Now suppose case (3) is true. If q has all of the attributes in {ar+1, ar+2,, a|v|}, then q = v; thus, v(q) = v(q) = v. On the other hand, if q does not contain ar+1, then v(q) = w. Also, we know that the order of the first r attributes in and is the same. Thus v qtw. As a result, if q does not contain ar+1 then we have v qtv q. Now let us consider the case where q contains ar+1, but does not contain all attributes in {ar+1, ar+2,, a|v|}. Furthermore, let us assume that q has all of the attributes in {ar+1, ar+2,, ah}, where r + 1 hb t, but does not include the attribute ah +1. It follows that q a Qh and v(q) = w {ar+1,ar+2,,ah}. Based on the definition of wh, we have w {ar+1,ar+2,,ah} p wh. Thus v(q) p wh. From the fact that q a Qh we have wh p q. Since wh is a node on P and is associated with P, all of the attributes of wh form a prefix of . Thus v qtwh . It follows that for all queries in case (3) we have v qtv q. We conclude that v qtv q for any query q in Q(v). Thus we have size(v(q)) size(v(q)), and consequently cq(v, ) cq(v, ) for every query q a Q(v). From Lemma 1 it follows that if two indexes of view v are associated with the same source-sink path of digraph Gv, then they have the same effect on reducing the cost of answering each query in Q(v). From Lemma 2 it follows that for each index of view v that is not associated with any source-sink path of digraph Gv, we can find an index of view v that is associated with a source-sink path of Gv and is at least as effective as in reducing the cost of answering each query in Q(v). We are now ready to define the set (v) for each view v a V. Denition 6. For a given view v construct the corresponding digraph Gv and determine all distinct source-sink paths in this digraph. For each source-sink path Pi obtained in this manner, determine an associated index i. We define (v) as the collection of all indexes for view v obtained in this manner. Following is a small illustrative example. Example 3. Consider view v = {a, b, c, d, e, f, g} and suppose Q(v) consists of the following queries: q1 = {a, b, c}, q2 = {c, d, e}, and q3 = {e, f, g}. Fig. 1 represents digraph Gv for v. The source-sink paths in this digraph are as follows: 1. () (c) (a, b, c) (a, b, c, d, e, f, g)

Z. Asgharzadeh Talebi et al. / Data & Knowledge Engineering 83 (2013) 111125

119

a, b, c, d, e, f, g a, b, c c c, d, e e e, f, g

Fig. 1. Digraph Gv for Example 3.

2. () (c) (c, d, e) (a, b, c, d, e, f, g) 3. () (e) (c, d, e) (a, b, c, d, e, f, g) 4. () (e) (e, f, g) (a, b, c, d, e, f, g) An index associated with the first path should have attribute c at its first position, then attributes a and b (in any order), and next attributes d, e, f, and g, in any order after a, b, and c. Thus, the permutation vector (c, a, b, d, e, f, g) is an index associated with the first path. Similarly, we observe that permutation vectors (c, d, e, a, b, f, g), (e, c, d, a, b, f, g), and (e, f, g, a, b, c, d) are indexes associated with the second, the third, and the fourth path, respectively. Thus, we have: v fc; a; b; d; e; f ; g ; c; d; e; a; b; f ; g ; e; c; d; a; b; f ; g ; e; f ; g; a; b; c; dg: We note that in this example, |(v)| = 4, whereas |(v)| = (|v|)! = 5040. From the above discussion it follows that for each view v and for each query q a Q, the set (v) contains at least one index that is at least as effective as any other index in (v) for answering query q with this view. It follows that we can easily reduce the search space of indexes in the OLAP-VI problem by limiting our search to the smaller collection (v) rather than (v) for each view v, and the resulting (smaller) search is guaranteed to produce at least one optimal solution for the original problem. This observation, along with the observations that we made earlier regarding the reduction in the size of the search space of views, could lead to potentially significant reductions in the size of the entire search space of views and indexes for the OLAP-VI problem, as we shall see in a few randomly constructed instances in Section 3.4. Correspondingly, we can remove all associated variables and constraints from the integer programming model IP1, leading to a smaller model for the problem; we refer to this model as IP2 as described below. For each view v, the computational requirement for constructing the corresponding digraph Gv is of order O(min{8 |v|, 8 |Q(v)|}). 3.3. Modied integer programming model IP2 In this model we use the following notation to represent various restricted subsets of views and indexes and their corresponding collections of subscripts. For each i = 1, 2,, m, let Vi = {vj V : vj t qi} represent the restricted collection of views that can be used to answer query qi (where V is as defined in Section 3.2), and let Ji represent the corresponding collection of subscripts, i.e., Ji = {j a J:vj a Vi}. Also let J represent the collection of subscripts of all views in V. For each view vj a V, let (vj) represent the restricted collection of its indexes as defined above, and let Lj represent the corresponding collection of subscripts, i.e., Lj = {l:jl a (vj)}. We can now write the integer programming model IP2 using this notation.

min

m X

2 2
jJ lL j

3 IP2 for all i 1 to m 3

i1 jJ i

4dij sij cijl yijl 5

subject to 4sij yijl 5 1 2 i 3 j   size vj 4t j xjl 5 b


jJ lL j lL

xjl t j for all j J and l L j sij t j for all i 1 to m and j J i yijl xjl for all i 1 to m; j J ; and l L i j t1 1 t j ; sij ; xjl ; yijl f0; 1gi; j; l

120

Z. Asgharzadeh Talebi et al. / Data & Knowledge Engineering 83 (2013) 111125

The following theorem follows directly from Observations 1 and 2, and Lemmas 1 and 2. Theorem 1. Given an OLAP-VI problem with input (D, Q, b), if we define the set V as in Section 3.2, and the set (v) for all v V as in Definition 6, then we have the following. (i) Every optimal solution of the integer programming model IP2 is also an optimal solution for the integer programming model IP1, and (ii) The integer programming model IP1 has at least one optimal solution that is also an optimal solution for model IP2. For a given OLAP-VI problem, the number of views and indexes considered in model IP2 can be significantly smaller than the corresponding number in model IP1. This is partly due to the fact that the restricted set of views V can be substantially smaller than the original set V, and partly due to the smaller number of indexes (v) for each view v that we consider in model IP2.

3.4. Numeric results In order to observe the impact of the above reduction operations we compare the size of the corresponding models IP1 and IP2 for a collection of instances of the view- and index-selection problem. More specifically, we compare the number of views and indexes in models IP1 and IP2 for each instance. The databases that we used are 7-attribute and 13-attribute TPC-H databases, see [40]. TPC-H is a benchmark widely used in the database community for measuring the evaluation performance of OLAP queries. Our TPC-H datasets are formed by selecting star-schema subsets of the overall TPC-H schema (please see [40] for the details). We shall discuss further these databases and the procedure that we used to build the collection of instances later in Section 5. Within each database we constructed 10 instances of the problem with the number of randomly generated queries for these instances ranging from 10 to 50. Our main observation is that in all instances the number of views and indexes in model IP2 is significantly smaller than those in model IP1. Of course the magnitude of reduction we observed in each instance depends on the size of that instance and the specific collection of queries present. But on average for the instances in the 7-attribute database the numbers of views and indexes in the model IP2 were 39.4% and 12.4%, respectively, of those in the model IP1, and the corresponding numbers for the instances in the 13-atrtribute database are 2.4% and 13.2%, respectively. The relative magnitude of reduction as a fraction of total size of the instance is higher for the instances with smaller number of queries as compared to those with larger number of queries. Further details about the size of the IP models are presented in Section 5, and complete numeric results are presented in [6]. 4. An IP-based heuristic method In this section we propose a strategy to further reduce the size of the integer programming model for our view- and indexselection problem. In this context, we limit the choice of indexes for each view v a V (Section 3.2) to a relatively small collection of promising alternatives for this view, which we refer to as (v). The cardinality (size) of the collection (v) is typically much smaller than the corresponding collection (v) that we defined in Section 3.2. This, in turn, would allow us to solve the IP model for larger number of queries and with larger number of attributes, hence allowing us to solve larger instances of the OLAP-VI problem. We refer to this smaller integer programming model as IPN. The downside of this approach, however, is that we can no longer guarantee that an optimal solution of the resulting integer programming model (IPN) is an optimal solution of the original OLAP-VI problem. In Section 4.1 we discuss an algorithm to obtain the collection (v) for each view v a V, and in Section 4.2 we present numeric results comparing the size of the reduced IP model with that of the original model. 4.1. Reduced collection of indexes For each view v a V we limit the number of indexes associated with this view to a relatively small positive integer that we refer to as Nv. Obviously smaller values of Nv for each view v result in a smaller IP model and thus a higher degree of scalability for this approach, but the resulting optimal solution of the IP model would be potentially further away from the corresponding optimal solution of the original OLAP-VI problem. Larger values of Nv would have an opposite effect. In our implementation we select the value of Nv equal to |Q(v)|. This choice of value for Nv is inspired by the fact that |Q(v)| is an upper bound on the number of indexes for view v at the optimal solution of model IP2 presented in Section 3.2 (note that each query q a Q(v) can be answered optimally by view v with at most one index for this view.) We select the specific Nv indexes associated with each view v using a greedy procedure, adding one index at a time until we reach Nv indexes. At each step of this greedy procedure we select the index that, along with all previously selected indexes for this view, results in the least cost of processing the queries in Q(v). The following formulas express this idea in a concise manner.

1 arg min v qQ v cq v; v n  o 2 1 v arg min v qQ v min cq v; ; cq v; v n    o 3 arg min v qQ v min cq v; ; cq v; 1 ; cq v; 2 v v v n      o N 1 2 N 1 : v v arg min v qQ v min cq v; ; cq v; v ; cq v; v ; ; cq v; v v

Z. Asgharzadeh Talebi et al. / Data & Knowledge Engineering 83 (2013) 111125

121

Nv We define the reduced collection of indexes (v) as {v 1, v 2,, v }. (If the total number of indexes for view v, i.e. |(v)|, is less than |Q(v)|, we select all indexes in (v) for this view.)

4.1.1. Model IPN We define the model IPN similar to model IP2, except that we use the set of indexes (v) in place of (v) for each view v a V. Note that while the number of variables and constraints in model IPN can be significantly smaller than the corresponding numbers in model IP2, the optimal solution of this model is no longer guaranteed to be optimal for the original problem. In the remainder of this section, we present an efficient algorithm to determine the set (v) for each view v a V; we refer to this algorithm as Algorithm IPNIDX. 4.1.2. Algorithm IPNIDX The pseudocode for Algorithm IPNIDX is displayed after this discussion. This algorithm is iterative and in each iteration we find one index, that is, in the first iteration we find v 1, in the second iteration we find v 2, and so on. In each iteration, we consider the nodes of digraph Gv in topological order. For each node w we find an order for the attributes of w, based on the order of the attributes of one of its parent nodes. We refer to this order as perm(w). The last node considered in this topological order is the sink node v, and we declare the corresponding order perm(v) as the index selected in this iteration.

At the end of each iteration, we update the value of MCS(q) for each q a Q(v), where MCS(q) is the minimum cost of answering query q using the indexes selected so far. At the beginning of the first iteration, MCS(q) = size(v) for all queries in Q(v). To find the order of attributes of node w in each iteration, first we consider the set of queries that affect the order of the attributes of w, i.e., queries in the set Q = {q Qtemp|q w and q w w}. Note that query q a Q(v) is in Qtemp if its set of attributes does not form a prefix of any index selected so far in the algorithm. Also, from the property of indexes in (v) we have perm(w) = (perm(u), arb(w\u)), where u is one of the parent nodes of node w, and arb(w\u) is an arbitrary order of the attributes in w\u. Thus, we need to find the parent of w that yields the minimum total cost of answering the queries in Q. Since we consider the nodes in the topological order, at the time of computing perm(w) we have cost(u, q) for each parent node u and for each query in Q; thus we can compute cost(u) = q a Qmin{cost(u, q), MCS(q)} for each parent node u of node w. Given the digraph Gv for view v, the computational requirement of Algorithm IPNIDX is O(Nv |Q(v)| 4 min{|Q(v)|,|v|}). (To see this, note that the while loop of Algorithm IPNIDX is repeated Nv times, and that the for loop in the while loop is repeated at most (the number of nodes) 2 |Q(v)| times.) 4.2. Numeric results In order to observe the impact of the above reduction operations we compare the size of the corresponding models IP2 and IPN. More specifically, we compare the number of indexes in models IP2 and IPN (note that the number of views in the two models is the same) for the same collection of 20 instances that we mentioned in Section 3.4. We observed that in every instance the

122

Z. Asgharzadeh Talebi et al. / Data & Knowledge Engineering 83 (2013) 111125

number of indexes in model IPN is significantly smaller than the comparable number in model IP2, as expected, although once again the specific values vary among the instances. On average, for the collection of 20 instances that we constructed, the number of indexes in model IPN is less than 0.1% of those in model IP2. This fraction is relatively higher for the smaller instances and it is significantly lower for the larger instances. Further details about this subject are provided in the context of our computational results in Section 5 and in [6]. 5. Experimental results We now present the results of a computational experiment with the models presented in this article. Our objectives in this experiment are: (1) to evaluate the scalability of the exact model IP2, and (2) to evaluate the scalability of the inexact model IPN and the quality of the solutions obtained by this model. To this end, we construct a collection of instances of the OLAP-VI problem of varying sizes on a number of databases that we have created using the data in the TPC-H benchmark at scale factor one, that is at the overall size of 1 GB (please see [40] for the details). Each database that we created has a schema that (i) is a subset of the schema of the original TPC-H database, and (ii) is a star schema. Once we created each such star schema, we created the corresponding database using exactly the TPC-H tables mentioned in the schema. (For instance, one of the databases that we created consists of the TPC-H data tables lineitem, orders, and part, and includes no other TPC-H data table.) We then solve each instance using different models and procedures as applicable and report our findings. We developed all programs in C++ and ran them on a PC with a 3 GHz Intel P4 processor, 1 GB RAM, cache size of 512 KB, and a 80 GB hard drive running Red Hat Linux Enterprise 4. We used CPLEX 10 (see [24]) to solve the integer programming models. 5.1. Data sets Each instance of the OLAP-VI problem is identified by a given database D, a given collection of queries Q, and a given storage space b. We used two different databases of the TPC-H benchmark (see [40]) a 7-attribute database and a 13-attribute database to construct the collection of instances in our experiment. We measured the sizes of the views in each database using the analytical method. More precisely, each database was obtained by adding to the original stored TPC-H tables generated with scale factor one, a single relation, with either 7 or 13 grouping attributes, that results from the natural join of a subset of the set of these base relations. Observe that the 7-attribute table and the 13-attribute table that we create in this manner are the raw-data tables for the 7-attribute view set, and the 13-attribute view set, respectively. For each raw-data table, we materialized the entire view set on the respective TPC-H database generated with scale factor 0.1 (to comply with our storage-space restrictions), and measured exactly the size of each view in the set using a SQL count query. Then we multiplied each of these exact view sizes by 10, to obtain estimates of the sizes of the views in the view set for the scale-one databases used in this experiment. Aside from the number of attributes in the database, the size of each instance is determined by the number and the make-up of its queries. Within each database we constructed instances of the OLAP-VI problem with the number of queries ranging from 3 to 50. The sizes of the instances that we solved are realistic and comparable to the sizes of the instances used in the related work (cf. [3,13,16,25]). For each instance we constructed the corresponding collection of queries randomly. More specifically, to construct an instance of the OLAP-VI problem with g queries over a database with k attributes, we first determined the number of attributes in each query as a randomly generated integer (t) between 1 and k 1. Then for each query with t attributes we constructed its actual collection of attributes by randomly generating t distinct integer values between 1 and k. These t integer values uniquely identify the collection of attributes for that query. 6 The difficulty of solving a specific instance of the OLAP-VI problem depends on the relative value of the storage space b as compared with the size of the raw-data view plus the size of the queries in the set Q. Suppose the value of storage space b is expressed as: ! b sizev1
qQ

sizeq

10

where v1 represents the raw-data view. If b 0 then the problem is infeasible, since the available storage space b is not even sufficient for storing the raw-data view (which is a required selection). If = 0 then the problem is not challenging, since there is only enough space to select the raw-data view. If 2 again the problem is not challenging since the best solution is clearly to materialize the raw-data view plus all queries in the set Q and an optimal index for each query. Thus, in order for an instance of the view- and index-selection problem to be nontrivial, we need to have 0 b b 2. In our experiments for each instance we set the value of = 0.5, i.e., the storage space limit b is equal to the size of the raw-data view plus one-half of the sum of the sizes of its collection of queries. For some instances we also solved the problem by setting = 0.1, 0.2, 0.3,, 0.9, 1, the pattern of findings did not change very much, although the actual solution did change as expected.
6 Consistent with the assumption that we made earlier, we continue to assume that for each query all associated attributes are in its WHERE clause and they are compared with constants.

Z. Asgharzadeh Talebi et al. / Data & Knowledge Engineering 83 (2013) 111125 Table 2 Comparing models IP2 and IPN, for the ten instances from the 7-attribute TPC-H database. Instance Number of queries Optimal cost for model IP2 1 2 3 4 5 6 7 8 9 10 20 20 20 20 20 20 20 30 40 50 20.25 20.41 20.29 20.42 20.41 20.17 21.05 30.29 40.35 50.16 IPN 20.25 22.17 20.54 20.66 20.52 21.00 21.16 30.30 40.35 50.30 Execution time (s) for model IP2 10.14 2.17 3.66 1.93 3.26 2.99 4.01 13.82 50.57 500.50

123

IPN 0.73 0.62 0.75 0.76 0.74 0.66 1.59 1.69 1.16 3.56

5.2. Results Our first experiment consists of solving a collection of instances of the OLAP-VI problem using the exact model IP2 and the inexact model IPN. We solved ten instances from the 7-attribute database and ten instances from the 13-attribute database with the number of queries ranging between 10 and 50. 7 We report our findings in Tables 2 and 3, respectively. For each instance we report (1) the number of queries, (2) the optimal value of the corresponding integer programming models IP2 and IPN (all integer programming models are solved using CPLEX 10), (3) the execution time, which includes both the pre-processing time to construct the restricted view sets and index sets in the IP models as well as the solving time using CPLEX 10. For these collections of instances, we make the following observations: The execution time for solving model IP2 is relatively small for small to moderate size instances, but for larger instances, the execution time increases rapidly. For the largest instance in Table 2 (instance 10), the execution time is over 8 min, and for the larger instances in Table 3 (instances 17 through 20) the execution time exceeds our one hour time limit. Obviously, for all instances where the execution was completed, the reported cost for model IP2 is optimal for the corresponding OLAP-VI problem. The execution time for solving the inexact model IPN is smaller than that of IP2 across all instances, and the difference is more significant for larger instances. Also, the cost of the solution obtained via model IPN is close to the optimal cost for those instances where we know the optimal solution (obtained via the exact model IP2). For this collection of instances, the average cost obtained via model IPN is 1% more than the corresponding optimal cost, and the maximum deviation from the optimal cost is 9% (instance 2 in Table 2). For the remaining instances where we do not know the optimal cost(instances 17 through 20 in Table 3), we observe that the cost obtained via IPN is at most 4% larger than a corresponding lower bound (i.e., the number of queries).8 We also solved a larger instance on the 13-attribute database with 100 queries using model IPN. The total time required to solve this instance using model IPN was 2674 s (i.e., about 45 min), and the value of cost obtained from this model is 100.06 (which is slightly larger than the corresponding lower bound 100). Needless to say that we could not solve model IP2 for this instance, since its execution time would be excessive. In all instances we also observed that the pre-processing time used to build each of the models IP2 and IPN is significantly smaller than the corresponding CPLEX time used to solve that model. For brevity we do not report these time durations separately. We also carried out a comparative study of our heuristic approach with other algorithms for solving the OLAP-VI problem. Many of the existing algorithms in the open literature address either the view selection problem (e.g., [4,8,25,26,38,44]) or the index selection problem (e.g., [12,13,16,30]). Some articles (e.g., [3,9,19,35]) propose a two step approach, where the views are selected first, followed by selecting an appropriate collection of indexes for these views. Typically an ad hoc approach is used in dividing the available space between the views in the first step and the indexes in the second step. Needless to say that such an approach could result in solutions that are quite far from optimal for the combined problem, although in some instances it could result in reasonably good solutions. See [21] for further discussion on this subject. The only article in the open literature that addresses the combined OLAP-VI problem is the paper by Gupta et al. [21]. In this paper the authors propose a greedy procedure where a collection of views and indexes are constructed one step at a time. The procedure starts with an empty collection, and at each step it adds either a view (and possibly one or more of its indexes) or an index for an existing view, so as to maximize the resulting immediate benefit, while observing the space limit. The immediate benefit associated with an entity v (either a view, or an index, or a view with one or more of its indexes) with respect to a given collection C is defined as the reduction in the total cost of answering the query workload when we add the entity v to the collection C.
These instances are the exact same instances that we used in Sections 3.4 and 4.2. Note that since all attributes of each query are in its WHERE clause and they are compared with constants (by assumption), it follows that the corresponding cost of answering this query using an appropriate view and a proper index could be as low as 1 (i.e., only one row of the corresponding view needs to be scanned). Hence, the total cost of answering a given collection of queries could be as low as the number of queries.
8 7

124

Z. Asgharzadeh Talebi et al. / Data & Knowledge Engineering 83 (2013) 111125

Table 3 Comparing models IP2 and IPN, for the ten instances from the 13-attribute TPC-H database. Instance Number of queries Optimal cost for model IP2 11 12 13 14 15 16 17 18 19 20 10 10 10 10 15 15 20 30 40 50 10.00 10.07 10.00 10.29 15.11 15.12 IPN 10.00 10.07 10.00 10.64 15.44 15.23 20.63 30.11 40.14 50.01 Execution time (s) for model IP2 2.10 0.96 0.96 0.83 287.34 520.74 >1 h >1 h >1 h >1 h IPN 0.65 0.61 0.61 0.62 0.93 0.82 1.49 10.17 21.71 291.57

Even though the authors in [21], through a comprehensive computational experiment, show that their proposed algorithm, when applicable, is superior to the above-mentioned two-step approach, there are two main concerns that remain about this algorithm. The first concern is the myopic (greedy) criterion that it uses for selecting the views and indexes. It is well-known that such myopic criteria, when employed in the context of a discrete optimization problem, could result in poor solutions for some instances of the problem, even though their performance could be satisfactory in other instances. The second and perhaps more significant concern is the fact that at each step the procedure needs to consider all views in the view set, one by one, along with some of their indexes. Hence the corresponding computational requirements can be excessively large. For a variation of the algorithm that limits itself to at most (r 1) indexes for each view (referred to as the r-greedy algorithm in [21]), the corresponding computational requirements is reported to be of order O(mr), where m is the total number of views. In a database with p attributes the corresponding value of m would be 2 p, which can be quite large for moderate values of p. The authors report that the largest number of attributes in a database for which they were able to carry out the corresponding computation is p=6. Another variation of the algorithm proposed in [21] that they refer to as the inner-level greedy algorithm suffers from the same weakness. In an independent experiment we developed our own software for the r-greedy algorithm and solved several instances of the OLAP-VI problem. As suggested by authors in [21] we set r = 4. The largest database for which we were able to solve the OLAP-VI problem had only 7 attributes, and the corresponding execution time for several instances (each instance had 10 queries) was between 10 and 15 min. The corresponding execution time for IPN on the same collection of instances was less than one second each. The total cost obtained via IPN was also significantly smaller than those obtained via this procedure. We refer the reader to [6] for details of our computational experiment. As reported earlier in this section, the computational requirements of our proposed approach allowed us to easily solve the instances of the OLAP-VI problem with 13 attributes and the quality of the resulting solutions seem to be quite good. In a more recent experiment we also solved several instances of the problem on a 17-attribute database and obtained similarly good results within reasonable execution time. 6. Concluding remarks In this article we undertook a systematic study of the OLAP view- and index-selection (OPAL-VI) problem. We constructed an integer programming (IP) model for this problem and studied the properties of the views and indexes that appear in the optimal solution for this model. We then used these properties to prune the space of potentially beneficial views and indexes while keeping at least one globally optimal solution in the search space. This results in a modified IP model which is significantly smaller than the original model, yet its optimal solution is guaranteed to be optimal for the original OLAP-VI problem, which, in turn, allows us to solve moderately large (realistic-size) instances of the OLAP-VI problem. We then extended this approach to further reduce the size of the IP model by removing all views and indexes that are less likely to be effective in answering the given collection of workload queries. Of course we can no longer guarantee that the optimal solution of the resulting IP model would be optimal for the original OLAP-VI problem, and in this sense our proposed approach is a heuristic procedure. But the smaller size of the resulting IP model allows us to solve even larger instances of the OLAP-VI problem. Through a computational experiment we demonstrated that this smaller IP model produces relatively good solutions (optimal or near-optimal) for larger instances of the OLAP-VI problem, and compares favorably with other existing approaches for solving this problem, i.e., it obtains better solutions in much less computation time. Our current and future research on this subject is focused on extending the scalability of this approach further so that we can use it to solve even larger instances of the OLAP-VI problem, i.e., instances with larger number of attributes in the database and/or larger number of queries in the workload. When the number of queries in the workload is larger than those we considered here, one such approach would be to cluster like queries into one master-query which would then be answered by one set of view and index. The difficulty with this approach is of course to devise an appropriate metric for evaluating the effectiveness of each clustering. Devising such a metric inevitably requires a careful study of the properties of the optimal views and indexes in this context. We have developed a similar metric for the view-selection problem and demonstrated its effectiveness by studying its analytic properties and through a computational study. We refer to this metric as the costbenefit ratio and discuss it in detail in [23]. We hope to extend this approach for the view- and index-selection problem in the near future.

Z. Asgharzadeh Talebi et al. / Data & Knowledge Engineering 83 (2013) 111125

125

Acknowledgment We are grateful to the three anonymous reviewers for their constructive comments and suggestions that helped us greatly improve the presentation of material in this article. References
[1] A. Abello, I.Y. Song, Data warehousing and OLAP (DOLAP '08), Data and Knowledge Engineering 69 (1) (2010) 12. [2] S. Agrawal, N. Bruno, S. Chaudhuri, V. Narasayya, AutoAdmin: self-tuning database systems technology, IEEE Data Engineering Bulletin 29 (3) (2006) 715. [3] S. Agrawal, S. Chaudhuri, V.R. Narasayya, Automated selection of materialized views and indexes for SQL databases, in: Proceedings of the 26th International Conference on Very Large Data Bases, 2000, pp. 496505. [4] Z. Asgharzadeh Talebi, R. Chirkova, Y. Fathi, Exact and inexact methods for solving the problem of view selection, International Journal of Business Intelligence and Data Mining 4 (3/4) (2009) 391415. [5] Z. Asgharzadeh Talebi, R. Chirkova, Y. Fathi, M. Stallmann, Exact and inexact methods for selecting views and indexes for OLAP performance improvement, in: Proceedings of the 11th International Conference on Extending Database Technology, 2008, pp. 311322. [6] Z. Asgharzadeh Talebi (2009) Exact and inexact methods for selecting views and indexes for OLAP performance improvement, doctoral dissertation submitted to the Graduate Program in Operations Research, North Carolina State University, Raleigh. Available at http://repository.lib.ncsu.edu/ir/bitstream/ 1840.16/6202/1/etd.pdf. [7] A. Balmin, F. Ozcan, K. Beyer, R. Cochrane, H. Pirahesh, A framework for using materialized XPath views in XML query processing, in: Proceedings of the 30th International Conference on Very Large Data Bases, 2004, pp. 6071. [8] E. Baralis, S. Paraboschi, E. Teniente, Materialized view selection in a multidimensional database, in: Proceedings of the 23th International Conference on Very Large Data Bases, 1997, pp. 156165. [9] L. Bellatreche, K. Karlapalem, M. Schneider, On efficient storage space distribution among materialized views and indices in data warehousing environments, in: Proceedings of the 9th international Conference on Information and Knowledge Management, 2000, pp. 397404. [10] C.M. Broughton, IBM DB2 cube views and DB2 materialized query tables in a SAS environment, http://www.sas.com/partners/directory/ibm/cubeviews.pdf 2005. [11] N. Bruno, S. Chaudhuri, Interactive physical design tuning, in: Proceedings of the IEEE 26th International Conference on Data Engineering, 2010, pp. 11611164. [12] A. Caprara, M. Fischetti, D. Maio, Exact and approximate algorithms for the index selection problem in physical database design, IEEE Transactions on Knowledge and Data Engineering 7 (6) (1995) 955967. [13] S. Chaudhuri, M. Datar, V. Narasayya, Index selection for databases: a hardness study and a principled heuristic solution, IEEE Transactions on Knowledge and Data Engineering. 16 (11) (2004) 13131323. [14] S. Chaudhuri, U. Dayal, An overview of data warehousing and OLAP technology, SIGMOD Record 26 (1) (1997) 6574. [15] S. Chaudhuri, U. Dayal, V. Narasayya, An overview of business intelligence technology, Communications of the ACM 54 (8) (2011) 8898. [16] S. Chaudhuri, V.R. Narasayya, An efficient cost-driven index selection tool for Microsoft SQL server, in: Proceedings of the 23th International Conference on Very Large Data Bases, 1997, pp. 146155. [17] S. Chaudhuri, V.R. Narasayya, G. Weikum, Database tuning using combinatorial search, Encyclopedia of Database Systems (2009) 738741. [18] R. Eshuis, A. Kumar, An integer programming based approach for verification and diagnosis of workflows, Data and Knowledge Engineering 69 (8) (2010) 816835. [19] C.I. Ezeife, A uniform approach for selecting views and indexes in a data warehouse, in: Proceedings of the 1997 International Symposium on Database Engineering and Applications, 1997, pp. 151160. [20] M. Golfarelli, V. Maniezzo, S. Rizzi, Materialization of fragmented views in multidimensional databases, Data and Knowledge Engineering 49 (3) (2004) 325351. [21] H. Gupta, V. Harinarayan, A. Rajaraman, J.D. Ullman, Index selection for OLAP, in: Proceedings of the 13th International Conference on Data Engineering, 1997, pp. 208219. [22] V. Harinarayan, A. Rajaraman, J.D. Ullman, Implementing data cubes efficiently, in: Proceedings of the 1996 ACM SIGMOD International Conference on Management of Data, 1996, pp. 205216. [23] R. Huang, R. Chirkova, Y. Fathi, Deterministic view selection for data analysis queries: properties and algorithms, the 16th East-European Conference on Advances in Database and Information Systems, ADBIS 7503 (2012) 195208. [24] ILOG, CPLEX 10.0 software package, http://www.ilog.com 2004. [25] P. Kalnis, N. Mamoulis, D. Papadias, View selection using randomized search, Data and Knowledge Engineering 42 (1) (2002) 89111. [26] H.J. Karloff, M. Mihail, On the complexity of the view-selection problem, in: Proceedings of the 18th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, 1999, pp. 167173. [27] R. Kimball, M. Ross, The Data Warehouse Toolkit, second edition Wiley Computer Publishing, 2002. [28] H. Kimura, C. Coffrin, A. Rasin, S.B. Zdonic, Optimizing index deployment order for evolving OLAP, in: Proceedings of the 15th International Conference on Extending Database Technology (EDBT '12), 2012, pp. 276287. [29] H. Kimura, G. Huo, A. Rasin, S. Madden, S. Zdonic, CORADD: correlation aware database designer for materialized views and indexes, Proceedings of the Vldb Endowment 3 (1) (2010) 11031113. [30] J. Kratica, I. Ljubic, D. Tosic, A genetic algorithm for the index selection problem, Applications of Evolutionary Computing 2611 (2003) 281291. [31] J. Li, Z. Asgharzadeh Talebi, R. Chirkova, Y. Fathi, A formal model for the problem of view selection for aggregate queries, in: Advances in Databases and Information Systems, 9th East European Conference, Tallinn, Estonia, 2005, pp. 125138. [32] S. Lightstone, Physical database design for relational databases, in: Encyclopedia of Database Systems, 2009, pp. 21082114. [33] Microsoft Reference (a), Web page of the AutoAdmin project: self-tuning and self-administering databases, http://research.microsoft.com/research/dmx/autoadmin. [34] Microsoft Reference (b), Web page of the Data Management, Exploration and Mining Group, http://research.microsoft.com/research/dmx/. [35] Microsoft Reference (c), White paper. available at http://www.strategy.com. [36] G. Nemhauser, L. Wolsey, Integer and Combinatorial Optimization, Wiley-Interscience, 1988. [37] SAS/OR, SAS/OR software package, http://support.sas.com/documentation/onlinedoc/or/index.html 2012. [38] A. Shukla, P. Deshpande, J.F. Naughton, Materialized view selection for multidimensional datasets, in: Proceedings of 24rd International Conference on Very Large Data Bases, 1998, pp. 488499. [39] D. Theodoratos, A. Tsois, Processing OLAP queries in hierarchically clustered databases, Data and Knowledge Engineering 45 (2) (2003) 205224. [40] Transaction Performance Processing Council, TPC benchmark-H standard specification revision 2.1.0, http://www.tpc.org/tpch/spec/tpch2.1.0.pdf 2002. [41] J. Ullman, H. Garcia-Molina, J. Widom, Database Systems: The Complete Book, Prentice Hall PTR, 2001. [42] G. Valentin, M. Zuliani, D. Zilio, G. Lohman, A. Skelley, DB2 advisor: an optimizer smart enough to recommend its own indexes, in: Proceedings of the 16th International Conference on Data Engineering, 2000, pp. 101110. [43] L. Wolsey, Integer Programming, Wiley-Interscience, 1998. [44] J. Yang, K. Karlapalem, Q. Li, Algorithms for materialized view design in data warehousing environment, in: Proceedings of the 23rd International Conference on Very Large Data Bases, 1997, pp. 136145. [45] D. Zilio, C. Zuzarte, S. Lightstone, W. Ma, G. Lohman, R. Cochrane, H. Pirahesh, L. Colby, J. Gryz, E. Alton, D. Liang, G. Valentin, Recommending materialized views and indexes with IBM DB2 design advisor, in: Proceedings of the 1st International Conference on Autonomic Computing, 2004, pp. 180188.

You might also like