You are on page 1of 79

Globally Recorded binary encoded Domain Compression algorithm in Column Oriented Databases

A Dissertation on Submitted In partial fulfillment For the award of the Degree of Master of Technology In Department of Information technology (With specialization in Information Communication)

Supervisor Name

Submitted By:

Mr. Santosh Kumar Singh Associate Prof.

Mehul Mahrishi Enrollment no: SGVU091543463

Suresh Gyan Vihar University Mahal, Jagatpura, Jaipur July - 2011

Candidates Declaration

I hereby declare that the work, which is being presented in the dissertation, entitled Globally Recorded binary encoded Domain Compression algorithm in Column Oriented Databases in partial fulfillment for the award of Degree of Master of Technology in Department of Information Technology with Specialization in Information Communication, and submitted to the Department of Information Technology, Suresh Gyan vihar University is a record of my own investigations carried under the Guidance of Mr. S.K. Singh, Department of Information Technology. I have not submitted the matter presented in this project/seminar anywhere for the award of any other Degree.

(Name and Signature of Candidate)

Counter Signed by:-

Mehul Mahrishi Information Communication Enrolment No.: SGVU091543463

Mr. Santosh Kumar Singh Supervisor (M. Tech IC)

DETAILS OF CANDIDATE, SUPERVISOR (S) AND EXAMINER

Name of Candidate: Mehul Mahrishi. Roll No. 104511 Deptt. Of Study: M. Tech. (Information Communication). Enrolment No. SGVU091543463 Thesis Title: Globally Recorded binary encoded Domain Compression algorithm in Column Oriented Databases...

.. Supervisor (s) and Examiners Recommended (with Office Address including Contact Numbers, email ID) Supervisor Co-Superviosr

Internal Examiner 1 2 3

Signature with Date

Programme Coordinator

Dean / Principal

Certificate
This certifies that the thesis entitled

Globally Recorded binary encoded Domain Compression algorithm in Column Oriented Databases

is submitted by
Mehul Mahrishi
SGVU091543463

IV semester , M.Tech (IC) in the year 2011 in partial fulfillment of Degree of Master of Technology in Information Communication

SURESH GYAN VIHAR UNIVERSITY, JAIPUR.

Signature of Supervisor Date: Place:

Acknowledgement

Foremost, I would like to express my sincere gratitude to my advisor and mentor Mr. S.K. Singh for the continuous support of my study and research, for his patience, motivation, enthusiasm, and knowledge. His guidance helped me in all the time of research and writing of this thesis. Besides my advisor, I would like to thank the rest of my thesis committee, especially Mr. Vibhakar Pathak for their encouragement, insightful comments, and hard questions. My sincere thanks also goes to Dr. S.L. Surana (Principal, SKIT), Dr. C.M. Choudhary(HOD CS,SKIT) and Dr. Anil Chaudhary(HOD IT,SKIT) , for supporting my advance studies and providing opportunities in their groups and leading me working on diverse exciting projects. My special thanks to Mr. Mukesh Gupta (Reader, SKIT) for his invaluable advise which helps me to take this decision. I thank my fellow mates Anita Shrotriya, Devendra Kr.Sharma, Vipin Jain, Singh Brothers, Kamal Hiran for the stimulating discussions, for the sleepless nights we were working together before deadlines, and for all the fun we have had in the last two years. Last but not the least; I would like to thank my family members: my parents (Mukesh & Madhulika Mahrishi), uncle & Aunt (Pushpanshu & Seema Mahrishi), brothers (Mridul & Harshit) and my grandmothers for their faith and giving me the first place by supporting me throughout my life. (Mehul Mahrishi)

Contents

List of Tables List of Figures Notations Abstract

iv v vi vii

CHAPTER 1 Introduction
1.1 1.2 1.3 1.4 1.5

1-4

Introduction.................... 1 Objective.... 1 Motivation.. 2 Research Contribution 3 Dissertation Outline........3

CHAPTER 2 Theories

5-23

2.1

Introduction..... 5 2.1.1 2.1.2 On-Line Transaction Processing...6 Query Intensive Applications.......7

2.2 2.3 2.4

The Rise of Columnar Database. 8 Definitions. ...10 Row Oriented Execution.... 12

ii 2.4.1 2.4.2 2.4.3 2.5 Vertical Partitioning.....12 Index-Only Plans..12 Materialized Views.......13

Column Oriented Database......13 2.5.1 2.5.2 2.5.3 2.5.4 Compression........13 Late Materialization.........14 Block Iteration.........14 Invisible Joins......14

2.6 2.7 2.8

Query execution in Row vs. Column oriented database.....................15 Compression.......17 Conventional Compression.18 2.8.1 2.8.2 Domain Compression..19 Attribute Compression... 20

2.9

Layout of Compressed Tuples..........21

CHAPTER 3 Methodology

24-31

3.1 3.2 3.3 3.4 3.5 3.6

Introduction..............................24 Reasons for Data Compression................... 25 Compression Scheme.. 28 Query Execution........... 30

Decompression........... 30 Prerequisites............................ 30

CHAPTER 4 Results & Discussions

32-44

4.1

Introduction..........32

iii 4.2 Anonymization ..33 4.2.1 4.2.2 4.2.3 4.3 Problem Definition & Contribution...........................34 Quality Measure of Anonymization...........................36 Conclusion..... 36

Domain compression through binary conversion................ 36 4.3.1 4.3.2 Encoding of Distinct Values.......36 Paired Encoding.........................38

4.4

Add-ons on Compression....................... 40 4.4.1 4.4.2 4.4.3 Functional dependencies... 40 Primary Keys......42 Few Distinct values................... 42

4.5 4.6

Limitations.. 43 Conclusion....43

CHAPTER 5 Conclusion & Future Work

45-47

5.1 5.2

Conclusion.....45 Future Work.................................................................................................46

APPENDIX I Infobright

48-62

References & Bibliography

63-67

iv

List of Tables

TABLES

TITLE

PAGE

2.1 2.2 3.1 3.2 3.3 4.1 4.2 4.3 4.4 4.5 4.6 4.7 4.8 4.9 4.10 4.11

A typical Row-oriented Database Table representing Column storing of data Employee table with type and cardinality Code Table Example Query execution Published Table View of published table by Global recording An instance of relation Student Representing Stage 1 of compression technique Representing Stage 1 with binary compression Representing Stage 2 compression Representing Stage 2 compression coupling Representing functional dependency based coupling Number of distinct values in each column Representing test case 1 Representing test case 2

6 10 28 29 30 34 35 37 38 38 39 40 41 41 42 42

List of Figures & Graphs

FIGURE

TITLE

PAGE

Figure 2.1 Figure 2.2 Figure 2.3 Figure 2.4 Graph I.1 Graph I.2 Graph I.3

OLTP Access OLAP Access Column based data storage Layout of Compressed Tuple Representing Load time comparison Representing Table size comparison Representing query execution comparison

6 7 11 23 61 61 61

vi

Notations

DBMS RDBMS OLTP SQL ICE IEE TB

: Database Management System : Relational Database Management System : Online Transactional Processing : Structured Query Language : Infobright Community Edition : Infobright Enterprise Edition : TeraBytes

vii

Abstract

Warehouses contain a lot of data and hence any leak or illegal publication of information risks the individuals privacy. This research work proposes the compression ad abstraction of data using existing compression algorithms. Although the technique is general and easier, it is my strong believe that it is particularly advantageous for data warehousing. Through this study, we propose two algorithms. The first algorithm describes the concept of compression of domains at attribute level and we call it as Attribute Domain Compression. This algorithm can be implemented on both row and columnar databases. The idea behind the algorithm is to reduce the size of large databases as to store them optimally. The second algorithm is also applicable for both concepts of databases but will optimally work for columnar databases. The idea behind the algorithm is to generalize the tuple domains by giving it a value say (n) such that all other n-1 tuples or at least maximum can be identified.

P a g e |1

Chapter 1

Introduction

1.1 Introduction
Large operational data and information is stored by different vendors and organizations in warehouses. Most of which is useful only when it is shared and analyzed with other related data. However this kind of data often contains some personal details which must be hidden from limited power users. The data can only be allowed to be released when individuals are unidentifiable.

Moreover, if we talk about Business intelligence and analytical applications queries, they are generally based on selection of particular attributes of a database. The simplicity and performance characteristic of columnar approach provides a cost effective implementation.

1.2 Objective
The main aim of the research is to propose a compression algorithm that is based on the concepts of Attribute domain compression. The data is recorded globally so that the concept of data abstraction can be preserved.

C h a p t e r 1 : I n t r o d u c t i o n : P a g e |2

We will use the concept of existing two algorithms: The first algorithm describes the concept of compression of domains at attribute level and we call it as Attribute Domain Compression. This algorithm can be implemented on both row and columnar databases. The idea behind the algorithm is to reduce the size of large databases as to store them optimally. The second algorithm is also applicable for both concepts of databases but will optimally work for columnar databases. The idea behind the algorithm is to generalize the tuple domains by giving it a value say (n) such that all other n-1 tuples or at least maximum can be identified.

1.3 Motivation
Data compression has been a very popular topic in the research literature and there is a large amount of work on this subject. The most obvious reason to consider compression in a database context is to reduce the space required in the disk. However, the motivation behind the research is whether the processing time of queries can be improved by reducing the amount of data that needs to be read from disk using a compression technique. Recently, there has been a revival of interest on employing compression techniques to improve performance in a database which also helps me to choose this as my topic for study. The data compression currently exists in the main databases engines, being adopted different approaches in each one of them.

C h a p t e r 1 : I n t r o d u c t i o n : P a g e |3

1.4 Research Contribution


In order to evaluate the performance speedup obtained with the compression performed a subset of the queries were executed with the following configurations:

1. No compression 2. Proposed compression 3. Categories compression and descriptions compression We then study about the two major compression algorithms present in row oriented database i.e. n-anonymization and domain encoding by binary compression.

Finally the report studies two complex algorithms and embeds them to form a final optimal algorithm for domain compression. The report will also represent the examples that are performed practically on a columnar oriented platform namedInfobright.

1.5 Dissertation Outline


This research work focus on the development of compression algorithm for columnar database over a tool Infobright. We start in Chapter 2 by documenting the theories that are relevant for understanding columnar databases and how compression is implemented on databases by various techniques that are given to us. In chapter 3, we study a compression technique and implemented it by query execution over MYSQL database. This work concludes the Dissertation Part I. Chapter 4 discusses the framework to facilitate the development of algorithm for columnar database and introduces two concepts Global recording anonymization and binary encoded domain compression. We conclude this chapter by developing a compression algorithm by

C h a p t e r 1 : I n t r o d u c t i o n : P a g e |4

combining these two concepts. After successful implementation of the compression algorithm, it is then tested and the output is displayed graphically. Finally, Chapter 5 illustrates the familiarity with the tool Infobright. Some basic queries and their execution are learned on an existing columnar database. It is not just a database but contains an inbuilt platform for compression algorithms that can be implemented on a DB.

P a g e |5

Chapter 2

Theories

2.1. Introduction
Most information systems available today are implemented by using commercially available database management system (DBMS) products. It is software which manages data stored in an information system, provides privacy and privileges to users, facilitates concurrent access to multiple users and provides recovery from system failures without the loss of system integrity. Relational database is most commonly used DBMS which organizes the data into different relations. Each relational database is a collection of inter-related data which is organized in a matrix with rows and columns. Each column represents the attribute of that particular entity which is converted into the database table, while each row of the matrix generally called a tuple represents the different values that an attribute can possess. Each row in a table represents a set of related data, and every row in the table has the same structure. For example, in a table that represents employee, each row would represent a single employee. Columns might represent things like employee name, employee street

Chapter

2 : T h e o r i e s : P a g e |6

address, his SSN etc. In a table that represents the relationship of employees with departments, each row would relate one employee with one department.

Table 2.1 A Typical Row oriented Database

Column 1 Row 1 Row 2 Row1 & Column 1 Row2 & Column 1

Column 2 Row1 & Column 2 Row2 & Column 2

Column 3 Row1 & Column 3 Row2 & Column 3

2.1.1 On-Line Transactional Processing


The popularity of RDBMS is mainly due to the support of on-line transactional processing (OLTP). Typically the OLTP system includes Student Management System, Bank Database etc. The queries includes, insert the new record for a new subject that is assigned to a student. These applications involve either no or very less analysis of data and serve the use of an information system for data preservation and querying. An OLTP query is for a short duration and requires minimal database resources. [3]

Figure 2.1 represents an OLTP process in which two queries insert and lookup are executed on a student table.

Chapter

2 : T h e o r i e s : P a g e |7

Figure 2.1 OLTP Access

2.1.2 Query Intensive Applications


In the mid of 1990s a new era of data management arises which was query specific and involves large complex data volumes. Example of such query specific DBMS are OLAP and Data mining.

OLAP

This tool summarizes the data from large data volumes and represents the query into results using 2-D or 3-D graphics to visualize the answer. The OLAP query is like Give the % comparison between the marks of all students in B. Tech and in M. Tech. The answer to this query would be generally in the form of graph or chart. Such 3-D and 2-D visualization of data is called as Data Cubes.

Figure 2.2 represents the access pattern of OLAP which requires a few attributes to be process and access to huge volume of data. It must be noted that the execution of number of queries per second in OLAP is very less in comparison to OLTP.

Chapter

2 : T h e o r i e s : P a g e |8

Figure 2.2 OLAP Access

Data Mining

Data mining is now more demanding application of databases. It is also known as Repeated OLAP. The objective of data mining is to locate the sub groups that require some mean values or statistical analysis of data to get result. The typical example of data mining query is Find the dangerous drivers from a car insurance customer database. It is left to the data mining tool to determine what the characteristics are of those dangerous customers group [3]. This is done typically by combining statically analysis and automated search techniques as similar to artificial intelligence.

2.2. The rise of Columnar Database


The roots of column-store DBMSs can be traced in the 1970s, when transposed files were first studied, followed by investigations on vertical partitioning as a form of a table attribute clustering technique. By the mid 1980s, the advantages of a fully decomposed storage model (DSM a predecessor to column stores) over NSM (traditional row-based storage) were documented.[4]

Chapter

2 : T h e o r i e s : P a g e |9

The relational databases present today are designed predominantly to handle online transactional processing (OLTP) applications. A transaction (e.g. an online purchasing a laptop through internet dealer) typically maps to one or more rows in a relational database, and all traditional RDBMS designs are based on a per row paradigm. For transactional-based systems, this architecture is well suited to handle the input of incoming data.

Data warehouses are used in almost every large organizations and research states that their size doubles after every third year. Moreover the hourly workload of these warehouses is huge and approximately 20lakhs SQL statements are encountered hourly. [7]

Warehouses contain a lot of data and hence any leak or illegal publication of information risks the individuals privacy. However, for applications that are very read intensive and selective in the information being requested, the OLTP database design isnt a model that typically holds up well. [6] Business intelligence and analytical applications queries often analyze selected attributes in a database. The simplicity and performance characteristic of columnar approach provides a cost effective implementation. Column oriented database generally known as columnar database reinvents how data is stored in databases. Storing data in such a fashion increases the probability of storing adjacent records on disk and hence odds of compression. This architecture suggests a different model in which inserting and deleting transactional data are done by a row-based system, but selective queries that are only interested in a few columns of a table are handled by columnar approach.

Chapter

2 : T h e o r i e s : P a g e | 10

Different methodologies such as indexing, materialistic views, horizontal partitioning etc. are provided by row oriented databases which are rather better ways of query execution, but they also have some disadvantages of their own. For example, in business intelligence/analytic environments, the ad-hoc nature of such scenarios makes it nearly impossible to predict which columns will need indexing, so tables end up either being over-indexed (which causes load and maintenance issues) or not properly indexed and so many queries end up running much slower than desired.

2.3. Definitions
A column-oriented DBMS is a database management system (DBMS) that stores its content by column rather than by row. Wiki [23]

It must always be remembered that columnar database is only an approach of how data is stored in memory, it doesnt defined any architectural implementation of database, and rather it follows the traditional database architecture.

Table 2.2

Table representing Column storing of data

SNO

SNAME

SSN

CITY

S1

MEHUL

200

JAIPUR

S2

VIPIN

201

HINDON

S3

DEVENDRA

300

KEKRI

Chapter

2 : T h e o r i e s : P a g e | 11

S4

ANITA

302

BHILWARA

The data would be stored on disk or in memory something like:

S1S2S3S4S5MEHULVIPINDEVENDRAANITAPALWIN200201300302202JAIPU RHINDONKEKRIAJMERGANGANAGAR This is in contrast to a traditional row based approach in which the data more like this: S1MEHUL200JAIPURS2VIPIN201HINDONS3DEVENDRA300KEKRIS4ANITA3 02AJMERS5PALWIN202GANGANAGAR

The above example also explains that columnar database can be highly compressed, moreover it is self-indexed and hence aggregate functions such as MIN, MAX, AVG, and COUNT can be efficiently performed.

Figure 2.3

Column based data storage

Chapter

2 : T h e o r i e s : P a g e | 12

As it is clearly that goal of a columnar database is to perform the write and read operations efficiently to and from hard disk storage in order to speed up the time it takes to return a query. In the above example, all the column 1 values are physically together followed by all the column 2 values, etc. The data is stored in record order, so the 100th entry for column 1 and the 100th entry for column 2 belong to the same input record [1]. This allows individual data elements, such as customer name for instance, to be accessed in columns as a group, rather than individually row-by-row.

2.4. Row Oriented Execution


In this section, we discuss several different techniques that can be used to implement a column-database design in a commercial row-oriented DBMS.

2.4.1 Vertical Partitioning


The most straightforward way to emulate a column-store approach in a row-store is to fully vertically partition each relation. This approach creates one physical table for each column in the logical schema, where the ith table has two columns, one with values from column i of the logical schema and one with the corresponding value in the position column. Queries are then rewritten to perform joins on the position attribute when fetching multiple columns from the same relation.

2.4.2 Index-only plans


The vertical partitioning approach has two problems. Firstly, it requires the position attribute to be stored in every column, which wastes space and disk bandwidth and secondly, most row-stores store a relatively large header on every tuple, which further wastes space. [7] Therefore to remove these problems we use another approach called

Chapter

2 : T h e o r i e s : P a g e | 13

as Index only plans. In this approach the base relations are stored using a standard, row-oriented design, but an additional dispersed B+Tree index is added on every column of every table.

2.4.3 Materialized Views


The third approach we consider uses materialized views. In this approach, we create an optimal set of materialized views for every query flight in the workload, where the optimal View for a given flight has only the columns needed to answer queries in that flight. We do not pre-join columns from different tables in these views.

2.5 Column Oriented Execution


In this section, we review three common optimizations used to improve performance in column-oriented database systems.

2.5.1 Compression
Compressing data using column-oriented compression algorithms and keeping data in this compressed format as it is operated upon has been shown to improve query performance by up to an order of magnitude. Storing data in columns allows all of the names to be stored together, all of the phone numbers together, etc. Certainly phone numbers are more similar to each other than surrounding text fields like e-mail addresses or names. Further, if the data is sorted by one of the columns, that column will be super-compressible.

Chapter

2 : T h e o r i e s : P a g e | 14

2.5.2 Late Materialization


In a column-store, information about a logical entity (e.g., a person) is stored in multiple locations on disk (e.g. name, e-mail address, phone number, etc. are all stored in separate columns), whereas in a row store such information is usually colocated in a single row of a table. [7] At some point in most query plans, data from multiple columns must be combined together into rows of information about an entity. Consequently, this join-like materialization of tuples (also called tuple construction) is an extremely common operation in a column store.

2.5.3 Block Iteration


In order to process a series of tuples, row-stores first iterate through each tuple, and then need to extract the needed attributes from these tuples through a tuple representation interface. In contrast to row-stores, in all column-stores, blocks of values from the same column are sent to an operator in a single function call. Further, no attribute extraction is needed, and if the column is fixed-width, these values can be iterated through directly as an array. Operating on data as an array not only minimizes per-tuple overhead, but it also exploits potential for parallelism on modern CPUs, as loop-pipelining techniques can be used. [2-5]

2.5.4 Invisible joins


Queries over data warehouses, particularly over data warehouses, often have the following structure:

Chapter

2 : T h e o r i e s : P a g e | 15

Restrict the set of tuples in the fact table using selection predicates on one (or many) dimension tables.

Then, perform some aggregation on the restricted fact table, often grouping by other dimension table attributes.

Thus, joins between the fact table and dimension tables need to be performed for each selection predicate and for each aggregate grouping.
As

an alternative to these query plans, we introduce a technique we call the invisible

join that can be used in column-oriented databases for foreign-key/primary-key joins. It works by rewriting joins into predicates on the foreign key columns in the fact table. These predicates can be evaluated either by using a hash lookup (in which case a hash join is simulated), or by using more advanced methods which are beyond the scope of our study. [1]

2.6. Query execution in Row vs. Column oriented database


When talking about the performance of databases, query execution is the most important and indistinct factor which can individually determines the performance of the database either it is row based or column based. We understand the concept by a simple example:

Suppose there are 1000 rows in a database table and the following query is executed over it.

Until no more {

Get a row out of the buffer manager

Chapter

2 : T h e o r i e s : P a g e | 16

Evaluate the row

Pass onward if it satisfies the predicate}

Notice that the inner loop of the executor is called 1000 times for our query above, once per row. Since the overhead of the inner loop largely determines performance, a row store executor will take CPU time proportional to the number of runs required to evaluate the query.

In contrast, in a column store executor the inner loop is:

Until no more {

Pick up a column

Evaluate the column

Pass on a row range

Notice that the inner loop is called once per column, not once per row. Also, notice that the algorithm complexity of processing a row is about the same as processing a column. [17]

Hence, the column store will consume vastly less CPU resources, because its inner loop is executed once per column, and there are a lot less columns than rows in evaluating a typical query.

Chapter

2 : T h e o r i e s : P a g e | 17

2.7. Compression
Data compression in databases is always been a very popular and interesting topic for the database researchers and there is a lot of work on this context. The most obvious reason for compression in any context is to reduce the space required in the disk and so is in databases. However, another important issue is to improve the processing time of queries by reducing the amount of data that needs to be read from disk using a compression technique.

After a long time since the evolution of databases, there is a revival in the field of compression to improve the quality and performance of databases. The data

compression currently exists in the main databases engines, being adopted different approaches in each one of them. It is generally accepted that due to the greater similarity and redundancy of data within columns, column stores provide superior compression, and therefore require less storage hardware and perform faster because, among other things, they read less data from the disk [17]. Moreover, the compression ratio is higher in columnar database because the entries in the columns are similar to each other. Both Huffman encoding and Arithmetic encoding are based on the statistical distribution of the frequencies of symbols appearing in the data. Huffman coding assigns a shorter compression code to a frequent symbol and a longer compression code to an infrequent symbol. For example, if there are four symbols a, b, c, and d, each with probability1 3/16, 1/16, 1/16, and 1/16, then 2 bits are needed to represent each symbol without compression.

Chapter

2 : T h e o r i e s : P a g e | 18

A possible Huffman coding is the following:

a = 0, b = 10, c = 110, d = 111.

As a result, the average length of a compressed symbol equals:

1 13/16 + 2 1/16 + 3 1/16 + 3 1/16 = 1.3 bits.

Arithmetic encoding is similar to Huffman encoding except that it assigns an interval to the whole input string based on the statistical distribution. [7]

2.8. Conventional Compression


Database Compression techniques are applied to gain the performance by decreasing the size and increasing Input and output functional/query performance of a database. The basic concept behind compression is that it delimits the storage and keeps the data adjoining to each other and therefore it reduces the size and number of transfers. This section demonstrates the two different classes of compression in databases. a. Domain Compression b. Attribute Compression The classes are equally implementable in column or row based database approach. Queries that are executed on compressed data are seen more efficient than the queries that are executed over a decompressed database [8]. In the section below, we will discuss each of the above section in detail.

Chapter

2 : T h e o r i e s : P a g e | 19

2.8.1 Domain Compression


In this type of compression technique, we will discuss three compression techniques: numeric compression in the presence of NULL values, string compression, and dictionary-based compression. Since all three compression techniques are applicable in domain compression, we obviously will be sticking with the compression of domain of the attributes. Numeric Compression in the presence of NULL values

This compression technique is used to compress those attributes which are of numeric type such as integer and contains some NULL values in their domain. The basic idea is that consecutive zeros or blanks of a tuple in the table are removed and a description of how many there were and where they existed is given at the end [13]. To eliminate the difference in size of the attribute because of null values, it is sometimes recommended to encode the data bit wise i.e. integer of 4 bytes is replaced by 4 bits. For example: Bit value for 1= 0001 Bit value for 2= 0011 Bit value for 3= 0111 Bit value for 4= 1111 And all 0s for the value 0

String Compression

String in database is represented by char data type and its compression is already proposed and implemented in SQL by providing varchar data type. An extension of conventional string compression is provided in this technique. The suggestion is that

Chapter

2 : T h e o r i e s : P a g e | 20

after converting the char type to varchar, it is further compressed in the second stage by any of the given compression algorithm such as Huffman coding, LZW algorithm etc. [24] Dictionary Encoding

This type of encoding technique uses a special type of data structure called Dictionary. It is very much effective in the circumstances when the database takes limited values that repeat a lot more time [14]. Dictionary encoding algorithm first calculates the number of bits, X, needed to encode a single attribute of the column (which can be calculated directly from the number of unique values of the attribute). It then calculates how many of these X-bit encoded values cannot in 1, 2, 3, or 4bytes. For example, if an attribute has 32 values, it can be encoded in 5 bits, so 1 of these values cannot in 1 byte, 3 in 2 bytes, 4 in 3 bytes, or 6 in 4 bytes.

2.8.2 Attribute Compression


As we know that all the compression techniques are designed especially for data warehouses where a huge amount of data is stored which are usually composed by a large number of textual attributes with low cardinality. But in such section we will demonstrate those techniques which can also be used in conventionally old databases such as MYSQL, SQL SERVER etc.[5] The main objective of this technique is to allow the encryption for reduction of the space occupied by dimension tables with number of rows, reducing the total space occupied and leading to a consequent gains on performance. In this type of compression technique, we will discuss two compression techniques: compression of categories and compression of comments.

Chapter

2 : T h e o r i e s : P a g e | 21

Compression of Categories

Categories are textual attributes with low cardinality. Examples of category attributes are: city, country, type of product, etc. Categories coding is done through the following steps: 1. The data in the attribute is analysed and a frequency histogram is build. 2. The table of codes is build based on the frequency histogram: the most frequent values are encoded with a one byte code; the least frequent values are coded using a two bytes code. In principle, two bytes are enough, but a third byte could be used if needed. 3. The codes table and necessary metadata is written to the database. 4. The attribute is updated, replacing the original values by the corresponding codes (the compressed values).

2.9 Layout of Compressed Tuples


Figure 2.4 shows the overall layout of a compressed tuple [7]. The figure shows that a tuple can be composed of up to five parts: 1. The first part of a tuple keeps the (compressed) values of all fields that are compressed using dictionary-based compression or any other fixed length compression technique. [5-7] 2. The second part keeps the encoded length information of all fields compressed using a variable-length compression technique such as the numerical compression techniques described above. 3. The third part contains the values of (uncompressed) fields of fixed length; e.g., integers, doubles, CHARs, but not VARCHARs or CHARs that were turned into VARCHARs as a result of compression.

Chapter

2 : T h e o r i e s : P a g e | 22

4. The fourth part contains the compressed values of fields that were compressed using a variable-length compression technique; for example, compressed integers, doubles, or dates. The fourth part would also contain the compressed value of the size of a VARCHAR field if this value was chosen to be compressed. (If the size information of a VARCHAR field is not compressed, then it is stored in the third part of a tuple as a fixed-length, uncompressed integer value.) 5. The fifth part of a tuple, finally, contains the string values (compressed or not compressed) of VARCHAR fields.

While all this sounds quite complicated, the separation in five different parts is very natural. First of all, it makes sense to separate fixed-sized and variable-sized parts of tuples, and this separation is standard in most database systems today. The first three parts of a tuple are fixed-sized which means that they have the same size for every tuple of a table. As a result, compression information and/or the value of a field can directly be retrieved from these parts without further address calculations [24]. In particular, uncompressed integer, double, date . . . fields can directly be accessed regardless of whether other fields are compressed or not [5]. Furthermore, it makes sense to pack all the length codes of compressed fields together because we will exploit this bundling in our fast decoding algorithm, as we will see soon.

Chapter

2 : T h e o r i e s : P a g e | 23

Figure 2.4

Layout of Compressed Tuple

Finally, we separate small variable-length (compressed) fields from potentially large variable-length string fields because the length information of small fields can be encoded into less than a byte whereas the length information of large fields is encoded in a two step process. Obviously, not every tuple of the database consists of these five parts [5]. For example, tuples that have no compressed fields consist only of the third and, maybe, the fifth part. Furthermore keep in mind that all tuples of the same table have the same layout and consist of the same number of parts because all the tuples of a table are compressed using the same techniques.

P a g e | 24

Chapter 3

Methodology

3.1 Introduction
As we discuss the compression techniques in chapter 2, by apply these techniques, queries are executed on a platform in which query rewriting and data decompression is done when necessary. In fact, the query execution is on a very small basis, it rather produces a very better result when compared with the uncompressed queries on the same platform. This chapter demonstrates the different compression methods that are applied on the tables and then compare the results graphically as well as in tabular forms. It must be noted that the queries with WHERE clause must only be rewritten because selection and projection operations dont requires searching of particular tuple of a particular attribute. Despite the fact that the development of data storage has increases, a similar increase of disk access development has not happened. On the other hand, speed of RAM memories and CPUs has improved. This technological trend led to the use of data

Chapter

3 : M e t h o d o l o g y : P a g e | 25

compression, trading some execution overhead (to compress and decompress data) for the reduction of space occupied by data. The compression techniques works both statically and dynamically i.e. data is compressed when it is read from the disk or compressed when it is executed in the form of queries. In databases, and particularly in warehouses, the reduction in the size of the data obtained by compression normally gains speed, as the extra cost in execution time (to compress and decompress the data) is compensated by the reduction in size of the data that have to be read/stored in the disks. [1]

3.2 Reasons for Data Compression


Data compression in data warehouses is particularly interesting for two main reasons: 1) The quantity of data in a warehouse is huge and hence compression is suitable and preferred over normal databases. 2) The data warehouses are used for querying only (i.e., only read accesses, as the data warehouse updates are done offline), This means that compression overhead is not relevant. Furthermore, if data is compressed using techniques that allow searching over the compressed data, then the gains in performance could be quite significant, as the decompression operation are only done when is strictly necessary. In spite of the potential advantages of compression in databases, most of the commercial relational database management systems (DBMS) either do not have compression or just provide data compression at the physical layer (i.e., database blocks), which is not flexible enough to become a real advantage. Flexibility in database compression is essential, as the data that could be advantageously compressed is frequently mixed in the same table with data whose compression is not

Chapter

3 : M e t h o d o l o g y : P a g e | 26

particularly helpful. Nonetheless, recent work on attribute-level compression methods has shown that compression can improve the performance of database systems in read-intensive environments such as data warehouses. [18] Data compression and data coding techniques transform a given set of data into a new set of data containing the same information, but occupying less space than the original data (ideally, the minimum space possible). Data compression is heavily used in data transmission and data storage. In fact, reducing the amount of data to be transmitted (or stored) is equivalent to the increase of the bandwidth of the transmission channel (or the size of the storage device). The first data compression proposals appeared in the 40s, namely proposed by D. Huffman, but these earlier proposals have evolved dramatically since then [7]. The main emphasis of previous work has been on the compression of numerical attributes, where coding techniques have been employed to reduce the length of integers, floating point numbers, and dates. However, string attributes (i.e., attributes of type CHAR (n) or VARCHAR (n) in SQL) often comprise a large portion of database records and thus have significant impact on query performance. The compression of data in databases offers two main advantages: 1. less space occupied by data and 2. Potentially better query response time. If the benefit in terms storage is easily understandable, the gain in performance is not so obvious. This gain is due to the fact that less data had to be read of the storage, which is clearly the most time-consuming operation during the query processing. The most interesting use of data compression and codification techniques in Databases are surely in data warehouses, given the huge amount of data normally involved and its clear orientation for the query processing. As in the data warehouses all the insertions

Chapter

3 : M e t h o d o l o g y : P a g e | 27

and updates are done during the update window , when the data warehouse is not available for users, off-line compression algorithms are more adequate, as the gain in query response time usually compensates the extra costs to codify the data before being loaded into the data warehouse. In fact, off-line compression algorithms optimize the decompression time, which normally implies more costs in the compression process. The technique presented in this report follow these ideas, as it takes advantage of the specific features of data warehouses to optimize the use of traditional text compression techniques.

In addition to the observations regarding when to use each of the various compression schemes, our results also illustrate the following important points: Physical database design should be aware of the compression subsystem. Performance is improved by compression schemes that take advantage of data locality. Queries on columns in projections with secondary and tertiary sort orders perform well, and it is generally beneficial to have low cardinality columns serve as the leftmost sort orders in the projection (to increase the average run-lengths of columns to the right). The more order and locality in a column, the better the database is. It is a good idea to operate directly on compressed data. The optimizer needs to be aware of the performance implications of operating directly on compressed data in its cost models. Further, cost models that only take into account I/O costs will likely perform poorly in the context of column-oriented systems since CPU cost is often the dominant factor.

3.3 Compression Scheme


Compression is done through the following steps:

Chapter

3 : M e t h o d o l o g y : P a g e | 28

1. Attributes are analyzed and a frequency histogram is build. 2. The table of codes is build based on the frequency histogram: the most frequent values are encoded with a one byte code; the least frequent values are coded using a two bytes code. In principle, two bytes are enough, but a third byte could be used if needed.[5] 3. The codes table and necessary metadata is written to the database. 4. The attribute is updated, replacing the original values by the corresponding codes (the compressed values). The below example of an employee table represents the compression technique:

Table 3.1

Employee table with type and cardinality

Attribute name SSN EMP_NAME EMP_ADD EMP_SEX EMP_SAL EMP_DOB EMP_CITY EMP_REMARKS

Attribute Type TEXT VARCHAR(20) TEXT CHAR INTEGER DATE TEXT TEXT

Cardinility 1000000 500 200 2 5000 50 95000 600

Table 3.1 presents an example of typical attributes of a client dimension in a data warehouse, which may be a large dimension in many businesses (e.g., e-business). For example, we can find several attributes that are candidates to coding, such as: EMP_NAME, EMP_ADD, EMP_SEX, EMP_SAL, EMP_DOB, EMP_CITY, and EMP_REMARKS.

Chapter

3 : M e t h o d o l o g y : P a g e | 29

Table 3.2 City name DELHI MUMBAI KOLKATA CHENNAI BANGALORE JAIPUR COIMBATORE

Code Table Example Code 00000010 00000100 00000110 00001000 00001000 00001000 00000110 00000110 00001000 00001000 00001000

City Postal Code 011 022 033 044 080 0141 0422

COCHIN

0484

00010000 00010000 00010000

Assuming that we want to code the EMP_CITY attribute, an example of possible resulting codes table is shown in Table 3.2. The codes are represented in binary to better understand the idea. As the attribute has more than 256 distinct values, we will have codes of one byte to represent the 256 most frequent values (e.g. Delhi and Mumbai) and codes of two bytes to represent the least frequent values (e.g. Jaipur and Bangalore). The values shown in Table 2 (represented in binary) would be the ones stored in the database, instead of the larger values. For example, instead of storing Jaipur, which corresponds to 6 ASCII chars, we just stores one byte with the binary cone 00000110 00000110.

3.4 Query Execution

Chapter

3 : M e t h o d o l o g y : P a g e | 30

Query rewriting is necessary in queries where the coded attributes are used in the WHERE clause for filtering. In these queries the values used for filter the result must be replaced by the correspondent coded values. Following are some simple examples of the type of query rewriting needed. The value JAIPUR is replaced by the corresponded code, fetched from the codes table, shown in Table 3.2.

Table 3.3

Query execution

Original Query Select EMP_NAME From EMPLOYEE Where EMP_CITY = JAIPUR

Modified Query Select EMP_NAME From EMPLOYEE Where EMP_CITY = 00000110 00000110

3.5 Decompression
The decompression of the attributes is only made when the coded attributes are in the query select list. In these cases the query is executed and after that the result set is processed in order to decompress the attributes that contain compressed values. As the typical data warehousing queries return small result sets the decompression time will represent a very small amount of the total query execution time.

3.6 Prerequisites
The goal of the experiments performed is to measure experimentally the gains in storage and performance obtained using the proposed technique.

Chapter

3 : M e t h o d o l o g y : P a g e | 31

The experiments were divided in two phases. In the first phase only categories compression was used. In the second phase we used categories compression in conjunction with descriptions compression.

P a g e | 32

Chapter 4

Results & Discussions

4.1 Introduction
Many theories regarding improvements in CPU speed have focused in last decades which overtaken improvements in disk access rates by orders of magnitude and thus inspiring the us for generating new data compression techniques in database systems to trade reduced disk I/O against additional CPU overhead for compression and decompression of data. After the development of compression technique in chapter 3 I propose a compression algorithm which integrates domain & attribute compression based on dictionary based anonymization and implementing global recording generalization. In this chapter, I demonstrate how to compress data that achieve better performance than conventional database systems. We address the following two issues. First, we implement a new proposed N-Anonymization technique embedded with global recording generalization. After evaluating, the report presents the algorithm for data compression and finally demonstrates that our approach gives a comparable result over the existing algorithms.

Chapter4: Results

& D i s c u s s i o n s : P a g e | 33

Second, we use a Binary Encoded pairing of attributes for data compression that we discuss in the previous chapter for string compression in the database and modify it so that it intelligently selects the most effective compression method for string-valued attributes. Moreover we also use the concept of data hiding and equivalent sets before compressing the data so that the private information of the users is not revealed publically.

4.2 Anonymization
Warehouses contain a lot of data and hence any leak or illegal publication of information risks the individuals privacy. N-Anonymity is a major technique to deidentify a data set. The idea behind the technique is to determine the value of a tuple, say n, such that other remaining n-1 tuples or at least maximum tuples can be identified by the value of n. The intensity of protection increases with increase the number of n. One way to produce n identical tuples within the identifiable attributes is to generalize values within the attributes, for example, removing city and street information in a address attribute. [6] There are many ways through which data unidentification can be done and one of the most appropriate approaches is generalization. Various generalization techniques include global recoding generalization multidimensional recoding generalization, and local recoding generalization [15]. Global recoding generalization maps the current domain of an attribute to a more general domain. For example, ages are mapped from years to 10-year intervals.

Chapter4: Results

& D i s c u s s i o n s : P a g e | 34

Multidimensional recoding generalization maps a set of values to another set of values, some or all of which are more general than the corresponding premapping values. For example, {male, 32, divorce} is mapped to {male, [30, 40), unknown}. Local recoding generalization modifies some values in one or more attributes to values in more general domains [6].

4.2.1 Problem definition and Contribution


From the very beginning we have cleared that our objective is to make every tuple of a published table identical to at least n-1 other tuples. Identity-related attributes are those which potentially identify individuals in a table. For example, the record of an old-aged male in the rural area with the postcode of 302033 is unique in Table 4.1, and hence, his problem of asthma may be revealed if the table is published. To preserve his privacy, we may generalize Gender and Postcode attribute values such that each tuple in attribute set {Gender, Age, Postcode} has at least two occurrences.

Table 4.1

Published Table

No. 01 02 03 04

Gender Male Male Female Female

Age Young Old Young Young

Postcode 302020 302033 302015 302015

Problem Heart Asthma Obesity Obesity

A view after this generalization is given in Table 4.2. Since various countries use different postcode schemes, we adopt a simplified postcode scheme, where its hierarchy {302033, 3020*, 30**, 3***, *} corresponds to {rural, city, region, state, unknown}, respectively.

Chapter4: Results

& D i s c u s s i o n s : P a g e | 35

Table 4.2

View of published table by Global recording

No. 01 02 03 04

Gender * * * *

Age Young Old Young Young

Postcode 3020* 3020* 3020* 3020*

Problem Heart Asthma Obesity Obesity

Identifier attribute set

A set of attributes that potentially identifies the individuals in a table is a set of identifier attribute. For example, attribute set {Gender, Age, Postcode} in Table 1a is an identifier attribute set. Equivalent Set ()

An equivalent set of a table with respect to an attribute set is the set of all tuples in the table containing identical values for the attribute set. Table 4.1 forms a equivalent set with respect to attributes {Gender, Age, Postcode, Problem}. Therefore table 4.2 is the 2-Anonymity view of the table 4.1 since two attribute are used to deidentify the published table.

4.2.2 Quality measure of Anonymization


After the study we can easily conclude that larger the size of equivalent set easier the compression and obviously cost of anonymization is a factor of equivalent set. On the basis of this theory, we can determine that:

CAVG

RECORDS

(4.1)

Chapter4: Results

& D i s c u s s i o n s : P a g e | 36

4.2.3 Conclusion
Another name for global recoding is domain generalization as because generalization happens at the domain level. A specific domain is replaced by a more general domain. There are no mixed values from different domains in a table generalized by global recoding. When an attribute value is generalized, every occurrence of the value is replaced by the new generalized value. A global recoding method may over generalize a table. An example of global recoding is given in Table 4.2. Two attributes Gender and Postcode are generalized. All gender information has been lost. It is not necessary to generalize the Gender and the Postcode attribute as a whole. So, we say that the global recoding method over generalizes this table.

4.3 Domain compression through binary conversion


We integrate two key methods, namely binary encoding of distinct values and pair wise encoding of attributes, to build our compression technique.

4.3.1 Encoding of Distinct values


This compression technique is based on the assumption that the table we have published contains minimum distinct domain of attributes and these values repeat over the huge number of tuples present in the database. Therefore, binary encoding of the distinct values of each attribute, followed by representation of the tuple values in each column of the relation with the corresponding encoded values would transform the entire relation into bits and thus compress it [16]. We will find out the number of distinct values in each column and encode the data into bits accordingly. For example consider an instant given below which represents the two major attributes of a relation Patients.

Chapter4: Results

& D i s c u s s i o n s : P a g e | 37

Table 4.3 an instance of relation Student

Age 10 20 30 50 70

Problem Cough & Cold Cough & Cold Obesity Diabetes Asthma

Now if we adopt the concept of N-Anonymization with global recording (refer 4.2), we can map the current domain of attributes to more general domain. For example Age can be mapped into 10-Age interval as shown in the Table 4.4. To examine the compression benefits achieved by this method assume that Age is of integer type and has 5 distinct values as in Table 4.3. Suppose if there are 50 patients then the total storage required by Age attribute will be 50*size of (int) = 50*4 = 200 bytes [9]. With our compression technique, we find that there are 9 distinct values for age therefore we need the upper bound of log (9) i.e. 4 bits to represent each data value in the Age field. It is easy to calculate that we would need 50*4 (bits) = 200 bits = 25 bytes which are reasonably less [9]. We call this as our stage 1 of our compression which just transforms one column into bits. If we apply this compression to all columns of the table, the result will be significant.

Chapter4: Results

& D i s c u s s i o n s : P a g e | 38

Table 4.4

Representing Stage 1 of compression technique

Age 10-20 30-40 50-60 70-100

Problem Cough & Cold Obesity Diabetes Asthma

Table 4.5

Representing Stage 1 with binary compression

Age 00 01 10 11

Problem Cough & Cold Obesity Diabetes Asthma

4.3.2 Paired Encoding


It can be easily seen from the above example that besides optimizing the memory requirement of the relations, above encoding technique is also helpful in reducing redundancy (repetition values) from the relation. That is, it is likely that they are few distinct values of even (column1, column2) taken together, in addition to just column1s distinct values or column2s distinct values. We then represent the two columns together as a single column with pair values transformed according to the encoding. This constitutes Stage 2 of our compression in which we use the bitencoded database from Stage 1 as input and further compress it by coupling columns in pairs of two, applying the distinct-pairs technique outlined. To examine the further

Chapter4: Results

& D i s c u s s i o n s : P a g e | 39

compression advantage achieved, suppose that we couple Age and Problem columns. We can see in our table 4.3 that there are 5 distinct pairs (10, Cough & Cold), (20, cough & cold), (30, obesity), (50, Diabetes), (70, Asthma) and hence our upper bound is log (5) = 2 bits approx. Table 4.6 shows the result of stage 2 compression.

Table 4.6 Representing Stage 2 compression

Age 00 01 10 11

Problem 00 01 10 11

After compressing the attribute, pairing or coupling of attributes is done. All the columns are coupled in pair of two in a similar manner. If the database contains even number of columns it is straightforward. If the columns are odd, we can intelligently choose any of the columns to be uncompressed.

Table 4.7 Representing Stage 2 compression coupling

Age- Problem 00 01 10 11

Chapter4: Results

& D i s c u s s i o n s : P a g e | 40

After this compression technique is applied we can easily calculate the space required i.e. Before compression: 5*(4) +4*(4) = 36 bytes 4*2 = 8 bits.

After Compression and coupling:

4.4 Add-ons to compression


After performing successful compression over relation and domains, some of the conclusions were derived by varying the coupling of attributes with each other. Some of those possibilities are shown by the following points.

4.4.1 Functional Dependencies


Functional dependencies exists between attributes and states that: Given a relation R, a set of attributes Y in R is said to be functional dependent on another attribute X if and only if each value of X is associated with at most one value of Y. This implies that attributes in set X can correspondingly determine the value of attributes in set Y [15]. By rearranging the attributes we determine that clubbing columns with relationships similar to functional dependencies proves better results in compression. Table 4.8 shows an example of functional dependencies based compression.

Table 4.8 Representing functional dependency based coupling

Name Harshit Naman

Gender M M

Age 10 20

Problem Cough & Cold Cough & Cold

Chapter4: Results

& D i s c u s s i o n s : P a g e | 41

Aman Rajiv Rajni

M M F

30 50 70

Obesity Diabetes Asthma

Two different test cases were used to check the level of compression. Test case couples the attributes {(name, age), (Gender, problem)} then individual and coupled distinct values are checked as shown in figure 4.9. Whereas in test case 2, coupling is done with the given attributes {(name, gender), (Age, Problem)}.

Table 4.9

representing the number of distinct values in each column

Column name Name Gender Age Problem

Distinct values 19 2 19 19

Table 4.10

representing test case 1

Column name Name, Age Gender, Problem

Distinct values 285 35

Table 4.11

representing test case 2

Column name Name, Gender Age, Problem

Distinct values 22 312

Chapter4: Results

& D i s c u s s i o n s : P a g e | 42

4.4.2 Primary Key


A primary key is an attribute which uniquely identifies a row in a table. The observation regarding the primary key is that coupling of the primary key column with a column having a large number of distinct values would be advantageous because each primary key value gets associated with each distinct value in the table and hence the resulting number of distinct tuples of the combination of the two will always be equal to the number of primary key values in the table.

4.4.3 Few distinct values


Sometimes database contains columns with a very few distinct values. For example Gender attribute will always contain either male or female as domain. Therefore it is recommended that such type of attributes must be coupled with those attributes which contains a large number of distinct values. For example consider 4 attributes {name, gender, age, problem} where name= 200, gender= 2, age=200, problem= 20 Consider the coupling, {gender, name} and {age, problem}. The result would be 200*2 + 200*20= 4400 distinct tuples. Whereas coupling {gender, problem} and {name, age}. The result would be 2*20 + 200*200= 40040 distinct tuples.

4.5 Limitations
Two of the most-often cited disadvantages of our approach are write operations and tuple construction. Write operations are generally considered problematic for two reasons: Inserted tuples have to be broken up into their component attributes and each attribute must be written separately, and

Chapter4: Results

& D i s c u s s i o n s : P a g e | 43

The dense packed data layout makes moving tuples within a page nearly impossible.

Tuple construction is also considered problematic since information about a logical entity is stored in multiple locations on disk, yet most queries access more than one attribute from an entity.

4.6 Conclusion
In this study we discuss two different compression techniques embedded with each other to form a Globally Recorded binary encoded Domain Compression. The first study defines generalization and discuss its different type in anonymization the attributes. It discusses how to handle a major problem in global recoding generalization, inconsistent domains in a field of a generalized table, and propose a method to approach the problem. The tables in the examples proposed global recoding method based on n-anonymity, and consistency. The second technique focuses on the extension of existing compression by encoding the domain in binary form and further encoding pairs of column values. It shows how coupling of columns can be effective if attributes are properly rearranged. In particular I found that in most cases it is beneficial to couple the primary key with the column having the maximum number of distinct values. Also, columns with very few distinct values should be paired with columns with a large number of dissimilar values. Functional dependencies should be determined to achieve better compression of related attributes. Overall, a better knowledge of the data distribution leads to better compression. Based on the database and the application environment being targeted, the optimum stage up to which compression is feasible and worthy also needs to be

Chapter4: Results

& D i s c u s s i o n s : P a g e | 44

determined, i.e. we need to decide the point at which the extra compression achieved is not worth the performance overhead involved.

P a g e | 45

Chapter 5

Conclusion & Future Work

5.1 Conclusion
In this thesis we study how to use compression techniques so that the performance of database can be improved. Moreover after comparing we also propose an algorithm for compressing columnar databases. We studied the following research issues:

Compression different domains of databases: We studied how different domains of a database such as varchar, int, NULL values can be dealt while compressing a database. Compared to

existing compression methods, our approach considers the heterogeneous nature of string attributes, and uses a comprehensive strategy to choose the most effective encoding level for each string attribute. Our experimental results show that using HDE methods achieves better compression ratio than using a single existing method, and using HDE also achieves the best balance between I/O saving and decompression overhead.

Chapter 5: Conclusion & Future Work:

P a g e | 46

Compression-aware query optimization: We observed that when to decompress string attributes is a very crucial issue for query performance. A traditional optimizer enhanced with a cost model that takes both Input/output benefits of compression and the CPU overhead of decompression into account, does not necessarily achieve good plans. My experiments show that the combination of effective compression methods and compression-aware query optimization is crucial for query performance therefore use of our compression methods and optimization algorithms achieves up to an order improvement in query performance over existing techniques. The significant gain in performance suggests that a compressed database system should have the query optimizer modified for better performance.

Compressing query results: We proposed how to use domain knowledge about the query to improve the effect of compression on query results. Our approach uses a combination of compression methods and we represented such combination using an algebraic framework.

5.2 Future Work


There are several interesting future dimensions for this research work.

Compression-aware query optimization: First, it would be interesting to study how caching of intermediate (decompressed) results can reduce the overhead of transient decompression. Second, we plan to study how our compression techniques can handle updates. Third, we will study the impact of hash join on our query optimization work.

Chapter 5: Conclusion & Future Work:

P a g e | 47

Result compression: We plan to explore the joint optimization problem of query plans and compression plans. Currently, the compression optimization is based on the query plan returned by the query optimization. However, the overall cost of a combination of a query plan and a compression plan is different from the cost of the query plan. For instance, a more expensive query plan may sort the result in an order such that the sortednormalization method can be applied and the overall cost will be lower.

P a g e | 48

APPENDIX I

Infobright

I.1 Introduction
The demand for business analytics and intelligence has grown dramatically across all industries. This demand is outpacing the availability of technical expertise and budgets to successfully implement. Infobright helps solve these problems by providing a solution those implements and manages a scalable analytic database.

Infobright offers two versions of their software: Infobright Community Edition (ICE) and Infobright Enterprise Edition (IEE). ICE is an open source product that can be freely downloaded. IEE is the commercial version of the software. It offers enhanced features that are often necessary for production and operational support.

The Infobright database is designed as an analytic database. It can handle business driven, ad-hoc queries in a fraction of the time the same queries would take on a transaction database. Infobright achieves its high analytic performance by organizing the data in columns instead of rows.

Appendix I:

I n f o b r i g h t : P a g e | 49

Infobright combines a columnar database with our Knowledge Grid architecture to deliver a self-managing, self-tuning database optimized for analytics. Infobright eliminates the need to create indexes, partition data, or do any manual tuning to achieve fast response for queries and reports.

The Infobright database resolves complex analytic queries without the need for traditional indexes, data partitioning, projections, manual tuning or specific schemas. Instead, the Knowledge Grid architecture automatically creates and stores the information needed to quickly resolve these queries. Infobright organizes the data into 2 layers: the compressed data itself that is stored in segments called Data Packs, and information about the data which comprises the components of the Knowledge Grid. For each query, the Infobright Granular Engine uses the information in the Knowledge Grid to determine which Data Packs are relevant to the query before decompressing any data.

Infobright technology is based on the following concepts: Column orientation Data Packs Knowledge Grid The Granular Computing Engine

I.2 Infobright Architecture


Column Orientation Infobright is, at its core, is a highly compressed column-oriented database. This means that instead of the data being stored row-by-row, it is stored column-bycolumn. There are many advantages to column-orientation, including the ability to do

Appendix I:

I n f o b r i g h t : P a g e | 50

more efficient data compression because each column stores a single data type (as opposed to rows that typically contain several data types), and allowing compression to be optimized for each particular data type. Infobright, which organizes each column into Data Packs (as described below) has greater compression than other columnoriented databases, as it applies a compression algorithm based on the content of each Data pack, not just column. Most queries only involve a subset of the columns of the tables and so a columnoriented database focuses on retrieving only the data that is required.

Data Packs and the Knowledge Grid Data is stored in 65K Data Packs. Data Pack Nodes contain a set of statistics about the data that is stored and compressed in each of the Data Packs. Knowledge Nodes provide a further set of metadata related to Data Packs or column relationships. Together, Data Pack Nodes and Knowledge Nodes form the Knowledge Grid. Unlike traditional database indexes, they are not manually created, and require no ongoing "care and feeding". Instead, they are created and managed automatically by the system. In essence, they create a high level view of the entire content of the database. This is what makes Infobright so well-suited for ad hoc analytics, unlike other databases that require pre-work such as indexes, projections, partitioning or aggregate tables in order to deliver fast query performance.

Granular Computing Engine The Granular Engine processes queries uses the Knowledge Grid information to optimize query processing. The goal is to eliminate or significantly reduce the amount of data that needs to be decompressed and accessed to answer a query. IEE can often

Appendix I:

I n f o b r i g h t : P a g e | 51

answer queries referencing only the Knowledge Grid information, (without having to read the data), which results in sub-second query response for those queries.

Iterative optimization and execution

GRANULAR COMPUTING ENGINE


Knowledge about data & pre selection for decompression

KNOWLEDGE GRID Knowledge nodes

Knowledge nodes

Data Pack nodes

Data Pack nodes

Column 1 Column 3

Column 2

Leading data compression algorithm

Figure I.1

Infobright Architecture

I.3 Infobright Benefits

Just Load and Go

Infobright is simple to implement and manage and requires very little administration.

Infobright is self-managing. There is no need to create or manage indexes or partition data.

Appendix I:

I n f o b r i g h t : P a g e | 52

Infobright is compatible with major Business Intelligence tools such as Jaspersoft, Actuate/BIRT, Cognos, Business Objects, Microstrategy, Pentaho and others.

High performance and scalability


Infobright loads data extremely fast - up to 280GB/hour. Infobright's columnar approach results in fast response times for complex analytic queries.

As you database goes, the query and load performance remains constant.

Infobright scales up to 50TB of data.

Low Cost

The cost of Infobright is very low compared to closed source, proprietary solutions.

Using Infobright eliminates the need for complex hardware infrastructure.

Infobright runs on low cost, industry standard servers. A single server can scale to support 50TB of data.

Infobright's industry-leading data compression (10:1 up to 40:1) significantly reduces the amount of storage required.

I.4 MySQL Integration


MySQL is the world's most popular open source database software, with over 11 million active installations. Infobright brings scalable analytics to MySQL users through its integration as a MySQL storage engine. If your MySQL database is growing and query performance is suffering, Infobright is the ideal choice.

Appendix I:

I n f o b r i g h t : P a g e | 53

Many users of MySQL turn to Infobright as their data volumes and analytic needs grow since Infobright offers exceptional query performance for analytic applications against large amounts of data. Migrating from MySQLs MyISAM storage engine, or other MySQL storage engines, to the Infobright column-oriented analytic database is quite straightforward.

Infobright contains a bundled version of MySQL and installing Infobright installs a new instance of MySQL along with Infobright's Optimizer, Knowledge Grid, the Infobright Loader and the underlying columnar storage architecture. This installation also includes MySQLs MyISAM storage engine. Unlike other storage engines that work with MySQL, it is not necessary to have an existing MySQL installation nor can Infobright be added to an existing MySQL Server installation. When installing Infobright, the assumption is that any previously existing MySQL or MyISAM database will exist in a separate installation of MySQL, installed in a different directory with a unique data path, configuration files, socket and port values.

In the data warehouse marketplace, the database must integrate with a variety of tools. By integrating with MySQL, Infobright leverages the extensive tool connectivity provided by MySQL connectors (C, JDBC, ODBC, .NET, Perl, etc.).

It also enables MySQL users to leverage the mature, tested BI tools with which they're already familiar. You'll also benefit from MySQL's legendary ease of use and low maintenance requirements.

Infobright-MySQL integration includes the following features:

Industry standard interfaces that include ODBC, JDBC, C API, PHP, Visual Basic, Ruby, Perl and Python;

Appendix I:

I n f o b r i g h t : P a g e | 54

Comprehensive management services and utilities; Robust connectivity with BI tools such as Actuate/BIRT, Business Objects, Cognos, Microstrategy, Pentaho, Jaspersoft and SAS.

I.5 Practical Implementation


Infobright neither needs nor allows the manual creation of performance structures with duplicated data such as indexes or table partitioning based on expected usage patterns of the data. When preparing the MySQL schema definition for execution in Infobright, the first thing to do is simplify the schema. This means removing all references to indexes and other constraints expressed as indexes including PRIMARY and FOREIGN KEYs, and UNIQUE and CHECK constraints. In addition, due to Infobrights extremely high query performance levels on large volumes of data, one should consider removing all aggregate, reporting and summary tables that may be in the data model as they are unnecessary.

I have done a little work with an existing airline database which has tables with many columns. The database contains a number of columns. Basic SQL queries are executed to check the performance of the database, but these are ad-hoc queries i.e. any column can be accessed by it. The Airline database is then tested with two existing database management softwares INFOBRIGHT and MYSQL. I created a table with large number of columns (around 50) with different data types. Then I tried the SQL statements to fill the data in the columns by using LOAD DATA INFILE instead.

Appendix I:

I n f o b r i g h t : P a g e | 55

Creating table airline_info

CREATE TABLE `airline_info` ( `Year` year(4) DEFAULT NULL, `Quarter` tinyint(4) DEFAULT NULL, `Month` tinyint(4) DEFAULT NULL, `DayofMonth` tinyint(4) DEFAULT NULL, `DayOfWeek` tinyint(4) DEFAULT NULL, `FlightDate` date DEFAULT NULL, `UniqueCarrier` char(7) DEFAULT NULL, `AirlineID` int(11) DEFAULT NULL, `Carrier` char(2) DEFAULT NULL, `TailNum` varchar(50) DEFAULT NULL, `FlightNum` varchar(10) DEFAULT NULL, `Origin` char(5) DEFAULT NULL, `OriginCityName` varchar(100) DEFAULT NULL, `OriginState` char(2) DEFAULT NULL, `OriginStateFips` varchar(10) DEFAULT NULL, `OriginStateName` varchar(100) DEFAULT NULL, `OriginWac` int(11) DEFAULT NULL, `Dest` char(5) DEFAULT NULL, `DestCityName` varchar(100) DEFAULT NULL, `DestState` char(2) DEFAULT NULL, `DestStateFips` varchar(10) DEFAULT NULL, `DestStateName` varchar(100) DEFAULT NULL, `DestWac` int(11) DEFAULT NULL, `CRSDepTime` int(11) DEFAULT NULL, `DepTime` int(11) DEFAULT NULL, `DepDelay` int(11) DEFAULT NULL, `DepDelayMinutes` int(11) DEFAULT NULL, `DepDel15` int(11) DEFAULT NULL, `DepartureDelayGroups` int(11) DEFAULT NULL,

Appendix I:

I n f o b r i g h t : P a g e | 56

`DepTimeBlk` varchar(20) DEFAULT NULL, `TaxiOut` int(11) DEFAULT NULL, `WheelsOff` int(11) DEFAULT NULL, `WheelsOn` int(11) DEFAULT NULL, `TaxiIn` int(11) DEFAULT NULL, `CRSArrTime` int(11) DEFAULT NULL, `ArrTime` int(11) DEFAULT NULL, `ArrDelay` int(11) DEFAULT NULL, `ArrDelayMinutes` int(11) DEFAULT NULL, `ArrDel15` int(11) DEFAULT NULL, `ArrivalDelayGroups` int(11) DEFAULT NULL, `ArrTimeBlk` varchar(20) DEFAULT NULL, `Cancelled` tinyint(4) DEFAULT NULL, `CancellationCode` char(1) DEFAULT NULL, `Diverted` tinyint(4) DEFAULT NULL, `CRSElapsedTime` INT(11) DEFAULT NULL, `ActualElapsedTime` INT(11) DEFAULT NULL, `AirTime` INT(11) DEFAULT NULL, `Flights` INT(11) DEFAULT NULL, `Distance` INT(11) DEFAULT NULL, `DistanceGroup` TINYINT(4) DEFAULT NULL, )

Loading data in Infobright


-S /tmp/mysql-ib.sock -e "LOAD DATA INFILE

mysql

'/data/d1/AirData_info/${YEAR}_$i.txt.tr' INTO TABLE airline_nifo FIELDS TERMINATED BY ',' ENCLOSED BY '\"'" airline_info

Query Execution

We all are aware of query execution and result display in MYSQL. The following queries represent the example of how execution is performed in INFOBRIGHT.

Appendix I:

I n f o b r i g h t : P a g e | 57

mysql> SELECT sum(c19), sum(c89), sum(c129), count(*) FROM t WHERE c11< 5; +----------+----------+-----------+----------+ | sum(c19) | sum(c89) | sum(c129) | count(*) | +----------+----------+-----------+----------+ | 2417861 | 2341752 | 2357072 | 487 |

+----------+----------+-----------+----------+ 1 row in set (0.16 sec)

mysql> SELECT sum(c19), sum(c89), sum(c129), count(*) FROM t WHERE c11> 5; +------------+------------+------------+----------+ | sum(c19) | sum(c89) | sum(c129) | count(*) |

+------------+------------+------------+----------+ | 4995339851 | 4990774999 | 4998401490 | 999382 |

+------------+------------+------------+----------+ 1 row in set (1.18 sec)

1. SELECT count(*) FROM airline_info;

Both INFOBRIGHT and MYSQL executes it immediately with result 1000 rows

2. SELECT DayOfWeek, count(*) AS c FROM ontime WHERE YearD BETWEEN 2000 AND 2008 GROUP BY DayOfWeek ORDER BY c DESC

The query Count flights per day from 2000 to 2008 years. With result:

Appendix I:

I n f o b r i g h t : P a g e | 58

[ 5, 7509643 ] [ 1, 7478969 ] [ 4, 7453687 ] [ 3, 7412939 ] [ 2, 7370368 ] [ 7, 7095198 ] [ 6, 6425690 ]

And it took 7.9s for INFOBRIGHT and 12.13s for MYSQL

3. SELECT t.carrier, c, c2, c*1000/c2 as c3 FROM (SELECT carrier, count(*) AS c FROM ontime WHERE DepDelay>10 AND Year=2007

GROUP BY carrier) t JOIN (SELECT carrier, count(*) AS c2 FROM ontime WHERE Year=2007 GROUP BY carrier) t2 ON

(t.Carrier=t2.Carrier) ORDER BY c3 DESC;

The query calculates the Percentage of delays for each carrier for 2007 year

So result is: [ "EV", 101796, 286234, 355 ] [ "US", 135987, 485447, 280 ] [ "AA", 176203, 633857, 277 ] [ "MQ", 145630, 540494, 269 ] [ "AS", 42830, 160185, 267 ] [ "B6", 50740, 191450, 265 ] [ "UA", 128174, 490002, 261 ]

With execution time: 0.5s for INFOBRIGHT and 2.92s for MYSQL

Appendix I:

I n f o b r i g h t : P a g e | 59

4. SELECT t.YEARD, c1/c2 FROM (select YEARD,count(*)*1000 as c1 from ontime WHERE DepDelay>10 GROUP BY YearD) t JOIN (select YEARD,count(*) as c2 from ontime GROUP BY YEARD) t2 ON

(t.YEARD=t2.YEARD)

The query finds the Percent of delayed (more 10mins) flights per year:

with result: [ 1988, 166 ] [ 1989, 199 ] [ 1990, 166 ] [ 1991, 147 ] [ 1992, 146 ] [ 1993, 154 ] [ 1994, 165 ] [ 1995, 193 ] [ 1996, 221 ] [ 1997, 191 ] [ 1998, 193 ] [ 1999, 200 ] [ 2000, 231 ] [ 2002, 163 ] [ 2003, 153 ] [ 2004, 192 ]

And with execution time 27.9s INFOBRIGHT and 8.59s MYSQL

It shows that INFOBRIGHT does not like scanning wide range of rows and hence MYSQL gives a more appropriate result when query is row oriented.

Appendix I:

I n f o b r i g h t : P a g e | 60

5. select year,count(*) as c1 from ontime group by YEAR

The query shows how many records per years

+++ | year | c1 |

+++ | 1989 | 5041200 | | 1990 | 5270893 | | 1991 | 5076925 | | 1992 | 5092157 | | 1993 | 5070501 | | 1994 | 5180048 | | 1995 | 5327435 | | 1996 | 5351983 | | 1997 | 5411843 | | 1998 | 5384721 | | 1999 | 5527884 | | 2000 | 5683047 | | 2001 | 5967780 | | 2002 | 5271359 | | 2003 | 6488540 | | 2004 | 7129270 | | 2005 | 7140596 | | 2006 | 7141922 | | 2007 | 7455458 | | 2008 | 7009728 | +++

And execution time: INFOBRIGHT 6.3s and MYSQL: 0.31s

Appendix I:

I n f o b r i g h t : P a g e | 61

The following graphical representation shows the difference of performance in MYSQL and INFOBRIGHT in different parameters.

100 80 60 40 20 0

90

51.6

INFOBRIGHT MYSQL

INFOBRIGHT

MYSQL

Graph I.1 Representing Load time comparison of INFOBRIGHT & MYSQL

100 80 60 40 20 0 INFOBRIGHT 30 0

92

INFOBRIGHT MYSQL

MYSQL

Graph I.2 Representing Table size (Kilobytes) comparison of INFOBRIGHT & MYSQL

30 25 20 15 10 5 0 Query 1 Query 2 12.13 7.9 2.92 0.5

27.9

INFOBRIGHT

8.59

6.3 0.31

MYSQL

Query 3

Query 4

Graph I.3 Representing query execution comparison of INFOBRIGHT & MYSQL

Appendix I:

I n f o b r i g h t : P a g e | 62

It must always be kept in mind that Infobright is not just a storage engine plugged into MySQL, but it is a complete server with a different optimizer, etc. It is pretty similar to MYSQL rather than the LOADFILE syntax to fill data in the tables. With columnstores, you may not need to build snowflake schemas or do much transformation. Column-stores are therefore less effort to get started in smaller companies with resource-starved IT departments. Infobright is not without its issues. Documentation is thin or non-existent. I spent hours and hours until I determined (and confirmed on the forums) that the Infobright loader does not support all of the MySQL syntax for bulk loads. This would not have been such a problem if the error message had provided some warning about my syntax that was perfectly legal in standard MySQL.

P a g e | 63

Bibliography
[1]
Building Compressed Databases by Zhiyuan Chen August 2002

[2]

Binary

Encoded

Attribute-Pairing

Technique

for

Database

Compression Akanksha Baid and Swetha Krishnan Computer Sciences Department University of Wisconsin, Madison baid,swetha@cs.wisc.edu

[3]

MONET- A next generation DBMS kernel for query intensive applications by Peter Alexander Boncz

[4]

ColumnStores vs. RowStores: How Different Are They Really? Daniel J Abadi Yale University New Haven, CT, USA dna@cs.yale.edu Samuel R. Madden MIT Cambridge, MA, USA Nabil Hachem AvantGarde Consulting, LLC Shrewsbury, MA, USA

[5]

Efficient compression of text attributes of data warehouse dimensions. Jorge Vieira1, Jorge Bernardino2, Henrique Madeira3 1Critical Software S.A. jvieira@criticalsoftware.com 2CISUC-ISEC, Instituto Politcnico de Coimbra jorge@isec.pt 3CISUC-DEI, Universidade de Coimbra

henrique@dei.uc.pt

P a g e | 64

[6]

Anonymization

by

Local

Recoding

in

Data

with

Attribute

Hierarchical Taxonomies. Jiuyong Li, Member, IEEE, Raymond Chi-Wing Wong, Student Member, IEEE, Ada Wai-Chee Fu, Member, IEEE, and Jian Pei, Senior Member, IEEE

[7]

The Implementation and Performance of Compressed Databases Till Westmann1_ Donald Kossmann2 Sven Helmer1 Guido Moerkotte1 1Universitat Mannheim 2Universitat Passau Informatik III FMI D-68131 Mannheim, Germany D-94030 Passau, Germany

[8]

Column-oriented Database Systems Daniel J. Abadi Yale University New Haven, CT, USA dna@cs.yale.edu Peter A. Boncz CWI Amsterdam, The Netherlands p.boncz@cwi.nl Stavros Harizopoulos HP Labs Palo Alto, CA, USA stavros@hp.com

[9]

Materialization strategies in a column-oriented DBMS Abadi, D.J., Myers, D.S., DeWitt, D.J., and Madden, S.R.: In Proc. ICDE, 2007.

[10]

C-Store: A Column-oriented DBMS Stonebraker, M. et al.: In Proc. VLDB, 2005.

P a g e | 65

[11]

Superscalar ram-cpu cache compression Zukowski, M., Heman, S., Nes, N., and Boncz, P.A.: In Proc. ICDE, 2006.

[12]

TPC-H toolkit at: http://www.tpc.org/tpch/

[13]

A comparison of c-store and row-store in a common framework A. Halverson, J. Beckmann, and J. Naughton. Technical Report, UW Madison Department of CS, TR1566, 2006.

[14]

C-Store - A column-oriented dbms. In VLDB, pages 553564, 2005. M. Stonebraker, D. J. Abadi, A. Batkin, X. Chen, M. Cherniack, M. Ferreira, E. Lau, A. Lin, S. Madden, E. J. ONeil, P. E. ONeil, A. Rasin, N. Tran, and S. B. Zdonik.

[15]

k-Anonymity: Model for Protecting Privacy, L. SweeneyA, Intl J. Uncertainty, Fuzziness and Knowledge Based Systems, vol. 10, no. 5, pp. 557-570, 2002.

[16]

UCI Repository of Machine Learning Databases, D.J. Newman, S. Hettich, C.L. Blake, and C.J. Merz, http://www.ics. uci.edu/~mlearn/MLRepository.html, 1998.

P a g e | 66

[17]

Query Optimization In Compressed Database Systems ACM SIGMOD (2001) 271-282 Chen, Z., Gehrke, J., Korn, F

[18]

Data Warehouse Striping, Journal of Data and Knowledge Engineering Volume 19, Issue 2, Elsevier Science Publication (2002) Bernardino, J., Furtado, P., Madeira, H.: Approximate Query Answering

[19]

An Efficient Compression Code for Text Databases Brisaboa, N., Iglesias, E., Navarro, G., Param, J., ECIR (2003) 468-481

[20]

Adjoined Dimension Column Index (ADC Index) to Improve Star Schema Query Performance. P. E. ONeil, X. Chen, and E. J. ONeil. In ICDE, 2008.

[21]

http://www.sybase.com/products/ informationmanagement/sybaseiq.

[22]

http://www.infobright.org/resources/_ _How_To_Migrate_from_MyISAM.pdf

P a g e | 67

[23]

http://en.wikipedia.org/wiki/Column-oriented_DBMS

[24]

Integrating

Compression

and

Execution

in

Column-Oriented

Database Systems Daniel J. Abadi MIT, Samuel R. Madden, MIT, Miguel C. Ferreira, MIT

You might also like