You are on page 1of 2

In this unit, we'll give you an overview of SQL for big data.

The question is: why use SQL on big data? Well, the Hadoop technology is widely used in big data environments. It's used t o explore extremely large data sets, and try and figure out what value might be contained in them. Or if you are already aware of the kind of value in those big data sets, you can use Hadoop to derive insights from it. Data warehouse augmen tation is also a very common use case for Hadoop where you might be feeding your data warehouse with additional information that you get from these big data set s. In order to derive value from Hadoop data sets, you use the MapReduce algorithm, which is highly scalable but is notoriously difficult to use. In order to run M apReduce jobs you might have to write programs in languages like Java, which can be a tedious process and it requires programming expertise. Or you might use hi gher level languages like Pig, which not too many people are familiar with and a lso require special expertise. And in addition, there is a lot of different file formats, storage mechanisms, and configuration options, etc. that you need to k eep into consideration when doing MapReduce. Now, SQL or the Standard Query Language can really democratize access to big dat a by opening up the data to a much wider audience. SQL has been in use for sever al decades now so people are very familiar with it and its widely used and its s yntax is widely known. Moreover, there are a lot of tools and applications out t here that are written using the SQL language. And there is a clear separation of what you want versus how to get it so you don t really need to worry about the underlying data structure to be able to get valu e from your data using the SQL language. So a combination of SQL and Hadoop can be very powerful for querying and analyzi ng big data. There is more than one option out there to run SQL on Hadoop. The first and the most popular one is Hive, which was initially developed by Fac ebook but is now an open source project in the Apache Foundation. And, on Wikipe dia it is described as a data warehouse infrastructure built on top of Hadoop fo r providing data summarization, query and analysis. The next is Dremel which was conceived at Google and its architecture has been p ublished as a research paper but it's yet to be made available for widespread co mmercial use. Dremel is described as a scalable, interactive adhoc query system for analysis of read only data. And by combining multi-level execution trees and columnar data layout, it's said to be capable of running aggregation queries ov er trillion row tables in seconds. There is also Drill which is based on Google's Dremel but with the additional fl exibility needed to support a broader range of query languages, data formats, an d data sources. Drill is an Apache Foundation project in the incubation stage an d has been sponsored mainly by MapR. In late 2012, Cloudera introduced Impala, which is also inspired by Dremel with a vision to bring real-time adhoc query capability to Apache Hadoop. An Impala b inary is available in public beta form and Cloudera has open sourced its code ba se. And a fairly new entrant in this space is Big SQL from IBM, which is now availab le in a technology preview format. Although Big SQL has been released fairly rec ently, it s actually a culmination of several projects in IBM's research and devel

opment since several years with a vision to bring powerful and highly performant SQL capabilities to the big data environment.

You might also like