You are on page 1of 27

Transparently Offloading Data Warehouse Data

to Hadoop using Data Virtualization


A Technical Whitepaper

Rick F. van der Lans


Independent Business Intelligence Analyst
R20/Consultancy
November 2014

Sponsored by

Copyright 2014 R20/Consultancy. All rights reserved. Cisco and the Cisco logo are trademarks or
registeredtrademarksofCiscoand/oritsaffiliatesintheU.S.ortherecountries.ToviewalistofCisco
trademarks,gotothisURL:www.cisco.com/go/trademarks.Trademarksofcompaniesreferencedinthis
documentarethesolepropertyoftheirrespectiveowners.

Table of Contents

Introduction

The Ever-Growing Data Warehouse

Overview of Hadoop

The Data Warehouse Environment and Hadoop

Examples of Offloadable Data

Steps for Offloading Data Warehouse Data to Hadoop

Step 1: Identifying Offloadable Data


Step 2: Installing Hadoop
Step 3: Importing Tables in CIS
Step 4: Migrating Reports to CIS
Step 5: Creating Tables in Hadoop
Step 6: Extending Views to Include the Hadoop Tables
Step 7: Collecting Statistical Data on Offloadable Data
Step 8: Initial Offloading of Data to Hadoop
Step 9: Refresh of the Offloaded Data
Step 10: Testing the Report Results
Step 11: Adapting the Backup Process

9
12
13
13
15
15
17
18
21
21
22

Getting Started

22

About the Author Rick F. van der Lans

24

About Cisco Systems, Inc.

24

Copyright 2014 R20/Consultancy, all rights reserved.

Transparently Offloading Data Warehouse Data to Hadoop using Data Virtualization

1 Introduction

ThiswhitepaperfocusesonintroducingthepopularHadoopdatastoragetechnologyinanexistingdata
warehouseenvironmentwiththeintentiontouseitasaplatformfordataoffloading.Theprimaryreasons
tooffloaddatafromcurrentSQLdatabaseserverstoHadoopistoreducestoragecostsandtospeedup
reports. The result is a data warehouse environment in which data is distributed across multiple data
storagetechnologies.

The Cisco Information Server (CIS) data virtualization server is used to hide this hybrid data storage
system.Itallowsorganizationstomigratetransparentlyfromtheirsingledatastoragesolutiontoahybrid
storagesystem.Usersandreportswontnoticethisoffloadingofdata.

The whitepaper describes, step by step, how to introduce Hadoop in an existing data warehouse
environment.Guidelines,dosanddonts,andbestpracticesareincluded.

As a complementary offering, Cisco also provides a packagedsolution called Cisco Big Data Warehouse
Expansion, which includes software, hardware,and services required to accelerate all the activities
involvedinoffloadingdatafromadatawarehousetoHadoop.

2 The Ever-Growing Data Warehouse

More, More, and More Most data warehouse environments use the onceinneverout principle to store
data.Inmostenvironmentsthereisnointentiontomovedatafromthedatawarehousetoanarchiveor
to remove it entirely. When new invoice, customer, or sales data is added, old data is not removed to
makeroom.Somedatawarehousescontaindatathatismorethantwentyyearsoldandthatsbarelyever
used.

Butitsnotonlynewdatacomingfromexistingdatasourcesthatenlargesadatawarehouse.Newdata
sourcesarecontinuouslyintroducedaswell.AnorganizationdevelopsanewInternetbasedtransaction
system,acquiresanewmarketingsystem,orinstallsanewCRMsystem;allthedataproducedbythese
newsystemsmustbecopiedtothedatawarehouseenlargingitevenfurther.

Besidesnewinternaldatasources,ithasbecomeverycommontoadddatafromexternaldatasources,
such as data from social media networks, open data sources, and public web services. Especially for
specificformsofanalytics,enrichinginternaldatawithexternaldatacanbeveryinsightful.Again,allthis
externaldataisstoredinthedatawarehouseenvironment.

Allthisnewdataandallthesenewdatasourcesleadtodatawarehouseenvironmentsthatkeepgrowing.

The Drawbacks of an Ever-Growing Data Warehouse Inprinciple,datagrowthisagoodthing.Whenmoredata


isavailableforanalystsandbusinessusers,theirreportingandanalyticalcapabilitiesincrease.Moredata
potentiallyincreasesthereportingandanalyticalcapabilitiesofanorganization.

Copyright 2014 R20/Consultancy, all rights reserved.

Transparently Offloading Data Warehouse Data to Hadoop using Data Virtualization

However,therearesomepracticaldrawbacks:

Expensive datastorage: Storingdata indatabasescosts money. Examples of datastoragecosts


are storage hardware costs, management costs, and license fees for those database server
especiallywhenthefeeisdependentonthedatabasesize.
Poor query performance: The bigger the tables in the data warehouse are, the slower the
reportingquerieswillbe.
Poor loading performance: Loading new data may slow down when tables become bigger.
Especially theindexes on thetablesthatmustbeupdatedwhenthedataisloaded,can have a
negativeeffectontheloadingspeed.
Slowbackup/recovery:Thelargeradatabaseis,thelongerthebackupprocessandaneventual
restoreofallthedatatakes.
Expensivedatabaseadministration:Thelargeradatabaseis,themoretimeconsumingdatabase
administrationwillbe.Moreandmoretimemustbespentontuningandoptimizingthedatabase
server,thetables,thebuffer,andsoon.

The Solutions Thereareseveralsolutionstoshrinkadatawarehouseenvironment:

RemovingData:Periodically,deletesomeoftheleastusedorthelessrecentlyuseddatatokeep
the data warehouse environment as small as possible. The drawback of this approach is that
analysisofthedeleteddataisnolongerpossible,limiting,inparticular,historicalanalysis.

MovingDatatoanOfflineArchive:Periodically,moveunusedorlittleuseddatatoanofflinedata
storagesysteminwhichthedataisnotavailableonline.Thissolutionkeepsthedatawarehouse
environmentsmall.Thechallengeistoreanimateofflinedataquicklyandeasilyforincidental
analysis.Acrucialquestiontoansweriswhethertheofflinedatashouldbereloadedinthedata
warehouse,orshouldittemporarilybecopiedtoaseparatedatabase?Inprinciple,thisdoesnt
reducetheamountofdatastored,itsjustthataportionisstoredoutsidethedatawarehouse.

OffloadingDatatoanInexpensiveOnlineDataStorageSystem:Moveaportionofthedatatoan
online data storage system with another set of technical characteristics. Obviously, such a data
storagesystemmustsupportonlinequeries,andthereshouldbenoneedtoreanimatethedata
beforeitcanbeusedforanalysis.Storingdatashouldalsobelessexpensive.Theresultisahybrid
data warehouse environment in which all the stored data is distributed over different data
storagetechnologies.

Formostorganizations,thethirdsolutionispreferred.

Solving the Data Warehouse Growing Pains with Hadoop One of the newest data storage technologies is
Hadoop.Hadoophasbeendesignedtohandlemassiveamountsofstoreddata,ithasbeenoptimizedto
process complex queries efficiently, it supports a wide range of application areas, and it has a low
price/performance ratio. Especially the latter characteristic makes Hadoop an attractive data storage
platform to operate side by side with the familiar SQL database technology. Because the financial and
technicalcharacteristicsofHadoopareverydifferentfromthoseofSQLdatabasetechnologies,itallows
designers to choose the best fit for each data set. For example, if query speed is crucial, SQL may be
selected,andwhenstoragecostsmustbereduced,Hadoopcanbechosen.Inotherwords,designerscan
goforthebestofbothworlds.

Copyright 2014 R20/Consultancy, all rights reserved.

Transparently Offloading Data Warehouse Data to Hadoop using Data Virtualization

Hiding the Hybrid Data Storage System with Data Virtualization SQLdatabasetechnologyisusedinalmostall
thedatawarehousestostoredata.WhendataisoffloadedtoHadoop,thedatawarehouseenvironment
deploys both data storage systems: Hadoop and SQL database technology. The consequence is that
reportsandusershavetousedifferentAPIsandlanguagesdependingonwheredataisstored.SuchAPIs
mustbestudiedindetail.EspeciallythetraditionalHadoopAPIs,suchasHDFS,HBase,andMapReduce,
are very technical and complex. Also, in such a new situation, reports and users must know in which
storage system the data resides that they want to analyze. All this will raise the costs of report
development and maintenance. Finally, many reporting and analytical tools do not support access to
Hadoop. This means that users have to learn how to work with new reporting tools and that existing
reportsmustberedeveloped.

A data virtualization server decouples the two data storage systems from the reports, presenting one
integrateddatastorageenvironment.Infact,withdatavirtualizationthereportsanduserswontnotice
thatdatahasbeenoffloadedtoanothersystem.UsersdonthavetolearnnewAPIsorlanguagesandthey
dont have to know in which system the data resides. Data virtualization fully hides the hybrid data
storagesystem.MoreonthisinSection4.

3 Overview of Hadoop

Introduction to Hadoop ApacheHadoophasbeendesignedtostore,process,andanalyzelargeamountsof


data from terabytes to petabytes and beyond, and to process data in parallel on a hardware platform
consistingofinexpensivecommoditycomputers.Itconsistsofasetofsoftwaremodulesfromwhichthe
developers can pick and choose. Figure 1 illustrates the Hadoop modules on which this whitepaper
focuses.

Figure 1 Hadoop consists of a number of modules including HDFS, YARN,


MapReduce, HBase, and Hive (the light blue boxes).

Copyright 2014 R20/Consultancy, all rights reserved.

Transparently Offloading Data Warehouse Data to Hadoop using Data Virtualization

Thecoremodulesarebrieflyintroducedhere.Formoreextensivedescriptions,werefertoTomWhites
book1onHadoop.

HDFS:TheHadoopDistributedFileSystem(HDFS)formsthefoundationofHadoop.Thismoduleis
responsibleforstoringandretrievingdata.Itsdesignedandoptimizedtodealwithlargeamounts
ofincomingdatapersecondandtomanageenormousamountsofdatauptopetabytes.

YARN:YARN(YetAnotherResourceNegotiator)isaresourcemanagerresponsibleforprocessing
allrequeststoHDFScorrectlyandfordistributingresourceusagecorrectly.Likeallotherresource
managers,itstaskistoassurethattheoverallperformanceisstableandpredictable.

MapReduce:MapReduceoffersaprogramminginterfacefordeveloperstowriteapplicationsthat
querydatastoredinHDFS.MapReducecanefficientlydistributequeryprocessingoverhundreds
ofnodes.Itpushesanyformofprocessingtothedataitself,andthusparallelizestheexecution
andminimizesdatatransportwithinthesystem.MapReducehasabatchorientedstyleofquery
processing.

HBase: The HBase module is designed for applications that need random, realtime, read/write
accesstodata.HBasehasanAPIconsistingofoperationssuchasinsertrecord,getrecord,and
updaterecord.HBaseisusuallycategorizedasaNoSQLsystem2.

Hive: The Hive module is a socalled SQLonHadoop engine and offers a SQL interface on data
stored in HDFS. It uses MapReduce or HBase to access the data. In case of the former, Hive
translateseachSQLstatementtoaMapReducejobthatexecutestherequest.Hivewasthefirst
SQLonHadoop engine. Nowadays, alternative products are available, including Apache Drill,
ClouderaImpala,andSparkSQL.

The Strengths of Hadoop

Highdatastoragescalability:HDFShasbeendesignedandoptimizedtohandleextremelylarge
files.Inreallifeprojects,Hadoophasrepeatedlyproventhatitsabletostore,process,analyze,
andmanagebigdata.

High data processing scalability: Hadoop has been designed specifically to operate in highly
distributedenvironmentsinwhichitcanexploitlargenumbersofnodesanddrives.Forexample,
onehundreddrivesworkingatthesametimecanreadoneterabyteofdataintwominutes.In
addition,MapReduceprocessingismovedtothenodeswherethedataislocated.Thereisalmost
nocentralizedcomponentthatcouldbecomeabottleneckandleadtoperformancedegradation.

High performance: Together with HDFS, MapReduce offers high performance reporting and
analytics.OneofthefeaturesofHDFSisdatareplication,whichmakesconcurrentaccesstothe
samedata(ondifferentnodes)possible.

White,T.,Hadoop,TheDefinitiveGuide,OReillyMedia,2012,thirdedition.
Redmond,E.andWilson,J.R.,SevenDatabasesinSevenWeeks:AGuidetoModernDatabasesandtheNoSQLMovement,
PragmaticBookshelf,2012.
2

Copyright 2014 R20/Consultancy, all rights reserved.

Transparently Offloading Data Warehouse Data to Hadoop using Data Virtualization

Low price/performance ratio: Hadoop has been designed to exploit inexpensive commodity
hardware.ThelicensefeesofcommercialHadoopvendorsarenotbasedontheamountofdata
stored.

Alldatatypes:HDFSisafilesystemthatcanstoreanytypeofdata,includingweblogs,emails,or
records with sensor data. In addition, functions can be developed in MapReduce that have the
same complexity found in SQL statements and beyond. MapReduce is not limited to structured
data.MapReduceallowsapplicationstoaccessanyformofdata.Forexample,complexfunctions
can be developed to analyze text or complex weblog records. If programmed correctly,
MapReduce is able to process these complex functions completely in parallel, thus distributing
thiscomplexandI/Oandresourceintensiveprocessingoveralargesetofnodes.

Fastloading:HDFShasbeendesignedtoloadmassiveamountsofdatainrealtimeorinbatch.

Summary Especiallythelowprice/performanceratioandthehighscalabilityfeaturesmakeHadoopan
ideal platform for storing offloaded many types of data currently stored in the data warehouse
environment.

4 The Data Warehouse Environment and Hadoop

The Databases of a Data Warehouse Environment Traditionally, a data warehouse environment consists of
severaldatabases,suchasastagingarea,acentraldatawarehouse,andmanydatamarts;seeFigure2.In
most environments, all these databases are implemented using SQL database server products, such as
Oracle,Teradata,IBMPureDataSystemforAnalytics(formerlyNetezza),orMicrosoftSQLServer.

Figure 2 Traditionally a data


warehouse environment
consists of several databases.

Notethatmostdataisstoredredundantlyinadatawarehouseenvironment.Forexample,datacoming
fromthesourcesystemsisfirststoredinastagingarea,thencopiedtoadatawarehouse,andfromthere
copied to one or moredata marts. In the latter case, the data is usually stored in a slightly aggregated
form,butitsstillredundantdata.

Copyright 2014 R20/Consultancy, all rights reserved.

Transparently Offloading Data Warehouse Data to Hadoop using Data Virtualization

From SQL to Hadoop As indicatedin Section 3,the low price/performanceratioand thehigh scalability
featuresmakeHadoopanidealplatformforstoringdatawarehousedata.Severalapplicationareasexist
todeployHadoopinadatawarehouseenvironment.Forexample,Hadoopcanbeusedtodevelopasand
boxfordatascientists,oritcanbeusedtostoremassiveamountsoftextualdata.And,itcanbedeployed
tolessenthegrowingpainsofadatawarehouseenvironment.BymovingsomeofthedatafromtheSQL
databasestoHadoop,organizationscandealwiththedatagrowthmoreeasily.

Ingeneral,therearetwowaystomigratedatatoHadoop:entiretablesorpartialtables.

Entiretables:OffloadingentiretablestoHadoopimpliesthatsomeofthetablesinaSQLdatabase(for
example,thedatawarehouseitself)aremovedcompletelytoHadoop;seeFigure3.Aftermigratingthe
tables,theSQLdatabasecontainslesstables.Reasonstooffloadanentiretableare,forexample,thatthe
tableisnotusedfrequently,orthedataismassiveandthustooexpensivetostoreintheSQLdatabase.

Figure 3 Entire tables are moved from a


SQL database to Hadoop.

Partialtables:Offloadingpartialtablesmeansthatonlyapartofatableisoffloaded.Forexample,fora
numberoftablesinaSQLdatabaseasubsetoftherecordsismovedtoatableinHadoop;seeFigure4.
Afterwards, the SQL database still contains allthe tables itownedbefore,itsjustthatthetablesdont
contain all the records anymore. The rest of the records is stored in Hadoop. Reasons for offloading a
partialtableare,forexample,thatlittleusedorunusedrecordsinatablecanbeidentified,orthatthe
tableisjustmassive.

InFigure4,analternativetooffloadingasubsetoftherowsisoffloadingasubsetofthecolumns.Thiscan
be useful when some columns are barely ever used, or when they contain very large values, such as
imagesandvideos.Inthiscase,movingthosecolumnstoHadoopmaybemorecosteffective.

Copyright 2014 R20/Consultancy, all rights reserved.

Transparently Offloading Data Warehouse Data to Hadoop using Data Virtualization

Figure 4 Offloading partial tables means that a subset of the records is moved from a SQL table to a Hadoop table.

A Hybrid Data Storage System OffloadingdatatoHadoopresultsinahybriddatastoragesystemwhere


some data is stored in SQL databases and some in Hadoop. Depending on the data and reporting
characteristics, data resides in one of the two storage systems. This means, for example, when data is
offloaded from a SQL database (that holds the central data warehouse) to Hadoop, that the data
warehouseisnolongeronephysicaldatabaseanymore.

The consequence of such a hybrid data storage system is that applications must know in which data
storage system the data resides that they need, they must know which records or columns of a table
reside in which storage system, they must know when data moves from one data storage system to
another,andtheymustunderstandthedifferentAPIsandlanguagessupportedbythestoragesystems.
And as already indicated, Hadoop APIs, such as HDFS, HBase, and MapReduce, are very technical and
complex.SkillstoworkwiththeseAPIsarenotcommonlyfoundinBIdepartments.

Data Virtualization and the Hybrid Data Storage System Allthesetechnicalaspects,suchashandlingdifferent


APIs,beinglocationandSQLdialectaware,slowdownreportdevelopmentandcomplicatemaintenance.
In particular for selfservice BI users, this may all be too complex, possibly leading to incorrect report
results.

AlsoimportanttonoteisthatpopularreportingandanalyticaltoolstypicallysupportSQLbutnotHadoop
interfaces, such as HDFS, HBase, andMapReduce. This makes it difficult and maybe even impossible to
usefamiliartoolsonthedatastoredinHadoop.Thisleadstoredevelopmentofexistingreportsusingnew
tools.

TheseproblemscanbesolvedbyplacingCISbetween,ontheonehand,alltheapplicationsandreporting
tools,and,ontheotherhand,alltheSQLdatabasesandHadoop;seeFigure5.Thisway,reportsandusers
still see one integrated database. CIS completely hides the hybrid data storage system. With data
virtualization, the location of the data, APIs, and dialect differences are fully hidden. Users and reports
donthavetoknowwhereorhowtablesarestored,theydonthavetoknowwhererecordsarestored,
Copyright 2014 R20/Consultancy, all rights reserved.

Transparently Offloading Data Warehouse Data to Hadoop using Data Virtualization

anduserscancontinueusingtheirfavoritereportingtoolsandreportsdontneedtoberedeveloped.In
addition, CIS makes a transparent migration to a hybrid storage environment possible. The users wont
noticetheoffloadingofdatatoHadoop.

Figure 5 By hiding all the technical


details, CIS turns the hybrid data
storage system into one integrated
database.

5 Examples of Offloadable Data

Infrequentlyuseddata is thefirsttype of datathatcomestomind foroffloading to Hadoop. But many


moretypesofdataexistthatarewellsuitedforoffloading.Thissectioncontainsexamplesoftypesofdata
thatorganizationsmayconsiderforoffloadingtoHadoop.

Massive Fact Data Some tables can be just massive in size. For example, tables in a data warehouse
containingcalldetailrecordsorsensordatamaycontainbillionsandbillionsofrecords.Purelybecauseof
thesheersizeofthosetablesandpossiblyalsotheingestionrateofnewdata,itsrecommendedtomove
thesetablestoHadoop.Thesemassivetablesarealmostalwaysfacttables.

Semi-Structured Data Nowadays,alotofnewdatacomesinasemistructuredform,forexample,inthe


form of XML or JSON documents. Technically, its possible to store such documents in SQL tables, but
access is not always fast, and transforming it to flat table structures may be timeconsuming during
loading.Hadoopallowsdatatobestoredinitsoriginalform.Infact,itsupportsfileformatsspecifically
developedfortheseformsofdata.

Textual Data As withsemistructureddata, textualdata,suchasemails,agreements, andsocialmedia


messages,canbestoredinSQLtables.However,itmaybemorecosteffectivetostorethistypeofdatain
Hadoop.

Copyright 2014 R20/Consultancy, all rights reserved.

Transparently Offloading Data Warehouse Data to Hadoop using Data Virtualization

Audio/Video Data InmanyBIenvironments,audioandvideodataisnotanalyzedatall.Mostreporting


and analytical tools cant exploit them at all. For example, current BI tools do not have the features to
detect cancer in MRI scans, to count the germs in a swab, to do optical character recognition, or to
identifymicrostructuralflawsinsteel.Plus,storingthistypeofdatainaSQLdatabaseisquiteexpensive.
Dedicatedtoolstoanalyzethisdatadoexist,andmostofthemworkquitewellwithHadoop.

Cold Data Manytablescontainrecordsorcolumnsthatare,forwhateverreason,rarelyeverused.Cold


datacanbedefinedasdatathatwasenteredalongtimeagoandthathasbeenusedinfrequentlyfora
considerableamountoftime.Inaway,colddataislikethejunkyouhavelyingaroundinyourgarageand
thatyouhaventusedinages,butisstillthereandisfillingupyourgaragesothatyourcarandbarbecue
areoutsideintheraincorrodingaway.

Obsolete Data Tablesmayalsocontaindatathathasbecomeirrelevantformostusersoftheorganization:


obsoletedata.Thiscanbe,forexample,descriptivedataonobsoletesalesproducts,orsalesdatarelated
to retail shops that have been closed down. Obsolete data cannot be thrown away. It may still be
necessaryforcompliancyreasons.Becauseofitsverylowdatausage,itmakessensetooffloadobsolete
datatoHadoop.

6 Steps for Offloading Data Warehouse Data to Hadoop

ThissectiondescribesstepbystephowtooffloaddatafromoneormoreoftheSQLdatabasesinadata
warehouse environment, such as the data warehouse or data mart, and how to move that data to
Hadoop.Thesearethesteps:

1. Identifyingoffloadabledata
2. InstallingHadoop
3. ImportingtablesinCIS
4. MigratingreportstoCIS
5. CreatingtablesinHadoop
6. ExtendingviewstoincludetheHadooptables
7. Collectingstatisticaldataonoffloadabledata
8. InitialoffloadingofdatatoHadoop
9. Refreshoftheoffloadeddata
10. Testingthereportresults
11. Adaptingthebackupprocess

WeassumethatallthedatabasesmakingupthedatawarehouseenvironmentaredevelopedwithaSQL
database server, such as Oracle, Teradata, IBM PureData System for Analytics (formerly Netezza), or
MicrosoftSQLServer.

Step 1: Identifying Offloadable Data

It all starts with identifying offloadable data. This sounds simple, but it isnt, because the identifying
processisnotanexactscience.Forexample,whenisdatacoldorwarm?Thereisalwaysasetofrecords

Copyright 2014 R20/Consultancy, all rights reserved.

Transparently Offloading Data Warehouse Data to Hadoop using Data Virtualization

10

thatislukewarm.Or,howmuchtextualdatafinanciallyjustifiesitsoffloadingtoHadoop?Whenisdata
reallyobsolete?Toassist,guidelinesaregiveninthissectiontoidentifyoffloadabledata.

Massive Fact Data Theverylargetablesinadatawarehouseordatamartarealmostalwaysfacttables.


Evaluateifsomeofthemcanbeoffloadedentirely.Forthis,studytheirqueryandloadworkload.Hereare
somereasonstooffloadamassivefacttable:

WhentheSQLdatabaseserverisstartingtohaveperformanceproblemswithloadingnewdatain
thefacttable.Hadoopsloadspeedmaybefaster.
Whenoverallthequeriesonthefacttableareshowingperformanceproblemsdueprimarilyto
thenumberofrecords.
WhenitsbecomingtooexpensivetostoretheentirefacttableintheSQLdatabaseserver.

When the query workload can be characterized as interactive reporting and analysis, be careful with
offloading to Hadoop, because Hadoops support of this type of analysis leaves much to be desired. If
mostofthereportingismuchmoretraditional,thenconsideroffloadingdata.

Whenitsnotanoptiontooffloadtheentiretable,offloadasubsetofthecoldrecords;seethenexttopic.

Cold Data Section5containsadefinitionofcolddata:datathatwasenteredalongtimeagoandthathas


been used infrequently for a considerable amount of time. Unfortunately, all the concepts used in the
definition of cold data are vague. What is a long time ago? What is used infrequently? What is a
considerable amount of time? To answer such questions, the usage of data usage must be analyzed in
detail.Adatausageanalysismustshowhowoftentablesarequeried,howoftenindividualrecordsand
columnsarequeried,howoftendataisinserted,andhowmanyusersusecertainrecordsorcolumns.The
more questions of this kind are answered, the more complete the data usage analysis will be, and the
moreeasyitistodeterminethetemperatureofthedata.

Data usage can be determined with customdeveloped tools or with dedicated monitors. Most SQL
databaseserverssupportmonitorsthatshowdatausage.Unfortunately,mostofthemdonotshowdata
usageonthelevelofdetailrequiredforanalyzingdatausage.Mostofthesetoolsonlyshowusageper
tableorpercolumn.Inanidealsituationdatausageanalysisshowsqueryusageperday;seeFigure6as
anexample.Inthisdiagram,thecurved,blacklineindicateshowinfrequentlyolderrecordsarestillbeing
used and how frequently the newer ones. The alternative purple curve indicates another form of data
usagewhereevenolderrecordsarestillbeingusedfrequently.Inthiscase,offloadingrecordsprobably
doesntmakesense.

Based on the data usage results, define the crossover point (the dotted line in Figure 6). The crossover
pointistheagethatdividesthecolddatafromthewarmdata.Inmostcasesitseasytoidentifythereally
coldandthereallywarmrecordsofatable,butthereisalwaysthisgroupofrecordsthatisneitherwarm
norcold.Incaseofdoubt,definetheserecordsaswarmanddontoffloadthem.

Copyright 2014 R20/Consultancy, all rights reserved.

Transparently Offloading Data Warehouse Data to Hadoop using Data Virtualization

11

Figure 6 A data usage analysis


exercise can result in a clear
presentation of usage of individual
records. It shows which records in a
table may be considered for
offloading.

Whendefiningacrossoverpoint,becarefulwiththefollowingaspects:

Thecrossoverpointisnotadate,butanage.
Thecrossoverpointmaychangeovertime.Forexample,initiallyanageoftwoyearsisidentified
asthecrossoverpoint,whichmustbechangedtothreeyearslateron.Thereasonmaybethat
usershavestartedtorunmoreformsofhistoricalanalysisleadingtoachangeofdatausage.This
mayslidethecrossoverpointtoanolderage(totheleftinFigure6).
Somerecordsthathavecomeofageandhaveclearlypassedthecrossoverpointmaybeveryhot
records. These records are called outliers. Analyze if there are outliers in tables whose usage is
significantlyhigherthanrecordswithapproximatelythesameage.Inthiscase,identifythemand
changethecrossoverpointaccordingly.

Sometimesanentirecolumncontainscolddata.Theycontaindatathatsbarelyeverorneverusedatall.
Inthiscase,nocrossoverpointisrequired.Especiallyoffloadingwidecolumnscanbeuseful.Examples
ofwidecolumnsarecolumnscontaininglongtextblocks,images,scanneddocuments,orvideos.Incase
ofanoffloadablecolumn,indicatingthecolumnsnameissufficient.

Obsolete Data Becauseofitsverylowdatausage,offloadobsoletedata.Thedifferencebetweenobsolete


dataandcolddataisthatforthelatteranagerelatedcrossoverpointisdefined,whereasobsoletedate
hasnorelationshiptotheconceptofage.

Forobsoleterecordsacriterionindicatestheonesthatareobsolete.Anexampleofsuchacriterionisan
indication of whether a shop has been closed. Or, all the values of a column may be obsolete when it
containscharacteristicsofaproductthatarenotinuseanymore.

Audio/Video Data Audioandvideodataisbarelyeverusedforreportingandanalysis.Asindicated,most


reportingtoolsdonotevenhavefeaturestoanalyzethem.Toreducethesizeofthedatawarehouse,its
stronglyrecommendedtooffloadallthisdatatoHadoop.Especiallyifseparatetableshavebeencreated
inadatawarehousetostoretheselargevalues,offloadingthementirelytoHadoopisarelativelysimple
exercise.Inaddition,inmanysituationsthistypeofdataisonlyusedbyasmallgroupofusers,butitmay
stillbeinthewayforotherusers.

Copyright 2014 R20/Consultancy, all rights reserved.

Transparently Offloading Data Warehouse Data to Hadoop using Data Virtualization

12

An additional advantage of moving audio and video data to Hadoop is that specialized tools may be
availabletoanalyzethisdata.Thiswouldenrichtheorganizationsanalyticalcapabilities.

Textual Data Foralargepart,whatappliestooffloadingaudioandvideodataalsoappliestooffloading


textual data. Textual data may end up filling up a large part of a database, slowing down the loading
processesandallthequeries.

Justification of Data Offloading Aftertheoffloadabledatahasbeendeterminedforeachtable,determine


whethertheamountofdatajustifiesanoffload.Thisdependsonthefollowingkeyaspects:

Databasesize:WhenanentireSQLdatabaseislessthan1Terabytelarge,donotoffloaddataat
all.
Totalnumberofrecordsinatable:Thebiggeratableis,thebiggerthechancethatperformance
problemsexistwiththequeryworkloadandtheloadingprocess,andthemorerelevantoffloading
can be. For example, donot offload a table with barely a hundred records. The overhead costs
wouldbetoohighcomparedtotheperformanceandstorageadvantages.
Percentageofoffloadablerecords:Thehigherthepercentageofoffloadabledatainatableis,the
higherthestoragesavingsareandthebiggerthequeryperformanceimprovementsare.
Percentageofqueriesaccessingnonoffloadeddata:Themorequeriesexistthataccessonlythe
remainingdata,thebiggerthequeryperformanceimprovementwillbeaftertheoffload.
Query performance: The more queries with poor performance exist, the bigger the benefit of
offloadingwillbe.Donotoffloadtableswithnoqueryperformanceproblems.
Totalamountofoffloadabledata:Afinalcheckiswhetherthetotalamountofoffloadabletables
is significant. To offload data from one or two tables only, is hard to justify financially. If not
sufficientoffloadabletablesareidentified,theprocessstops.

Summary Theresultofthisfirststepisalistoftablesforwhichoffloadingcanbeconsideredpluscriteria
indicating the offloadable data. For cold records the criterion is identified with a crossover point, for
obsoletedatathecriterionisalistofcolumnsandrecords,andforothertypesofdataitsanindicationof
theentiretableorasetofcolumns.

Step 2: Installing Hadoop

For installing Hadoop we refer to the manuals and documentation of various vendors. However, to
guaranteethatithasbeeninstalledandoptimizedproperlyandthatreplicationhasbeensetupcorrectly,
consultHadoopexperts.Donotleavethistobeginners;yourdataistoovaluable!

NospecificHadoopimplementationisrecommended.Thereisonlyonerequirement.Tobeabletowork
withCIS,aSQLonHadoopenginemustbeinstalled.SuchanenginemakestheHadoopfileslooklikeSQL
tablesandcanbeaccessedusingSQL.Currently,CISsupportstheSQLonHadoopenginesHive(versions1
and 2) and Impala. Important to understand is that Hive and Impala share the same metadata store,
whichisaccessibleviathehCataloginterface.Forexample,tablescreatedwithHivecanbequeriedwith
Impala,andviceversa.So,aninitialinvestmentinonedoesnotrequireamajormigrationwhenswitching
toanotherone.

Copyright 2014 R20/Consultancy, all rights reserved.

Transparently Offloading Data Warehouse Data to Hadoop using Data Virtualization

13

Note:ThemarketofSQLonHadoopenginesiscurrentlyhighlyvolatile.Newproductsandnewfeatures
areaddedalmosteverymonth.Therefore,itshardtopredictwhatthebestproductwillbeinthelong
run.CISwillcontinuetosupportmoreSQLonHadoopenginesinthefuture.

Step 3: Importing Tables in CIS

IdentifyalltheSQLdatabasesinthedatawarehouseenvironmentthatcontainoffloadabledata.Connect
CIStoalltheseSQLdatabases.AlsoconnectCIStotheselectedSQLonHadoopenginetogetaccessto
HadoopHDFS.Forallthesedatasourcesorganizethecorrectdataaccessprivileges.

Next,importinCISthedefinitionsofallthetablesthatarebeingusedbythereports;seeFigure7.Then,
publish them all via ODBC/JDBC. The reason that all the tables (and not only the ones with offloadable
data)mustbeimportedandpublished,isthatwewantCIStohandlealldataaccess.Ifonlytableswith
offloadable data are imported, the reports themselves would have to access the two different data
sources(SQLdatabaseandCIS)andthisseriouslycomplicatesreportdevelopment.

Figure 7 All the tables (including the ones without offloadable data) accessed by the reports must be imported in CIS so that
in the new situation all the data access is handled by CIS. In this diagram, for each table T a view called V is defined.

IfprimaryandforeignkeyshavebeendefinedontheSQLtables,definethesamesetofkeysontheviews
in CIS. The reason is that several reporting tools need to be aware of these keys to understand which
relationshipsbetweenthetablesexist.

Step 4: Migrating Reports to CIS

In the new environment, CIS hides the hybrid data storage system by decoupling the reports from the
data stores. For this, all the reports have to be migrated. Instead of accessing the tables in the SQL
databasesdirectly,theymustaccessthosesametablesviaviewsdefinedinCIS;seeFigure8.Theseare
theviewsdefinedinthepreviousstep.

Copyright 2014 R20/Consultancy, all rights reserved.

Transparently Offloading Data Warehouse Data to Hadoop using Data Virtualization

14

Figure 8 All the data access of the reports must be

redirected. In stead of accessing the real tables in the SQL


databases, reports must access the views defined in CIS.

This step starts with redirecting the reports from the SQL databases to CIS. Instead of using the
ODBC/JDBC driver of the SQL database, they must access data through one of CIS ODBC/JDBC drivers.
Thismainlyinvolvesworkonthesideofthereportingtool.Forexample,changesmayhavetobemadeto
a semantic layer or in a metadata directory. For most reports this migration will proceed effortlessly.
Manyofthemwillshowexactly thesame resultsastheydidbefore, becausetheyrestillaccessingthe
sametablesinthesameSQLdatabases.

Theremaybesomeminorissues.Forexample,itcouldbethatthereportsexecuteSQLstatementsthat
use proprietary features not supported by CIS. In this case, try to rewrite these SQL statements and
replacetheproprietaryfeaturesbymorestandardones.Ifspecificscalarfunctionsaremissing,develop
themusingCISitself.ThesefunctionsareexecutedbyCISitselfandnotbytheunderlyingSQLdatabase
server.

Ifthereisnosolution,usethepassthroughfeatureofCIS.Inthiscase,CISreceivesthequeriesfromthe
reportandpassesthemunchangedtotheunderlyingdatabaseserver.

Attheendofthisstep,thereportsstillaccessthesametablestheyvealwaysaccessed;itsjustthatnow
theirSQLstatementspassthroughCIS.Thisopensthewaytotransparentoffloadingofdata.Infact,all
thestepsdescribedsofarcanbeexecutedonthedatawarehouseenvironmentwithoutanyonenoticing
it.

Note:Executethisstepreportbyreport.First,applythissteptoonereportbeforemovingontothenext.
Thisway,lessonslearnedwhenmigratingonereportcanbeusedwhenmigratingthenextone.

Copyright 2014 R20/Consultancy, all rights reserved.

Transparently Offloading Data Warehouse Data to Hadoop using Data Virtualization

15

Step 5: Creating Tables in Hadoop

Inthisstep,newtablesarecreatedinHadoopthatwillholdtheoffloadeddata.UsetheselectedSQLon
Hadoop engine to create them. What these Hadoop tables will look like and how they will be handled
dependscompletelyonwhetheranentiretable,asubsetofrecords,orasubsetofcolumnsisoffloaded.

Offloadinganentiretableorasubsetofrecordsofatable:Foreachtablethatmustbeoffloaded
entirely, or for which a subset of records must be offloaded, define a table in Hadoop. Each
HadooptablemusthaveasimilartablestructureastheoriginalSQLtable.Thismeansthatthey
must all have the same set of columns, identical column names, and each column must have a
similardatatype.

Offloadingasubsetofcolumns:Foreachtable,forwhichasubsetofcolumnsisoffloaded,define
a table in Hadoop that includes the primary key columns oftheoriginal table plusthe columns
thatmustbeoffloaded.Makesurethatthesamecolumnnamesandsimilardatatypesareused
whencreatingtheHadooptables.

Currently,HiveandImpala,andmostoftheotherSQLonHadoopenginesaswell,dontsupportprimary
keys.Themanualsstate:Itstheresponsibilityoftheapplicationstoguaranteeuniquenessoftheprimary
keycolumns.Overtime,thiswillchange.IftheSQLonHadoopengineinusedoessupportprimarykeys,
copytheprimarykeydefinitionfromtheoriginalSQLtabletotheHadooptable.

IfindexescanbedefinedusingtheSQLonHadoopengine,considercopyingtheindexesfromtheoriginal
SQLtabletotheHadooptableaswell.

BecausemostSQLonHadoopenginesarerelativelyyoung,theydontsupportallthedatatypesthatare
offeredbySQLdatabaseservers.Evidently,classicdatatypes,suchasinteger,character,anddecimal,are
supported.ForeachSQLonHadoopenginealistofsupporteddatatypesisavailable.Forexample,see
theApachedocumentation3foradetaileddescriptionofthedatatypessupportedbyHive.

PhysicalparametersmustbesetforeachHadooptable.Theseparametersdefinehowdataisstoredand
accessed.Especiallyimportantisthefileformat.ExamplesoffileformatsareSEQUENCEFILE,TEXTFILE,RCFILE,
ORC,and AVRO.Thefileformathasanimpactonthesizeofthefiles,dataloadspeed,queryperformance,
etcetera.Forreportingandanalyticsofstructureddata,itsrecommendedtousethe ORC(OptimizedRow
Columnar)fileformat4.Thisisacolumnorientedfileformatthatcompressesthedata,reducesI/O,and
speedsupqueryperformance.Ithasbeendesignedspecificallyforreportingandanalyticalworkloads.For
tablesthatprimarilycontainaudioandvideodataselecttheSEQUENCEFILEformat,andselectTEXTFILEwhen
textualdatahastobestored.

Step 6: Extending Views to Include the Hadoop Tables

TheviewsdefinedinStep3onlyshowdatastoredintheSQLdatabases.Inthisstep,theviewsthatpoint
to SQL tables that contain offloadable data are extended with Hadoop tables that will contain the

Apache,HiveDataTypes,seehttps://cwiki.apache.org/confluence/display/Hive/LanguageManual+Types
Apache,ORCFiles,seehttps://cwiki.apache.org/confluence/display/Hive/LanguageManual+ORC

Copyright 2014 R20/Consultancy, all rights reserved.

Transparently Offloading Data Warehouse Data to Hadoop using Data Virtualization

16

offloadeddata.Thesolutiondependsonwhetherasubsetofrecordsortheentiretableisoffloaded,or
whetherasubsetofcolumnsisoffloaded.

Offload an Entire Table or a Subset of Records Redefine each view in such a way that it includes the
equivalentHadooptable.TheolddefinitioncontainingonlytheSQLtable:

SELECT
FROM

*
SQL_TABLE

ThenewdefinitionafterextendingitwiththeHadooptable:

SELECT
*
FROM
SQL_TABLE
UNION ALL
SELECT
*
FROM
HADOOP_TABLE

The table HADOOP_TABLE refers to the table defined with the SQLonHadoop engine and SQL_TABLE to the
equivalenttableintheSQLdatabase.The UNIONoperatorisusedtocombinethetwotables.Ifpossible,
usethe UNION ALLoperatorwhencombiningthetwotables(asintheexampleabove). UNIONrequiresthat
duplicate rows are removed from the result. Whether there are duplicate rows or not, this requires
processing.Ifprimarykeyshavebeendefinedonthetables, UNION ALLisrecommendedinstead,because
thisoperatordoesntrequireremovalofduplicaterows,whichspeedsupqueryprocessing. UNION ALLcan
beusedinthissituation,becausenoduplicaterowsexist.

NotethatthedefinitionofthisviewischangedagaininStep8.Whenallthedatahasbeenoffloadedand
SQL_TABLEhasbecomeempty,itcanberemovedfromthedefinition.

TheresultofextendingtheseviewsispresentedinFigure9.Here,theviewcalledV2isdefinedontheSQL
table T2 combined with the Hadoop table V2. For the reports that access V2 nothinghas changed. The
reportsstillreturnthesameresults.

Figure 9 The view


definitions are changed so
that they show data from
the SQL tables and from
the Hadoop tables.

Copyright 2014 R20/Consultancy, all rights reserved.

Transparently Offloading Data Warehouse Data to Hadoop using Data Virtualization

17

Offload Subset of Columns Whenasubsetofcolumnsmustbeoffloaded,thealteredviewdefinitionlooks


quitedifferent:

SELECT
FROM

*
SQL_TABLE LEFT OUTER JOIN HADOOP_TABLE
ON (SQL_TABLE.PRIMARY_KEY_COLUMNS = HADOOP_TABLE.PRIMARY_KEY_COLUMNS)

Withthisdefinition,theoffloadedcolumnsare(virtually)addedtotheSQL_TABLEusingajoin.

Notethatinthisdefinitionaleftouterjoinisused.Aninnerjoincanbeused(whichmaybefaster),iffor
each row in the SQL table one row exists in the Hadoop table. If the columns to be offloaded contain
manynullvalues,itsrecommendedtoinsertarowintheHadooptableforeachrowintheSQLtablethat
containsvaluesinthosecolumns.Andinthatcase,aleftouterjoinisrequired.

Remarks MakesurethatinCISthestatisticalinformationontheHadooptableshasbeenupdated.Thisis
important, because CIS must know that the Hadoop tables are empty and that they dont have to be
accessed.Ifforsomereason,CISaccessesHadoop(unnecessarily),thenaddafakeconditiontotheview
definitionthattellsCISnottoaccessit;seethelastlineofcodeinthisexample:

SELECT
*
FROM
SQL_TABLE
UNION ALL
SELECT
*
FROM
HADOOP_TABLE
WHERE
1=2

Note:Thisstephasnoimpactonthereportingresults.Whenaccessingthenewversionsoftheseviews,
theresultsshouldstillbethesameasbeforethisstep,becausetheHadooptablescontainnodatayet.

Step 7: Collecting Statistical Data on Offloadable Data

Before offloadable data is migrated, its important that statistical data on that data is collected. This
statisticaldataisrequiredtotestthereportresultsafterwards;seeStep10.

Themorestatisticaldataisderived,thebetteritis,butatleastcollectthefollowingstatisticaldataonthe
tablescontainingoffloadabledata:

Foreachtabledeterminethenumberofoffloadablerecords.Thisisnecessarytodetermineifall
oftherecordshavebeenmigratedcorrectly.
Foreachtabledeterminethelowestandhighestkeyvalueofthatdata.
Foreachnumericcolumndeterminethesumandtheaveragevalue(withahighprecision)ofall
thevalues.Thisisnecessarytodeterminewhetherthenumericvalueshavebeencopiedcorrectly
andthattherehavebeennoproblemswithnumericdatatypeconversions.
For each alphanumeric column determine the average value (high precision) of the lengths of
eachstringvalue.Removethetrailingblanksbeforethelengthiscalculated.

Copyright 2014 R20/Consultancy, all rights reserved.

Transparently Offloading Data Warehouse Data to Hadoop using Data Virtualization

18

Step 8: Initial Offloading of Data to Hadoop

The Initial Offloading of Data ThisstepdescribestheinitialoffloadingofdatatoHadoop.Thenextsection


describestherefreshoftheoffloadeddata.Someoftheworkdoneinthisstepisonlydoneonce,itsjust
togettheballrolling.Thenextstepdealswiththecontinuousoffloadingprocess.

The initial offloading of data stored in the SQL tables to the Hadoop tables requires a solution that
extractsdatafromtheSQLtables,loadsitintotheHadooptables,andremovesitfromtheSQLtables.
Suchasolutioncanbedevelopedinseveralwaysandseveraltechnologiesareavailable,suchasCISitself,
dedicatedETLtechnology,HadoopsownETLtoolcalledSqoop,oraninternallydevelopedsolution.The
choiceprimarilydependsontheamountofdatatobeoffloaded.

Regardlessofthesolution,donotspendtoomuchdevelopmenttimeonthisstep,becausesomepartsof
the solution will be used only once. Therefore, even if the amount of data is considerable, go for the
simplestsolution,evenifthismeansthatthemigrationtakessometime.Itdoesntmakesensetospend
fourdaysondevelopmenttoimprovetheperformanceofthemigrationsolutionfromtwohourstotwo
minutes.

Notethattheviewsdonthavetoberedefinedafterthisstep,becausetheyalreadyaccesstheoffloaded
columnsfromHadoop.

Migrating an Entire Table


1)DevelopthelogicthatmovesallthedataoftheSQLtabletotheHadooptable:

INSERT
SELECT
FROM

INTO HADOOP_TABLE
*
SQL_TABLE

2)DroptheentireSQLtable:

DROP TABLE SQL_TABLE

3) Redefine the view, because there is no need to access the SQL table anymore. The new definition
becomes:

SELECT
FROM

*
HADOOP_TABLE

4)ChangethecurrentETLprocessthatloadsdataintothe SQL_TABLE.Itmustberedirectedtoloaddata
straightintotheHADOOP TABLEinstead.

Copyright 2014 R20/Consultancy, all rights reserved.

Transparently Offloading Data Warehouse Data to Hadoop using Data Virtualization

19

Migrating a Subset of Records


1)DevelopthelogicthatmigratesthedatafromtheSQLtabletotheHadooptable:

INSERT
SELECT
FROM
WHERE

INTO HADOOP_TABLE
*
SQL_TABLE
CHARACTERISTIC = 'XYZ'

TheWHEREclausecontainsaconditionthatidentifiesthesubsetofrecordstobeoffloaded:

2)DevelopthelogictodeleteoffloadabledatafromtheSQLtables:

DELETE
WHERE

FROM SQL_TABLE
CHARACTERISTIC = 'XYZ'

Notethatcopyinganintegerorstringvaluefromonesystemtoanotherisstraightforward.Thisdoesnot
applytovideoandaudiodata.Usually,thisisnotjustamatterofcopyingtheaudiofromonesystemto
another.ItmaywellbethatadedicatedprogramhastobewrittentoextractanaudiovaluefromtheSQL
tableandtostoreitinanHadooptable,andforbothdedicatedlogichastobewritten;logicspecificfor
thatplatform.

Migrating a Subset of Cold Data


1)DevelopthelogicthatmigratesthecolddatafromtheSQLtabletotheHadooptable:

INSERT
SELECT
FROM
WHERE

INTO HADOOP_TABLE
*
SQL_TABLE
DATE_RECORD < CURRENT_DATE -

3 YEARS

The table HADOOP_TABLE refers to the table defined with the SQLonHadoop engine and SQL_TABLE to the
equivalent table in the SQL database. This example relates to cold data, therefore the condition in the
WHERE clause indicates a crossover point which is three years; data older than three years is moved to
Hadoop.

2)DevelopthelogictodeleteoffloadabledatafromtheSQLtables:

DELETE
WHERE

FROM SQL_TABLE
DATE_RECORD < CURRENT_DATE -

3 YEARS

Migrating a Set of Columns


1)Developthelogictooffloadasubsetofcolumns:

INSERT
SELECT
FROM

INTO HADOOP_TABLE
PRIMARY_KEY, OFFLOADED_COLUMN1, OFFLOADED_COLUMN2, ...
SQL_TABLE

Copyright 2014 R20/Consultancy, all rights reserved.

Transparently Offloading Data Warehouse Data to Hadoop using Data Virtualization

20

Thisstatementmustbeexpandedwiththefollowing WHEREclauseifalltheoffloadedcolumnscontaina
nullvalue:

INSERT
SELECT
FROM
WHERE

INTO HADOOP_TABLE
PRIMARY_KEY, OFFLOADED_COLUMN1, OFFLOADED_COLUMN2, ...
SQL_TABLE
NOT(OFFLOADED_COLUMN1 IS NULL AND OFFLOADED_COLUMN2 IS NULL AND ...)

2)RemovetheoffloadedcolumnsfromtheSQLtable:

ALTER TABLE SQL_TABLE DROP OFFLOADED_COLUMN1, OFFLOADED_COLUMN2, ...

3)ChangethecurrentETLprocessthatloadsdataintotheSQL_TABLE.Datafortheoffloadedcolumnsmust
gostraightintoHADOOP TABLEinstead.

Offloading Large Sets of Data Thelogicabovesuggeststhatoffloadingcanbedonebyusingtwoorthree


SQLstatements.InprinciplethiscanbedoneusingCIS.CISallowsdatatobecopiedfromonedatastore
toanotherusingan INSERT/SELECTstatement.Iftheamountofdatatobecopiedissmall,thiswillwork,
andmayevenworkefficiently.Butthisisnottherecommendedapproachwhenlargesetsofdatahaveto
be offloaded. For example, one such statement may be responsible for copying one billion rows to
Hadoop in one go. The chance that problems will occur during the execution of such a simple
INSERT/SELECTstatement,issubstantial.Theabovestatementsarepurelyincludedtoshowthelogictobe
executed, but it shouldnot be regarded as thepreferred implementation form when large sets of data
have to be offloaded. When large amounts of data have to be moved to Hadoop, use dedicated tools,
suchasETLtools.Theyhavebeenoptimizedandtunedforthistypeofwork.

EvenwhenETLtoolsareused,itmaytakeaconsiderableamountoftimetocopythedata.Infact,itmay
eventakedays.Becauseoffloadingmustbedoneoffline(else,reportswillshowincorrectresults),itmay
berequiredtooffloadthedatainbatches.Forexample,everynightonemonthofcolddataiscopied,or
allthetransactionsofaregionarecopied.Becausetheviewsaccessedbytheusersincludebothtables,
reportresultswillnotbeimpacted.Aftereachnightofoffloading,CISextractsmoredatafromHadoop
andlessfromthe SQLdatabaseserver.If such a strategyis selected andpossible, copytheoldestdata
first,andworkyourwaytothemorecurrentdata.

Differences in SQL Dialects TheSQLdialectsupportedbyaSQLdatabaseservercandifferslightlyfromthat


of a SQLonHadoop engine. In most cases these differences are minor, but exist nevertheless. If
differences exist, some reports wont work or wont work properly. Its important to identify these
differences.

Potentialdifferences:

SQL language differences: Many vendors of SQL database products have added their own SQL
proprietary constructs and statements to the language. For example, DB2 supports the MERGE
statementthatmostothersdont.
SQLprocessingdifferences:Someconstructsareprocesseddifferentlybydifferentvendors.
Function differences: Many SQL functions are supported by all SQL products, but each product
hasitsownproprietaryfunctions.Forexample,somehaveaddedfunctionsformanipulatingXML
documents,andothersforstatisticalanalysis.

Copyright 2014 R20/Consultancy, all rights reserved.

Transparently Offloading Data Warehouse Data to Hadoop using Data Virtualization

21

Function processing differences: Some SQL functions are processed differently by various
products.Especiallydatetimerelatedfunctionsarenotoriousfortheirdifferences.
Datatypedifferences:SomeSQLdatabasessupportspecialdatatypes.Forexample,afewhave
datatypesforgeographicalanalysis,andsomeallowdevelopmentofspecialdatatypes.
Data type processing differences: Data types may be named the same, but that doesnt mean
their behavior is the same. For example, the maximum time DB2 can store in a time column is
24:00:00,whileits23:59:59forOracle.Thiscanleadtodifferentresults.
Nullvalueprocessingdifferences:Thenullvalueisnotprocessedinexactlythesamewaybyall
SQLproducts.

CIS handles many of these differences. Nevertheless, these differences can lead to SQL queries that
cannotbeexecuted,orthatreturndifferentresults.

Step 9: Refresh of the Offloaded Data

Formanyformsofoffloading,aftertheinitialoffload,theworkisdone.Forexample,whenanentiretable
hasbeenoffloaded,fromthenonnewdataisaddedtoHadoop.Thesameistrueforobsoletedatathat
hastobeoffloadedjustonce.Butthisisnottruefor,forexample,warmdata,whicheventuallybecomes
colddataandmustbemovedtoHadoop.Therefore,changesmustbemadetotheETLlogicthatfeedsthe
datawarehouseenvironmentwithnewdata.

Periodically, data that has turned cold must be moved. Schedule this solution to run in synch with the
crossover point of the cold data. For example, if the crossover point is older than 36 months, the
migration mustbe executed every month, and if the crossover point is 8 quarters, it mustbe executed
everyquarter.

MostofthedevelopmentworkwilldealwiththeETLlogic,andnotwithwhathasbeendefinedinHadoop
orCIS.

Step 10: Testing the Report Results

WhendatahasbeenmigratedtoHadoop,andbeforeitsmadeavailabletotheusers,itsimportantthat
extensivetestsareruntodeterminewhetherallthereportsstillreturnidenticalresults.Thereportresults
ofthenewsituationmustbe100%identicaltotheprevioussituation.Werecommendatwosteptesting
approachwherethedataqualityistestedfirstandthenthereports.

ForthefirststepthestatisticaldatacalculatedinStep7isneeded.Runthesamequeriesagainthatwere
usedtocalculatethatstatisticaldata.Checkifallthenumbersarestillidentical.Differencescanarisefor
various reasons, because differences exist between data types, the way nulls are handled, how queries
are processed, and how functions are executed. For example, some SQL products return a rounded
integerwhenanintegerisdividedbyanotherintegervalue,whileothersreturnadecimal.Evidently,this
leadstodifferentreportresults.

Copyright 2014 R20/Consultancy, all rights reserved.

Transparently Offloading Data Warehouse Data to Hadoop using Data Virtualization

22

When data quality has been tested, the real reports must be run on the new hybrid system. This is
primarilyavisualtest.Comparethereportresultsofthenewsituationwiththosedevelopedfortheold
situation.

Step 11: Adapting the Backup Process

In most data warehouse environments, the backup process of the SQL databases has already been
organized.Periodically,backupsarecreatedincasearestoreisrequired.Arestoremaybeneededdueto
ahardwarefailure(datacorruptionondisk,diskornodecrash,rackfailure),ausererror(corrupteddata
writes,accidentalormaliciousdatadeletion),orasitefailuredueto,forexample,fire.

NowthatdataisoffloadedtoHadoop,aseparatebackupprocessmustbedevelopedforthatportionof
the data. Hadoop HDFS supports data replication. By replicating data, its relatively easy to handle the
hardwarefailures.ConfigureHadoopinsuchawaythatatleastthreereplicasarestored.Besurethatthe
firsttworeplicasareondifferenthostsandthethirdreplicaonadifferentrack.Inaddition,dontforget
tobackuptheconfigurationfilesandtheNameNodemetadata.

Inthecaseofcolddata,newdataisonlyaddedtotheHadoopdatabasewhendatahascomeofage,so
periodically,anincrementalbackupisrecommended.

Theorderofdoingabackupisasfollows:

1. LoadnewdataintotheSQLdatabase
2. OffloadcolddatafromtheSQLdatabase
3. DothebackupoftheSQLdatabase
4. LoadcolddataintoHadoop
5. Dothe(incremental)backupofHadoop

If things do go wrong and a restore is required, its important that the restore processes for the SQL
database and Hadoop are synchronized. At any point in time no record may appear in both databases,
becauseitwouldinstantlyleadtoincorrectreportresults.

7 Getting Started

Werecommenddeployingastepbystep,iterativeprocesswhenimplementingahybriddatawarehouse
environment in which offloaded data is stored in Hadoop. Do not execute the steps described in the
previoussectiononebyoneforalltablesthatcontainoffloadabledata.Instead,worktablebytable.First,
identifythetablewiththelargestamountofoffloadabledata.Executeallthestepstooffloadthisdata.
Checkifeverythingworks,andonlyifitdoes,moveontothesecondtable.

When everything works, so when data has been offloaded and the reports return identical results, its
timetoexploitthefeaturesofCISmoreextensively.Afterstep11,onlyitsabstractionanddatafederation
capabilitiesareused.ButCIScandomore.Forexample,whenmanyreportdefinitionscontainthesame
datarelated definitions, extract them from the reports and define them in CIS views. Or, when data
securityrulesaredefinedinthereports,definetheminCIS.Themorespecificationsareextractedfrom

Copyright 2014 R20/Consultancy, all rights reserved.

Transparently Offloading Data Warehouse Data to Hadoop using Data Virtualization

23

the reports and implemented in CIS, the easier they are to define and maintain and it leads to more
consistentreportingresults.

Copyright 2014 R20/Consultancy, all rights reserved.

Transparently Offloading Data Warehouse Data to Hadoop using Data Virtualization

24

About the Author Rick F. van der Lans

Rick F. van der Lans is an independent analyst, consultant, author, and lecturer specializing in data
warehousing, business intelligence, database technology, and data virtualization. He works for
R20/Consultancy(www.r20.nl),aconsultancycompanyhefoundedin1987.

RickischairmanoftheannualEuropeanEnterpriseDataandBusinessIntelligenceConference(organized
annually in London). He writes for SearchBusinessAnalytics.Techtarget.com, BeyeNetwork.com5 and
otherwebsites.HeintroducedthebusinessintelligencearchitecturecalledtheDataDeliveryPlatformin
2009 in a number of articles6 all published at BeyeNetwork.com. The Data Delivery Platform is an
architecturebasedondatavirtualization.

He has written several books on SQL. Published in 1987, his popular Introduction to SQL7 was the first
EnglishbookonthemarketdevotedentirelytoSQL.Aftermorethantwentyfiveyears,thisbookisstill
beingsold,andhasbeentranslatedinseverallanguages,includingChinese,German,andItalian.Hislatest
book8DataVirtualizationforBusinessIntelligenceSystemswaspublishedin2012.

Formoreinformationpleasevisitwww.r20.nl,oremailtorick@r20.nl.Youcanalsogetintouchwithhim
viaLinkedInandviaTwitter@Rick_vanderlans.

About Cisco Systems, Inc.

Cisco (NASDAQ:CSCO) is the worldwide leader in IT that helps companies seize the opportunities of
tomorrow by proving that amazing things can happen when you connect the previously
unconnected.Cisco Information Server is agile data virtualization software that makes it easy for
companiestoaccessbusinessdataacrossthenetworkasifitwereinasingleplace.

Formoreinformation,pleasevisitwww.cisco.com/go/datavirtualization.

Seehttp://www.beyenetwork.com/channels/5087/articles/
Seehttp://www.beyenetwork.com/channels/5087/view/12495
7
R.F.vanderLans,IntroductiontoSQL;MasteringtheRelationalDatabaseLanguage,fourthedition,AddisonWesley,2007.
8
R.F.vanderLans,DataVirtualizationforBusinessIntelligenceSystems,MorganKaufmannPublishers,2012.
6

Copyright 2014 R20/Consultancy, all rights reserved.

You might also like