Professional Documents
Culture Documents
Sponsored by
Copyright 2014 R20/Consultancy. All rights reserved. Cisco and the Cisco logo are trademarks or
registeredtrademarksofCiscoand/oritsaffiliatesintheU.S.ortherecountries.ToviewalistofCisco
trademarks,gotothisURL:www.cisco.com/go/trademarks.Trademarksofcompaniesreferencedinthis
documentarethesolepropertyoftheirrespectiveowners.
Table of Contents
Introduction
Overview of Hadoop
9
12
13
13
15
15
17
18
21
21
22
Getting Started
22
24
24
1 Introduction
ThiswhitepaperfocusesonintroducingthepopularHadoopdatastoragetechnologyinanexistingdata
warehouseenvironmentwiththeintentiontouseitasaplatformfordataoffloading.Theprimaryreasons
tooffloaddatafromcurrentSQLdatabaseserverstoHadoopistoreducestoragecostsandtospeedup
reports. The result is a data warehouse environment in which data is distributed across multiple data
storagetechnologies.
The Cisco Information Server (CIS) data virtualization server is used to hide this hybrid data storage
system.Itallowsorganizationstomigratetransparentlyfromtheirsingledatastoragesolutiontoahybrid
storagesystem.Usersandreportswontnoticethisoffloadingofdata.
The whitepaper describes, step by step, how to introduce Hadoop in an existing data warehouse
environment.Guidelines,dosanddonts,andbestpracticesareincluded.
As a complementary offering, Cisco also provides a packagedsolution called Cisco Big Data Warehouse
Expansion, which includes software, hardware,and services required to accelerate all the activities
involvedinoffloadingdatafromadatawarehousetoHadoop.
More, More, and More Most data warehouse environments use the onceinneverout principle to store
data.Inmostenvironmentsthereisnointentiontomovedatafromthedatawarehousetoanarchiveor
to remove it entirely. When new invoice, customer, or sales data is added, old data is not removed to
makeroom.Somedatawarehousescontaindatathatismorethantwentyyearsoldandthatsbarelyever
used.
Butitsnotonlynewdatacomingfromexistingdatasourcesthatenlargesadatawarehouse.Newdata
sourcesarecontinuouslyintroducedaswell.AnorganizationdevelopsanewInternetbasedtransaction
system,acquiresanewmarketingsystem,orinstallsanewCRMsystem;allthedataproducedbythese
newsystemsmustbecopiedtothedatawarehouseenlargingitevenfurther.
Besidesnewinternaldatasources,ithasbecomeverycommontoadddatafromexternaldatasources,
such as data from social media networks, open data sources, and public web services. Especially for
specificformsofanalytics,enrichinginternaldatawithexternaldatacanbeveryinsightful.Again,allthis
externaldataisstoredinthedatawarehouseenvironment.
Allthisnewdataandallthesenewdatasourcesleadtodatawarehouseenvironmentsthatkeepgrowing.
However,therearesomepracticaldrawbacks:
RemovingData:Periodically,deletesomeoftheleastusedorthelessrecentlyuseddatatokeep
the data warehouse environment as small as possible. The drawback of this approach is that
analysisofthedeleteddataisnolongerpossible,limiting,inparticular,historicalanalysis.
MovingDatatoanOfflineArchive:Periodically,moveunusedorlittleuseddatatoanofflinedata
storagesysteminwhichthedataisnotavailableonline.Thissolutionkeepsthedatawarehouse
environmentsmall.Thechallengeistoreanimateofflinedataquicklyandeasilyforincidental
analysis.Acrucialquestiontoansweriswhethertheofflinedatashouldbereloadedinthedata
warehouse,orshouldittemporarilybecopiedtoaseparatedatabase?Inprinciple,thisdoesnt
reducetheamountofdatastored,itsjustthataportionisstoredoutsidethedatawarehouse.
OffloadingDatatoanInexpensiveOnlineDataStorageSystem:Moveaportionofthedatatoan
online data storage system with another set of technical characteristics. Obviously, such a data
storagesystemmustsupportonlinequeries,andthereshouldbenoneedtoreanimatethedata
beforeitcanbeusedforanalysis.Storingdatashouldalsobelessexpensive.Theresultisahybrid
data warehouse environment in which all the stored data is distributed over different data
storagetechnologies.
Formostorganizations,thethirdsolutionispreferred.
Solving the Data Warehouse Growing Pains with Hadoop One of the newest data storage technologies is
Hadoop.Hadoophasbeendesignedtohandlemassiveamountsofstoreddata,ithasbeenoptimizedto
process complex queries efficiently, it supports a wide range of application areas, and it has a low
price/performance ratio. Especially the latter characteristic makes Hadoop an attractive data storage
platform to operate side by side with the familiar SQL database technology. Because the financial and
technicalcharacteristicsofHadoopareverydifferentfromthoseofSQLdatabasetechnologies,itallows
designers to choose the best fit for each data set. For example, if query speed is crucial, SQL may be
selected,andwhenstoragecostsmustbereduced,Hadoopcanbechosen.Inotherwords,designerscan
goforthebestofbothworlds.
Hiding the Hybrid Data Storage System with Data Virtualization SQLdatabasetechnologyisusedinalmostall
thedatawarehousestostoredata.WhendataisoffloadedtoHadoop,thedatawarehouseenvironment
deploys both data storage systems: Hadoop and SQL database technology. The consequence is that
reportsandusershavetousedifferentAPIsandlanguagesdependingonwheredataisstored.SuchAPIs
mustbestudiedindetail.EspeciallythetraditionalHadoopAPIs,suchasHDFS,HBase,andMapReduce,
are very technical and complex. Also, in such a new situation, reports and users must know in which
storage system the data resides that they want to analyze. All this will raise the costs of report
development and maintenance. Finally, many reporting and analytical tools do not support access to
Hadoop. This means that users have to learn how to work with new reporting tools and that existing
reportsmustberedeveloped.
A data virtualization server decouples the two data storage systems from the reports, presenting one
integrateddatastorageenvironment.Infact,withdatavirtualizationthereportsanduserswontnotice
thatdatahasbeenoffloadedtoanothersystem.UsersdonthavetolearnnewAPIsorlanguagesandthey
dont have to know in which system the data resides. Data virtualization fully hides the hybrid data
storagesystem.MoreonthisinSection4.
3 Overview of Hadoop
Thecoremodulesarebrieflyintroducedhere.Formoreextensivedescriptions,werefertoTomWhites
book1onHadoop.
HDFS:TheHadoopDistributedFileSystem(HDFS)formsthefoundationofHadoop.Thismoduleis
responsibleforstoringandretrievingdata.Itsdesignedandoptimizedtodealwithlargeamounts
ofincomingdatapersecondandtomanageenormousamountsofdatauptopetabytes.
YARN:YARN(YetAnotherResourceNegotiator)isaresourcemanagerresponsibleforprocessing
allrequeststoHDFScorrectlyandfordistributingresourceusagecorrectly.Likeallotherresource
managers,itstaskistoassurethattheoverallperformanceisstableandpredictable.
MapReduce:MapReduceoffersaprogramminginterfacefordeveloperstowriteapplicationsthat
querydatastoredinHDFS.MapReducecanefficientlydistributequeryprocessingoverhundreds
ofnodes.Itpushesanyformofprocessingtothedataitself,andthusparallelizestheexecution
andminimizesdatatransportwithinthesystem.MapReducehasabatchorientedstyleofquery
processing.
HBase: The HBase module is designed for applications that need random, realtime, read/write
accesstodata.HBasehasanAPIconsistingofoperationssuchasinsertrecord,getrecord,and
updaterecord.HBaseisusuallycategorizedasaNoSQLsystem2.
Hive: The Hive module is a socalled SQLonHadoop engine and offers a SQL interface on data
stored in HDFS. It uses MapReduce or HBase to access the data. In case of the former, Hive
translateseachSQLstatementtoaMapReducejobthatexecutestherequest.Hivewasthefirst
SQLonHadoop engine. Nowadays, alternative products are available, including Apache Drill,
ClouderaImpala,andSparkSQL.
Highdatastoragescalability:HDFShasbeendesignedandoptimizedtohandleextremelylarge
files.Inreallifeprojects,Hadoophasrepeatedlyproventhatitsabletostore,process,analyze,
andmanagebigdata.
High data processing scalability: Hadoop has been designed specifically to operate in highly
distributedenvironmentsinwhichitcanexploitlargenumbersofnodesanddrives.Forexample,
onehundreddrivesworkingatthesametimecanreadoneterabyteofdataintwominutes.In
addition,MapReduceprocessingismovedtothenodeswherethedataislocated.Thereisalmost
nocentralizedcomponentthatcouldbecomeabottleneckandleadtoperformancedegradation.
High performance: Together with HDFS, MapReduce offers high performance reporting and
analytics.OneofthefeaturesofHDFSisdatareplication,whichmakesconcurrentaccesstothe
samedata(ondifferentnodes)possible.
White,T.,Hadoop,TheDefinitiveGuide,OReillyMedia,2012,thirdedition.
Redmond,E.andWilson,J.R.,SevenDatabasesinSevenWeeks:AGuidetoModernDatabasesandtheNoSQLMovement,
PragmaticBookshelf,2012.
2
Low price/performance ratio: Hadoop has been designed to exploit inexpensive commodity
hardware.ThelicensefeesofcommercialHadoopvendorsarenotbasedontheamountofdata
stored.
Alldatatypes:HDFSisafilesystemthatcanstoreanytypeofdata,includingweblogs,emails,or
records with sensor data. In addition, functions can be developed in MapReduce that have the
same complexity found in SQL statements and beyond. MapReduce is not limited to structured
data.MapReduceallowsapplicationstoaccessanyformofdata.Forexample,complexfunctions
can be developed to analyze text or complex weblog records. If programmed correctly,
MapReduce is able to process these complex functions completely in parallel, thus distributing
thiscomplexandI/Oandresourceintensiveprocessingoveralargesetofnodes.
Fastloading:HDFShasbeendesignedtoloadmassiveamountsofdatainrealtimeorinbatch.
Summary Especiallythelowprice/performanceratioandthehighscalabilityfeaturesmakeHadoopan
ideal platform for storing offloaded many types of data currently stored in the data warehouse
environment.
The Databases of a Data Warehouse Environment Traditionally, a data warehouse environment consists of
severaldatabases,suchasastagingarea,acentraldatawarehouse,andmanydatamarts;seeFigure2.In
most environments, all these databases are implemented using SQL database server products, such as
Oracle,Teradata,IBMPureDataSystemforAnalytics(formerlyNetezza),orMicrosoftSQLServer.
Notethatmostdataisstoredredundantlyinadatawarehouseenvironment.Forexample,datacoming
fromthesourcesystemsisfirststoredinastagingarea,thencopiedtoadatawarehouse,andfromthere
copied to one or moredata marts. In the latter case, the data is usually stored in a slightly aggregated
form,butitsstillredundantdata.
From SQL to Hadoop As indicatedin Section 3,the low price/performanceratioand thehigh scalability
featuresmakeHadoopanidealplatformforstoringdatawarehousedata.Severalapplicationareasexist
todeployHadoopinadatawarehouseenvironment.Forexample,Hadoopcanbeusedtodevelopasand
boxfordatascientists,oritcanbeusedtostoremassiveamountsoftextualdata.And,itcanbedeployed
tolessenthegrowingpainsofadatawarehouseenvironment.BymovingsomeofthedatafromtheSQL
databasestoHadoop,organizationscandealwiththedatagrowthmoreeasily.
Ingeneral,therearetwowaystomigratedatatoHadoop:entiretablesorpartialtables.
Entiretables:OffloadingentiretablestoHadoopimpliesthatsomeofthetablesinaSQLdatabase(for
example,thedatawarehouseitself)aremovedcompletelytoHadoop;seeFigure3.Aftermigratingthe
tables,theSQLdatabasecontainslesstables.Reasonstooffloadanentiretableare,forexample,thatthe
tableisnotusedfrequently,orthedataismassiveandthustooexpensivetostoreintheSQLdatabase.
Partialtables:Offloadingpartialtablesmeansthatonlyapartofatableisoffloaded.Forexample,fora
numberoftablesinaSQLdatabaseasubsetoftherecordsismovedtoatableinHadoop;seeFigure4.
Afterwards, the SQL database still contains allthe tables itownedbefore,itsjustthatthetablesdont
contain all the records anymore. The rest of the records is stored in Hadoop. Reasons for offloading a
partialtableare,forexample,thatlittleusedorunusedrecordsinatablecanbeidentified,orthatthe
tableisjustmassive.
InFigure4,analternativetooffloadingasubsetoftherowsisoffloadingasubsetofthecolumns.Thiscan
be useful when some columns are barely ever used, or when they contain very large values, such as
imagesandvideos.Inthiscase,movingthosecolumnstoHadoopmaybemorecosteffective.
Figure 4 Offloading partial tables means that a subset of the records is moved from a SQL table to a Hadoop table.
The consequence of such a hybrid data storage system is that applications must know in which data
storage system the data resides that they need, they must know which records or columns of a table
reside in which storage system, they must know when data moves from one data storage system to
another,andtheymustunderstandthedifferentAPIsandlanguagessupportedbythestoragesystems.
And as already indicated, Hadoop APIs, such as HDFS, HBase, and MapReduce, are very technical and
complex.SkillstoworkwiththeseAPIsarenotcommonlyfoundinBIdepartments.
AlsoimportanttonoteisthatpopularreportingandanalyticaltoolstypicallysupportSQLbutnotHadoop
interfaces, such as HDFS, HBase, andMapReduce. This makes it difficult and maybe even impossible to
usefamiliartoolsonthedatastoredinHadoop.Thisleadstoredevelopmentofexistingreportsusingnew
tools.
TheseproblemscanbesolvedbyplacingCISbetween,ontheonehand,alltheapplicationsandreporting
tools,and,ontheotherhand,alltheSQLdatabasesandHadoop;seeFigure5.Thisway,reportsandusers
still see one integrated database. CIS completely hides the hybrid data storage system. With data
virtualization, the location of the data, APIs, and dialect differences are fully hidden. Users and reports
donthavetoknowwhereorhowtablesarestored,theydonthavetoknowwhererecordsarestored,
Copyright 2014 R20/Consultancy, all rights reserved.
anduserscancontinueusingtheirfavoritereportingtoolsandreportsdontneedtoberedeveloped.In
addition, CIS makes a transparent migration to a hybrid storage environment possible. The users wont
noticetheoffloadingofdatatoHadoop.
Massive Fact Data Some tables can be just massive in size. For example, tables in a data warehouse
containingcalldetailrecordsorsensordatamaycontainbillionsandbillionsofrecords.Purelybecauseof
thesheersizeofthosetablesandpossiblyalsotheingestionrateofnewdata,itsrecommendedtomove
thesetablestoHadoop.Thesemassivetablesarealmostalwaysfacttables.
ThissectiondescribesstepbystephowtooffloaddatafromoneormoreoftheSQLdatabasesinadata
warehouse environment, such as the data warehouse or data mart, and how to move that data to
Hadoop.Thesearethesteps:
1. Identifyingoffloadabledata
2. InstallingHadoop
3. ImportingtablesinCIS
4. MigratingreportstoCIS
5. CreatingtablesinHadoop
6. ExtendingviewstoincludetheHadooptables
7. Collectingstatisticaldataonoffloadabledata
8. InitialoffloadingofdatatoHadoop
9. Refreshoftheoffloadeddata
10. Testingthereportresults
11. Adaptingthebackupprocess
WeassumethatallthedatabasesmakingupthedatawarehouseenvironmentaredevelopedwithaSQL
database server, such as Oracle, Teradata, IBM PureData System for Analytics (formerly Netezza), or
MicrosoftSQLServer.
It all starts with identifying offloadable data. This sounds simple, but it isnt, because the identifying
processisnotanexactscience.Forexample,whenisdatacoldorwarm?Thereisalwaysasetofrecords
10
thatislukewarm.Or,howmuchtextualdatafinanciallyjustifiesitsoffloadingtoHadoop?Whenisdata
reallyobsolete?Toassist,guidelinesaregiveninthissectiontoidentifyoffloadabledata.
WhentheSQLdatabaseserverisstartingtohaveperformanceproblemswithloadingnewdatain
thefacttable.Hadoopsloadspeedmaybefaster.
Whenoverallthequeriesonthefacttableareshowingperformanceproblemsdueprimarilyto
thenumberofrecords.
WhenitsbecomingtooexpensivetostoretheentirefacttableintheSQLdatabaseserver.
When the query workload can be characterized as interactive reporting and analysis, be careful with
offloading to Hadoop, because Hadoops support of this type of analysis leaves much to be desired. If
mostofthereportingismuchmoretraditional,thenconsideroffloadingdata.
Whenitsnotanoptiontooffloadtheentiretable,offloadasubsetofthecoldrecords;seethenexttopic.
Data usage can be determined with customdeveloped tools or with dedicated monitors. Most SQL
databaseserverssupportmonitorsthatshowdatausage.Unfortunately,mostofthemdonotshowdata
usageonthelevelofdetailrequiredforanalyzingdatausage.Mostofthesetoolsonlyshowusageper
tableorpercolumn.Inanidealsituationdatausageanalysisshowsqueryusageperday;seeFigure6as
anexample.Inthisdiagram,thecurved,blacklineindicateshowinfrequentlyolderrecordsarestillbeing
used and how frequently the newer ones. The alternative purple curve indicates another form of data
usagewhereevenolderrecordsarestillbeingusedfrequently.Inthiscase,offloadingrecordsprobably
doesntmakesense.
Based on the data usage results, define the crossover point (the dotted line in Figure 6). The crossover
pointistheagethatdividesthecolddatafromthewarmdata.Inmostcasesitseasytoidentifythereally
coldandthereallywarmrecordsofatable,butthereisalwaysthisgroupofrecordsthatisneitherwarm
norcold.Incaseofdoubt,definetheserecordsaswarmanddontoffloadthem.
11
Whendefiningacrossoverpoint,becarefulwiththefollowingaspects:
Thecrossoverpointisnotadate,butanage.
Thecrossoverpointmaychangeovertime.Forexample,initiallyanageoftwoyearsisidentified
asthecrossoverpoint,whichmustbechangedtothreeyearslateron.Thereasonmaybethat
usershavestartedtorunmoreformsofhistoricalanalysisleadingtoachangeofdatausage.This
mayslidethecrossoverpointtoanolderage(totheleftinFigure6).
Somerecordsthathavecomeofageandhaveclearlypassedthecrossoverpointmaybeveryhot
records. These records are called outliers. Analyze if there are outliers in tables whose usage is
significantlyhigherthanrecordswithapproximatelythesameage.Inthiscase,identifythemand
changethecrossoverpointaccordingly.
Sometimesanentirecolumncontainscolddata.Theycontaindatathatsbarelyeverorneverusedatall.
Inthiscase,nocrossoverpointisrequired.Especiallyoffloadingwidecolumnscanbeuseful.Examples
ofwidecolumnsarecolumnscontaininglongtextblocks,images,scanneddocuments,orvideos.Incase
ofanoffloadablecolumn,indicatingthecolumnsnameissufficient.
Forobsoleterecordsacriterionindicatestheonesthatareobsolete.Anexampleofsuchacriterionisan
indication of whether a shop has been closed. Or, all the values of a column may be obsolete when it
containscharacteristicsofaproductthatarenotinuseanymore.
12
An additional advantage of moving audio and video data to Hadoop is that specialized tools may be
availabletoanalyzethisdata.Thiswouldenrichtheorganizationsanalyticalcapabilities.
Databasesize:WhenanentireSQLdatabaseislessthan1Terabytelarge,donotoffloaddataat
all.
Totalnumberofrecordsinatable:Thebiggeratableis,thebiggerthechancethatperformance
problemsexistwiththequeryworkloadandtheloadingprocess,andthemorerelevantoffloading
can be. For example, donot offload a table with barely a hundred records. The overhead costs
wouldbetoohighcomparedtotheperformanceandstorageadvantages.
Percentageofoffloadablerecords:Thehigherthepercentageofoffloadabledatainatableis,the
higherthestoragesavingsareandthebiggerthequeryperformanceimprovementsare.
Percentageofqueriesaccessingnonoffloadeddata:Themorequeriesexistthataccessonlythe
remainingdata,thebiggerthequeryperformanceimprovementwillbeaftertheoffload.
Query performance: The more queries with poor performance exist, the bigger the benefit of
offloadingwillbe.Donotoffloadtableswithnoqueryperformanceproblems.
Totalamountofoffloadabledata:Afinalcheckiswhetherthetotalamountofoffloadabletables
is significant. To offload data from one or two tables only, is hard to justify financially. If not
sufficientoffloadabletablesareidentified,theprocessstops.
Summary Theresultofthisfirststepisalistoftablesforwhichoffloadingcanbeconsideredpluscriteria
indicating the offloadable data. For cold records the criterion is identified with a crossover point, for
obsoletedatathecriterionisalistofcolumnsandrecords,andforothertypesofdataitsanindicationof
theentiretableorasetofcolumns.
For installing Hadoop we refer to the manuals and documentation of various vendors. However, to
guaranteethatithasbeeninstalledandoptimizedproperlyandthatreplicationhasbeensetupcorrectly,
consultHadoopexperts.Donotleavethistobeginners;yourdataistoovaluable!
NospecificHadoopimplementationisrecommended.Thereisonlyonerequirement.Tobeabletowork
withCIS,aSQLonHadoopenginemustbeinstalled.SuchanenginemakestheHadoopfileslooklikeSQL
tablesandcanbeaccessedusingSQL.Currently,CISsupportstheSQLonHadoopenginesHive(versions1
and 2) and Impala. Important to understand is that Hive and Impala share the same metadata store,
whichisaccessibleviathehCataloginterface.Forexample,tablescreatedwithHivecanbequeriedwith
Impala,andviceversa.So,aninitialinvestmentinonedoesnotrequireamajormigrationwhenswitching
toanotherone.
13
Note:ThemarketofSQLonHadoopenginesiscurrentlyhighlyvolatile.Newproductsandnewfeatures
areaddedalmosteverymonth.Therefore,itshardtopredictwhatthebestproductwillbeinthelong
run.CISwillcontinuetosupportmoreSQLonHadoopenginesinthefuture.
IdentifyalltheSQLdatabasesinthedatawarehouseenvironmentthatcontainoffloadabledata.Connect
CIStoalltheseSQLdatabases.AlsoconnectCIStotheselectedSQLonHadoopenginetogetaccessto
HadoopHDFS.Forallthesedatasourcesorganizethecorrectdataaccessprivileges.
Next,importinCISthedefinitionsofallthetablesthatarebeingusedbythereports;seeFigure7.Then,
publish them all via ODBC/JDBC. The reason that all the tables (and not only the ones with offloadable
data)mustbeimportedandpublished,isthatwewantCIStohandlealldataaccess.Ifonlytableswith
offloadable data are imported, the reports themselves would have to access the two different data
sources(SQLdatabaseandCIS)andthisseriouslycomplicatesreportdevelopment.
Figure 7 All the tables (including the ones without offloadable data) accessed by the reports must be imported in CIS so that
in the new situation all the data access is handled by CIS. In this diagram, for each table T a view called V is defined.
IfprimaryandforeignkeyshavebeendefinedontheSQLtables,definethesamesetofkeysontheviews
in CIS. The reason is that several reporting tools need to be aware of these keys to understand which
relationshipsbetweenthetablesexist.
In the new environment, CIS hides the hybrid data storage system by decoupling the reports from the
data stores. For this, all the reports have to be migrated. Instead of accessing the tables in the SQL
databasesdirectly,theymustaccessthosesametablesviaviewsdefinedinCIS;seeFigure8.Theseare
theviewsdefinedinthepreviousstep.
14
This step starts with redirecting the reports from the SQL databases to CIS. Instead of using the
ODBC/JDBC driver of the SQL database, they must access data through one of CIS ODBC/JDBC drivers.
Thismainlyinvolvesworkonthesideofthereportingtool.Forexample,changesmayhavetobemadeto
a semantic layer or in a metadata directory. For most reports this migration will proceed effortlessly.
Manyofthemwillshowexactly thesame resultsastheydidbefore, becausetheyrestillaccessingthe
sametablesinthesameSQLdatabases.
Theremaybesomeminorissues.Forexample,itcouldbethatthereportsexecuteSQLstatementsthat
use proprietary features not supported by CIS. In this case, try to rewrite these SQL statements and
replacetheproprietaryfeaturesbymorestandardones.Ifspecificscalarfunctionsaremissing,develop
themusingCISitself.ThesefunctionsareexecutedbyCISitselfandnotbytheunderlyingSQLdatabase
server.
Ifthereisnosolution,usethepassthroughfeatureofCIS.Inthiscase,CISreceivesthequeriesfromthe
reportandpassesthemunchangedtotheunderlyingdatabaseserver.
Attheendofthisstep,thereportsstillaccessthesametablestheyvealwaysaccessed;itsjustthatnow
theirSQLstatementspassthroughCIS.Thisopensthewaytotransparentoffloadingofdata.Infact,all
thestepsdescribedsofarcanbeexecutedonthedatawarehouseenvironmentwithoutanyonenoticing
it.
Note:Executethisstepreportbyreport.First,applythissteptoonereportbeforemovingontothenext.
Thisway,lessonslearnedwhenmigratingonereportcanbeusedwhenmigratingthenextone.
15
Inthisstep,newtablesarecreatedinHadoopthatwillholdtheoffloadeddata.UsetheselectedSQLon
Hadoop engine to create them. What these Hadoop tables will look like and how they will be handled
dependscompletelyonwhetheranentiretable,asubsetofrecords,orasubsetofcolumnsisoffloaded.
Offloadinganentiretableorasubsetofrecordsofatable:Foreachtablethatmustbeoffloaded
entirely, or for which a subset of records must be offloaded, define a table in Hadoop. Each
HadooptablemusthaveasimilartablestructureastheoriginalSQLtable.Thismeansthatthey
must all have the same set of columns, identical column names, and each column must have a
similardatatype.
Offloadingasubsetofcolumns:Foreachtable,forwhichasubsetofcolumnsisoffloaded,define
a table in Hadoop that includes the primary key columns oftheoriginal table plusthe columns
thatmustbeoffloaded.Makesurethatthesamecolumnnamesandsimilardatatypesareused
whencreatingtheHadooptables.
Currently,HiveandImpala,andmostoftheotherSQLonHadoopenginesaswell,dontsupportprimary
keys.Themanualsstate:Itstheresponsibilityoftheapplicationstoguaranteeuniquenessoftheprimary
keycolumns.Overtime,thiswillchange.IftheSQLonHadoopengineinusedoessupportprimarykeys,
copytheprimarykeydefinitionfromtheoriginalSQLtabletotheHadooptable.
IfindexescanbedefinedusingtheSQLonHadoopengine,considercopyingtheindexesfromtheoriginal
SQLtabletotheHadooptableaswell.
BecausemostSQLonHadoopenginesarerelativelyyoung,theydontsupportallthedatatypesthatare
offeredbySQLdatabaseservers.Evidently,classicdatatypes,suchasinteger,character,anddecimal,are
supported.ForeachSQLonHadoopenginealistofsupporteddatatypesisavailable.Forexample,see
theApachedocumentation3foradetaileddescriptionofthedatatypessupportedbyHive.
PhysicalparametersmustbesetforeachHadooptable.Theseparametersdefinehowdataisstoredand
accessed.Especiallyimportantisthefileformat.ExamplesoffileformatsareSEQUENCEFILE,TEXTFILE,RCFILE,
ORC,and AVRO.Thefileformathasanimpactonthesizeofthefiles,dataloadspeed,queryperformance,
etcetera.Forreportingandanalyticsofstructureddata,itsrecommendedtousethe ORC(OptimizedRow
Columnar)fileformat4.Thisisacolumnorientedfileformatthatcompressesthedata,reducesI/O,and
speedsupqueryperformance.Ithasbeendesignedspecificallyforreportingandanalyticalworkloads.For
tablesthatprimarilycontainaudioandvideodataselecttheSEQUENCEFILEformat,andselectTEXTFILEwhen
textualdatahastobestored.
TheviewsdefinedinStep3onlyshowdatastoredintheSQLdatabases.Inthisstep,theviewsthatpoint
to SQL tables that contain offloadable data are extended with Hadoop tables that will contain the
Apache,HiveDataTypes,seehttps://cwiki.apache.org/confluence/display/Hive/LanguageManual+Types
Apache,ORCFiles,seehttps://cwiki.apache.org/confluence/display/Hive/LanguageManual+ORC
16
offloadeddata.Thesolutiondependsonwhetherasubsetofrecordsortheentiretableisoffloaded,or
whetherasubsetofcolumnsisoffloaded.
Offload an Entire Table or a Subset of Records Redefine each view in such a way that it includes the
equivalentHadooptable.TheolddefinitioncontainingonlytheSQLtable:
SELECT
FROM
*
SQL_TABLE
ThenewdefinitionafterextendingitwiththeHadooptable:
SELECT
*
FROM
SQL_TABLE
UNION ALL
SELECT
*
FROM
HADOOP_TABLE
The table HADOOP_TABLE refers to the table defined with the SQLonHadoop engine and SQL_TABLE to the
equivalenttableintheSQLdatabase.The UNIONoperatorisusedtocombinethetwotables.Ifpossible,
usethe UNION ALLoperatorwhencombiningthetwotables(asintheexampleabove). UNIONrequiresthat
duplicate rows are removed from the result. Whether there are duplicate rows or not, this requires
processing.Ifprimarykeyshavebeendefinedonthetables, UNION ALLisrecommendedinstead,because
thisoperatordoesntrequireremovalofduplicaterows,whichspeedsupqueryprocessing. UNION ALLcan
beusedinthissituation,becausenoduplicaterowsexist.
NotethatthedefinitionofthisviewischangedagaininStep8.Whenallthedatahasbeenoffloadedand
SQL_TABLEhasbecomeempty,itcanberemovedfromthedefinition.
TheresultofextendingtheseviewsispresentedinFigure9.Here,theviewcalledV2isdefinedontheSQL
table T2 combined with the Hadoop table V2. For the reports that access V2 nothinghas changed. The
reportsstillreturnthesameresults.
17
SELECT
FROM
*
SQL_TABLE LEFT OUTER JOIN HADOOP_TABLE
ON (SQL_TABLE.PRIMARY_KEY_COLUMNS = HADOOP_TABLE.PRIMARY_KEY_COLUMNS)
Withthisdefinition,theoffloadedcolumnsare(virtually)addedtotheSQL_TABLEusingajoin.
Notethatinthisdefinitionaleftouterjoinisused.Aninnerjoincanbeused(whichmaybefaster),iffor
each row in the SQL table one row exists in the Hadoop table. If the columns to be offloaded contain
manynullvalues,itsrecommendedtoinsertarowintheHadooptableforeachrowintheSQLtablethat
containsvaluesinthosecolumns.Andinthatcase,aleftouterjoinisrequired.
Remarks MakesurethatinCISthestatisticalinformationontheHadooptableshasbeenupdated.Thisis
important, because CIS must know that the Hadoop tables are empty and that they dont have to be
accessed.Ifforsomereason,CISaccessesHadoop(unnecessarily),thenaddafakeconditiontotheview
definitionthattellsCISnottoaccessit;seethelastlineofcodeinthisexample:
SELECT
*
FROM
SQL_TABLE
UNION ALL
SELECT
*
FROM
HADOOP_TABLE
WHERE
1=2
Note:Thisstephasnoimpactonthereportingresults.Whenaccessingthenewversionsoftheseviews,
theresultsshouldstillbethesameasbeforethisstep,becausetheHadooptablescontainnodatayet.
Before offloadable data is migrated, its important that statistical data on that data is collected. This
statisticaldataisrequiredtotestthereportresultsafterwards;seeStep10.
Themorestatisticaldataisderived,thebetteritis,butatleastcollectthefollowingstatisticaldataonthe
tablescontainingoffloadabledata:
Foreachtabledeterminethenumberofoffloadablerecords.Thisisnecessarytodetermineifall
oftherecordshavebeenmigratedcorrectly.
Foreachtabledeterminethelowestandhighestkeyvalueofthatdata.
Foreachnumericcolumndeterminethesumandtheaveragevalue(withahighprecision)ofall
thevalues.Thisisnecessarytodeterminewhetherthenumericvalueshavebeencopiedcorrectly
andthattherehavebeennoproblemswithnumericdatatypeconversions.
For each alphanumeric column determine the average value (high precision) of the lengths of
eachstringvalue.Removethetrailingblanksbeforethelengthiscalculated.
18
The initial offloading of data stored in the SQL tables to the Hadoop tables requires a solution that
extractsdatafromtheSQLtables,loadsitintotheHadooptables,andremovesitfromtheSQLtables.
Suchasolutioncanbedevelopedinseveralwaysandseveraltechnologiesareavailable,suchasCISitself,
dedicatedETLtechnology,HadoopsownETLtoolcalledSqoop,oraninternallydevelopedsolution.The
choiceprimarilydependsontheamountofdatatobeoffloaded.
Regardlessofthesolution,donotspendtoomuchdevelopmenttimeonthisstep,becausesomepartsof
the solution will be used only once. Therefore, even if the amount of data is considerable, go for the
simplestsolution,evenifthismeansthatthemigrationtakessometime.Itdoesntmakesensetospend
fourdaysondevelopmenttoimprovetheperformanceofthemigrationsolutionfromtwohourstotwo
minutes.
Notethattheviewsdonthavetoberedefinedafterthisstep,becausetheyalreadyaccesstheoffloaded
columnsfromHadoop.
INSERT
SELECT
FROM
INTO HADOOP_TABLE
*
SQL_TABLE
2)DroptheentireSQLtable:
3) Redefine the view, because there is no need to access the SQL table anymore. The new definition
becomes:
SELECT
FROM
*
HADOOP_TABLE
4)ChangethecurrentETLprocessthatloadsdataintothe SQL_TABLE.Itmustberedirectedtoloaddata
straightintotheHADOOP TABLEinstead.
19
INSERT
SELECT
FROM
WHERE
INTO HADOOP_TABLE
*
SQL_TABLE
CHARACTERISTIC = 'XYZ'
TheWHEREclausecontainsaconditionthatidentifiesthesubsetofrecordstobeoffloaded:
2)DevelopthelogictodeleteoffloadabledatafromtheSQLtables:
DELETE
WHERE
FROM SQL_TABLE
CHARACTERISTIC = 'XYZ'
Notethatcopyinganintegerorstringvaluefromonesystemtoanotherisstraightforward.Thisdoesnot
applytovideoandaudiodata.Usually,thisisnotjustamatterofcopyingtheaudiofromonesystemto
another.ItmaywellbethatadedicatedprogramhastobewrittentoextractanaudiovaluefromtheSQL
tableandtostoreitinanHadooptable,andforbothdedicatedlogichastobewritten;logicspecificfor
thatplatform.
INSERT
SELECT
FROM
WHERE
INTO HADOOP_TABLE
*
SQL_TABLE
DATE_RECORD < CURRENT_DATE -
3 YEARS
The table HADOOP_TABLE refers to the table defined with the SQLonHadoop engine and SQL_TABLE to the
equivalent table in the SQL database. This example relates to cold data, therefore the condition in the
WHERE clause indicates a crossover point which is three years; data older than three years is moved to
Hadoop.
2)DevelopthelogictodeleteoffloadabledatafromtheSQLtables:
DELETE
WHERE
FROM SQL_TABLE
DATE_RECORD < CURRENT_DATE -
3 YEARS
INSERT
SELECT
FROM
INTO HADOOP_TABLE
PRIMARY_KEY, OFFLOADED_COLUMN1, OFFLOADED_COLUMN2, ...
SQL_TABLE
20
Thisstatementmustbeexpandedwiththefollowing WHEREclauseifalltheoffloadedcolumnscontaina
nullvalue:
INSERT
SELECT
FROM
WHERE
INTO HADOOP_TABLE
PRIMARY_KEY, OFFLOADED_COLUMN1, OFFLOADED_COLUMN2, ...
SQL_TABLE
NOT(OFFLOADED_COLUMN1 IS NULL AND OFFLOADED_COLUMN2 IS NULL AND ...)
2)RemovetheoffloadedcolumnsfromtheSQLtable:
3)ChangethecurrentETLprocessthatloadsdataintotheSQL_TABLE.Datafortheoffloadedcolumnsmust
gostraightintoHADOOP TABLEinstead.
EvenwhenETLtoolsareused,itmaytakeaconsiderableamountoftimetocopythedata.Infact,itmay
eventakedays.Becauseoffloadingmustbedoneoffline(else,reportswillshowincorrectresults),itmay
berequiredtooffloadthedatainbatches.Forexample,everynightonemonthofcolddataiscopied,or
allthetransactionsofaregionarecopied.Becausetheviewsaccessedbytheusersincludebothtables,
reportresultswillnotbeimpacted.Aftereachnightofoffloading,CISextractsmoredatafromHadoop
andlessfromthe SQLdatabaseserver.If such a strategyis selected andpossible, copytheoldestdata
first,andworkyourwaytothemorecurrentdata.
Potentialdifferences:
SQL language differences: Many vendors of SQL database products have added their own SQL
proprietary constructs and statements to the language. For example, DB2 supports the MERGE
statementthatmostothersdont.
SQLprocessingdifferences:Someconstructsareprocesseddifferentlybydifferentvendors.
Function differences: Many SQL functions are supported by all SQL products, but each product
hasitsownproprietaryfunctions.Forexample,somehaveaddedfunctionsformanipulatingXML
documents,andothersforstatisticalanalysis.
21
Function processing differences: Some SQL functions are processed differently by various
products.Especiallydatetimerelatedfunctionsarenotoriousfortheirdifferences.
Datatypedifferences:SomeSQLdatabasessupportspecialdatatypes.Forexample,afewhave
datatypesforgeographicalanalysis,andsomeallowdevelopmentofspecialdatatypes.
Data type processing differences: Data types may be named the same, but that doesnt mean
their behavior is the same. For example, the maximum time DB2 can store in a time column is
24:00:00,whileits23:59:59forOracle.Thiscanleadtodifferentresults.
Nullvalueprocessingdifferences:Thenullvalueisnotprocessedinexactlythesamewaybyall
SQLproducts.
CIS handles many of these differences. Nevertheless, these differences can lead to SQL queries that
cannotbeexecuted,orthatreturndifferentresults.
Formanyformsofoffloading,aftertheinitialoffload,theworkisdone.Forexample,whenanentiretable
hasbeenoffloaded,fromthenonnewdataisaddedtoHadoop.Thesameistrueforobsoletedatathat
hastobeoffloadedjustonce.Butthisisnottruefor,forexample,warmdata,whicheventuallybecomes
colddataandmustbemovedtoHadoop.Therefore,changesmustbemadetotheETLlogicthatfeedsthe
datawarehouseenvironmentwithnewdata.
Periodically, data that has turned cold must be moved. Schedule this solution to run in synch with the
crossover point of the cold data. For example, if the crossover point is older than 36 months, the
migration mustbe executed every month, and if the crossover point is 8 quarters, it mustbe executed
everyquarter.
MostofthedevelopmentworkwilldealwiththeETLlogic,andnotwithwhathasbeendefinedinHadoop
orCIS.
WhendatahasbeenmigratedtoHadoop,andbeforeitsmadeavailabletotheusers,itsimportantthat
extensivetestsareruntodeterminewhetherallthereportsstillreturnidenticalresults.Thereportresults
ofthenewsituationmustbe100%identicaltotheprevioussituation.Werecommendatwosteptesting
approachwherethedataqualityistestedfirstandthenthereports.
ForthefirststepthestatisticaldatacalculatedinStep7isneeded.Runthesamequeriesagainthatwere
usedtocalculatethatstatisticaldata.Checkifallthenumbersarestillidentical.Differencescanarisefor
various reasons, because differences exist between data types, the way nulls are handled, how queries
are processed, and how functions are executed. For example, some SQL products return a rounded
integerwhenanintegerisdividedbyanotherintegervalue,whileothersreturnadecimal.Evidently,this
leadstodifferentreportresults.
22
When data quality has been tested, the real reports must be run on the new hybrid system. This is
primarilyavisualtest.Comparethereportresultsofthenewsituationwiththosedevelopedfortheold
situation.
In most data warehouse environments, the backup process of the SQL databases has already been
organized.Periodically,backupsarecreatedincasearestoreisrequired.Arestoremaybeneededdueto
ahardwarefailure(datacorruptionondisk,diskornodecrash,rackfailure),ausererror(corrupteddata
writes,accidentalormaliciousdatadeletion),orasitefailuredueto,forexample,fire.
NowthatdataisoffloadedtoHadoop,aseparatebackupprocessmustbedevelopedforthatportionof
the data. Hadoop HDFS supports data replication. By replicating data, its relatively easy to handle the
hardwarefailures.ConfigureHadoopinsuchawaythatatleastthreereplicasarestored.Besurethatthe
firsttworeplicasareondifferenthostsandthethirdreplicaonadifferentrack.Inaddition,dontforget
tobackuptheconfigurationfilesandtheNameNodemetadata.
Inthecaseofcolddata,newdataisonlyaddedtotheHadoopdatabasewhendatahascomeofage,so
periodically,anincrementalbackupisrecommended.
Theorderofdoingabackupisasfollows:
1. LoadnewdataintotheSQLdatabase
2. OffloadcolddatafromtheSQLdatabase
3. DothebackupoftheSQLdatabase
4. LoadcolddataintoHadoop
5. Dothe(incremental)backupofHadoop
If things do go wrong and a restore is required, its important that the restore processes for the SQL
database and Hadoop are synchronized. At any point in time no record may appear in both databases,
becauseitwouldinstantlyleadtoincorrectreportresults.
7 Getting Started
Werecommenddeployingastepbystep,iterativeprocesswhenimplementingahybriddatawarehouse
environment in which offloaded data is stored in Hadoop. Do not execute the steps described in the
previoussectiononebyoneforalltablesthatcontainoffloadabledata.Instead,worktablebytable.First,
identifythetablewiththelargestamountofoffloadabledata.Executeallthestepstooffloadthisdata.
Checkifeverythingworks,andonlyifitdoes,moveontothesecondtable.
When everything works, so when data has been offloaded and the reports return identical results, its
timetoexploitthefeaturesofCISmoreextensively.Afterstep11,onlyitsabstractionanddatafederation
capabilitiesareused.ButCIScandomore.Forexample,whenmanyreportdefinitionscontainthesame
datarelated definitions, extract them from the reports and define them in CIS views. Or, when data
securityrulesaredefinedinthereports,definetheminCIS.Themorespecificationsareextractedfrom
23
the reports and implemented in CIS, the easier they are to define and maintain and it leads to more
consistentreportingresults.
24
Rick F. van der Lans is an independent analyst, consultant, author, and lecturer specializing in data
warehousing, business intelligence, database technology, and data virtualization. He works for
R20/Consultancy(www.r20.nl),aconsultancycompanyhefoundedin1987.
RickischairmanoftheannualEuropeanEnterpriseDataandBusinessIntelligenceConference(organized
annually in London). He writes for SearchBusinessAnalytics.Techtarget.com, BeyeNetwork.com5 and
otherwebsites.HeintroducedthebusinessintelligencearchitecturecalledtheDataDeliveryPlatformin
2009 in a number of articles6 all published at BeyeNetwork.com. The Data Delivery Platform is an
architecturebasedondatavirtualization.
He has written several books on SQL. Published in 1987, his popular Introduction to SQL7 was the first
EnglishbookonthemarketdevotedentirelytoSQL.Aftermorethantwentyfiveyears,thisbookisstill
beingsold,andhasbeentranslatedinseverallanguages,includingChinese,German,andItalian.Hislatest
book8DataVirtualizationforBusinessIntelligenceSystemswaspublishedin2012.
Formoreinformationpleasevisitwww.r20.nl,oremailtorick@r20.nl.Youcanalsogetintouchwithhim
viaLinkedInandviaTwitter@Rick_vanderlans.
Cisco (NASDAQ:CSCO) is the worldwide leader in IT that helps companies seize the opportunities of
tomorrow by proving that amazing things can happen when you connect the previously
unconnected.Cisco Information Server is agile data virtualization software that makes it easy for
companiestoaccessbusinessdataacrossthenetworkasifitwereinasingleplace.
Formoreinformation,pleasevisitwww.cisco.com/go/datavirtualization.
Seehttp://www.beyenetwork.com/channels/5087/articles/
Seehttp://www.beyenetwork.com/channels/5087/view/12495
7
R.F.vanderLans,IntroductiontoSQL;MasteringtheRelationalDatabaseLanguage,fourthedition,AddisonWesley,2007.
8
R.F.vanderLans,DataVirtualizationforBusinessIntelligenceSystems,MorganKaufmannPublishers,2012.
6