You are on page 1of 33

TheEvolvingRoleofthe

EnterpriseDataWarehousein
theEraofBigDataAnalytics


AKimballGroupWhitePaper
ByRalphKimball









TableofContents

ExecutiveSummary......................................................................................................1
AbouttheAuthor...........................................................................................................1
Introduction..................................................................................................................2
Dataisanassetonthebalancesheet.....................................................................3
Raisingthecurtainonbigdataanalytics.....................................................................4
Usecasesforbigdataanalytics...............................................................................4
Makingsenseofbigdataanalyticusecases...........................................................7
Bigdataanalyticssystemrequirements......................................................................9
Extendedrelationaldatabasemanagementsystems............................................10
MapReduce/Hadoopsystems................................................................................13
HowMapReduceworksinHadoop........................................................................14
ToolsfortheHadoopenvironment.........................................................................16
Featureconvergenceinthecomingdecade..........................................................18
Reusableanalytics.................................................................................................20
Complexeventprocessing(CEP)..........................................................................20
Datawarehouseculturalchangesinthecomingdecade..........................................21
Sandboxes..............................................................................................................21
Lowlatency.............................................................................................................22
Continuousthirstformoreexquisitedetail.............................................................22
Lighttouchdatawaitsforitsrelevancetobeexposed...........................................23
Simpleanalysisofallthedatatrumpssophisticatedanalysisofsomeofthedata.23
Datastructuresshouldbedeclaredatquerytime,notatdataloadtime...............24
TheEDWsupportingbigdataanalyticsmustbemagnetic,agile,anddeep.........24
Theconflictbetweenabstractionandcontrol.........................................................24
Datawarehouseorganizationchangesinthecomingdecade...................................25
Technicalskillsetsrequired..................................................................................25
Neworganizationsrequired....................................................................................26
Newdevelopmentparadigmsrequired...................................................................27
Lessonsfromtheearlydatawarehousingera.......................................................28
Analyticsinthecloud..............................................................................................29
WhitherEDW?.........................................................................................................29
Acknowledgements....................................................................................................31
References.................................................................................................................31












TheEvolvingRoleoftheEnterpriseDataWarehouseintheEraofBigDataAnalytics












TheEvolvingRoleoftheEnterpriseDataWarehouseintheEraofBigDataAnalytics1

ExecutiveSummary
Inthiswhitepaper,wedescribetherapidlyevolvinglandscapefordesigningan
enterprisedatawarehouse(EDW)tosupportbusinessanalyticsintheeraof"big
data.Wedescribethescopeandchallengesofbuildingandevolvingaverystable
andsuccessfulEDWarchitecturetomeetnewbusinessrequirements.These
includeextremeintegration,semi-andun-structureddatasources,petabytesof
behavioralandimagedataaccessedthroughMapReduce/Hadoopaswellas
massivelyparallelrelationaldatabases,andthenstructuringtheEDWtosupport
advancedanalytics.Thispaperprovidesdetailedguidancefordesigningand
administeringthenecessaryprocessesfordeployment.Thiswhitepaperhasbeen
writteninresponsetoalackofspecificguidanceintheindustryastohowtheEDW
needstorespondtothebigdataanalyticschallenge,andwhatnecessarydesign
elementsareneededtosupportthesenewrequirements.

AbouttheAuthor
RalphKimballfoundedtheKimballGroup.Sincethemid1980s,hehasbeenthedata
warehouse/businessintelligence(DW/BI)industrysthoughtleaderonthe
dimensionalapproachandtrainedmorethan10,000ITprofessionals.Priortoworking
atMetaphorandfoundingRedBrickSystems,Ralphco-inventedtheStarworkstation
atXeroxsPaloAltoResearchCenter(PARC).RalphhashisPh.D.inElectrical
EngineeringfromStanfordUniversity.

TheKimballGroupisthesourcefordimensionalDW/BIconsultingandeducation,
consistentwithourbest-sellingToolkitbookseries,DesignTips,andaward-winning
articles.Visitwww.kimballgroup.comformoreinformation.



TheEvolvingRoleoftheEnterpriseDataWarehouseintheEraofBigDataAnalytics2

Introduction
Whatisbigdata?Itsbignessisactuallynotthemostinterestingcharacteristic.Big
dataisstructured,semistructured,unstructured,andrawdatainmanydifferent
formats,insomecaseslookingtotallydifferentthanthecleanscalarnumbersandtext
wehavestoredinourdatawarehousesforthelast30years.Muchbigdatacannotbe
analyzedwithanythingthatlookslikeSQL.Butmostimportant,bigdataisa
paradigmshiftinhowwethinkaboutdataassets,wheredowecollectthem,howdo
weanalyzethem,andhowdowemonetizetheinsightsfromtheanalysis.Thebig
datarevolutionisaboutfindingnewvaluewithinandoutsideconventionaldata
sources.Anadditionalapproachisneededbecausethesoftwareandhardware
environmentsofthepasthavenotbeenabletocapture,manage,orprocessthenew
formsofdatawithinreasonabledevelopmenttimesorprocessingtimes.Weare
challengedtoreorganizeourinformationmanagementlandscapetoextenda
remarkablystableandsuccessfulEDWarchitecturetothisneweraofbigdata
analytics.
Inreadingthiswhitepaperpleasebearinmindthattheconsistentviewofthisauthor
hasalwaysbeenthatthe"datawarehouse"comprisesthecompleteecosystemfor
extracting,cleaning,integratinganddeliveringdatatodecisionmakers,andtherefore
includestheextract-transform-load(ETL)andbusinessintelligence(BI)functions
consideredasoutsideofthedatawarehousebymoreconservativewriters.This
authorhasalwaystakentheviewthatdatawarehousinghasaverycomprehensive
roleincapturingallformsofenterprisedata,andthenpreparingthatdataforthemost
effectiveusebydecision-makersallacrosstheenterprise.Thiswhitepapertakesthe
aggressiveviewthattheenterprisedatawarehouseisonthevergeofaveryexciting
newsetofresponsibilities.ThescopeoftheEDWwillincreasedramatically.
Also,inthiswhitepaper,althoughweconsistentlyusethetermETLtodescribethe
movementofdatawithintheenterprisedatawarehouse,theconventionaluseofthis
termdoesnotdojusticetothemuchlargerresponsibilityofmovingdataacross
networksandbetweensystemsandbetweenprofoundlydifferentprocessesinthe
worldofbigdataanalytics.ETLisaportionofamuchlargertechnologycalleddata
integration(DI).SincewehaveusedETLconsistentlyinourbooksandclassesfor
manyyears,wewillkeepthatterminologyinthispaper,bearinginmindthatETLis
meantinthelargersenseofDI.
Thiswhitepaperstandsbackfromthemarketplaceasitexistsinearly2011to
highlighttheclearlyemergingnewtrendsbroughtbythebigdatarevolution.Anda
revolutionitis.AsJamesMarkarian,Informatica'sExecutiveVicePresidentandChief
TechnologyOfficer,remarked:"thedatabasemarkethasfinallygotteninteresting
again."Becausemuchofthenewbigdatatoolsandapproachesareversion1or
evenversion0developments,thelandscapewillcontinuetochangerapidly.However
thereisgrowingawarenessinthemarketplacethatnewkindsofanalysisarepossible
andthatkeycompetitors,especiallye-commerceenterprises,arealreadytaking
advantageofthenewparadigm.Thiswhitepaperisintendedtobeaguidetohelp
businessintelligence,datawarehousingandinformationmanagementprofessionals












TheEvolvingRoleoftheEnterpriseDataWarehouseintheEraofBigDataAnalytics3

andmanagementteamsunderstandandprepareforbigdataasacomplementary
extensiontotheircurrentEDWarchitecture.
Dataisanassetonthebalancesheet
Enterprisesincreasinglyrecognizethatdataitselfisanassetthatshouldappearon
thebalancesheetinthesamewaythattraditionalassetsfromthemanufacturingage
suchasequipmentandlandhavealwaysappeared.Thereareseveralwaysto
determinethevalueofthedataasset,including
costtoproducethedata
costtoreplacethedataifitislost
revenueorprofitopportunityprovidedbythedata
revenueorprofitlossifdatafallsintocompetitorshands
legalexposurefromfinesandlawsuitsifdataisexposedtothewrongparties
Butmoreimportantthanthedataitself,enterpriseshaveshownthatinsightsfrom
datacanbemonetized.Whenane-commercesitedetectsanincreaseinfavorable
clickthroughsfromanexperimentaladtreatment,thatinsightcanbetakentothe
bottomlineimmediately.Thisdirectcause-and-effectiseasilyunderstoodby
management,andananalyticresearchgroupthatconsistentlydemonstratesthese
insightsislookeduponasastrategicresourcefortheenterprisebythehighestlevels
ofmanagement.Thisgrowthinbusinessawarenessofthevalueofdata-driven
insightsisrapidlyspreadingoutwardfromthee-commerceworldtovirtuallyevery
businesssegment.
Datawarehousing,ofcourse,hasbeendemonstratingthevalueofdata-driven
insightsforatleast20years.Butuntilquiterecentlydatawarehousinghasbeen
focusedonhistoricaltransactiondata.Duringthepastdecadefrom2000to2009,
threemajorseismicshiftsoccurredindatawarehousing.Thefirst,earlyinthe
decade,wasthedecisiveintroductionoflowlatencyoperationaldataintothedata
warehousetogetherwiththeexistinghistoricaldata.Ofcourse,manyofthesenew
operationaldatausecasesbenefitedfromreal-timedata,insomecasesdemanding
instantaneousdelivery.Thesecondseismicshiftgrowingincreasinglythroughoutthe
decadewasthegatheringofcustomerbehaviordata,whichnotonlyincluded
traditionaltransactionssuchaspurchasesandclickthroughsbutaddedhuge
volumesof"subtransactions"thatrepresentedmeasurableeventsleadinguptothe
transactionsthemselves.Forexample,allthewebpageeventsacustomerengaged
inpriortothefinaltransactioneventbecamearecordofcustomerbehavior."Good
paths"throughthesewebpageeventhistoriesgavelotsofinsightintoproductive(i.e.,
monetizable)customerbehavior.
Thethirdseismicevent,whichisgatheringenormousmomentumaswetransitioninto
thecurrentdecade,istheextractionofproductpreferencesandcustomers
sentimentsfromsocialmedia,especiallythemassivequantitiesofmachine-
generatedunstructureddatageneratedbythenewbusinessparadigmsofdot-com
companies.Itisthisfinalseismicshiftthathaspushedmanyenterprisesintolooking
seriouslyatunstructureddataforthefirsttime,andasking"howonearthdowe












TheEvolvingRoleoftheEnterpriseDataWarehouseintheEraofBigDataAnalytics4

analyzethisstuff?"Thepointhereisnotthatunstructureddataissomenewthing
recentlydiscovered,butrathertheanalysisofunstructureddatahasgone
mainstreamjustrecently.

Raisingthecurtainonbigdataanalytics
Usecasesforbigdataanalytics
Bigdataanalyticsusecasesarespreadinglikewildfire.Hereisasetofusecases
reportedrecently,includingabenchmarksetof"Hadoop-able"usecasesproposed
byJeffHammerbacher,ChiefScientistforCloudera.Followingthesebrief
descriptionsisatablesummarizingthesalientstructureandprocessing
characteristicsofeachusecase.Notethatnoneoftheseusecasescanbesatisfied
withscalarnumericdata,norcananybeproperlyanalyzedbysimpleSQL
statements.Allofthemcanbescaledintothepetabyterangeandbeyondwith
appropriatebusinessassumptions.
Searchranking.Allsearchenginesattempttoranktherelevanceofawebpagetoa
searchrequestagainstallotherpossiblewebpages.Googlespagerankalgorithmis,
ofcourse,theposterchildforthisusecase.
Adtracking.E-commercesitestypicallyrecordanenormousriverofdataincluding
everypageeventineveryusersession.Thisallowsforveryshortturnaroundof
experimentsinadplacement,color,size,wording,andotherfeatures.Whenan
experimentshowsthatsuchafeaturechangeinanadresultsinimprovedclick
throughbehavior,thechangecanbeimplementedvirtuallyinrealtime.
Locationandproximitytracking.ManyusecasesaddpreciseGPSlocationtracking,
togetherwithfrequentupdates,inoperationalapplications,securityanalysis,
navigation,andsocialmedia.Preciselocationtrackingopensthedoorforan
enormousoceanofdataaboutotherlocationsnearbytheGPSmeasurement.These
otherlocationsmayrepresentopportunitiesforsalesorservices.
Causalfactordiscovery.Point-of-saledatahaslongbeenabletoshowuswhenthe
salesofaproductgoessharplyupordown.Butsearchingforthecausalfactorsthat
explainthesedeviationshasbeen,atbest,aguessinggameoranartform.The
answersmaybefoundincompetitivepricingdata,competitivepromotionaldata
includingprintandtelevisionmedia,weather,holidays,nationaleventsincluding
disasters,andvirallyspreadopinionsfoundinsocialmedia.Seethenextusecaseas
well.
SocialCRM.Thisusecaseisoneofthehottestnewareasformarketinganalysis.
TheAltimeterGrouphasdescribedaveryusefulsetofkeyperformanceindicatorsfor
socialCRMthatincludeshareofvoice,audienceengagement,conversationreach,
activeadvocates,advocateinfluence,advocacyimpact,resolutionrate,resolution
time,satisfactionscore,topictrends,sentimentratio,andideaimpact.Thecalculation
oftheseKPIsinvolvesin-depthtrollingofahugearrayofdatasources,especially
unstructuredsocialmedia.












TheEvolvingRoleoftheEnterpriseDataWarehouseintheEraofBigDataAnalytics5

Documentsimilaritytesting.Twodocumentscanbecomparedtoderiveametricof
similarity.Thereisalargebodyofacademicresearchandtestedalgorithms,for
examplelatentsemanticanalysis,thatisjustnowfindingitswaytodrivingmonetized
insightsofinteresttobigdatapractitioners.Forexample,asinglesourcedocument
canbeusedasakindofmultifacetedtemplatetocompareagainstalargesetof
targetdocuments.Thiscouldbeusedforthreatdiscovery,sentimentanalysis,and
opinionpolls.Forexample:"findallthedocumentsthatagreewithmysource
documentonglobalwarming."
Genomicsanalysis:e.g.,commercialseedgenesequencing.Afewmonthsagothe
cottonresearchcommunitywasthrilledbyagenomesequencingannouncementthat
statedinpart"Thesequencewillserveacriticalroleasthereferenceforfuture
assemblyofthelargercottoncropgenome.Cottonisthemostimportantfibercrop
worldwideandthissequenceinformationwillopenthewayformorerapidbreeding
forhigheryield,betterfiberqualityandadaptationtoenvironmentalstressesandfor
insectanddiseaseresistance.ScientistRyanRappstressedtheimportanceof
involvingthecottonresearchcommunityinanalyzingthesequence,identifyinggenes
andgenefamiliesanddeterminingthefuturedirectionsofresearch.(SeedQuest,
Sept22,2010).Thisusecaseisjustoneexampleofawholeindustrythatisbeing
formedtoaddressgenomicsanalysisbroadly,beyondthisexampleofseedgene
sequencing.
Discoveryofcustomercohortgroups.Customercohortgroupsareusedbymany
enterprisestoidentifycommondemographictrendsandbehaviorhistories.Weareall
familiarwithAmazon'scohortgroupswhentheysayothercustomerswhoboughtthe
samebookasyouhavealsoboughtthefollowingbooks.Ofcourse,ifyoucansell
yourproductorservicetoonememberofacohortgroup,thenalltherestmaybe
reasonableprospects.Cohortgroupsarerepresentedlogicallyandgraphicallyas
links,andmuchoftheanalysisofcohortgroupsinvolvesspecializedlinkanalysis
algorithms.
In-flightaircraftstatus.Thisusecaseaswellasthefollowingtwousecasesaremade
possiblebytheintroductionofsensortechnologyeverywhere.Inthecaseofaircraft
systems,in-flightstatusofhundredsofvariablesonengines,fuelsystems,hydraulics,
andelectricalsystemsaremeasuredandtransmittedeveryfewmilliseconds.The
valueofthisusecaseisnotjusttheengineeringtelemetrydatathatcouldbe
analyzedatsomefuturepointintime,butdrivesreal-timeadaptivecontrol,fuel
usage,partfailureprediction,andpilotnotification.
Smartutilitymeters.Itdidn'ttakelongforutilitycompaniestofigureoutthatasmart
metercanbeusedformorethanjustthemonthlyreadoutthatproducesthe
customersutilitybill.Bydrasticallycrankingupthefrequencyofthereadoutstoas
muchasonereadoutpersecondpermeteracrosstheentirecustomerlandscape,
manyusefulanalysescanbeperformedincludingdynamicload-balancing,failure
response,adaptivepricing,andlonger-termstrategiesforincentingcustomersto
utilizetheutilitymoreeffectively(eitherfromthecustomerspointofvieworthe
utility'spointofview!)












TheEvolvingRoleoftheEnterpriseDataWarehouseintheEraofBigDataAnalytics6

Buildingsensors.Modernindustrialbuildingsandhigh-risesarebeingfittedwith
thousandsofsmallsensorstodetecttemperature,humidity,vibration,andnoise.Like
thesmartutilitymeters,collectingthisdataeveryfewseconds24hoursperday
allowsmanyformsofanalysisincludingenergyusage,unusualproblemsincluding
securityviolations,componentfailureinair-conditioningandheatingsystemsand
plumbingsystems,andthedevelopmentofconstructionpracticesandpricing
strategies.
Satelliteimagecomparison.Imagesoftheregionsoftheearthfromsatellitesare
capturedbyeverypassofcertainsatellitesonintervalstypicallyseparatedbyasmall
numberofdays.Overlayingtheseimagesandcomputingthedifferencesallowsthe
creationofhotspotmapsshowingwhathaschanged.Thisanalysiscanidentify
construction,destruction,changesduetodisasterslikehurricanesandearthquakes
andfires,andthespreadofhumanencroachment.
CATscancomparisons.CATscansarestacksofimagestakenas"slices"ofthe
humanbody.LargelibrariesofCATscanscanbeanalyzedtofacilitatetheautomatic
diagnosisofmedicalissuesandtheirprevalence.
Financialaccountfrauddetectionandintervention.Accountfraud,ofcourse,has
immediateandobviousfinancialimpact.Inmanycasesfraudcanbedetectedby
patternsofaccountbehavior,insomecasescrossingmultiplefinancialsystems.For
example,"checkkiting"requirestherapidtransferofmoneybackandforthbetween
twoseparateaccounts.Certainformsofbrokerfraudinvolvetwoconspiringbrokers
sellingasecurityback-and-forthateverincreasingprices,untilanunsuspectingthird
partyenterstheactionbybuyingthesecurity,allowingthefraudulentbrokersto
quicklyexit.Again,thisbehaviormaytakeplaceacrosstwoseparateexchangesina
shortperiodoftime.
Computersystemhackingdetectionandintervention.Systemhackinginmanycases
involvesanunusualentrymodeorsomeotherkindofbehaviorthatinretrospectisa
smokinggunbutmaybehardtodetectinreal-time.
Onlinegamegesturetracking.Onlinegamecompaniestypicallyrecordeveryclick
andmaneuverbyeveryplayeratthemostfinegrainedlevel.Thisavalancheof
"telemetrydata"allowsfrauddetection,interventionforaplayerwhoisgetting
consistentlydefeated(andthereforediscouraged),offersofadditionalfeaturesor
gamegoalsforplayerswhoareabouttofinishagameanddepart,ideasfornew
gamefeatures,andexperimentsfornewfeaturesinthegames.Thiscanbe
generalizedtotelevisionviewing.YourDVRboxcancaptureremotecontrol
keystrokes,recordingevents,playbackevents,picture-in-pictureviewing,andthe
contextoftheguide.Allofthiscanbesentbacktoyourprovider.
Bigscienceincludingatomsmashers,weatheranalysis,spaceprobetelemetryfeeds.
Majorscientificprojectshavealwayscollectedalotofdata,butnowthetechniquesof
bigdataanalyticsareallowingbroaderaccessandmuchmoretimelyaccesstothe
data.Bigsciencedata,ofcourse,isamixtureofallformsofdata,scalar,vector,
complexstructures,analogwaveforms,andimages.












TheEvolvingRoleoftheEnterpriseDataWarehouseintheEraofBigDataAnalytics7

"Databag"exploration.Therearemanysituationsincommercialenvironmentsandin
theresearchcommunitieswherelargevolumesofrawdataarecollected.One
examplemightbedatacollectedaboutstructurefires.Beyondthepredictable
dimensionsoftime,place,primarycauseoffire,andrespondingfirefighters,there
maybeawealthofunpredictableanecdotaldatathatatbestcanbemodeledasa
disorderlycollectionofnamevaluepairs,suchas"contributingweather=lightning.
Anotherexamplewouldbethelistingofallrelevantfinancialassetsforadefendantin
alawsuit.Againsuchalistislikelytobeadisorderlycollectionofnamevaluepairs,
suchas"sharedrealestateownership=condominium.Thelistofexampleslikethis
isendless.Whattheyhaveincommonistheneedtoencapsulatethedisorderly
collectionofnamevaluepairswhichisgenerallyknownasa"databag.Complex
databagsmaycontainbothnamevaluepairsaswellasembeddedsubdatabags.
Thechallengeinthisusecaseistofindacommonwaytoapproachtheanalysisof
databagswhenthecontentofthedatamayneedtobediscoveredafterthedatais
loaded.
Thefinaltwousecasesareoldandvenerableexamplesthatevenpredatedata
warehousingitself.Butnewlifehasbeenbreathedintotheseusecasesbecauseof
theexcitingpotentialofultra-atomiccustomerbehaviordata.
Loanriskanalysisandinsurancepolicyunderwriting.Inordertoevaluatetheriskofa
prospectiveloanoraprospectiveinsurancepolicy,manydatasourcescanbe
broughtintoplayrangingfrompaymenthistories,detailedcreditbehavior,
employmentdata,andfinancialassetdisclosures.Insomecasesthecollateralfora
loanortheinsureditemmaybeaccompaniedbyimagedata.
Customerchurnanalysis.Enterprisesconcernedwithchurnwanttounderstandthe
predictivefactorsleadinguptothelossofacustomer,includingthatcustomers
detailedbehavioraswellasmanyexternalfactorsincludingtheeconomy,lifestage
andotherdemographicsofthecustomer,andfinallyrealtimecompetitiveissues.
Makingsenseofbigdataanalyticusecases
Certainlythepurposeofdevelopingthislistofusecasesistoconvincethereader
thattheusecasescomeinallshapesandsizesandformats,andrequiremany
specializedapproachestoanalyze.Upuntilveryrecentlyalltheseusecasesexisted
asseparateendeavors,ofteninvolvingspecialpurposebuiltsystems.Buttheindustry
awarenessofthe"bigdataanalyticschallenge"ismotivatingeveryonetolookforthe
architecturalsimilaritiesanddifferencesacrossalltheseusecases.Anygiven
enterpriseisincreasinglylikelytoencounteroneormoreoftheseusecases.That
realizationisdrivingtheinterestinsystemarchitecturesthataddressesthebigdata
analyticsprobleminageneralway.Pleasestudythefollowingtable.












TheEvolvingRoleoftheEnterpriseDataWarehouseintheEraofBigDataAnalytics8

Thesheerdensityofthistablemakesitclearthatsystemstosupportbigdata
analyticshavetolookverydifferentthantheclassicrelationaldatabasesystemsfrom
the1980sand1990s.TheoriginalRDBMSswerenotbuilttohandleanyofthe
requirementsrepresentedascolumnsinthistable!


Searchranking X X X X X X
Adtracking X X X X X X X X
Location&proximity X X X X X
Causaldiscovery X X X X X X X
SocialCRM X X X X X X X X
Documentsimilarity X X X X X X X
Genomicanalysis X X X X X
Cohortgroups X X X X X X
In-flightenginestatus X X X X X X
Smartutilitymeters X X X X X X
Buildingsensors X X X X X X X X
Satelliteimages X X X X
CATscans X X X X X X
Financialfraud X X X X X X X X X
Hackingdetection X X X X X X X X X
Gamegestures X X X X X X X X
Bigscience X X X X X X X X X
Databagexploration X X X X X X
Riskanalysis X X X X X X X X
Churnanalysis X X X X X X X
V
e
c
t
o
r
,

m
a
t
r
i
x
,

o
r

c
o
m
p
l
e
x

s
t
r
u
c
t
u
r
e

F
r
e
e

t
e
x
t

I
m
a
g
e

o
r

b
i
n
a
r
y

d
a
t
a

D
a
t
a

b
a
g
s

I
t
e
r
a
t
i
v
e

l
o
g
i
c

o
r

c
o
m
p
l
e
x

b
r
a
n
c
h
i
n
g

A
d
v
a
n
c
e
d

a
n
a
l
y
t
i
c

r
o
u
t
i
n
e
s

R
a
p
i
d
l
y

r
e
p
e
a
t
e
d

m
e
a
s
u
r
e
m
e
n
t
s

E
x
t
r
e
m
e

l
o
w

l
a
t
e
n
c
y

A
c
c
e
s
s

t
o

a
l
l

d
a
t
a

r
e
q
u
i
r
e
d



TheEvolvingRoleoftheEnterpriseDataWarehouseintheEraofBigDataAnalytics9

Bigdataanalyticssystemrequirements
Beforediscussingtheexcitingnewtechnicalandarchitecturaldevelopmentsofthe
2010s,let'ssummarizetheoverallrequirementsforsupportingbigdataanalytics,
keepinginmindthatwearenotrequiringasinglesystemorasinglevendor's
technologytoprovideablanketsolutionforeveryusecase.Fromtheperspectiveof
2011,wehavetheluxuryofstandingbackfromalltheseusecasesgatheredinthe
lastfewyears,andwearenowinapositiontosurroundtherequirementswithsome
confidence.
Thedevelopmentofbigdataanalyticshasreachedapointwhereitneedsanoverall
missionstatementandidentityindependentofalistofusecases.Manyofushave
livedthroughearlierinstantiationsofadvancedanalyticsthatwentbythenamesof
advancedstatistics,artificialintelligenceanddatamining.Noneoftheseearlier
wavesbecameacoherentthemethattranscendedtheindividualexamples,as
compellingasthoseexampleswere.
Hereisanattempttostepbackanddefinethecharacteristicsofbigdataanalyticsat
thehighestlevels.Inthefollowing,theterm"UDF"isusedinthebroadestsenseof
anyuserdefinedfunctionorprogramoralgorithmthatmayappearanywhereinthe
end-to-endanalysisarchitecture.
Inthecoming2010sdecade,theanalysisofbigdatawillrequireatechnologyor
combinationoftechnologiescapableof:
scalingtoeasilysupportpetabytes(thousandsofterabytes)ofdata
beingdistributedacrossthousandsofprocessors,potentiallygeographically
unaware,andpotentiallyheterogeneous
subsecondresponsetimeforhighlyconstrainedstandardSQLqueries
embeddingarbitrarilycomplexuser-definedfunctions(UDFs)within
processingrequests
implementingUDFsinawidevarietyofindustry-standardprocedural
languages
assemblingextensivelibrariesofreusableUDFscrossingmostorallofuse
cases
executingUDFsas"relationscans"overpetabytesizeddatasetsinafew
minutes
supportingawidevarietyofdatatypesgrowingtoincludeimages,waveforms,
arbitrarilyhierarchicaldatastructures,anddatabags
loadingdatatobereadyforanalysis,atveryhighrates,atleastgigabytesper
second
integratingdatafrommultiplesourcesduringtheloadprocessatveryhigh
rates(GB/sec)
loadingdatabeforedeclaringordiscoveringitsstructure
executingcertainstreaminganalyticqueriesinrealtimeonincomingload
data












TheEvolvingRoleoftheEnterpriseDataWarehouseintheEraofBigDataAnalytics10

updatingdatainplaceatfullloadspeeds
joiningabillionrowdimensiontabletoatrillionrowfacttablewithoutpre-
clusteringthedimensiontablewiththefacttable
schedulingandexecutionofcomplexmulti-hundrednodeworkflows
beingconfiguredwithoutbeingsubjecttoasinglepointoffailure
failoverandprocesscontinuationwhenprocessingnodesfail
supportingextrememixedworkloadsincludingthousandsofgeographically
dispersedon-lineusersandprogramsexecutingavarietyofrequestsranging
fromadhocqueriestostrategicanalysis,andwhileloadingdatainbatchand
streamingfashion
Twoarchitectureshaveemergedtoaddressbigdataanalytics:extendedRDBMS,
andMapReduce/Hadoop.Thesearchitecturesarebeingimplementedascompletely
separatesystemsandinvariousinterestinghybridcombinationsinvolvingboth
architectures.Wewillstartbydiscussingthearchitecturesseparately.
Extendedrelationaldatabasemanagementsystems
Allofthemajorrelationaldatabasemanagementsystemvendorsareaddingfeatures
toaddressbigdataanalyticsfromasolidrelationalperspective.Thetwomost
significantarchitecturaldevelopmentshavebeentheovertakingofthehighendofthe
marketwithmassivelyparallelprocessing(MPP),andthegrowingadoptionof
columnarstorage.WhenMPPandcolumnarstoragetechniquesarecombined,a
numberofthesystemrequirementsintheabovelistcanstarttobeaddressed,
including:
scalingtosupportexabytes(thousandsofpetabytes)ofdata
beingdistributedacrosstensofthousandsofgeographicallydispersed
processors
subsecondresponsetimeforhighlyconstrainedstandardSQLqueries
updatingdatainplaceatfullloadspeeds
beingconfiguredwithoutbeingsubjecttoasinglepointoffailure
failoverandprocesscontinuationwhenprocessingnodesfail
Additionally,RDBMSvendorsareaddingsomecomplexuser-definedfunctions
(UDF's)totheirsyntax,butthekindofgeneralpurposeprocedurallanguage
computingrequiredbybigdataanalyticsisnotbeingsatisfiedinrelational
environmentsatthistime.
Inasimilarvein,RDBMSvendorsareallowingcomplexdatastructurestobestored
inindividualfields.Thesekindofembeddedcomplexdatastructureshavebeen
knownas"blobs"formanyyears.It'simportanttounderstandthatrelational
databaseshaveahardtimeprovidinggeneralsupportforinterpretingblobssince
blobsdonotfittherelationalparadigm.AnRDBMSindeedprovidessomevalueby
hostingtheblobsinastructuredframework,butmuchofthecomplexinterpretation
andcomputationontheblobsmustbedonewithspeciallycraftedUDFs,orBI












TheEvolvingRoleoftheEnterpriseDataWarehouseintheEraofBigDataAnalytics11

applicationlayerclients.Blobsarerelatedtodatabagsdiscussedelsewhereinthis
paper.SeethesectionentitledDatastructuresshouldbedeclaredatquerytime.
MPPimplementationshaveneversatisfactorilyaddressedthe"bigjoinissuewherea
billionrowdimensiontableisattemptedtobejoinedtoatrillionrowfacttablewithout
resortingtoclusteredstorage.Thebigjoincrisisoccurswhenanadhocconstraintis
placedagainstthedimensiontableresultinginapotentiallyverylargesetofdimension
keysthatmustbephysicallydownloadedintoeveryoneofthephysicalsegmentsof
thetrillionrowfacttablestoredseparatelyintheMPPsystem.Sincethedimension
keysarescatteredrandomlyacrosstheseparatesegmentsofthetrillionrowfacttable,
itisveryhardtoavoidalengthydownloadstepoftheverylargedimensiontableto
everyoneofthefacttablestoragepartitions.Tobefair,theMapReduce/Hadoop
architecturehasnotbeenabletoaddressthebigjoinproblemeither.
Columnardatastoragefitstherelationalparadigm,andespeciallydimensionally
modeleddatabases,verywell.Besidesthesignificantadvantageofhighcompression
ofsparsedata,columnardatabasesallowaverylargenumberofcolumnscompared
torow-orienteddatabases,andplacelittleoverheadonthesystemwhencolumnsare
addedtoanexistingschema.ThemostsignificantAchilles'heel,atleastin2011,is
theslowloadingspeedofdataintothecolumnarformat.Althoughimpressiveload
speedimprovementsarebeingannouncedbycolumnardatabasevendors,theyhave
stillnotachievedthegigabytes-per-secondrequirementlistedabove.
ThestandardRDBMSarchitectureforimplementinganenterprisedatawarehouse
basedondimensionalmodelingprinciplesissimpleandwellunderstood,asshownin
Figure1.Recallthatthroughoutthiswhitepaper,theEDWisdefinedinthe
comprehensivesensetoincludeallbackroomandfrontroomprocessesincluding
ETL,datapresentation,andBIapplications.

Figure1.ThestandardRDBMSbasedarchitectureforanenterprisedatawarehouse
Source:TheDataWarehouseLifecycleToolkit,2
nd
edition,Kimballetal.(2008)



TheEvolvingRoleoftheEnterpriseDataWarehouseintheEraofBigDataAnalytics12

InthisstandardEDWarchitecturetheETLsystemisamajorcomponentthatsits
betweenthesourcesystemsandthepresentationserversthatareresponsiblefor
exposingalldatatobusinessintelligenceapplications.Inthisview,theETLsystem
addssignificantvaluebycleaning,conforming,andarrangingthedataintoaseriesof
dimensionalschemaswhicharethenstoredphysicallyinthepresentationserver.A
crucialelementofthisarchitectureisthepreparationofconformeddimensionsinthe
ETLsystemthatservesasthebasisofintegrationfortheBIapplications.Itisthe
strongconvictionofthisauthorthatdeferringthebuildingofthedimensional
structuresandtheissuesofintegrationuntilquerytimeisthewrongarchitecture.
Sucha"deferredcomputation"approachrequiresanundulyexpensivequery
optimizertocorrectlyquerycomplexnon-dimensionalmodelseverytimeaqueryis
presented.Thecalculationofintegrationatqueryprocessingtimegenerallyrequires
complexapplicationlogicintheBItoolswhichalsomighthavetobeexecutedfor
everyquery.
TheextendedRDBMSarchitecturetosupportbigdataanalyticspreservesthe
standardarchitecturewithanumberofimportantadditions,shownbelowinFigure2
withlargearrows:

Figure2.TheextendedRDBMSbasedarchitectureforanenterprisedatawarehouse
Thefactthatthehigh-levelenterprisedatawarehousearchitectureisnotmaterially
changedbytheintroductionofnewdatastructures,oragrowinglibraryofspecially
crafteduser-definedfunctions,orpowerfulprocedurallanguage-basedprograms
actingaspowerfulBIclients,isthecharmoftheextendedRDBMSapproachtobig
dataanalytics.ThemajorRDBMSplayersareabletomarshaltheirenormouslegacy
ofmillionsoflinesofcode,powerfulgovernancecapabilities,andsystemstabilitybuilt
overdecadesofservingthemarketplace.












TheEvolvingRoleoftheEnterpriseDataWarehouseintheEraofBigDataAnalytics13

However,itistheopinionofthisauthorthattheextendedRDBMSsystemscannotbe
theonlysolutionforbigdataanalytics.Atsomepoint,tackingonnon-relationaldata
structuresandnon-relationalprocessingalgorithmstothebasic,coherentRDBMS
architecturewillbecomeunwieldyandinefficient.TheSwissArmyknifeanalogy
comestomind.Anotheranalogyclosertothetopicistheprogramminglanguage
PL/1.Originallydesignedasanoverarching,multipurpose,powerfulprogramming
languageforallformsofdataandallapplications,itultimatelybecameabloatedand
sprawlingcorpusthattriedtodotoomanythingsinasinglelanguage.Sincethe
heydayofPL/1therehasbeenawonderfulevolutionofmorenarrowlyfocused
programminglanguageswithmanynewconceptsandfeaturesthatsimplycouldn'tbe
tackedontoPL/1afteracertainpoint.Relationaldatabasemanagementsystemsdo
somanythingssowellthatthereisnodangerofsufferingthesamefateasPL/1.The
bigdataanalyticsspaceisgrowingsorapidlyandinsuchexcitingandunexpected
newdirectionsthatalighterweight,moreflexibleandmoreagileprocessing
frameworkinadditiontoRDBMSsystemsmaybeareasonablealternative.
MapReduce/Hadoopsystems
MapReduceisaprocessingframeworkoriginallydevelopedbyGoogleintheearly
2000sforperformingwebpagesearchesacrossthousandsofphysicallyseparated
machines.TheMapReduceapproachisextremelygeneral.CompleteMapReduce
systemscanbeimplementedinavarietyoflanguagesalthoughthemostsignificant
implementationisinJava.MapReduceisreallyaUDF(userdefinedfunction)
executionframework,wherethe"F"canbeextraordinarilycomplex.Originally
targetedtobuildingGoogle'swebpagesearchindex,aMapReducejobcanbe
definedforvirtuallyanydatastructureandanyapplication.Thetargetprocessorsthat
actuallyperformtherequestedcomputationcanbeidentical(a"cluster"),orcanbea
heterogeneousmixofprocessortypes(a"grid").Thedataineachprocessorupon
whichtheultimatecomputationisperformedcanbestoredinadatabase,ormore
commonlyinafilesystem,andcanbeinanydigitalformat.
ThemostsignificantimplementationofMapReduceisApacheHadoop,knownsimply
asHadoop.Hadoopisanopensource,top-levelApacheproject,withthousandsof
contributorsandawholeindustryofdiverseapplications.Hadooprunsnativelyonits
owndistributedfilesystem(HDFS)andcanalsoreadandwritetoAmazonS3and
others.Conventionaldatabasevendorsarealsoimplementinginterfacestoallow
Hadoopjobstoberunovermassivelydistributedinstancesoftheirdatabases.
AswewillseewhenwegiveabriefoverviewofhowaHadoopjobworks,bandwidth
betweentheseparateprocessorscanbeahugeissue.HDFSisaso-called"rack
aware"filesystembecausethecentralnamenodeknowswhichnodesresideonthe
samerackandwhichareconnectedbymorethanonenetworkhop.Hadoopexploits
therelationshipbetweenthecentraljobdispatcherandHDFStosignificantlyoptimize
amassivelydistributedprocessingtaskbyhavingdetailedknowledgeofwheredata
actuallyresides.Thisalsoimpliesthatacriticalaspectofperformancecontrolisco-
locatingsegmentsofdataonactualphysicalhardwarerackssothattheMapReduce
communicationcanbeaccomplishedatbackplanespeedsratherthanslowernetwork
speeds.Notethatremotecloud-basedfilesystemssuchasAmazonS3and












TheEvolvingRoleoftheEnterpriseDataWarehouseintheEraofBigDataAnalytics14

CloudStoreare,bytheirnature,unabletoprovidetherackawarebenefit.Ofcourse,
cloud-basedfilesystemshaveanumberofcompellingadvantageswhichwe'll
discusslater.
HowMapReduceworksinHadoop
AMapReducejobissubmittedtoacentralizedJobTracker,whichinturnschedules
partsofthejobtoanumberofTaskTrackernodes.Although,ingenerala
TaskTrackermayfailanditstaskcanbereassignedbytheJobTracker,the
JobTrackerisasinglepointoffailure.IftheJobTrackerhalts,theMapReducejob
mustberestartedorberesumedfromintermediatesnapshots.
AMapReducejobisalwaysdividedintotwodistinctphases,mapandreduce.The
overallinputtoaMapReducejobisdividedintomanyequalsizedsplits,eachof
whichisassignedamaptask.Themapfunctionisthenappliedtoeachrecordin
eachsplit.Forlargejobs,thejobtrackerschedulesthesemaptasksinparallel.The
overallperformanceofaMapReducejobdependssignificantlyonachievinga
balanceofenoughparallelsplitstokeepmanymachinesbusy,butnotsomany
parallelsplitsthattheinterprocesscommunicationofmanagingallthesplitsbogs
downtheoveralljob.WhenMapReduceisrunovertheHDFSfilesystem,atypical
defaultsplitsizeis64MBofinputdata.
Asthenamesuggests,themaptaskisthefirsthalfoftheMapReducejob.Eachmap
taskproducesasetofintermediateresultrecordswhicharewrittentothelocaldiskof
themachineperformingthemaptask.ThesecondhalfoftheMapReducejob,the
reducetask,mayrunonanyprocessingnode.Theoutputsofthemappers(nodes
runningmaptasks)aresortedandpartitionedinsuchawaythattheseoutputscanbe
transferredtothereducers(nodesrunningthereducetask).Thefinaloutputsofthe
reducerscomprisethesortedandpartitionedresultssetoftheoverallMapReduce
job.InMapReducerunningoverHDFS,theresultssetiswrittentoHDFSandis
replicatedforreliability.
InFigure3,weshowthistaskflowforaMapReducejobwiththreemappernodes
feedingtworeducernodes,byreproducingfigure2.3fromTomWhite'sbook,
Hadoop,TheDefinitiveGuide,2ndEdition,(O'Reilly,2010).












TheEvolvingRoleoftheEnterpriseDataWarehouseintheEraofBigDataAnalytics15

Figure3.AnexampleMapReducejob
InTomWhite'sbook,asimpleMapReducejobisdescribedwhichweextend
somewhathere.Supposethattheoriginaldatabeforethesplitsareappliedconsists
ofaverylargenumber(perhapsbillions)ofunsortedtemperaturemeasurements,one
perrecord.Suchmeasurementscouldcomefrommanythousandsofautomatic
sensorslocatedaroundtheUnitedStates.Thesplitsareassignedtotheseparate
mappernodestoequalizeasmuchaspossiblethenumberofrecordsgoingtoeach
node.Theactualformofthemapperinputsarekey-valuepairs,inthiscasea
sequentialrecordidentifierandthefullrecordcontainingthetemperature
measurementsaswellasotherdata.Thejobofeachmapperissimplytoparsethe
recordspresentedtoitandextracttheyear,thestate,andthetemperature,which
becomesthesecondsetofkey-valuepairspassedfromthemappertothereducer.
Thejobofeachreduceristofindthemaximumreportedtemperatureforeachstate,
andeachdistinctyearintherecordspassedtoit.Eachreducerisresponsiblefora
state,soinordertoaccomplishthetransfer,theoutputofeachmappermustbe
sortedsothatthekey-valuepairscanbedispatchedtotheappropriatereducers.In
thiscasetherewouldbe50reducers,oneforeachstate.Thesesortedblocksare
thentransferredtothereducersinastepwhichisacriticalfeatureoftheMapReduce
architecture,whereitiscalledthe"shuffle.
Noticethattheshuffleinvolvesatruephysicaltransferofdatabetweenprocessing
nodes.Thismakesthevalueoftherackawarefeaturemoreobvious,sincealotof
dataneedstobemovedfromthemapperstothereducers.Thecleverreadermay
wonderifthisdatatransfercouldbereducedbyhavingthemapperoutputscombined
sothatmanyreadingsfromasinglestateandyeararegiventothereducerasa
singlekey-valuepairratherthanmany.Theanswerisyes,andHadoopprovidesa
combinerfunctiontoaccomplishexactlythisend.












TheEvolvingRoleoftheEnterpriseDataWarehouseintheEraofBigDataAnalytics16

Eachreducerreceivesalargenumberofstate/year-temperaturekey-valuepairs,and
findsthemaximumtemperatureforagivenyear.Thesemaximumtemperaturesfor
eachyeararethefinaloutputfromeachreducer.
Thisapproachcanbescaledmoreorlessindefinitely.ReallyseriousMapReduce
jobsrunningonHDFSmayhavehundredsorthousandsofmappersandreducers,
processingpetabytesofinputdata.
AtthispointtheappealoftheMapReduce/Hadoopapproachshouldbeclear.There
arevirtuallynorestrictionsontheformoftheinputstotheoveralljob.Thereonly
needstobesomerationalbasisforcreatingsplitsandreadingrecords,inthiscase
therecordidentifierinTomWhite'sexample.Actuallogicinthemappersandthe
reducerscanbeprogrammedinvirtuallyanyprogramminglanguageandcanbeas
simpleastheaboveexample,ormuchmorecomplicatedUDFs.Thereadershould
beabletovisualizehowsomeofthemorecomplexusecases(e.g.,comparisonof
satelliteimages)describedearlierinthepapercouldfitintothisframework.
ToolsfortheHadoopenvironment
Whatwehavedescribedthusfaristhecoreprocessingcomponentwhen
MapReduceisrunintheHadoopenvironment.Thisisroughlyequivalentto
describingtheinnerprocessingloopinarelationaldatabasemanagementsystem.In
bothcasesthere'salotmoretothesesystemstoimplementacompletefunctioning
environment.ThefollowingisabriefoverviewoftypicaltoolsusedinaMapReduce/
Hadoopenvironment.Wegroupthesetoolsbyoverallfunction.TomWhite'sbook,
mentionedabove,isanexcellentstartingpointforunderstandinghowthesetoolsare
used.
Gettingdatainandgettingdataout
ETLplatforms--ETLplatforms,withtheirlonghistoryofimportingand
exportingdatatorelationaldatabases,providespecificinterfacesformoving
dataintoandoutofHDFS.Theplatform-basedapproach,ascontrastedwith
handcoding,providesextensivesupportformetadata,dataquality,
documentation,andavisualstyleofsystembuilding.
SqoopSqoop,developedbyCloudera,isanopensourcetoolthatallows
importingdatafromarelationalsourcetoHDFSandexportingdatafrom
HDFStoarelationaltarget.DataimportedbySqoopintoHDFScanbeused
bothbyMapReduceapplicationsandHBaseapplications.HBaseisdescribed
below.
ScribeScribe,developedatFacebookandreleasedasopensource,isused
toaggregatelogdatafromalargenumberofWebservers.
FlumeFlume,developedbyCloudera,isadistributedreliablestreamingdata
collectionservice.ItusesacentralconfigurationmanagedbyZookeeperand
supportstunablereliabilityandautomaticfailoverandrecovery.



TheEvolvingRoleoftheEnterpriseDataWarehouseintheEraofBigDataAnalytics17

Programming
Low-levelMapReduceprogramming--primarycodeformappersand
reducerscanbewritteninanumberoflanguages.Hadoop'snativelanguage
isJavabutHadoopexposesAPIsforwritingcodeinotherlanguagessuchas
RubyandPython.AninterfacetoC++isprovided,whichisnamedHadoop
Pipes.ProgrammingMapReduceatthelowestlevelobviouslyprovidesthe
mostpotentialpower,butthislevelofprogrammingisverymuchlike
assemblylanguageprogramming.Itcanbeverylaborious,especiallywhen
attemptingtodoconceptuallysimpletaskslikejoiningtwodatasets.
HighlevelMapReduceprogramming--ApachePig,orsimplyPig,isaclient-
sideopen-sourceapplicationprovidingahighlevelprogramminglanguagefor
processinglargedatasetsinMapReduce.Theprogramminglanguageitselfis
calledPigLatin.Hiveisanalternativeapplicationdesignedtolookmuchmore
likeSQL,andisusedfordatawarehousingusecases.Whenemployedforthe
appropriateusecases,PigandtheHiveprovideenormousprogramming
productivitybenefitsoverlow-levelMapReduceprogramming,oftenbya
factorof10ormore.PigandHivelifttheapplicationdevelopersperspective
upfrommanagingthedetailedmapperandreducerprocessestomoreofan
applicationsfocus.
IntegrateddevelopmentenvironmentMapReduce/Hadoopdevelopment
needstomovedecisivelyawayfrombarehandcodingtobeadoptedby
mainstreamITshops.Anintegrateddevelopmentenvironmentfor
MapReduce/Hadoopneedstoincludeeditorsforsourcecode,compilers,tools
forautomatingsystembuilds,debuggers,andaversioncontrolsystem.
Integratedapplicationenvironmentanevenhigherlayeraboveanintegrated
developmentenvironmentcouldbecalledanintegratedapplication
environment,wherecomplexreusableanalyticroutinesareassembledinto
completeapplicationsviaagraphicaluserinterface.Thiskindofenvironment
mightbeabletouseopensourcealgorithmssuchasprovidedbytheApache
MahoutprojectwhichdistributesmachinelearningalgorithmsonHadoop
platform.
Cascading--Cascadingisanothertoolthatisanabstractionlayerforwriting
complexMapReduceapplications.ItisbestdescribedasathinJavalibrary
typicallyinvokedfromcommandlinetobeusedasaqueryAPIandprocess
scheduler.ItisnotintendedtobeacomprehensivealternativetoPigorHive.
HBase--HBaseisanopen-source,nonrelational,columnorienteddatabase
thatrunsdirectlyonHadoop.ItisnotaMapReduceimplementation.A
principaldifferentiatorofHBasefromPigorHive(MapReduce
implementations)istheabilitytoprovidereal-timereadandwriterandom-
accesstoverylargedatasets.
Oozie--Oozieisaserver-basedworkflowenginespecializedinrunning
workflowjobswithactionsthatexecuteHadoopjobs,suchasMapReduce,
Pig,Hive,Sqoop,HDFSoperations,andsub-workflows.












TheEvolvingRoleoftheEnterpriseDataWarehouseintheEraofBigDataAnalytics18

ZooKeeperZooKeeperisacentralizedconfigurationmanagerfordistributed
applications.ZookeepercanbeusedindependentlyofHadoopaswell.
Administering
EmbeddedHadoopadminfeaturesHadoopsupportsacomprehensive
runtimeenvironmentincludingeditlog,safemodeoperation,auditlogging,
filesystemcheck,datanodeblockverifier,datanodeblockdistribution
balancer,performancemonitor,comprehensivelogfiles,metricsfor
administrators,countersforMapReduceusers,metadatabackup,data
backup,filesystembalancer,commissioninganddecommissioningnodes.
JavamanagementextensionsastandardJavaAPIformonitoringand
managingapplications
GangliaContextanopensourcedistributedmonitoringsystemforverylarge
clusters
Featureconvergenceinthecomingdecade
ItissafetosaythatrelationaldatabasemanagementsystemsandMapReduce/
Hadoopsystemswillincreasinglyfindwaystocoexistgracefullyinthecoming
decade.Butthesystemshavedistinctcharacteristics,asdepictedinthefollowing
table:

IntheupcomingdecadeRDBMSswillextendtheirsupportforhostingcomplexdata
typesas"blobs,andwillextendAPIsforarbitraryanalyticroutinestooperateonthe
contentsofrecords.MapReduce/Hadoopsystems,especiallyHive,willdeepentheir
supportforSQLinterfacesandfullersupportofthecompleteSQLlanguage.But
neitherwilltakeoverthemarketforbigdataanalyticsexclusively.Asremarked
earlier,RDBMSscannotprovide"relational"semanticsformanyofthecomplexuse
casesrequiredbybigdataanalytics.Atbest,RDBMSswillproviderelationalstructure
surroundingthecomplexpayloads.
RelationalDBMSs MapReduce/Hadoop
Proprietary,mostly Opensource
Expensive Lessexpensive
Datarequiresstructuring Datadoesnotrequirestructuring
Greatforspeedyindexedlookups Greatformassivefulldatascans
Deepsupportforrelationalsemantics Indirectsupportforrelationalsemantics,e.g.
Hive
Indirectsupportforcomplexdatastructures Deepsupportforcomplexdatastructures
Indirectsupportforiteration,complex
branching
Deepsupportforiteration,complexbranching
Deepsupportfortransactionprocessing Littleornosupportfortransactionprocessing












TheEvolvingRoleoftheEnterpriseDataWarehouseintheEraofBigDataAnalytics19

Similarly,MapReduce/HadoopsystemswillnevertakeoverACID-compliant
transactionprocessing,orbecomesuperiortoRDBMSsforindexedqueriesonrow
andcolumnorientedtables.
Asthispaperisbeingwritten,significantadvancesarebeingmadeindeveloping
hybridsystemsusingbothrelationaldatabasetechnologyandMapReduce/Hadoop
technology.Figure4illustratestwoprimaryalternatives.Thefirstalternativedelivers
thedatadirectlyintoaMapReduce/Hadoopconfigurationforprimarynon-relational
analysis.Aswehavedescribed,thisanalysiscanrangethefullgamutfromcomplex
analyticalroutinestosimplesortingthatlookslikeaconventionalETLstep.Whenthe
MapReduce/Hadoopstepiscomplete,theresultsareloadedintoanRDBMSfor
conventionalstructuredqueryingwithSQL.
ThesecondalternativeconfigurationloadsthedatadirectlytoanRDBMS,evenwhen
theprimarydatapayloadsarenotconventionalscalarmeasurements.Atthatpoint
twoanalysismodesarepossible.Thedatacanbeanalyzedwithspeciallycrafted
user-definedfunctions,effectivelyfromtheBIlayer,orpassedtoadownstream
MapReduce/Hadoopapplication.
Inthefutureevenmorecomplexcombinationswilltiethesearchitecturesmore
closelytogether,includingMapReducesystemswhosemappersandreducersare
actuallyrelationaldatabases,andrelationaldatabasesystemswhoseunderling
storageconsistsofHDFSfiles.

Figure4.AlternativehybridarchitecturesusingbothRDBMSandHadoop.
ItwillprobablybedifficultforITorganizationstosortoutthevendorclaimswhichwill
almostcertainlyclaimthattheirsystemsdoeverything.Insomecasestheseclaims












TheEvolvingRoleoftheEnterpriseDataWarehouseintheEraofBigDataAnalytics20

are"objectionremovers"whichmeansthattheyareclaimsthathaveagrainoftruth
tothem,andaremadetomakeyoufeelgood,butdonotstanduptoscrutinyina
competitiveandpracticalenvironment.Buyerbeware!
Reusableanalytics
Uptothispointwehavebeggedtheissueofwheredoesallthespecialanalytic
softwarecomefrom.Bigdataanalyticswillneverprosperifeveryinstanceisa
customcodedsolution.BoththeRDBMSandtheopen-sourcecommunities
recognizethisandtwomaindevelopmentthemeshaveemerged.High-endstatistical
analysisvendors,suchasSAS,havedevelopedextensiveandproprietaryreusable
librariesforawiderangeofanalyticapplications,includingadvancedstatistics,data
mining,predictiveanalytics,featuredetection,linearmodels,discriminantanalysis,
andmanyothers.Theopensourcecommunityhasanumberofinitiatives,themost
notableofwhichareHadoop-MLandApacheMahout.QuotingfromHadoop-MLs
website:
Hadoop-ML(is)aninfrastructuretofacilitatetheimplementationofparallel
machinelearning/datamining(ML/DM)algorithmsonHadoop.Hadoop-ML
hasbeendesignedtoallowforthespecificationofbothtask-parallelanddata-
parallelML/DMalgorithms.Furthermore,itsupportsthecompositionof
parallelML/DMalgorithmsusingbothserialaswellasparallelbuildingblocks
--thisallowsonetowritereusableparallelcode.Theproposedabstraction
easestheimplementationprocessbyrequiringtheusertoonlyspecify
computationsandtheirdependencies,withoutworryingaboutscheduling,
datamanagement,andcommunication.Asaconsequence,thecodesare
portableinthattheuserneverneedstowriteHadoop-specificcode.This
potentiallyallowsonetoleveragefutureparallelizationplatformswithout
rewritingone'scode.
ApacheMahoutprovidesfreeimplementationsofmachinelearningalgorithmson
Hadoopplatform.
Complexeventprocessing(CEP)
Complexeventprocessing(CEP)consistsofprocessingeventshappeninginsideand
outsideanorganizationtoidentifymeaningfulpatternsinordertotakesubsequent
actioninrealtime.Forexample,CEPisusedinutilitynetworks(electrical,gasand
water)toidentifypossibleissuesbeforetheybecomedetrimental.TheseCEP
deploymentsallowforreal-timeinterventionforcriticalnetworkorinfrastructure
situations.ThecombinationofdeepDWanalyticsandCEPcanbeappliedinretail
customersettingstoanalyzebehaviorandidentifysituationswhereacompanymay
loseacustomerorbeabletosellthemadditionalproductsorservicesatthetimeof
theirdirectengagement.Inbanking,sophisticatedanalyticsmighthelptoidentifythe
10mostcommonpatternsoffraudandCEPcanthenbeusedtowatchforthose
patternssotheymaybethwartedbeforealoss.
Atthetimeofthiswhitepaper,CEPisnotgenerallythoughtofaspartoftheEDW,
butthisauthorbelievesthattechnicaladvancesincontinuousqueryprocessingwill












TheEvolvingRoleoftheEnterpriseDataWarehouseintheEraofBigDataAnalytics21

causeCEPandEDWtosharedataandworkmorecloselytogetherinthecoming
decade.

Datawarehouseculturalchangesinthecomingdecade
Theenterprisedatawarehousemustabsolutelystayrelevanttothebusiness.Asthe
valueandthevisibilityofbigdataanalyticsgrows,thedatawarehousemust
encompassthenewculture,skills,techniques,andsystemsrequiredforbigdata
analytics.
Sandboxes
Forexample,bigdataanalysisencouragesexploratorysandboxesfor
experimentation.Thesesandboxesarecopiesorsegmentsofthemassivedatasets
beingsourcedbytheorganization.Individualanalystsorverysmallgroupsare
encouragedtoanalyzethedatawithaverywidevarietyoftools,rangingfromserious
statisticaltoolslikeSAS,MatlaborR,topredictivemodels,andmanyformsofadhoc
queryingandvisualizationthroughadvancedBIgraphicalinterfaces.Theanalyst
responsibleforagivensandboxisallowedtodoanythingwiththedata,usinganytool
theywant,evenifthetoolstheyusearenotcorporatestandards.Thesandbox
phenomenonhasenormousenergybutitcarriesasignificantrisktotheIT
organizationandEDWarchitecturebecauseitcouldcreateisolatedandincompatible
stovepipesofdata.Thispointisamplifiedinthesectiononorganizationalchanges,
below.
Exploratorysandboxesusuallyhavealimitedtimeduration,lastingweeksoratmost
afewmonths.Theirdatacanbeafrozensnapshot,orawindowonacertainsegment
ofincomingdata.Theanalystmayhavepermissiontorunanexperimentchanginga
featureontheproductorserviceinthemarketplace,andthenperformingA/Btesting
toseehowthechangeaffectscustomerbehavior.Typically,ifsuchanexperiment
producesasuccessfulresult,thesandboxexperimentisterminated,andthefeature
goesintoproduction.Atthatpoint,trackingapplicationsthatmayhavebeen
implementedinthesandboxusingaquickanddirtyprototypinglanguage,areusually
reimplementedbyotherpersonnelintheEDWenvironmentusingcorporatestandard
tools.Inseveralofthee-commerceenterprisesinterviewedforthiswhitepaper,
analyticsandboxeswereextremelyimportant,andinsomecaseshundredsofthe
sandboxexperimentswereongoingsimultaneously.Asoneintervieweecommented,
"newlydiscoveredpatternshavethemostdisruptivepotential,andinsightsfromthem
leadtothehighestreturnsoninvestment."
Architecturally,sandboxesshouldnotbebruteforcecopiesofentiredatasets,or
evenmajorsegmentsofthesedatasets.Indimensionalmodelingparlance,the
analystneedsmuchmorethanjustafacttabletoruntheexperiment.Ataminimum
theanalystalsoneedsoneormoreverylargedimensiontables,andpossibly
additionalfacttablesforcomplete"drillacross"analysis.If100analystsarecreating
bruteforcecopyversionsofthedataforthesandboxestherewillbeenormous
wastingofdiskspaceandresourcesforalltheredundantcopies.Rememberthatthe












TheEvolvingRoleoftheEnterpriseDataWarehouseintheEraofBigDataAnalytics22

largestdimensiontables,suchascustomerdimensions,canhave500millionrows!
Therecommendedarchitectureforaserioussandboxenvironmentistobuildeach
sandboxusingconformed(shared)dimensionswhichareincorporatedintoeach
sandboxasrelationalviews,ortheirequivalentunderHadoopapplications.
Lowlatency
Anelementarymistakewhengatheringbusinessrequirementsduringthedesignofa
datawarehouseistoaskthebusinessuseriftheywant"realtime"data.Usersare
likelytosay"ofcourse!"Althoughperhapsthisanswerhasbeensomewhatgratuitous
inthepast,agoodbusinesscasecannowbemadeinmanysituationsthatmore
frequentupdatesofdatadeliveredtothebusinesswithlowerandlowerlatenciesare
justified.BothRDBMSsandMapReduce/Hadoopsystemsstrugglewithloading
giganticamountsofdataandmakingthatdataavailablewithinsecondsofthatdata
beingcreated.Butthemarketplacewantsthis,andregardlessofatechnologists
doubtabouttherequirement,therequirementisrealandoverthenextdecadeitmust
beaddressed.
Aninterestingangleonlowlatencydataisthedesiretobeginseriousanalysisonthe
dataasitisstreamingin,butpossiblyfarbeforethedatacollectionprocesseven
terminates.Thereissignificantinterestinstreaminganalysissystemswhichallow
SQL-likequeriestoprocessthedataasitflowsintothesystem.Insomeusecases
whentheresultsofastreamingquerysurpassathreshold,theanalysiscanbehalted
withoutrunningthejobtothebitterend.Anacademiceffort,knownascontinuous
querylanguage(CQL),hasmadeimpressiveprogressindefiningtherequirements
forstreamingdataprocessingincludingcleversemanticsfordynamicallymovingtime
windowsonthestreamingdata.LookforCQLlanguageextensionsandstreaming
dataquerycapabilitiesintheloadprogramsforbothRDBMSsandHDFSdeployed
datasets.Anidealimplementationwouldallowstreamingdataanalysistotakeplace
whilethedataisbeingloadedatgigabytespersecond.
Theavailabilityofextremelyfrequentandextremelydetailedeventmeasurements
candriveinteractiveintervention.Theusecaseswherethisinterventionisimportant
spansmanysituationsrangingfromonlinegamingtoproductoffersuggestionsto
financialaccountfraudresponsestothestabilityofnetworks.
Continuousthirstformoreexquisitedetail
Analystsareforeverthirstingformoredetailineverymarketplaceobservation,
especiallyofcustomerbehavior.Forexampleeverywebpageevent(apagebeing
paintedonauser'sscreen)spawnshundredsofrecordsdescribingeveryobjecton
thepage.Inonlinegames,whereeverygestureentersthedatastream,asmanyas
100descriptorsareattachedtoeachofthesegesturemicro-events.Forinstance,ina
hypotheticalonlinebaseballgame,whenthebatterswingsatapitch,everything
describingthepositionoftheplayers,thescore,runnersonthebases,andeventhe
characteristicsofthepitch,areallstoredwiththatindividualrecord.Inbothofthese
examples,thecompletecontextmustbecapturedwithinthecurrentrecord,because
itisimpracticaltocomputethisdetailedcontextafterthefactfromseparatedata
sources.Thelessonforthecomingdecadeisthatthisthirstforexquisitedetailwill












TheEvolvingRoleoftheEnterpriseDataWarehouseintheEraofBigDataAnalytics23

onlygrow.Itispossibletoimaginethousandsofattributesbeingattachedtosome
micro-events,andthecategoriesandnamesoftheseattributeswillgrowin
unpredictableways.Thismakesthedatabagapproachdiscussedearlierinthepaper
muchmoreimportant.Itmeansthatpositionallydependentschemas,withthekeys
(namesofthedata)pre-declaredascolumnnamesisanunworkabledesign.
Finally,aperfecthistoricalreconstructionofinterestingeventssuchaswebpage
exposuresneedstobemorethanjustalistofattributesonthewebpagewhenitwas
displayed,evenifthatlistisenormouslydetailed.Aperfecthistoricalreconstructionof
thewebpageneedstobeseenthroughamultimediauserinterface,i.e.,abrowser.
Lighttouchdatawaitsforitsrelevancetobeexposed
Lighttouchdataisanaspectoftheexquisitedetaildatadescribedintheprevious
section.Forexample,ifacustomerbrowsesawebsiteextensivelybeforemakinga
purchase,agreatdealofmicro-contextisstoredinallthewebpageeventspriortothe
purchase.Whenthepurchaseismade,someofthatmicro-contextsuddenly
becomesmuchmoreimportant,andiselevatedfrom"lighttouchdata"torealdata.At
thatpointthesequenceofexposurestotheselectedproductortocompetitive
productsinthesamespacebecomespossibletobesessionized.Thesemicro-events
areprettymuchmeaninglessbeforethepurchaseevent,becausetherearesomany
conceivableandirrelevantthreadsthatwouldbedeadendsforanalysis.This
requiresoceansoflighttouchdatatobestored,waitingfortherelevanceofselected
threadsofthesemicro-eventstoeventuallybeexposed.Conventionalseasonality
thinkingsuggeststhatatleastfivequarters(15months)ofthislighttouchdataneeds
tobekeptonline.Thisisoneinstanceofaremarkmadeconsistentlyduring
interviewsforthiswhitepaperthatanalystswant"longertails"whichmeansthatthey
wantmoresignificanthistoriesthantheycurrentlyget.
Simpleanalysisofallthedatatrumpssophisticatedanalysisofsomeofthedata
Althoughdatasamplinghasneverbeenapopulartechniqueindatawarehousing,
surprisinglythearrivalofenormouspetabytesizeddatasetshasnotincreasedthe
interestinanalyzingasubsetofthedata.Onthecontrary,anumberofanalystspoint
outthatmonetizableinsightscanbederivedfromverysmallpopulationsthatcouldbe
missedbyonlysamplingsomeofthedata.Ofcoursethisisasomewhatcontroversial
point,sincethesameanalystsadmitthatifyouhave1trillionbehaviorobservation
records,youmaybeabletofindanybehaviorpatternifyoulookhardenough.
Anothersomewhatcontroversialpointraisedbysomeanalystsistheirconcernthat
anyformofdatacleaningontheincomingdatacoulderaseinterestinglow-frequency
"edgecases.Ultimatelyboththecasesofmisleadingrarebehaviorpatterns,and
misleadingcorrupteddataneedtobegentlyfilteredoutofthedata.
Assumingthatthebehaviorinsightsfromverysmallpopulationsarevalid,thereis
widespreadrecognitionthatmicro-marketingtothesmallpopulationsispossible,and
doingenoughofthiscanbuildasustainablestrategicadvantage.












TheEvolvingRoleoftheEnterpriseDataWarehouseintheEraofBigDataAnalytics24

Afinalargumentinfavorofanalyzingcompletedatasetsisthatthese"relationscans"
donotrequireindexesoraggregationstobecomputedinadvanceoftheanalysis.
ThisapproachfitswellwiththebasicMapReducedistributedanalysisarchitecture.
Datastructuresshouldbedeclaredatquerytime,notatdataloadtime
Anumberofanalystsinterviewedforthiswhitepapersaidthattheenormousdata
setstheyweretryingtoanalyzeneededtobeloadedinaqueryablestatebeforethe
structureandcontentofthedatasetswerecompletelyunderstood.Again,thinkingof
thedatabagkindofmarketplaceobservationwherewithinawell-structured
dimensionalmeasurementprocesstheactualobservationisadisorderlyand
potentiallyunpredictablesetofkeyvaluepairs,thestructureofthisdatabagmay
needtobediscovered,andalternateinterpretationofthestructuresmayneedtobe
possiblewithoutreloadingthedatabase.Onerespondentremarkedthatyesterdays
fringedataistomorrowswell-structureddata,implyingthatweneedexceptional
flexibilityasweexplorenewkindsofdatasources.
AkeydifferentiatorbetweentheRDBMSapproachandtheMapReduce/Hadoop
approachisthedeferralofthedatastructuredeclarationuntilquerytimeinthe
MapReduce/Hadoopsystems.AnobjectionfromtheRDBMScommunitythatforcing
everyMapReducejobtodeclarethetargetdatastructurepromotesakindofchaos
becauseeveryanalystcandotheirownthing.Butthatobjectionseemstomissthe
pointthatastandarddatastructuredeclarationcaneasilybepublishedasalibrary
modulethatcanbepickedupbyeveryanalystwhentheyareimplementingtheir
application.
TheEDWsupportingbigdataanalyticsmustbemagnetic,agile,anddeep
CohenandDolanintheirseminalbutsomewhatcontroversialpaperonbigdata
analyticsarguethatEDWsmustshedsomeoldorthodoxiesinordertobemagnetic,
agile,anddeep.Amagneticenvironmentplacestheleastimpedimentsonthe
incorporationofnew,unexpected,andpotentiallydirtydatasources.Specifically,this
supportstheneedtodeferdeclarationofdatastructuresuntilafterthedataisloaded.
AccordingtoCohenandDolan,anagileenvironmenteschewslong-rangecareful
designandplanning!Andadeepenvironmentallowsrunningsophisticatedanalytic
algorithmsonmassivedatasetswithoutsampling,orperhapsevencleaning.We
havemadethesepointselsewhereinthiswhitepaperbutCohenandDolanspaper
isaparticularlypotent,ifunusual,argument.Readthispapertogetsomeprovocative
perspectives!AlinktoCohenandDolanspaperisprovidedinthereferencessection
attheendofthiswhitepaper.
Theconflictbetweenabstractionandcontrol
IntheMapReduce/Hadoopworld,PigandHivearewidelyregardedasvaluable
abstractionsthatallowtheprogrammertofocusondatabasesemanticsratherthan
programmingdirectlyinJava.Butseveralanalystsinterviewedforthispaper
remarkedthattoomuchabstractionandtoomuchdistancingfromwherethedata
actuallyisstoredcanbedisastrouslyinefficient.Thisseemslikeareasonable
concernwhendealingwiththeverylargestdatasets,whereabadalgorithmcould












TheEvolvingRoleoftheEnterpriseDataWarehouseintheEraofBigDataAnalytics25

resultinruntimesmeasuredindays.Forthebreakingwaveofthebiggestdatasets,
programmingtoolswillneedtoallowconsiderablecontroloverthestoragestrategy,
andtheprocessingapproaches,butwithoutrequiringprogrammingusingthelowest
levelcode.

Datawarehouseorganizationchangesinthecomingdecade
Thegrowingimportanceofbigdataanalyticsamountstosomethingbetweena
midcoursecorrectionandarevolutionforenterprisedatawarehousing.Newskillsets,
neworganizations,newdevelopmentparadigms,andnewtechnologywillneedtobe
absorbedbymanyenterprises,especiallythosefacingtheusecasesdescribedinthis
paper.Noteveryenterpriseneedstojumpintothepetabyteocean,butitisthis
author'spredictionthattheupcomingdecadewillseeasteadygrowthinthe
percentageoflargeenterprisesrecognizingthevalueofbigdataanalytics.
Mostobserverswouldagreethatbigdataanalyticsfallswithin"information
management,"butthesameobserversmayquibbleaboutwhetherthisaffectsthe
"datawarehouse."Ratherthanworryingaboutwhethertheboxontheorganization
chartlabeledEDWhasresponsibilityforbigdataanalytics,wetaketheperspective
thatenterprisedatawarehousingwithoutthecapitallettersabsolutelyencompasses
bigdataanalytics.Havingsaidthat,therewillbemanydifferentorganizational
structuresandmanagementperspectivesasindustriesexpandtheirinformation
management.Thiskindoftinkeringandadjustingtothenewparadigmisnormaland
expected.Wewentthroughaverysimilarphaseinthemid1980swhendata
warehousingitselfwasanewparadigmforITandthebusiness.Manyofthemost
successfulearlydatawarehousinginitiativesstartedinthebusinessorganizations
andwereeventuallyincorporatedintothoseITorganizationsthatthenmademajor
commitmentstobeingbusinessrelevant.Itislikelythesameevolutionwilltakeplace
withbigdataanalytics.
Thechallengebeforeinformationmanagersinlargeenterprisesishowtoencourage
threeseparatedatawarehouseendeavors:conventionalRDBMSapplications,
MapReduce/Hadoopapplications,andadvancedanalytics.
Technicalskillsetsrequired
Itisworthrepeatingherethemessageoftheveryfirstsentenceofthiswhitepaper.
Petabytescaledatasetsareofcourseabigchallengebutbigdataanalysisisoften
aboutdifficultiesotherthandatavolume.Youcanhavefastarrivingdataorcomplex
dataorcomplexanalyseswhichareverychallengingevenifallyouhaveare
terabytesofdata!
ThecareandfeedingofRDBMS-orienteddatawarehousesinvolvesa
comprehensivesetofskillsthatisprettywellunderstood:SQLprogramming,ETL
platformexpertise,databasemodeling,taskscheduling,systembuildingand
maintenanceskills,oneormorescriptinglanguagessuchasPythonorPerl,UNIXor
Windowsoperatingsystemskills,andbusinessintelligencetoolsskills.SQL












TheEvolvingRoleoftheEnterpriseDataWarehouseintheEraofBigDataAnalytics26

programming,whichisatthecoreofanRDBMSimplementation,isadeclarative
language,whichcontrastswiththemindsetoftheprocedurallanguageskillsneeded
forMapReduce/Hadoopprogramming,atleastinJava.Thedatawarehouseteam
alsoneedstohaveagoodpartnershipwithinotherareasofITincludingstorage
management,security,networking,andsupportofmobiledevices.Finally,gooddata
warehousingalsorequiresanextensiveinvolvementwiththebusinesscommunity,
andwiththecognitivepsychologyofend-users!
ThecareandfeedingofMapReduce/Hadoopdatawarehouses,includinganyofthe
bigdataanalyticsusecasesdescribedinthispaper,involvesasetofskillsthatonly
partiallyoverlaptraditionalRDBMSdatawarehouseskills.Thereinliesasignificant
challenge.Thesenewskillsincludelower-levelprogramminglanguagessuchas
Java,C++,Ruby,Python,andMapReduceinterfacesmostcommonlyavailablevia
Java.Althoughtherequirementtoprogramviaproceduralbasedlower-level
programminglanguageswillbereducedsignificantlyduringtheupcomingdecadein
favorofPig,Hive,andHBase,itmaybeeasiertorecruitMapReduce/Hadoop
applicationdevelopersfromtheprogrammingcommunityratherthanthedata
warehousecommunity,ifthedatawarehousejobapplicantslackprogrammingand
UNIXskills.IfMapReduce/Hadoopdatawarehousesaremanagedexclusivelywith
opensourcetools,thenZookeeperandOozieskillswillbeneededtoo.Keepinmind
thattheopen-sourcecommunityinnovatesquickly.Hive,PigandHBasearenotthe
lastwordinhigh-levelinterfacestoHadoopforanalysis.Itislikelythatwewillsee
muchmoreinnovationinthisdecadeincludingentirelynewinterfaces.
ETLplatformprovidershaveabigopportunitytoprovidemuchofthegluethatwilltie
togetherthebigdatasources,MapReduce/Hadoopapplications,andexisting
relationaldatabases.DeveloperswithETLplatformskillswillbeabletoleveragea
greatdealoftheirexperienceandinstinctsinsystembuildingwhentheyincorporate
MapReduce/Hadoopapplications.
Finally,theanalystswhomwehavedescribedasoftenworkinginsandbox
environmentswillarrivewithaneclecticandunpredictablesetofskillsstartingwith
deepanalyticexpertise.Forthesepeopleitisprobablymoreimportanttobe
conversantinSAS,Matlab,orRthantohavespecificprogramminglanguageor
operatingsystemskills.SuchindividualstypicallywillarrivewithUNIXskills,and
somereasonableprogrammingproficiency,andmostofthesepeopleareextremely
tolerantoflearningnewcomplextechnicalenvironments.Perhapsthebiggest
challengewithtraditionalanalystsisgettingthemtorelyontheotherresources
availabletothemwithinIT,ratherthanbuildingtheirownextractanddatadelivery
pipelines.Thisisatrickybalancebecauseyouwanttogivetheanalystsunusual
freedom,butyouneedtolookovertheirshoulderstomakesurethattheyarenot
wastingtheirtime.
Neworganizationsrequired
Atthisearlystageofthebigdataanalyticsrevolution,thereisnoquestionthatthe
analystsmustbepartofthebusinessorganization,bothtounderstandthe
microscopicworkingsofthebusiness,butalsotobeabletoconductthekindofrapid
turnaroundexperimentsandinvestigationswehavedescribedinthispaper.Aswe












TheEvolvingRoleoftheEnterpriseDataWarehouseintheEraofBigDataAnalytics27

havedescribed,theseanalystsmustbeheavilysupportedinatechnicalsense,with
potentiallymassivecomputepoweranddatatransferbandwidth.Soalthoughthe
analystsmayresideinthebusinessorganizations,thisisagreatopportunityforITto
gaincredibilityandpresencewiththebusiness.Itwouldbeasignificantmistakeand
alostopportunityfortheanalystsandtheirsandboxestoexistasroguetechnical
outpostsinthebusinessworldwithoutrecognizingandtakingadvantageoftheirdeep
dependenceonthetraditionalITworld.
Insomeorganizationsweinterviewedforthiswhitepaper,wesawseparateanalytic
groupsembeddedwithindifferentbusinessorganizations,butwithoutverymuch
crosscommunicationorcommonidentityestablishedamongtheanalyticgroups.In
somenoteworthycases,thislackofan"analyticcommunity"ledtolostopportunities
toleverageeachother'swork,andledtomultiplegroupsreinventingthesame
approaches,andduplicatingprogrammingeffortsandinfrastructuredemandsasthey
madeseparatecopiesofthesamedata.
Werecommendthatacrossdivisionalanalyticscommunitybeestablishedmimicking
someofthesuccessfuldatawarehousecommunitybuildingeffortswehaveseenin
thepastdecade.Suchacommunityshouldhaveregularcrossdivisionalmeetings,as
wellasakindofprivateLinkedInapplicationtopromoteawarenessofallthecontacts
andperspectivesandresourcesthattheseindividualscollectintheirown
investigations,andaprivatewebportalwhereinformationandnewseventsare
shared.Periodictalkscanbegiven,hopefullyinvitingmembersofthebusiness
communityaswell,andabovealltheanalyticscommunityneedsT-shirtsandmugs!
Newdevelopmentparadigmsrequired
Evenbeforethearrivalofbigdataanalytics,datawarehousinghasbeentransforming
itselftoprovidemorerapidresponsetonewopportunitiesandtobemoreintouch
withthebusinesscommunity.Someofthepracticesoftheagilesoftware
developmentmovementhavebeensuccessfullyadoptedbythedatawarehouse
community,althoughrealisticallythishasnotbeenahighlyvisibletransformation.
But,inparticular,theagiledevelopmentapproachsupportsthedatawarehouseby
beingorganizedaroundsmallteamsdrivenbythebusiness,nottypicallybyIT.An
agiledevelopmenteffortalsoproducesfrequenttangibledeliveries,deemphasizes
documentationandformaldevelopmentmethodologies,andtoleratesmidcourse
correctionandtheincrementalacceptanceofnewrequirements.Themostsensitive
ingredientforsuccessofagiledevelopmentprojectsisthepersonalityandskillsofthe
businessleaderwhoultimatelyisincharge.Theagilebusinessleaderneedstobea
thoughtfulandsophisticatedobserverofthedevelopmentprocessandtherealitiesof
theinformationworld.Hopefullytheagilebusinessleaderisaprettygoodmanageras
well.
Bigdataanalyticscertainlyopensthedoortobusinessinvolvementsincethecentral
analysisisprobablydoneinthebusinessenvironmentdirectly.Butitisprobably
unlikelythattheprofessionalanalystistherightpersontobetheoverallagiledata
warehouseprojectleader.Theagileprojectleaderneedstobewellskilledin
facilitatingshorteffectivemeetings,resolvingissuesanddevelopmentchoices,












TheEvolvingRoleoftheEnterpriseDataWarehouseintheEraofBigDataAnalytics28

determiningthetruthofprogressreportsfromindividualdevelopers,communicating
withtherestoftheorganization,andgettingfundingforinitiatives.
Traditionaldatawarehousedevelopmenthasdiscoveredtheattractivenessof
buildingincrementallyfromamodeststart,butwithagoodarchitecturalfoundation
thatprovidesablueprintforwherefuturedevelopmentwillgo.Thisauthorhas
describedinmanypapersthetechniquesfor"gracefulmodification"ofdimensional
datawarehouseschemas.Inadimensionallymodeleddatawarehouse,new
measurementfacts,newdimensionalattributes,andevennewdimensionscanbe
addedtoexistingdatawarehouseapplicationswithoutchanging,invalidating,or
rollingoverexistinginformationdeliverypipelinestotheendusers.Manyoftheuse
caseswehavedescribedinthispaperforbigdataanalyticssuggestthatnewfacts,
newattributes,andnewdimensionswillroutinelybecomeavailable.
Integrationofnewdatasourcesintoadatawarehousehasalwaysbeena
significantchallenge,sinceoftenthesenewdatasourcesarrivewithoutanythought
tointegrationwithexistingdatasources.Thiswillcertainlybethecasewithbigdata
analytics.Againfordimensionallymodeleddatawarehouses,thisauthorhas
describedtechniquesforincrementalintegration,where"enterprisedimensional
attributes"aredefinedandplantedinthedimensionsoftheseparatedatasources.
Wecalltheseconformeddimensions.Thedevelopmentanddeploymentof
conformeddimensionsfitstheagiledevelopmentapproachbeautifully,sincethis
kindofintegrationcanbeimplementedonedatasourceatatime,andone
dimensionalattributeatatime,againinawaythatisnondestructivetoexisting
applications.Pleaseseethereferencessectionattheendofthiswhitepaperfor
moreinformationonconformeddimensions.
Finally,atleastoneorganizationinterviewedforthiswhitepaperhastakenagilityto
itslogicalextreme.Individualdevelopersaregivencompleteend-to-end
responsibilityforaproject,allthewayfromoriginalsourcingofthedata,through
experimentalanalysis,re-implementingtheprojectforproductionuse,andworking
withtheend-usersandtheirBItoolsinsupportivemode.Althoughthisdevelopment
approachremainsanexperiment,earlyresultsareveryinterestingbecausethese
developersfeelasignificantsenseofresponsibilityandpridefortheirprojects.
Lessonsfromtheearlydatawarehousingera
Ittookmostofthe1990sfororganizationstounderstandwhatadatawarehouse
wasandhowtobuildandmanagethosekindsofsystems.Interestingly,attheend
ofthe1990s,datawarehousingwaseffectivelyrelabeledasbusinessintelligence.
Thiswasaverypositivedevelopmentbecauseitreflectedtheneedforthebusiness
toownandtakeresponsibilityfortheusesofdata.
Theearliestdatawarehousepioneershadnochoicebuttodotheirownsystems
integration,assemblingbest-of-breedcomponents,andcopingwiththeinevitable
incompatibilitiesinissuesofdealingwithmultiplevendors.Bytheendofthe1990s,
thebestofbreedapproachgavewaytovendorstacksofintegratedproducts,a
trendwhichcontinuesuntiltoday.Atthispoint,thereareonlyafewindependent
vendorsinthedatawarehousespace,andthosevendorshavesucceededby












TheEvolvingRoleoftheEnterpriseDataWarehouseintheEraofBigDataAnalytics29

interfacingwithnearlyeveryconceivableformatandinterface,therebyproviding
bridgesbetweenthemorelimitedproprietaryvendorstacks.
Withthebenefitofhindsightgainedfromthetraditionaldatawarehouseexperience,
thebigdataanalyticsversionofdatawarehousingislikelytoconsolidatequite
quickly.Onlythebravestorganizationswithverystrongsoftwaredevelopmentskills
shouldconsiderrollingtheirownbigdataanalyticsapplicationsdirectlyonraw
MapReduce/Hadoop.Forinformationmanagementorganizationswishingtofocuson
thebusinessissuesratherthanonthebreakingwaveofsoftwaredevelopment,a
packagedHadoopdistribution(e.g.,Cloudera)makesalotofsense.TheleadingETL
platformvendorslikelywillalsointroducepackagedenvironmentsforhandlingmany
ofthephasesofMapReduce/Hadoopdevelopment.
Analyticsinthecloud
Thiswhitepaperhasnotdiscussedcloudimplementationsofbigdataanalytics.Most
oftheenterprisesinterviewedforthiswhitepaperwerenotusingpubliccloud
implementationsfortheirproductionanalytics.Nevertheless,cloudimplementations
maybeveryattractiveinthestartupphaseforananalyticseffort.Acloudservicecan
provideinstantscalabilityduringthisstartupphase,withoutcommittingtoamassive
legacyinvestmentinhardware.Dataanalysisprojectscanbeturnedonandturned
offonshortnotice.Recallthattypicalanalyticenvironmentsmayinvolvehundredsof
separatesandboxesandparallelexperiments.
Manyoftheorganizationsinterviewedforthispaperstatedthatmatureanalytics
shouldbebroughtin-house,perhapsimplementedtechnicallyasacloudbutwithin
theconfinesoftheorganization.Ofcourse,suchanin-housecloudmayreducefears
ofsecurityandprivacybreaches(fairlyornot).
Aremotecloudimplementationraisesissuesofnetworkbandwidth,especiallyina
broadlyintegratedapplicationwithmultipleverylargedatasetsindifferentlocations.
Imaginesolvingthebigjoinproblemwhereyourtrillionrowfacttableisoutonthe
cloud,andyourbillionrowdimensiontableislocatedin-house.
Althoughthebestperformingsystemstrytoachieveathree-waybalanceamong
CPU,diskspeed,andbandwidth,mostorganizationsinterviewedforthispaper
predictedthatbandwidthwouldemergeasthenumberonelimitingfactorforbigdata
analyticssystemperformance.
WhitherEDW?
Theenterprisedatawarehousemustexpandtoencompassbigdataanalyticsaspart
ofoverallinformationmanagement.Themissionofthedatawarehousehasalways
beentocollectthedataassetsoftheorganizationandstructuretheminawaythatis
mostusefultodecision-makers.Althoughsomeorganizationsmaypersistwithabox
ontheorgchartlabeledEDWthatisrestrictedtotraditionalreportingactivitieson
transactionaldata,thescopeoftheEDWshouldgrowtoreflectthesenewbigdata
developments.InsomesensethereareonlytwofunctionsofIT:gettingthedatain
(transactionprocessing),andgettingthedataout.TheEDWisgettingthedataout.












TheEvolvingRoleoftheEnterpriseDataWarehouseintheEraofBigDataAnalytics30

Thebigchoicefacingshopswithgrowingbigdataanalyticsinvestmentsiswhetherto
chooseanRDBMS-onlysolution,oradualRDBMSandMapReduce/Hadoop
solution.Thisauthorpredictsthatthedualsolutionwilldominate,andinmanycases
thetwoarchitectureswillnotexistasseparateislandsbutratherwillhaverichdata
pipelinesgoinginbothdirections.Itissafetosaythatbotharchitectureswillevolve
hugelyoverthenextdecade,butthisauthorpredictsthatbotharchitectureswillshare
thebigdataanalyticsmarketplaceattheendofthedecade.
Sometimeswhenanexcitingnewtechnologyarrives,thereisatendencytoclosethe
dooronoldertechnologiesasiftheyweregoingtogoaway.Datawarehousinghas
builtanenormouslegacyofexperience,bestpractices,supportingstructures,
technicalexpertise,andcredibilitywiththebusinessworld.Thiswillbethefoundation
forinformationmanagementintheupcomingdecadeasdatawarehousingexpands
toincludebigdataanalytics.



TheEvolvingRoleoftheEnterpriseDataWarehouseintheEraofBigDataAnalytics31

Acknowledgments
ThisauthorisgratefulforInformaticassponsoringofthiswhitepaperandfor
providingabsolutelyno"vendorbias.Theopinionsinthiswhitepaperaresolelythe
responsibilityoftheauthor.
Anumberofsmartandknowledgeablebigdatapractitionersmadethemselves
availableduringtheresearchphaseofthewhitepaperforinterviews.These
individualsprovidedmanyusefulinsights.Inalphabeticorderbyorganization,we
thank
AmrAwadallah,MikeOlson,Cloudera
BrianDolan,Discovix
OliverRatzesberger,eBay
AlexIgnatius,ElectronicArts
WilliamSchmarzo,EMC
AshishThusoo,Facebook
JuliannaDeLua,JohnHaddad,SanjayKrishnamurthi,RonLunasin,Informatica
NicholasWakefield,LinkedIn
DanGraham,DilipKrishna,RonKunze,Teradata
ProfessorMichaelFranklin,ComputerScienceDepartment,U.C.Berkeley
RaymieStata,Yahoo!
DanMcCaffrey,KenRudin,Zynga

References
AnArchitectureforDataQuality,aKimballGroupWhitepaper:http://
vip.informatica.com/?elqPURLPage=8784
EssentialStepsfortheIntegratedEDW,aKimballGroupWhitepaper:http://
vip.informatica.com/?elqPURLPage=8785
Hadoop,TheDefinitiveGuide,2
nd
Edition,TomWhite,OReilly(2011)
Hadoop-MLwebsite:http://videolectures.net/nipsworkshops09_pednault_hmli/
MADSkills:NewAnalysisPracticesforBigData,Cohen,Dolanetal,http://
db.cs.berkeley.edu/jmh/papers/madskills-032009.pdf
TheDataWarehouseLifecycleToolkit,2
nd
edition,Kimballetal.,Wiley(2008)

You might also like