Professional Documents
Culture Documents
For
BigDataIntegrationand
Analysis
VIDEO:
https://www.youtube.com/watch?v=qlHG55S2K7g
BigDataIntegrationandAnalysis
UserManual
Page2
Tableofcontents
1. Introduction
1.1WhyIntegration&AnalyzingData
1.2ApacheHadoop
1.3ApacheSpark
1.4Authorizedusepermission
2.SystemSummary
2.1SystemConfiguration
2.2ArchitectureDiagram
3.IntegrationJobSpecificationLanguage
3.1IJSLGrammar
3.2Examples
3
3
4
4
5
5
5
5
6
6
9
4.Application
11
4.1.Runningit
4.2Setting
11
12
4.3.DataIntegration
4.3.1.Specifyingvaluesmanually
4.3.1.1.TransformationsOperations
4.3.1.2.Restrictionoperation
4.3.1.3.Outputfiles
13
13
15
18
18
4.3.2.LoadinganexistingIJSLscript
4.3.3.WritingaIJSLscript
20
22
4.4.DataAnalysis
4.4.1LoadfilesfromHadoopFileSystem
24
4.4.2SelectingfilesfromHadoopFileSystem
4.4.3Displayingfieldnamesfromselectedfiles
4.4.4Inputquery
4.4.5Sampleresults
5.PossibleflowsforDataIntegration
5.1Specifyingvaluesmanually
5.2LoadinganexistingIJSLscript
5.3WritingaIJSLscript
6.PossibleflowsforDataAnalysis
7.References
24
25
25
25
26
26
27
27
27
27
28
BigDataIntegrationandAnalysis
UserManual
Page3
1. Introduction
These days,datastreamsfromeach and every activity of dailylife:fromphones,credit
cards, televisions and computers from sensorequipped buildings,GPS, trains,buses,
planes,bridges, factories,
and so on. Thedata flows so fast thatthetotalaccumulation
of the past two yearsa zettabytedwarfs the prior record ofhumancivilization. This
huge amount of data is very important as it contains a lot of useful information and
consideringthe volume, velocity and variety ofdata, cleaningand analyzing big datais
abigchallenge.
Arealexampleofsuch
achallengecanbeseeninecommercecompanies,suchas
Amazon, wherethey havehuge amountsof customerrelated data.Thisdata is crucial
to any company for which reason they are ready and eager to spend a bigportionof
their budget in analyzing data so they can, among others, establish solid predictive
models. For the example of ecommerce companies, one concern is to list the most
sought products by their customers so they can predict which classofcustomers(per
ageforexample)tendstobywhatinwhichperiodoftime.[1][2]
1.1 WhyIntegration&AnalyzingData?
Extracting, cleaning, and loading data are three core steps in the Data Integration
process. Data reshapingprograms are difficult to write becauseof their complexity,but
they are required becauseeach analytic toolexpectsdatainaveryspecific form andto
get the data into that form typically requires awholeseries ofcleaning,normalization,
reformatting,integration,andrestructuringoperations.
Analyzing Big Data is used to find meaning and discover hidden relationships in Big
Data. The technological advances in storage, processing, and analysis of Big Data
include the rapidly decreasing cost of storage and CPU power in recent years the
flexibility and costeffectiveness of datacenters and cloud computing for elastic
BigDataIntegrationandAnalysis
UserManual
Page4
computation and storage and the development of new frameworks such as Hadoop
and Spark, allowed users to take advantage of these distributed computing systems
storinglargequantitiesofdatathroughflexibleparallelprocessing.[1],[3],[4].
1.2 ApacheHadoop
Apache Hadoop [5]
isanopensourcesoftwareframeworkwritteninJavafordistributed
storage and distributed processing of very large data sets on computer clusters built
fromcommodityhardware.
ThecoreofApacheHadoopconsistsof:
Storagepart:HadoopDistributedFileSystem(HDFS).
Processingpart:HadoopMapReduce.
Hadoop splits files into large blocks and distributes them amongst the nodes in the
cluster.To processthe data,HadoopMapReducetransferspackagedcodefornodesto
process in parallel, based on the data each node needs to process. Hadoop
(MapReduce) alongside thebuiltinsolutions like Hive andPig, have beenwidely used
for years now intensively in batch processing chains. Onesucha chainareETL(like)
dataintegrationprograms.
1.3 ApacheSpark
Apache Spark [6] is a cluster computing platform designed to be fast and general
purpose. Onthespeed side,SparkextendsthepopularMapReducemodeltoefficiently
support more types of computations, including interactive queries and stream
processing. Speed is for instance very important in interactive computations where
responses are ultimately needed in scale of few seconds. This includes querying
dataset and running iterative programs, such as the ones found inMachine Learning.
Therefore, Spark came with native inmemory data processing that speedsup hugely
thesetypesofcomputing.
BigDataIntegrationandAnalysis
UserManual
Page5
1.4 Authorizedusepermission
ThissystemisdesignedforlabprojectstudyinEnterpriseInformationSystems.
2. SystemSummary
2.1 SystemConfiguration
Ubuntu14.4
Hadoop2.6
Spark1.4
Scala2.1
Maven
EclipseIDE
JavaJDK1.8
2.2.ArchitectureDiagram
Figure1:ArchitectureofIntegrationandAnalysingofData[7].
BigDataIntegrationandAnalysis
UserManual
Page6
3.
INTEGRATIONJOBSPECIFICATIONLANGUAGE
Weproposealanguagetousersareabletodescribetheintegrationjob.
Wenameit
IntegrationJobSpecificationLanguage
,
IJSL
forshort.
3.1.IJSLGrammar
IJSLscriptconsistsofkeywordsandparameters:
KEYWORDS
Thereare10keywords(whichcanbewritteninuppercaseorlowercase).Forclarity,
willwritethemonuppercasethroughoutourreport:
INPUTFILE,OUTPUTFILE,SEPARATOR,PROJECTEDCOLUMNS,
PROJECTEDCOLUMNSNAMES,MERGE,SPLIT,CASE,FORMATDATES,
RESTRICTION,EMPTY
OBLIGATORY
OPTIONAL
INPUTFILE
MERGE
OUTPUTFILE
SPLIT
SEPARATOR
CASE
PROJECTEDCOLUMNS
FORMATDATES
PROJECTEDNAMES
RESTRICTION
Table2:Somekeywordsareobligatoryandothersoptional
PARAMETERS
Parametersaredefinedwiththefollowingstructure:
Par1|Par2|
Ifwedontwanttodefineparameters
onoptionalkeywords,
weuseth
ekeywor
d:
EMPTY
Example:
MERGE
EMPTY
orjustdontwritethatstatement.
BigDataIntegrationandAnalysis
UserManual
Page7
INPUTFILE
DefinestheinputCSVfilename.
Example:
INPUTFILE
Customers.csv
OUTPUTFILE
DefinestheoutputCSVfilename.
Example:
OUTPUTFILE
CustomersOutput.csv
SEPARATOR
Definesthedelimiteroftheinputfile.Itcanonlyonecharacter.
MostcommonusedonesareComma(,)Semicolon()Pipe(|)andCaret(^)
Example:
SEPARATOR
,
PROJECTEDCOLUMNS
Definestheindexofthecolumnstobeprojected(with0beingthefirstindex).
Part1:Firstprojectedcolumn.
Part2:Secondprojectedcolumn.
PartN:Nprojectedcolumn.
Example:
PROJECTEDCOLUMNS
1|3
PROJECTEDNAMES
Definesthenamesofthecolumnstobeprojected.
Part1:Firstprojectedcolumnname.
Part2:Secondprojectedcolumnname.
PartN:Nprojectedcolumnname.
Example:
PROJECTEDNAMES
Name|City
BigDataIntegrationandAnalysis
UserManual
Page8
[MERGE]
Definestheindexofthetwocolumnstobemergedandthemergecharacter.
Part1:Firstcolumnindex.
Part2:Secondcolumnindex.
Part3:Mergecharacter.
Example:
MERGE
0|1|
[
SPLIT]
Definestheindexofcolumntobesplittedandthecharacter.
Part1:Columnindex.
Part2:Splitcharacter.
Example:
SPLIT
2|
_
[CASE]
Definestheindexofcolumntoupperorlowerthecase
Part1:Columnindex.
Part2:0forUPERCASE,1forLOWERCASE
Example:
CASE
1|0
[FORMAT]
Definestheindexofdatecolumntobeformatted.
Part1:Columnindex.
Part2:DD/MM/YYYYMM/DD/YYYYYYYY/MM/DD
Example:
FORMAT
3|MM/DD/YYYY
[RESTRICTION]
Definestheindexofcolumntoberestricted(filtered),theoperatorusedandthevalue.
Part1:Columnindex.
BigDataIntegrationandAnalysis
UserManual
Page9
Part2:Fornumeralvalues:=,<>,>,<,>=,<=Fortextualvalues:EQUAL,NOTEQUAL,
CONTAINS
Part3:value
Example:
RESTRICTION
0|>=|20|0|<|50
3.2.EXAMPLES:
OurexampleCSVinputfileiscalledCustomers.csv,andhasthefollowingheader,6
columns:Customer_ID|Name|Address|City|ZipCode|Phone
A. Wewantasoutput3columns:Customer_ID,Address,CitywhereCustomer_ID
valuesaregreaterthanorequalto20andlessthan50.
INPUTFILECustomers.csv
OUTPUTFILEOutput.csv
SEPARATOR|
PROJECTEDCOLUMNS0|2|3
PROJECTEDNAMESCustomer_ID|Address|City
RESTRICTION0|>=|20|0|<|50
B. Wewantasoutputallthecolumns,butrenamesomeofthem:Customer_ID>
ID,ZipCode>Postal_Code.AndchangeNamecasetocapitalletters.
INPUTFILECustomers.csv
OUTPUTFILEOutput.csv
SEPARATOR|
PROJECTEDCOLUMNS0|1|2|3|4|5
PROJECTEDNAMESID|Name|Address|City|Postal_Code|Phone
MERGEEMPTY
SPLITEMPTY
CASE1|0
FORMATDATESEMPTY
RESTRICTIONEMPTY
C. WewantasoutputonlyColumnName,butwewanttosplititontwocolumns
First_NameandLast_Name.
INPUTFILECustomers1.csv
OUTPUTFILEOutput.csv
SEPARATOR|
PROJECTEDCOLUMNS1|1000
PROJECTEDNAMESFirst_Name|Last_Name
MERGEEMPTY
SPLIT1|
CASEempty
FORMATDATESEMPTY
RESTRICTIONEMPTY
BigDataIntegrationandAnalysis
UserManual
Page10
4.Application
Inordertoinstalltheapplicationweneedtounzipthefilellama.zipintothepath
/home/hduser/Desktop/
Itwillcopythefollowingfiles:
llama.jar
llama.jpg
Settings
4.1.Runningit
Onlinuxterminal:
Locatethepathwheresparkisinstalled,inthiscase
/home/miguel/Downloads/spark1.4.1binhadoop2.6/
andexecutethefollowingcommand
./bin/sparksubmitclasseis.lab.groupb.MainDrivermasterspark://master:7077
/home/hduser/Desktop/llama.jar
Options:
class:
Classthatcontainsthemainfunction.
master:
Locationofthesparkcluster.Itshouldhavetheoption
local
ifwewanttousespark
locally.
Finallywehavetoindicatethepathwherethejarislocated.
BigDataIntegrationandAnalysis
UserManual
Page11
4.2Settings
Theapplicationhasthefollowinglistofsettings,asshowninFigure2,toeditthem
wehavetogotothe
Settings
tabandthefollowingformwillappear.
Figure2
IntegrationModule
URI
URIwheretheHDFSis
Inputfilespath
Pathfortheinputfiles
Tempfilespath
Pathforthetempfiles
Outputfilespath
Pathfortheoutputfiles
AnalysisModule
URI
URIwheretheHDFSis
Inputfilespath
Pathfortheinputfiles
Outputfilespath
Pathfortheoutputfiles
BigDataIntegrationandAnalysis
UserManual
Page12
4.3.DataIntegration
InthissectionwewilldescribetheDataIntegrationphase.
Wehavethreepossiblewaystodoanintegrationjob:
Specifyingvaluesmanually.
LoadinganexistingIJSLscript.
WritingaIJSLscript.
4.3.1.Specifyingvaluesmanually
InFigure3:
1)SelectthefirsttabnamedIntegrationofData.
2)SelectafilefromthelistoffilesalreadyuploadedonHDFS.
3)ClickSelectbutton.
4) Define the separator character thatwe are using, bydefaultthe value is comma
(,)
Figure3
InFigure4:alistofDesiredcolumnswillappearfromwhich:
1)selecttheneededcolumnsfortheintegration,
2)ClicktheSelectbutton.
BigDataIntegrationandAnalysis
UserManual
Page13
Figure4
In Figure 5, as you can see three sections appears corresponding respectively to:
Transformation Operations, Restriction Operations and Output files. 3 and 4 are
optional5isobligatory.
BigDataIntegrationandAnalysis
UserManual
Page14
Figure5
4.3.1.1.TransformationsOperations:
Four transformations operations using Mapreduce are offered by the software: Merge
columns, Split columns, Caseofcolumns,andFormatting of datecolumns andone for
Renameheaders.
InFigure6,Formerging:
BigDataIntegrationandAnalysis
UserManual
Page15
1)SelecttheMergetabinTransformationOperations.
2)Selectthetwocolumnswhichyouwanttomerge,
3)Definethemergingcharactertheusedbydefaultisblankspace.
4)WritethenameofthenewcolumnintheTextboxlocatedaftertheequalsign=.
5)ClicktheProceedbutton.
Figure6
InFigure7,Forsplitting:
1)SelectthecolumnSplittab
2)Selectthecolumnnametobesplitted.
3)Writethenameofthenewtwocolumnsaftertheequalsign=.,
4)Definethesplittingcharactertheusedbydefaultisblankspace.
5)ClickProceed.
Figure7
InFigure8,forCasing:
1)Selectcasingtab
2)Selectthedesiredcolumn
3)SelectUpperCaseorLowerCase
4)ClickProceed
BigDataIntegrationandAnalysis
UserManual
Page16
Figure8
InFigure9,forFormatting:
1)SelectFormattingtab.
2)Selectthedesiredcolumntoformat,itmustbeadatecolumn
3)Selectthedesireddateformatforrelatedcolumn.
Threedateformatsaresupported:DD/MM/YYYYMM/DD/YYYYYYYY/MM/DD
4)ClickProceed
Figure9
InFigure10,forRenamingcolumns:
1)SelectRenametab
2)Selectthedesiredcolumntoberenamed.
3)Writethenewdesiredname.
4)ClickProceed
Itisalsopossibletorenamemorecolumnsclickingonthe+button.
Figure10
BigDataIntegrationandAnalysis
UserManual
Page17
4.3.1.2.Restrictionoperation
InRestrictionOperationssection,wecanrestricttheoutputaccordingtosomecriterias.
Fornumericvalues:=,<>,>,<,>=,<=
Fortextualvalues:EQUAL,NOTEQUAL,CONTAINS.
InFigure11:
1)Selectthedesiredcolumn
2)Selectthecriteriatocompare
3)Writethevaluewhichyouwanttocompareit.
4)ClickProceedbutton.
Itisalsopossibletoaddmorerestrictionsclickingonthe+button.
Figure11
4.3.1.3.Outputfiles
InFigure12,wedefinetheoutputfiles.
1)Selectthedesiredcolumns
2)Writethedesirednameforoutputfile.
3)PresstheRunbutton.
Itisalsopossibletogeneratemorefilesclickingonthe+button.
Figure12
BigDataIntegrationandAnalysis
UserManual
Page18
Aftertheintegrationjobisdoneoneformisopen,asshowninFigure13,ifyouwant
tosaveitasaIJSLscriptclickYesotherwiseclickNo.
Figure13
InFigure14,Itwillappearonedialog:
1)Writethenameofthefile
2)ClickSave.
Figure14
BigDataIntegrationandAnalysis
UserManual
Page19
4.3.2.Loadinganexistin
gIJSLscript
AlsoyoucanLoadanexistingIJSLscriptfileandRunit.
InFigure15,16:
1)ClickLoadbutton,
2)Selectfile.
3)ClickLoadbutton.
Figure15
Figure16
BigDataIntegrationandAnalysis
UserManual
Page20
InFigure17:
VerifytheIJSLscript.Modifyitifneeded.
1)Clickrun.
Figure17
BigDataIntegrationandAnalysis
UserManual
Page21
4.3.3.WritingaIJSLscript
ThestructureofthegrammarfortheIJSLscriptisexplainedat4.
InFigures18,19:
1)ClickWrite
2)WritetheIJSLscript
3)ClickRun
Figure18
BigDataIntegrationandAnalysis
UserManual
Page22
Figure19
InFigure20,alsowecansaveournewIJSLscript.
1)ClickSave.
2)Writethenameofthefile
3)ClickSave
BigDataIntegrationandAnalysis
UserManual
Page23
Figure20
4.4.DataAnalysis
InthissectionwewilldescribetheDataAnalysisphase.
4.4.1LoadfilesfromHadoopFileSystem
ThesystemwilluploadthelistoffilesthatarecurrentlyintheInputDirectorythatis
specifiedintheSettingsfile.AsshowninFigure21.
Figure21
BigDataIntegrationandAnalysis
UserManual
Page24
4.4.2SelectingfilesfromHadoopFileSystem
InFigure22:
1)Definetheseparator(defaultiscomma).
2)Selectatleastonefile
3)ClickbuttonAdd.
Figure22
4.4.3Displayingfieldnamesfromselectedfiles
Thetablewillberegisteredinsparkwiththenameofthefileanditwillbedisplay
amongwithitsfieldnamestohelptheuserformulatingthequeries.Asshownin
Figure23.
Figure23
4.4.4Inputquery
InFigure24,atextareawillbedisplayedfortheuser
BigDataIntegrationandAnalysis
UserManual
Page25
1)Inputthedesiredquery.
2)InputanametosavetheresultsintheHDFS.
2)ClickExecutebutton.
Figure24
4.4.5Sampleresults
OnlyasmallportionoftheresultsisshowninsidetheResultsbox,therestsistobe
savedtodisk.AsshowninFigure25.
Figure25
5.PossibleflowsforDataIntegration
Allpossibleflowsthatanexecutionofoneinstancearelistedhere.Ifonedesiredtaskis
notlistedhere,itmaybepossibletoexecuteitbytwoexecutions.Forexample:Merge
andSplit,FirstdoaMergeandlaterdoasplitjob.
BigDataIntegrationandAnalysis
UserManual
Page26
5.1Specifyingvaluesmanually
1.Inputfile>Desiredcolumns>Outputfiles.
2.Inputfile>Desiredcolumns>Restrictionoperations>Outputfiles.
3.Inputfile>Desiredcolumns>Transformationoperations[MERGE]>Outputfile*.
4.Inputfile>Desiredcolumns>Transformationoperations[SPLIT]>Outputfile*.
5.Inputfile>Desiredcolumns>Transformationoperations[CASING]>Outputfiles.
6.Inputfile>Desiredcolumns>Transformationoperations[CASING]>
Transformationoperations[FORMATTING]>Outputfiles.
7.Inputfile>Desiredcolumns>Transformationoperations[CASING]>Restriction
operations>Outputfile*.
8.Inputfile>Desiredcolumns>Transformationoperations[CASING]>
Transformationoperations[FORMATTING]>Restrictionoperations>Outputfile.
9.Inputfile>Desiredcolumns>Transformationoperations[FORMATTING]>Output
file.
10.Inputfile>Desiredcolumns>Transformationoperations[FORMATTING]>
Restrictionoperations>Outputfile.
11.Inputfile>Desiredcolumns>Transformationoperations[RENAME]>Outputfile.
12.Inputfile>Desiredcolumns>Transformationoperations[RENAME]>Restriction
operations>Outputfile.
5.2LoadinganexistingIJSLscript
1.Load>Run.
5.3WritingaIJSLscript
1.Write>Run.
2.Write>Save>Run.
6.PossibleflowsforDataAnalysis
1. Listfile>Selectfiles>Displaytablenamesandfields>InputQuery>Save
file>Displaypartialresults
BigDataIntegrationandAnalysis
UserManual
Page27
7.References
[1]
http://www.oracle.com/technetwork/database/options/advancedanalytics/bigdataanalyticswp
oaa1930891.pdf
[2]
http://harvardmagazine.com/2014/03/whybigdataisabigdeal
[3]
http://uscisii2.github.io/papers/knoblock13sbd.pdf
[4]
https://downloads.cloudsecurityalliance.org/initiatives/bdwg/Big_Data_Analytics_for_Security_Int
elligence.pdf
[5]
https://en.wikipedia.org/wiki/Apache_Hadoop
[6]
http://cdn.oreillystatic.com/oreilly/booksamplers/9781449358624_sampler.pdf
[7]
http://www.glennklockwood.com/data-intensive/hadoop/mapreduce-workflow.png