2016 07 11-Big Data Management

Big Data Management
OSCAR ROMERO
D T I M R E S E A R C H G R O U P ( H T T P : / / W W W. E S S I . U P C . E D U / D T I M / )
U N I V E R S I TAT P O L I T C N I C A D E C ATA L U N YA - B A R C E L O N AT E C H
A L I C A N T E , 1 1 T H J U LY, 2 0 1 6
Cursos destiu Big Data

Table of contents
1. Introduction
What is Big Data?
The need for a paradigm shift: from relational databases to NOSQL
The main steps of the data lifecycle within IT systems
The Big Data Ecosystems and the technological stack
Cloud Services
11 July 2016 2
Introduction to Big Data
11 July 2016 3
A New Business Model
Traditionally databases have been seen as a passive asset
OLTP systems: Data gathered is structured to facilitate (automate) daily operations
The relational model as de facto standard
Soon, many realized data is a valuable asset for any organization. So, use it!
Decisional systems: Stored data is analysed to better understand our activity (I want to know)
Data warehousing as de facto standard
11 July 2016 4
Instagrams Fable
11 July 2016 5
Other Examples
The overkilling approach
Facebook: Facebook + Instagram + Whatsapp + LinkedIn +
Google: Android + Google Search + Calendar + Gmail + Doodle +
Even if most of us are not Facebook or Google we can still benefit from data
Cross available data (companies fusion, buying data, open data, agreements, etc.)
Digitalise the organization processes (e.g., the national health system, tax collection, e-banking, etc.)
Monitor the user (phone apps, internet navigation, service usage, wearables, RFID, etc.)
Sometimes, in an indirect way (e.g., provide (free) services to learn habits; such as free wi-fi to geolocate the user)
Sensors (Smart Cities, Internet of Things, etc.)

Bottom line: most of the times, the most interesting data is not available (innovation comes to play!)
11 July 2016 6
Data as the New Cornerstone
We have witnessed the bloom of a new business model based on data analytics: Data is not a passive
but an active asset
Data is the new oil! - Clive Humby, 2006
No! Data is the new soil - David McCandless, 2010
The effective use of data to make decisions gave rise to thedata-driven society concept
The confluence of three major socio-economic and technological trends makes data-driven
innovation a new phenomenon today. These three trends include (OECD):
The exponential growth in data generated and collected,
the widespread use of data analytics including start-ups and small and medium enterprises (SMEs), and
the emergence of a paradigm shift in knowledge
Organizations must adapt their infrastructures to benefit from the data deluge
Digital data doubling every 18 months (IDC)
Innovation is mandatory!
11 July 2016 7
Same Purpose; Different Means
The economic and social role of data is not new
Economic and social activities have long evolved around the analysis and use of data
In business, concepts such as Business Intelligence and Data Warehousing already emerged in the
1960s and became popular in the late 1980s
Decision support systems refer to any IT system supporting decision making

The Business Intelligence Lifecycle as cornerstone
11 July 2016 8
The Business Intelligence Lifecycle
Business strategy
OLAP Data Mining Reports
Data warehouse Decision

Support
Data
Warehousing
Extraction, Transformation and Load (ETL)
Accounting Marketing Human resources Stocks management
11 July 2016 9
The DW Exploitation
Typically, the analysis of the data has been considered at three different levels of detail
Querying & Reporting: Static report generation
OLAP: Dynamic summarizations of data
Data Mining and Machine Learning: Inference of hidden patterns or trends
11 July 2016 10
The DW Exploitation
Typically, the analysis of the data has been considered > (kc <- kmeans(newiris, 3))
K-means clustering with 3 clusters of sizes 38, 50, 62
at three different levels of detail
Querying & Reporting: Static report generation Cluster means:
Sepal.Length Sepal.Width Petal.Length Petal.Width
OLAP: Dynamic summarizations of data 1 6.850000 3.073684
2 5.006000 3.428000
5.742105 2.071053
1.462000 0.246000
3 5.901613 2.748387 4.393548 1.433871
Data Mining and Machine Learning: Inference of hidden
patterns or trends Clustering vector:
[1] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
[30] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 1 3 3 3 3
3
[59] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 1 3 3 3 3 3 3 3 3
3
[88] 3 3 3 3 3 3 3 3 3 3 3 3 3 1 3 1 1 1 1 3 1 1 1 1 1 1 3 3
1
[117] 1 1 1 3 1 3 1 3 1 1 3 3 1 1 1 1 1 3 1 1 1 1 3 1 1 1 3 1
1
[146] 1 3 1 1 3
Within cluster sum of squares by cluster:

[1] 23.87947 15.15100 39.82097
Available components:
[1] "cluster" "centers" "withinss" "size"
Example from http://www.rdatamining.com/examples/kmeans-clustering
11 July 2016 11
Types of Analysis
Descriptive: Deterministic, not probabilistic
Computing summarizations, counts, min, max, etc.
Typical OLAP operations
Predictive: Probabilistic by nature. Try to forecast what may happen according to what have happened
Linear and non-linear regression,
Classification,
Clustering,
Association rules,
etc.
Prescriptive: Given the prediction(s) of a (several) model(s), understand why something is happening and
undertake automatic action(s). Examples:
Stock market indicators (to buy or sell shares)
Automatically increase / decrease prices
11 July 2016 12
Data Warehousing Vs. Big Data
Both are decision support systems (DSS)
Big Data can be seen as the evolution of Data Warehousing ecosystems to incorporate external data to
the decision making processes of the organization as first class-citizen
External Vs. Internal data
Semi-structured and unstructured data as first-class citizens
Dynamic Vs. static data sources
Not in control Vs. well-controlled sources
On-demand data quality threshold
Lightweight transformations Vs. heavy transformations
On-demand Vs. static goals
Load-first model-later (Data Lake) Vs. schema-fixed approach (DW)
And many others consequences
Semantic-aware solutions
Privacy
Etc.
11 July 2016 13
Big Data [Management]
Definition based on the limitations traditional DSS (such as Data Warehousing) cannot overcome
Velocity
Volume
Variety
Afterwards, other Vs have been added to highlight

other relevant issues typical of DSS
Variability
Validity / Veracity
Value
These problems map to traditional database problems

Same theory, techniques, algorithms different approach From IBM Understanding Big Data
11 July 2016 14
Big Data [Analytics]
Consider the previous slide and now recall we previously classified data analysis as:
Query & reporting, OLAP, data mining / machine learning
Descriptive, predictive, prescriptive
This classification still applies to Big Data but with a subtle change:
Query & Reporting + OLAP
Data Mining & Machine Learning
What does Big Data Mean?

Michael Stonebraker
Communications of the ACM blog, 2012
11 July 2016 15
Big Data [Analytics]
Consider the previous slide and now recall we previously classified data analysis as:
Query & reporting, OLAP, data mining / machine learning
Descriptive, predictive, prescriptive
This classification still applies to Big Data but with a subtle change:
Query &Small
Reporting + OLAP
Analytics
Big Analytics
Data Mining & Machine Learning
What does Big Data Mean?

Michael Stonebraker
Communications of the ACM blog, 2012
11 July 2016 15
An Example: BigBench
11 July 2016 16
NOSQL: A Paradigm Shift
FROM RELATIONAL TO NOSQL
11 July 2016 17
The End of an Architectural Era
Navigate
Search
Small data sets Personal Computers
WEB 1.0 Read Era
11 July 2016 18
The End of an Architectural Era
Real time
Big Data
Concurrency
Unstructured and structured
data (users)
WEB 2.0 Write Era
11 July 2016 19
RDBMS: One Size Fits All
Mainly write-only Systems (e.g., OLTP)
Data storage
Normalization
Queries
Indexes: B+, Hash
Joins: BNL, RNL, Hash-Join, Merge Join
Read-only Systems (e.g., DW)

Data Storage
Denormalized data
Queries
Indexes: Bitmaps
Joins: Star-join
Materialized Views
11 July 2016 20
Distributed Data Management
(to the rescue)
Too many reads? Replication
Too many writes? Fragmentation
FRAGMENTATION
REPLICATION
11 July 2016 21
Why Distribution?
How long does it take to read 1TB of data from disk in a
centralized database compared to a shared-nothing and shared-
memory distributed system?
Centralized: 1024 1024 100 ~ 166 minutes
1
10241024 1
Distributed (100 nodes): 100

100
~ 90 seconds
Assuming an ideal scenario with linear scalability
In a memory-shared environment, as size of data increases, , the CPU of the
computer is typically overwhelmed at some point by the data flow and it is slowed
down
In a shared-nothing environment this does not happen (as 100 CPUs are available)
but network communication is required
Extracted from Web Data Management (Abiteboul et al.): An Introduction to Distributed Systems
11 July 2016 22
Why Distribution? Some Reflections
With current technologies
Disk transfer rate is a bottleneck for batch processing of large scale datasets; parallelization and distribution
can help to eliminate this bottleneck
It is faster to read the memory of a computer in your LAN (1GB/s) than your own disk (100MB/s)
Distributing data (i.e., by fragmenting and allocating it on a cluster) enables parallel processing
Several machines working together to solve a problem
Direct application of the divide-and-conquer principle
So good news, isnt it? WAIT

Now the network comes into play, and network, like CPUs, may get overwhelmed (it is a shared resource)
When possible, data should be accessed where they are stored (or near) to avoid costly data exchange over the network (data locality)
Distributed RDBMS do not allow to scale up to hundreds of machines in a cluster
New solutions required (NOSQL, welcomed!)
11 July 2016 23
Distributed RDBMS
Distributed RDBMS (DRDBMS) are not flexible enough
Logging ACID properties
Persistent redo log
Undo log
CLI interfaces (JDBC, ODBC, etc.) Data structure
Concurrency control (locking) ACID properties
Latching associated with multi-thread Architectural limitation
Two-phase commit (2PC) protocol in distributed transactions ACID properties
Buffers management (cache disk pages) Architectural limitation
Variable length records management Data structure
Locate records in a page
Locate fields in a record
The End of an Architectural Area. Its Time for a Complete Rewrite. Michael Stonebraker et al.
VLDB Proceedings, 2007
11 July 2016 24
A New Type of Distributed Systems
NOSQL (Not Only SQL) systems have the following goals:
Schemaless: No explicit schema [new data structures] [distribution]
Reliability / availability: Keep delivering service even if its software or hardware components fail
Scalability: Continuously evolve to support a growing amount of tasks [distribution]
Efficiency: How well the system performs, usually measured in terms of response time (latency) and
throughput (bandwith)
[distribution]
11 July 2016 25
NOSQL Three Pillars
1. Distributed Data Management
Divide-and-conquer principle
Use (and abuse) of parallelism
2. Flexible data models

Reduce the impedance mismatch
Reduce the code overhead to make transformations between different data models
3. New architectural solutions

Relational database architectures date back to the 70s
Some modules may not be needed (e.g., concurrency control in read-only environments)
Some modules need to be rethought (e.g., from disk-based to memory-based architectures)
11 July 2016 26
Distributed DB design
Node distribution People Name Age Gender Birthplace
Data fragments I1 Oscar 34 Male Lleida
Data replication F1
I2 Anna 35 Female Barcelona
Distributed DB catalog
Fragmentation trade-off: Location I3 Merc 45 Female Gav
Global or local for each node
F2
Centralized in a single node or distributed I4 Carme 22 Female Terrassa
Single-copy vs. Multi-copy
I5 Miquel 27 Male Matar
Distributed query processing
Data distribution / replication F3 I6 John 55 Male Glasgow
Communication overhead
Distributed transaction management

How to enforce (or not) the ACID properties
Replication trade-off: Queries vs. Data consistency between replicas (updates)
Distributed recovery system
Distributed concurrency control system
Security issues
Network security
11 July 2016 27
Node distribution People Name Age Gender Birthplace
Data fragments I1 Oscar 34 Male Lleida
Data replication F1
I2 Anna 35 Female Barcelona
Fragmentation trade-off: Location I3 Merc 45 Female Gav
F2
Centralized in a single node or distributed I4 Carme 22 Female Terrassa
I5 Miquel 27 Male Matar
Data distribution / replication F3 I6 John 55 Male Glasgow

Replication trade-off: Queries vs. Data consistency between replicas (updates)
Security issues
Network security
11 July 2016 27
Node distribution
Data fragments
Data replication F1
Fragmentation trade-off: Location

Centralized in a single node or distributed
F2

Data distribution / replication F3

Replication trade-off: Queries vs. Data consistency between replicas (updates) Node1 Node2 Node3 Node4 Node5
Security issues
Network security NETWORK
11 July 2016 27
Node distribution
Data fragments
Data replication F1

F2


How to enforce (or not) the ACID properties F1
Security issues
11 July 2016 27
Node distribution
Data fragments
Data replication

F2


How to enforce (or not) the ACID properties F1 F1
Security issues
11 July 2016 27
Node distribution
Data fragments
Data replication

F2


Distributed recovery system F2
Security issues
11 July 2016 27
Node distribution
Data fragments
Data replication

Distributed transaction management F2

Distributed recovery system F2
Security issues
11 July 2016 27
Node distribution
Data fragments
Data replication


Distributed recovery system F2 F3
Security issues
11 July 2016 27
Node distribution
Data fragments
Data replication

Data distribution / replication

How to enforce (or not) the ACID properties F1 F1 F3
Security issues
11 July 2016 27
Node distribution
Data fragments
Data replication


How to enforce (or not) the ACID properties F1 F1 F3
Security issues
11 July 2016 27
Node distribution
Data fragments
Data replication

Master
How to enforce (or not) the ACID properties F1 F1 F3 F1: @N1, @N2
Replication trade-off: Queries vs. Data consistency between replicas (updates) Node1 Node2 Node3 Node4 F2: @N2, @N3
Node5
Distributed recovery system F2 F3 F3: @N3, @N4
Security issues
11 July 2016 27
Node distribution
Data fragments
Data replication

Master
Node5
Security issues
11 July 2016 27
Node distribution
Data fragments
Data replication
Distributed DB catalog Q(People[Name])


Master
Node5
Security issues
11 July 2016 27
Node distribution
Data fragments
Data replication


Master
Node5
Security issues
11 July 2016 27
Node distribution
Data fragments
Data replication

Q1(F1[Name])
Distributed query processing Q3(F3[Name])

Data distribution / replication Q2(F2[Name])
Master
Node5
Security issues
11 July 2016 27
Node distribution
Data fragments
Data replication

Q1(F1[Name])
Distributed query processing Q3(F3[Name])

Data distribution / replication Q2(F2[Name])
Master
Node5
Security issues
11 July 2016 27
Distributed Database Design
Given a DB and its workload, how
should the DB be split and allocated to
sites as to optimize certain objective
functions
Minimize resource consumption for query
processign
Two main issues:

Data fragmentation
Data allocation
Data replication
11 July 2016 28
Global Catalog Management
Centralized version (@master)
Accessing it is a bottleneck
Single-point failure
May add a mirror
Poorer performance
Distributed version (several masters)

Replica synchronization
Potential inconsistencies
11 July 2016 29
Global Catalog Example: HDFS
11 July 2016 30
Global Catalog Example: HBase
LSM: Distributed B+ Tree
Key
B+ Tree with a better balance to deal with writes

Data is horizontally fragmented
Dynamic fragmentation, distributed on a cluster of machines or cloud
Specific catalog
Tuples are lexicographically sorted according to the key
Key of the LSM. Each row (entry) consists of <key, loc> : Key: it is the last key value in that fragment. Loc: it is the physical address of that fragment
Insertions are performed in memory
Delta store: Stores insertions in memory
Every time the delta store is full, it is flushed to disk
Compactations are periodically run http://www.cs.umb.edu/~poneil/lsmtree.pdf
11 July 2016 31
Decide Data Allocation
Given a set of fragments, a set of sites on which a number of applications are running, allocate
each fragment such that some optimization criterion is met (subject to certain constraints)
It is known to be a NP-hard problem
The optimal solution depends on many factors
Location in which the query originates
The query processing strategies (e.g., join methods)
Furthermore, in a dynamic environment the workload and access pattern may change
The problem is typically simplified with certain assumptions (e.g., only communication cost
considered)
Typical approaches build cost models and any optimization algorithm can be adapted to solve it
Heuristics are also available: (e.g., best-fit for non-replicated fragments)
Sub-optimal solutions
11 July 2016 32
Managing Data Replication
Replicating fragments improves the system throughput but raises some other issues:
Consistency
Update performance
Most used replication protocols

Eager Lazy replication
Primary Secondary versioning
Strong Eventually
Consistency Consistent
11 July 2016 33
Trade-off: Performance Vs. Consistency
Consistency Performance
(Ratio of correct answers) (System throughput)
11 July 2016 34
CAP Theorem
Eric Brewer. CAP Theorem: any networked shared-data system can have at most two of three
desirable properties:
consistency (C) equivalent to having a single up-to-date copy of the data;
high availability (A) of that data (for updates); and
tolerance to network partitions (P)
Example:
C ->
A ->
P ->
35
CAP Theorem
Example:
Update C ->
A ->
P ->
35
CAP Theorem
Example:
Eager Replication
Update C ->
A ->
P ->
35
CAP Theorem
Example:
Eager Replication
Update C -> OK
A -> NO
P -> OK
35
CAP Theorem
Example:
ACID
Eager Replication
Update C -> OK
A -> NO
P -> OK
35
CAP Theorem
tolerance to network partitions (P).
Example:
C ->
A ->
P ->
36
CAP Theorem
Example:
Update C ->
A ->
P ->
36
CAP Theorem
Example:
Lazy Replication
Update C ->
A ->
P ->
36
CAP Theorem
Example:
Lazy Replication
Update C -> NO
A -> OK
P -> OK
36
CAP Theorem
Example:
BASE
Lazy Replication
Update C -> NO
A -> OK
P -> OK
36
CAP Theorem
Example:
C ->
A ->
P ->
37
CAP Theorem
Example:
Update C ->
A ->
P ->
37
CAP Theorem
Example:
Eager Replication
Update C ->
A ->
P ->
37
CAP Theorem
Example:
Eager Replication
Update C -> OK
A -> OK
P -> NO
37
CAP Theorem
Example:
????
Eager Replication
Update C -> OK
A -> OK
P -> NO
37
CAP Theorem Revisited
The CAP theorem is not about choosing two out of the three forever and ever
Distributed systems are not always partitioned
Without partitions: CA
Otherwise
Detect a partition
Normally by means of latency (time-bound connection)
Enter an explicit partition mode limiting some operations choosing either:
CP (i.e., ACID by means of e.g., 2PCP or PAXOS) or,
If a partition is detected, the operation is aborted
AP (i.e., BASE)
The operation goes on and we will tackle this next
If AP was chosen, enter a recovery process commonly known as partition recovery (e.g., compensate mistakes and get rid of
inconsistencies introduced)
Achieve consistency: Roll-back to consistent state and apply ops in a deterministic way (e.g., using time-stamps)
Reduce complexity by only allowing certain operations (e.g., Google Docs)
Commutative operations (concatenate logs, sort and execute them)
Repair mistakes: Restore invariants violated
Last writer wins
11 July 2016 38
Flexible Data Models (Schemaless)
Relational databases give for granted the relational model
The NOSQL wave presents other data models to boost performance and flexibility
I. Key-value: Hadoop (HDFS, HBase), Cassandra, Voldemort, etc.
II. Document (kind of key-value): MongoDB, CouchDB, etc.
III. Graph-based: Neo4J, Giraph, GrapX, etc.
IV. Streams: Apache Flink, Spark Stream, etc.
V. Etc.
11 July 2016 39
New Architectures
NOSQL introduces a critical reasoning on the reference database architecture
ALL relational databases follow System-R architecture (late 70s!)
11 July 2016 40
Most Typical Architectural Solutions
Primary indexes to implement the global catalog
Distributed B+: WiredTiger, HBase, etc.
Consistent Hashing: Voldemort, MongoDB (until 2.X), etc.
Bloom filters to avoid distributed look ups
In-memory processing
Columnar block iteration: Vertical fragmentation + fixed-size values + compression (run length
encoding)
Heavily exploited by column-oriented databases
Only for read-only workloads
Sequential reads
Key design
11 July 2016 41
Most Typical Architectural Solutions
Random Vs. Sequential Reads
Take the most out of databases by boosting sequential reads
Enables pre-fetching
Option to maximize the effective read ratio (by a good db design)
http://www.benstopford.com/2015/04/28/elements-of-scale-composing-and-scaling-data-platforms/
11 July 2016 42
Example: NewSQL
IDEA: For OLTP systems RDBMS can also be outperformed
Main memory DB High availability

A DB less than 1Tb fits in memory Cannot wait for the recovery process
20 nodes x 32 Gb (or more) costs less than 50,000US$ Multiple machines in a Peer-To-Peer configuration
Undo log is in-memory and discarded on commit
Reduce costs
One thread systems Human costs are higher than Hw and Sw
Perform incoming SQL commands to completion, An expert DBA is expensive and rare
without interruption Alternative is brute force
One transaction takes less than 1ms Automatic horizontal partitioning and replication
No isolation needed Execute queries at any replica and updates to all of them
Grid computing Optimize queries at compile time

Enjoy horizontal partitioning and parallelism
Add new nodes to the grid without going down
11 July 2016 43
Example: the Data Lake
IDEA: Load -first, Model-Later
Modeling at load time restricts the potential
analysis that can be done later (Big Analytics)
Store raw data and create on -demand views to
handle with precise analysis needs
11 July 2016 44
Example: The Lambda-Architecture
Batch layer Serving layer
Batch view
Master Batch view Query

dataset
New data
(Stream)
Query
Speed layer (Stream Processing Engine)
IDEA: Accommodate volume and velocity Real-time view Real-time view

Real time Vs. Batch processing
Precise Vs. Approximate results Storage manager
Temporary Summary Static

(working data) (synopses) (metadata)
11 July 2016 45
Example: Polyglot Systems
IDEA: Federate different kinds of NOSQL databases in a single system
Coined by Martin Fowler: http://martinfowler.com/bliki/PolyglotPersistence.html
46
Is It About New Things?
No, it is not. Current architectures do not introduce a single new concept
Old theoretical findings are now used or broadly used
Primary indexes
Sequential reads
Vertical partitioning
Compression (e.g., run length encoding)
Fixed-size values
Bloom filters
Polyglot systems

47
Is It About New Things?
No, it is not. Current architectures do not introduce a single new concept
Old theoretical findings are now used or broadly used
Primary indexes
Sequential reads
Vertical partitioning
Compression (e.g., run length encoding)
Fixed-size values
Bloom filters
Polyglot systems

Big Data brings critical thinking on traditional assumptions and tears them down. Nevertheless, Big
Data solutions are fully built on top of the database theory
47
11 July 2016 48
Watch Out! The Problem is NOT SQL
SQL is a query language. SQL is cool! Not a problem
The problem is that relational systems are too generic
OLTP: stored procedures and simple queries
OLAP: ad-hoc complex queries
Documents: large objects
Streams: time windows with volatile data
Scientific: uncertainty and heterogeneity
But the overhead of RDBMS has nothing to do with SQL
Low-level, record-at-a-time interface is not the solution
SQL Databases vS. NoSQL Databases
Michael Stonebraker
Communications of the ACM, 53(4), 2010
11 July 2016 49
Data Management
DATA LIFECYCLE
11 July 2016 50
Data Management
Data management refers to the tasks a database management system (DBMS) must provide to cover the
data lifecycle within an IT system. In this context, we focus on:
Ingestion: the means provided to insert /upload / put data into the DBMS (in SQL, the INSERT command)
Modeling: the conceptual data structures used to arrange data within the DBMS (e.g., tables in the relational
model, graphs in graph modelling, etc.)
Storage: the physical data structures used to persist data in the DBMS (e.g., hash, b-tree, heap files, etc.)
Processing: the means provided (many times, in algebraic form) to manipulate data once stored in the the DBMS
physical data structures (in SQL, the relational algebra)
Querying / fetching: the means provided to allow the DBMS user to specify the data processing she would like to
perform (in SQL, SELECT queries). It is typically triggered in terms of the conceptual data structures
In Big Data settings, it refers to the same concepts but assuming a NOSQL system is behind
Typically, a distributed system
Possibly with an alternative data model to the relational one
Implementing ad-hoc architectural solutions
11 July 2016 51
Data Management
11 July 2016 52
Data Management
John, BCN,
33; Maria, Ingestion
BCN, 22
11 July 2016 52
Data Management
Name City Age

John BCN 23
John, BCN,
33; Maria, Ingestion Maria BCN 22
BCN, 22
Conceptual
Schema
11 July 2016 52
Data Management
Name City Age Hash-based storage

H(age) = age%20
John BCN 23 Bucket2 = {Maria|BCN|22}
John, BCN, Bucket3 = {John|BCN|23}
BCN, 22
Conceptual Physical
Schema Schema
11 July 2016 52
Data Management
Get(name = john)

H(age) = age%20
BCN, 22
Conceptual Physical
Schema Schema
11 July 2016 52
Data Management
Get(name = john)
Data Processing

H(age) = age%20
BCN, 22
Conceptual Physical
Schema Schema
11 July 2016 52
Data Management
Get(name = john)
Data Processing

H(age) = age%20
BCN, 22
Conceptual Physical
Schema Schema
11 July 2016 52
Apache Hadoop as Example
By Vctor Herrero. Big Data Management & Analytics (UPC School)
11 July 2016 53
11 July 2016 53
11 July 2016 53
11 July 2016 53
11 July 2016 53
Ingestion
11 July 2016 53
Ingestion
Modeling and
Storage
11 July 2016 53
Ingestion
Processing
Modeling and
Storage
11 July 2016 53
Ingestion Querying
Processing
Modeling and
Storage
11 July 2016 53
Small Analytics Big Analytics
Ingestion Querying
Processing
Modeling and
Storage
11 July 2016 53
Small Analytics Big Analytics
Ingestion Querying
Processing
Modeling and
Storage
The whole ecosystem maps to a DBMS!

11 July 2016 53
Data Ingestion
RELATIONAL DBMS HADOOP ECOSYSTEM
Data must be in the same machine Data must reach the cluster first (SCP, FTP,
etc.) and from there insert it into the Big Data
INSERT INTO (DML command)
system
Main options:
Main options:
o Bulk load (aka batch loading)
o Bulk load (aka batch loading)
o Record-at-a-time
o Record-at-a-time
Connection options:
Connection options:
o ODBC (JDBC in the general case)
o API (less standardised than ODBC)
o Database command (typically from a file)
o Database command (typically from a file)
11 July 2016 54
Data Modeling
Based on the relational model No single reference model

Key-value
Tables, rows and columns
Document
Each row represents an instance, columns Stream
attributes / features Graph
Constraints are limited
PK, FK, Check,
The closer the model looks to the way data is later
When creating (i.e., at creation time) the tables stored internally the better (impedance
mismatch)
you MUST specify the table schema (i.e.,
columns and constraints) Further, ideally, the schema should be defined at
insertion time and not at definition time
(schemaless databases)
11 July 2016 55
Impedance Mismatch
Petra Selmer, Advances in Data Management 2012
11 July 2016 56
Impedance Mismatch
Petra Selmer, Advances in Data Management 2012
11 July 2016 57
Key-Value
Characteristics
Entries in form of key-values (can be seen as enormous hash tables)
One key maps only to one value
Query on key only
Schemaless
key value
Suitable for Bob Michael_Elisabeth_30_Bobby_2010

Very large repositories of data (scalability is a must)
Unstructured data whose schema is difficult to predict
Main way to access data is by key
Systems
Hadoop ecosystem
11 July 2016 58
document
Document
Characteristics
Entries in form of key-values where the value is a XML or JSON document
One key maps only to one document
It is possible to index any of the document entries (in JSON, the JSON keys, in
XML the XML tags)
Query on key or indexed document entries
Schemaless
key
SocialNetwork
Suitable for
Data produced in XML / JSON format (web data)
Main way to access data is by key and all the document is processed together
Systems
MongoDB, CouchDB
11 July 2016 59
Graph
Characteristics
Data stored as nodes and edges graph
Relationships as first-class citizens
Schemaless
Suitable for
Topology queries (based on the shape of the graph)
Systems
Neo4J, Titan, Sparksee
http://grouplens.org/datasets/movielens/
11 July 2016 60
Streams
Characteristics
Strong temporal locality
The whole dataset is not available but a portion of it (window)
Stream (window)
Suitable for
Real-time applications
Data per se is not important but to gain insight (e.g., sensor data)
Approximate answers are enough
Systems
Apache Flink, Spark Streaming, Storm
11 July 2016 61
Data Storage
Generic architecture that can be tuned according Concepts that gained weight
to the needs: Primary indexes
Mainly write-only Systems (e.g., OLTP) Sequential reads
Normalization Vertical partitioning
Indexes: B+, Hash Compression (e.g., run length encoding)
Joins: BNL, RNL, Hash-Join, Merge Join Fixed-size values
Read-only Systems (e.g., DW) Bloom filters
Denormalized data Polyglot systems
Indexes: Bitmaps
Joins: Star-join Such architectures are very specific and good (but
Materialized Views very good) in solving a specific problem
11 July 2016 62
Data Processing
The relational algebra as cornerstone Several data processing engines

Each of them works better under certain assumptions
Data processing is hidden for us and the DBMS 40% to 80% of current Spark / MR analytical jobs
takes care of it (query optimizer) triggered in a cluster require a single machine to be
Data Mining and Machine Learning is not executed
performed inside the DBMS However, Big Data workflows (i.e., migrating data
Data is dumped to files that are later loaded in among systems) require the power of the cluster for
specific tools (R, SAS) large datasets
For large volumes, the data locality principle must
be applied and DM / ML performed without moving
data
11 July 2016 63
The Ancestor of them All: MapReduce
SQL
File System
RDBMS
11 July 2016 64
SQL
File System
RDBMS
11 July 2016 64
SQL
File System
RDBMS
HBase
11 July 2016 64
MapReduce
File System
RDBMS
HBase
11 July 2016 64
MapReduce
File System
RDBMS
HBase
11 July 2016 64
MapReduce
File
HDFS System
RDBMS
HBase
11 July 2016 64
MapReduce Basics
Map Merge-Sort Reduce
Simple model to express relatively sophisticated distributed programs

Processes pairs [key, value]
Signature:
11 July 2016 65
MapReduce: WordCount Example
11 July 2016 66
<#line, text>
11 July 2016 66
<#line, text> Map
11 July 2016 66
<#line, text> Map
The 1
Project 1
Gutemberg 1
Ebook 1
of 1
The 1
Outline 1
Of 1
Science, 1
Vol. 1
1 1
(of 1
4), 1
11 July 2016 by 1 66
[The,[1,1,1,1,]]
<#line, text> Map Merge-Short
The 1
Project 1
Gutemberg 1
Ebook 1
of 1
The 1
Outline 1
Of 1
Science, 1
Vol. 1
1 1
(of 1
4), 1
11 July 2016 by 1 66
[The,[1,1,1,1,]]
<#line, text> Map Merge-Short Reduce
The 1
Project 1
Gutemberg 1
Ebook 1
of 1
The 1
Outline 1
Of 1
Science, 1
Vol. 1
1 1
(of 1
4), 1
11 July 2016 by 1 66
[The,[1,1,1,1,]]
<#line, text> Map Merge-Short Reduce
The
Project
1
1 The 57631
Gutemberg 1
Ebook 1
of 1
The 1
Outline 1
Of 1
Science, 1
Vol. 1
1 1
(of 1
4), 1
11 July 2016 by 1 66
public void map(LongWritable key, Text value) {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
write(new Text(tokenizer.nextToken()), new IntWritable(1));
}
}
public void reduce(Text key, Iterable<IntWritable> values) {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
write(key, new IntWritable(sum));
}
11 July 2016 67
KEY key, Text value)
public void map(LongWritable VALUE {
write(new Text(tokenizer.nextToken()), new IntWritable(1));
}
}
int sum = 0;
sum += val.get();
}
}
11 July 2016 67
KEY
write(new Text(tokenizer.nextToken()), VALUE
new IntWritable(1));
}
}
int sum = 0;
sum += val.get();
}
}
11 July 2016 67
KEY
}
}
KEY
public void reduce(Text key, VALUE
Iterable<IntWritable> values) {
int sum = 0;
sum += val.get();
}
}
11 July 2016 67
KEY
}
}
KEY
int sum = 0;
sum += val.get();
}
KEY
write(key, VALUE
new IntWritable(sum));
}
11 July 2016 67
StringTokenizer tokenizer BLACKBOX
= new StringTokenizer(line);
KEY
}
}
KEY
int sum = 0;
BLACKBOX
sum += val.get();
}
KEY
write(key, VALUE
new IntWritable(sum));
}
11 July 2016 67
Main Processing Approaches
Distributed Frameworks: Thought to run on a cluster and exploit parallelism
MapReduce, Spark (bear in mind Spark-R and Spark MLlib are built on top of Spark), Presto,
Centralized Processing on Distributed Storage: Precise results on small sets Recall some issues due to Big
Metis, Hadoop-R, Data characteristics:
In distributed settings,
Graph Processing: Thought to deal with graph-like
approximations are fine
Distributed: Pregel / Giraph, GraphLab, GraphLINQ, and real-time answers are
Centralized: GraphChi, Neo4J, . preferred
Centralized approaches
Stream Processing: Deal with data streams
yield precise results in
Spark Streaming, Apache Flink, Storm, batch processing but need
to deal with small inputs
(<= 0,5 GB as thumb rule)
11 July 2016 68
Workflow Orchestrators
There is no single data processing framework that is always the best (we learnt that from NOSQL!)
Workflow orchestrators are another abstraction layer, above data processing frameworks that,
given a query, decides what data processing frameworks is the more adequate
Ingestion Querying
Processing
Modeling and
Storage
11 July 2016 69
There is no single data processing framework that is always the best (we learnt that from NOSQL!)
Workflow orchestrators are another abstraction layer, above data processing frameworks that,
given a query, decides what data processing frameworks is the more adequate
Workflow Orchestrators (Oozie, Musketeer)
Ingestion Querying
Processing
Modeling and
Storage
11 July 2016 69
Current workflow orchestrators are rather poor: Oozie
But there are attempts for smarter approaches: the ideas behindMusketeer deserves special
attention Query Data Processing
Frameworks Frameworks
In short, does a similar

job to global query
optimizers of traditional
distributed RDBMS
11 July 2016 70
Data Querying
SQL is the only way to query the database Querying is done in a programmatic way using the
operators provided by the data processing
No matter if through ODBC or built-in framework (i.e., programming in MapReduce,
commands or syntactic sugar Spark, etc)
SELECT FROM WHERE Programming is typically done in Java, Python or Scala
The translation from the query to a procedural access
plan is not transparent as it was before (now you do
Declarative languages are COOL because they the job)
lower the entry barrier
Learn the language and use the database Unfortunately, very few attempts to develop
declarative languages like SQL for Big Data
Did you hear about data processing before? Hive / SparkQL and Pig / Spork for Hadoop
Thank SQL for that Cypher for Neo4J (graph database)
SparkR / MLlib / MLBase

11 July 2016 71
Cloud Services
PROVIDING ACCESS TO INFRASTRUCTURE
11 July 2016 72
Analogy: Electricity as a Utility
Pay-per-use
11 July 2016 73
Analogy: Electricity as a Utility
Own production Pay-per-use
11 July 2016 73
Computation as a Utility
11 July 2016 74
Computation as a Utility
Private Data Centre Public Cloud

(Own production) (Pay-per-use)
11 July 2016 74
Cloud Computing (Definition)
Cloud computing is a model for enabling convenient, on-demand
network access to a shared pool of configurable computing resources
(e.g., networks, servers, storage, applications, and services) that can
be rapidly provisioned and released with minimal management effort
or service provider interaction.
NIST (National Institute of Standards and Technology)
11 July 2016 75
Management Improvement
Daniel Abady, UC Berkeley
11 July 2016 76
Undercapacity Risk
Daniel Abady, UC Berkeley
11 July 2016 77
Benefits of Cloud Computing
Benefits for deploying in a cloud environment
Resolve problems related to updating/upgrading 39%
Able to scale IT resources to meet needs 39%
Relieve preassure on internal resources 39%
Rapid development 39%
Able to take advantage of latest functionality 40%
Reduce IT support needs 40%
Lower outside maintenance costs 42%
Lower labor costs 44%
Software license savings 46%
Hardware savings 47%
Pay only for what we use 50%
IBM global survey of IT and line-of-business decision makers 2012
11 July 2016 78
Benefits of Cloud Computing
Benefits for deploying in a cloud environment
Resolve problems related to updating/upgrading 39%
Able to scale IT resources to meet needs 39% Reduce

Relieve preassure on internal resources 39% time
Rapid development
to value
39%
Able to take advantage of latest functionality 40%
Reduce IT support needs 40%
Lower outside maintenance costs 42%

Cost
Lower labor costs 44%
Reduction
Software license savings 46%
Hardware savings 47%
Pay only for what we use 50%
IBM global survey of IT and line-of-business decision makers 2012
11 July 2016 78
Levels of Service
The company outsources some responsibility to the service provider. These levels are
incremental and thus, SaaS implies PaaS and PaaS implies IaaS
Infrastructure as a Service (IaaS)
You get a server to which you can connect through remote connection protocols (VPN, SSH, FTP, etc.)
Typically it covers the hardware (computers, network, virtualization, etc.)
Platform as a Service (PaaS)
You get those software modules needed to run applications (databases, web servers, security, etc.)
Software as a Service (SaaS)
Besides IaaS and PaaS some software is there ready to be used (e.g., Google Docs, Dropbox, etc.)
11 July 2016 79
Levels of Service
The company outsources some responsibility to the service provider. These levels are
incremental and thus, SaaS implies PaaS and PaaS implies IaaS
Infrastructure as a Service (IaaS)
You get a server to which you can connect through remote connection protocols (VPN, SSH, FTP, etc.)
Typically it covers the hardware (computers, network, virtualization, etc.)
Platform as a Service (PaaS)
You get those software modules needed to run applications (databases, web servers, security, etc.)
Software as a Service (SaaS)
Besides IaaS and PaaS some software is there ready to be used (e.g., Google Docs, Dropbox, etc.)
Sometimes, one may talk about Business as a Service (BaaS)

A whole business process is outsourced (e.g., Paypal, Amadeus, etc.)
11 July 2016 79
Share of Responsibility
You manage
Application Application Application Application
Runtime Runtime Runtime Runtime
You manage
Security Security Security Security
Integration Integration Integration Integration
Provider manages
You manage
Databases Databases Databases Databases
Provider manages
Servers Servers Servers Servers
Provider manages
Virtualization Virtualization Virtualization Virtualization
Server Hw Server Hw Server Hw Server Hw
Storage Storage Storage Storage
Networking Networking Networking Networking
11 July 2016 80
Service Providers
International
Amazon Services
Redfshift
Microsoft Azure
Google Cloud Platform
Etc.
11 July 2016 81
Thanks! Any Question?
OROM E RO @ ESSI.U PC.EDU
HO M E PAGE: HT T P :/ /WW W. ESSI .U PC. EDU /DTI M/PEO P LE/O RO MERO
T W I T TER: @ RO M E RO _ M_ O SCAR
11 July 2016 82

2016 07 11-Big Data Management

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

2016 07 11-Big Data Management

Uploaded by

Copyright:

Available Formats

Big Data Management

Cursos destiu Big Data

Decision support systems refer to any IT system supporting decision making

OLAP Data Mining Reports

Data warehouse Decision

Extraction, Transformation and Load (ETL)

Accounting Marketing Human resources Stocks management

Within cluster sum of squares by cluster:

Afterwards, other Vs have been added to highlight

These problems map to traditional database problems

What does Big Data Mean?

What does Big Data Mean?

Small data sets Personal Computers

WEB 1.0 Read Era

WEB 2.0 Write Era

Read-only Systems (e.g., DW)

Too many writes? Fragmentation

So good news, isnt it? WAIT

2. Flexible data models

3. New architectural solutions

Distributed transaction management

Distributed transaction management

Distributed query processing

Distributed transaction management

Distributed query processing

Distributed transaction management

Distributed query processing

Distributed transaction management

Distributed query processing

Distributed transaction management

Distributed query processing

Distributed transaction management F2

Distributed query processing

Distributed transaction management F2

Distributed query processing

Distributed transaction management F2

Distributed query processing

Distributed transaction management F2

Distributed query processing

Distributed concurrency control system

Distributed query processing

Distributed concurrency control system

Distributed DB catalog Q(People[Name])

Distributed query processing

Distributed concurrency control system

Distributed DB catalog Q(People[Name])

Distributed query processing

Distributed concurrency control system

Distributed DB catalog Q(People[Name])

Distributed query processing Q3(F3[Name])

Distributed concurrency control system

Distributed DB catalog Q(People[Name])

Distributed query processing Q3(F3[Name])

Distributed concurrency control system

Two main issues:

Distributed version (several masters)

B+ Tree with a better balance to deal with writes

Most used replication protocols

Main memory DB High availability

Grid computing Optimize queries at compile time

Master Batch view Query

IDEA: Accommodate volume and velocity Real-time view Real-time view

Temporary Summary Static

Coined by Martin Fowler: http://martinfowler.com/bliki/PolyglotPersistence.html