You are on page 1of 148

Big Data Management

OSCAR ROMERO

D T I M R E S E A R C H G R O U P ( H T T P : / / W W W. E S S I . U P C . E D U / D T I M / )

U N I V E R S I TAT P O L I T C N I C A D E C ATA L U N YA - B A R C E L O N AT E C H

A L I C A N T E , 1 1 T H J U LY, 2 0 1 6

Cursos destiu Big Data


Table of contents
1. Introduction
What is Big Data?
The need for a paradigm shift: from relational databases to NOSQL
The main steps of the data lifecycle within IT systems
The Big Data Ecosystems and the technological stack
Cloud Services

11 July 2016 2
Introduction to Big Data

11 July 2016 3
A New Business Model
Traditionally databases have been seen as a passive asset
OLTP systems: Data gathered is structured to facilitate (automate) daily operations
The relational model as de facto standard

Soon, many realized data is a valuable asset for any organization. So, use it!
Decisional systems: Stored data is analysed to better understand our activity (I want to know)
Data warehousing as de facto standard

11 July 2016 4
Instagrams Fable

11 July 2016 5
Other Examples
The overkilling approach
Facebook: Facebook + Instagram + Whatsapp + LinkedIn +
Google: Android + Google Search + Calendar + Gmail + Doodle +

Even if most of us are not Facebook or Google we can still benefit from data
Cross available data (companies fusion, buying data, open data, agreements, etc.)
Digitalise the organization processes (e.g., the national health system, tax collection, e-banking, etc.)
Monitor the user (phone apps, internet navigation, service usage, wearables, RFID, etc.)
Sometimes, in an indirect way (e.g., provide (free) services to learn habits; such as free wi-fi to geolocate the user)
Sensors (Smart Cities, Internet of Things, etc.)

Bottom line: most of the times, the most interesting data is not available (innovation comes to play!)

11 July 2016 6
Data as the New Cornerstone
We have witnessed the bloom of a new business model based on data analytics: Data is not a passive
but an active asset
Data is the new oil! - Clive Humby, 2006
No! Data is the new soil - David McCandless, 2010
The effective use of data to make decisions gave rise to thedata-driven society concept
The confluence of three major socio-economic and technological trends makes data-driven
innovation a new phenomenon today. These three trends include (OECD):
The exponential growth in data generated and collected,
the widespread use of data analytics including start-ups and small and medium enterprises (SMEs), and
the emergence of a paradigm shift in knowledge
Organizations must adapt their infrastructures to benefit from the data deluge
Digital data doubling every 18 months (IDC)
Innovation is mandatory!

11 July 2016 7
Same Purpose; Different Means
The economic and social role of data is not new
Economic and social activities have long evolved around the analysis and use of data
In business, concepts such as Business Intelligence and Data Warehousing already emerged in the
1960s and became popular in the late 1980s

Decision support systems refer to any IT system supporting decision making


The Business Intelligence Lifecycle as cornerstone

11 July 2016 8
The Business Intelligence Lifecycle
Business strategy

OLAP Data Mining Reports

Data warehouse Decision


Support
Data
Warehousing

Extraction, Transformation and Load (ETL)

Accounting Marketing Human resources Stocks management

11 July 2016 9
The DW Exploitation
Typically, the analysis of the data has been considered at three different levels of detail
Querying & Reporting: Static report generation
OLAP: Dynamic summarizations of data
Data Mining and Machine Learning: Inference of hidden patterns or trends

11 July 2016 10
The DW Exploitation
Typically, the analysis of the data has been considered > (kc <- kmeans(newiris, 3))
K-means clustering with 3 clusters of sizes 38, 50, 62
at three different levels of detail
Querying & Reporting: Static report generation Cluster means:
Sepal.Length Sepal.Width Petal.Length Petal.Width
OLAP: Dynamic summarizations of data 1 6.850000 3.073684
2 5.006000 3.428000
5.742105 2.071053
1.462000 0.246000
3 5.901613 2.748387 4.393548 1.433871
Data Mining and Machine Learning: Inference of hidden
patterns or trends Clustering vector:
[1] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
[30] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 1 3 3 3 3
3
[59] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 1 3 3 3 3 3 3 3 3
3
[88] 3 3 3 3 3 3 3 3 3 3 3 3 3 1 3 1 1 1 1 3 1 1 1 1 1 1 3 3
1
[117] 1 1 1 3 1 3 1 3 1 1 3 3 1 1 1 1 1 3 1 1 1 1 3 1 1 1 3 1
1
[146] 1 3 1 1 3

Within cluster sum of squares by cluster:


[1] 23.87947 15.15100 39.82097
Available components:
[1] "cluster" "centers" "withinss" "size"
Example from http://www.rdatamining.com/examples/kmeans-clustering

11 July 2016 11
Types of Analysis
Descriptive: Deterministic, not probabilistic
Computing summarizations, counts, min, max, etc.
Typical OLAP operations

Predictive: Probabilistic by nature. Try to forecast what may happen according to what have happened
Linear and non-linear regression,
Classification,
Clustering,
Association rules,
etc.

Prescriptive: Given the prediction(s) of a (several) model(s), understand why something is happening and
undertake automatic action(s). Examples:
Stock market indicators (to buy or sell shares)
Automatically increase / decrease prices

11 July 2016 12
Data Warehousing Vs. Big Data
Both are decision support systems (DSS)
Big Data can be seen as the evolution of Data Warehousing ecosystems to incorporate external data to
the decision making processes of the organization as first class-citizen
External Vs. Internal data
Semi-structured and unstructured data as first-class citizens
Dynamic Vs. static data sources
Not in control Vs. well-controlled sources
On-demand data quality threshold
Lightweight transformations Vs. heavy transformations
On-demand Vs. static goals
Load-first model-later (Data Lake) Vs. schema-fixed approach (DW)
And many others consequences
Semantic-aware solutions
Privacy
Etc.

11 July 2016 13
Big Data [Management]
Definition based on the limitations traditional DSS (such as Data Warehousing) cannot overcome
Velocity
Volume
Variety

Afterwards, other Vs have been added to highlight


other relevant issues typical of DSS
Variability
Validity / Veracity
Value

These problems map to traditional database problems


Same theory, techniques, algorithms different approach From IBM Understanding Big Data

11 July 2016 14
Big Data [Analytics]
Consider the previous slide and now recall we previously classified data analysis as:
Query & reporting, OLAP, data mining / machine learning
Descriptive, predictive, prescriptive

This classification still applies to Big Data but with a subtle change:
Query & Reporting + OLAP
Data Mining & Machine Learning

What does Big Data Mean?


Michael Stonebraker
Communications of the ACM blog, 2012

11 July 2016 15
Big Data [Analytics]
Consider the previous slide and now recall we previously classified data analysis as:
Query & reporting, OLAP, data mining / machine learning
Descriptive, predictive, prescriptive

This classification still applies to Big Data but with a subtle change:
Query &Small
Reporting + OLAP
Analytics
Big Analytics
Data Mining & Machine Learning

What does Big Data Mean?


Michael Stonebraker
Communications of the ACM blog, 2012

11 July 2016 15
An Example: BigBench

11 July 2016 16
NOSQL: A Paradigm Shift
FROM RELATIONAL TO NOSQL

11 July 2016 17
The End of an Architectural Era
Navigate

Search

Small data sets Personal Computers

WEB 1.0 Read Era

11 July 2016 18
The End of an Architectural Era
Real time
Big Data

Concurrency
Unstructured and structured
data (users)

WEB 2.0 Write Era

11 July 2016 19
RDBMS: One Size Fits All
Mainly write-only Systems (e.g., OLTP)
Data storage
Normalization
Queries
Indexes: B+, Hash
Joins: BNL, RNL, Hash-Join, Merge Join

Read-only Systems (e.g., DW)


Data Storage
Denormalized data
Queries
Indexes: Bitmaps
Joins: Star-join
Materialized Views

11 July 2016 20
Distributed Data Management
(to the rescue)
Too many reads? Replication

Too many writes? Fragmentation

FRAGMENTATION

REPLICATION

11 July 2016 21
Why Distribution?
How long does it take to read 1TB of data from disk in a
centralized database compared to a shared-nothing and shared-
memory distributed system?
Centralized: 1024 1024 100 ~ 166 minutes
1

10241024 1
Distributed (100 nodes): 100

100
~ 90 seconds
Assuming an ideal scenario with linear scalability
In a memory-shared environment, as size of data increases, , the CPU of the
computer is typically overwhelmed at some point by the data flow and it is slowed
down
In a shared-nothing environment this does not happen (as 100 CPUs are available)
but network communication is required

Extracted from Web Data Management (Abiteboul et al.): An Introduction to Distributed Systems

11 July 2016 22
Why Distribution? Some Reflections
With current technologies
Disk transfer rate is a bottleneck for batch processing of large scale datasets; parallelization and distribution
can help to eliminate this bottleneck
It is faster to read the memory of a computer in your LAN (1GB/s) than your own disk (100MB/s)
Distributing data (i.e., by fragmenting and allocating it on a cluster) enables parallel processing
Several machines working together to solve a problem
Direct application of the divide-and-conquer principle

So good news, isnt it? WAIT


Now the network comes into play, and network, like CPUs, may get overwhelmed (it is a shared resource)
When possible, data should be accessed where they are stored (or near) to avoid costly data exchange over the network (data locality)
Distributed RDBMS do not allow to scale up to hundreds of machines in a cluster
New solutions required (NOSQL, welcomed!)

11 July 2016 23
Distributed RDBMS
Distributed RDBMS (DRDBMS) are not flexible enough
Logging ACID properties
Persistent redo log
Undo log
CLI interfaces (JDBC, ODBC, etc.) Data structure
Concurrency control (locking) ACID properties
Latching associated with multi-thread Architectural limitation
Two-phase commit (2PC) protocol in distributed transactions ACID properties
Buffers management (cache disk pages) Architectural limitation
Variable length records management Data structure
Locate records in a page
Locate fields in a record

The End of an Architectural Area. Its Time for a Complete Rewrite. Michael Stonebraker et al.
VLDB Proceedings, 2007

11 July 2016 24
A New Type of Distributed Systems
NOSQL (Not Only SQL) systems have the following goals:
Schemaless: No explicit schema [new data structures] [distribution]
Reliability / availability: Keep delivering service even if its software or hardware components fail
Scalability: Continuously evolve to support a growing amount of tasks [distribution]
Efficiency: How well the system performs, usually measured in terms of response time (latency) and
throughput (bandwith)
[distribution]

11 July 2016 25
NOSQL Three Pillars
1. Distributed Data Management
Divide-and-conquer principle
Use (and abuse) of parallelism

2. Flexible data models


Reduce the impedance mismatch
Reduce the code overhead to make transformations between different data models

3. New architectural solutions


Relational database architectures date back to the 70s
Some modules may not be needed (e.g., concurrency control in read-only environments)
Some modules need to be rethought (e.g., from disk-based to memory-based architectures)

11 July 2016 26
Distributed Data Management
Distributed DB design
Node distribution People Name Age Gender Birthplace
Data fragments I1 Oscar 34 Male Lleida
Data replication F1
I2 Anna 35 Female Barcelona
Distributed DB catalog
Fragmentation trade-off: Location I3 Merc 45 Female Gav
Global or local for each node
F2
Centralized in a single node or distributed I4 Carme 22 Female Terrassa
Single-copy vs. Multi-copy
I5 Miquel 27 Male Matar
Distributed query processing
Data distribution / replication F3 I6 John 55 Male Glasgow
Communication overhead

Distributed transaction management


How to enforce (or not) the ACID properties
Replication trade-off: Queries vs. Data consistency between replicas (updates)
Distributed recovery system
Distributed concurrency control system

Security issues
Network security

11 July 2016 27
Distributed Data Management
Distributed DB design
Node distribution People Name Age Gender Birthplace
Data fragments I1 Oscar 34 Male Lleida
Data replication F1
I2 Anna 35 Female Barcelona
Distributed DB catalog
Fragmentation trade-off: Location I3 Merc 45 Female Gav
Global or local for each node
F2
Centralized in a single node or distributed I4 Carme 22 Female Terrassa
Single-copy vs. Multi-copy
I5 Miquel 27 Male Matar
Distributed query processing
Data distribution / replication F3 I6 John 55 Male Glasgow
Communication overhead

Distributed transaction management


How to enforce (or not) the ACID properties
Replication trade-off: Queries vs. Data consistency between replicas (updates)
Distributed recovery system
Distributed concurrency control system

Security issues
Network security

11 July 2016 27
Distributed Data Management
Distributed DB design
Node distribution
Data fragments
Data replication F1
Distributed DB catalog
Fragmentation trade-off: Location


Global or local for each node
Centralized in a single node or distributed
F2
Single-copy vs. Multi-copy

Distributed query processing


Data distribution / replication F3
Communication overhead

Distributed transaction management


How to enforce (or not) the ACID properties
Replication trade-off: Queries vs. Data consistency between replicas (updates) Node1 Node2 Node3 Node4 Node5
Distributed recovery system
Distributed concurrency control system

Security issues
Network security NETWORK

11 July 2016 27
Distributed Data Management
Distributed DB design
Node distribution
Data fragments
Data replication F1
Distributed DB catalog
Fragmentation trade-off: Location


Global or local for each node
Centralized in a single node or distributed
F2
Single-copy vs. Multi-copy

Distributed query processing


Data distribution / replication F3
Communication overhead

Distributed transaction management


How to enforce (or not) the ACID properties F1
Replication trade-off: Queries vs. Data consistency between replicas (updates) Node1 Node2 Node3 Node4 Node5
Distributed recovery system
Distributed concurrency control system

Security issues
Network security NETWORK

11 July 2016 27
Distributed Data Management
Distributed DB design
Node distribution
Data fragments
Data replication

Distributed DB catalog
Fragmentation trade-off: Location


Global or local for each node
Centralized in a single node or distributed
F2
Single-copy vs. Multi-copy

Distributed query processing


Data distribution / replication F3
Communication overhead

Distributed transaction management


How to enforce (or not) the ACID properties F1 F1
Replication trade-off: Queries vs. Data consistency between replicas (updates) Node1 Node2 Node3 Node4 Node5
Distributed recovery system
Distributed concurrency control system

Security issues
Network security NETWORK

11 July 2016 27
Distributed Data Management
Distributed DB design
Node distribution
Data fragments
Data replication

Distributed DB catalog
Fragmentation trade-off: Location


Global or local for each node
Centralized in a single node or distributed
F2
Single-copy vs. Multi-copy

Distributed query processing


Data distribution / replication F3
Communication overhead

Distributed transaction management


How to enforce (or not) the ACID properties F1 F1
Replication trade-off: Queries vs. Data consistency between replicas (updates) Node1 Node2 Node3 Node4 Node5
Distributed recovery system F2
Distributed concurrency control system

Security issues
Network security NETWORK

11 July 2016 27
Distributed Data Management
Distributed DB design
Node distribution
Data fragments
Data replication

Distributed DB catalog
Fragmentation trade-off: Location
Global or local for each node
Centralized in a single node or distributed
Single-copy vs. Multi-copy

Distributed query processing


Data distribution / replication F3
Communication overhead

Distributed transaction management F2


How to enforce (or not) the ACID properties F1 F1
Replication trade-off: Queries vs. Data consistency between replicas (updates) Node1 Node2 Node3 Node4 Node5
Distributed recovery system F2
Distributed concurrency control system

Security issues
Network security NETWORK

11 July 2016 27
Distributed Data Management
Distributed DB design
Node distribution
Data fragments
Data replication

Distributed DB catalog
Fragmentation trade-off: Location
Global or local for each node
Centralized in a single node or distributed
Single-copy vs. Multi-copy

Distributed query processing


Data distribution / replication F3
Communication overhead

Distributed transaction management F2


How to enforce (or not) the ACID properties F1 F1
Replication trade-off: Queries vs. Data consistency between replicas (updates) Node1 Node2 Node3 Node4 Node5
Distributed recovery system F2 F3
Distributed concurrency control system

Security issues
Network security NETWORK

11 July 2016 27
Distributed Data Management
Distributed DB design
Node distribution
Data fragments
Data replication

Distributed DB catalog
Fragmentation trade-off: Location
Global or local for each node
Centralized in a single node or distributed
Single-copy vs. Multi-copy

Distributed query processing


Data distribution / replication
Communication overhead

Distributed transaction management F2


How to enforce (or not) the ACID properties F1 F1 F3
Replication trade-off: Queries vs. Data consistency between replicas (updates) Node1 Node2 Node3 Node4 Node5
Distributed recovery system F2 F3
Distributed concurrency control system

Security issues
Network security NETWORK

11 July 2016 27
Distributed Data Management
Distributed DB design
Node distribution
Data fragments
Data replication

Distributed DB catalog
Fragmentation trade-off: Location
Global or local for each node
Centralized in a single node or distributed
Single-copy vs. Multi-copy

Distributed query processing


Data distribution / replication
Communication overhead

Distributed transaction management F2


How to enforce (or not) the ACID properties F1 F1 F3
Replication trade-off: Queries vs. Data consistency between replicas (updates) Node1 Node2 Node3 Node4 Node5
Distributed recovery system F2 F3
Distributed concurrency control system

Security issues
Network security NETWORK

11 July 2016 27
Distributed Data Management
Distributed DB design
Node distribution
Data fragments
Data replication

Distributed DB catalog
Fragmentation trade-off: Location
Global or local for each node
Centralized in a single node or distributed
Single-copy vs. Multi-copy

Distributed query processing


Data distribution / replication
Communication overhead
Master
Distributed transaction management F2
How to enforce (or not) the ACID properties F1 F1 F3 F1: @N1, @N2
Replication trade-off: Queries vs. Data consistency between replicas (updates) Node1 Node2 Node3 Node4 F2: @N2, @N3
Node5
Distributed recovery system F2 F3 F3: @N3, @N4

Distributed concurrency control system

Security issues
Network security NETWORK

11 July 2016 27
Distributed Data Management
Distributed DB design
Node distribution
Data fragments
Data replication

Distributed DB catalog
Fragmentation trade-off: Location
Global or local for each node
Centralized in a single node or distributed
Single-copy vs. Multi-copy

Distributed query processing


Data distribution / replication
Communication overhead
Master
Distributed transaction management F2
How to enforce (or not) the ACID properties F1 F1 F3 F1: @N1, @N2
Replication trade-off: Queries vs. Data consistency between replicas (updates) Node1 Node2 Node3 Node4 F2: @N2, @N3
Node5
Distributed recovery system F2 F3 F3: @N3, @N4

Distributed concurrency control system

Security issues
Network security NETWORK

11 July 2016 27
Distributed Data Management
Distributed DB design
Node distribution
Data fragments
Data replication

Distributed DB catalog Q(People[Name])


Fragmentation trade-off: Location
Global or local for each node
Centralized in a single node or distributed
Single-copy vs. Multi-copy

Distributed query processing


Data distribution / replication
Communication overhead
Master
Distributed transaction management F2
How to enforce (or not) the ACID properties F1 F1 F3 F1: @N1, @N2
Replication trade-off: Queries vs. Data consistency between replicas (updates) Node1 Node2 Node3 Node4 F2: @N2, @N3
Node5
Distributed recovery system F2 F3 F3: @N3, @N4

Distributed concurrency control system

Security issues
Network security NETWORK

11 July 2016 27
Distributed Data Management
Distributed DB design
Node distribution
Data fragments
Data replication

Distributed DB catalog Q(People[Name])


Fragmentation trade-off: Location
Global or local for each node
Centralized in a single node or distributed
Single-copy vs. Multi-copy

Distributed query processing


Data distribution / replication
Communication overhead
Master
Distributed transaction management F2
How to enforce (or not) the ACID properties F1 F1 F3 F1: @N1, @N2
Replication trade-off: Queries vs. Data consistency between replicas (updates) Node1 Node2 Node3 Node4 F2: @N2, @N3
Node5
Distributed recovery system F2 F3 F3: @N3, @N4

Distributed concurrency control system

Security issues
Network security NETWORK

11 July 2016 27
Distributed Data Management
Distributed DB design
Node distribution
Data fragments
Data replication

Distributed DB catalog Q(People[Name])


Fragmentation trade-off: Location
Global or local for each node
Centralized in a single node or distributed
Q1(F1[Name])
Single-copy vs. Multi-copy

Distributed query processing Q3(F3[Name])


Data distribution / replication Q2(F2[Name])
Communication overhead
Master
Distributed transaction management F2
How to enforce (or not) the ACID properties F1 F1 F3 F1: @N1, @N2
Replication trade-off: Queries vs. Data consistency between replicas (updates) Node1 Node2 Node3 Node4 F2: @N2, @N3
Node5
Distributed recovery system F2 F3 F3: @N3, @N4

Distributed concurrency control system

Security issues
Network security NETWORK

11 July 2016 27
Distributed Data Management
Distributed DB design
Node distribution
Data fragments
Data replication

Distributed DB catalog Q(People[Name])


Fragmentation trade-off: Location
Global or local for each node
Centralized in a single node or distributed
Q1(F1[Name])
Single-copy vs. Multi-copy

Distributed query processing Q3(F3[Name])


Data distribution / replication Q2(F2[Name])
Communication overhead
Master
Distributed transaction management F2
How to enforce (or not) the ACID properties F1 F1 F3 F1: @N1, @N2
Replication trade-off: Queries vs. Data consistency between replicas (updates) Node1 Node2 Node3 Node4 F2: @N2, @N3
Node5
Distributed recovery system F2 F3 F3: @N3, @N4

Distributed concurrency control system

Security issues
Network security NETWORK

11 July 2016 27
Distributed Database Design
Given a DB and its workload, how
should the DB be split and allocated to
sites as to optimize certain objective
functions
Minimize resource consumption for query
processign

Two main issues:


Data fragmentation
Data allocation
Data replication

11 July 2016 28
Global Catalog Management
Centralized version (@master)
Accessing it is a bottleneck
Single-point failure
May add a mirror
Poorer performance

Distributed version (several masters)


Replica synchronization
Potential inconsistencies

11 July 2016 29
Global Catalog Example: HDFS

11 July 2016 30
Global Catalog Example: HBase
LSM: Distributed B+ Tree

Key

B+ Tree with a better balance to deal with writes


Data is horizontally fragmented
Dynamic fragmentation, distributed on a cluster of machines or cloud
Specific catalog
Tuples are lexicographically sorted according to the key
Key of the LSM. Each row (entry) consists of <key, loc> : Key: it is the last key value in that fragment. Loc: it is the physical address of that fragment
Insertions are performed in memory
Delta store: Stores insertions in memory
Every time the delta store is full, it is flushed to disk
Compactations are periodically run http://www.cs.umb.edu/~poneil/lsmtree.pdf

11 July 2016 31
Decide Data Allocation
Given a set of fragments, a set of sites on which a number of applications are running, allocate
each fragment such that some optimization criterion is met (subject to certain constraints)
It is known to be a NP-hard problem
The optimal solution depends on many factors
Location in which the query originates
The query processing strategies (e.g., join methods)
Furthermore, in a dynamic environment the workload and access pattern may change
The problem is typically simplified with certain assumptions (e.g., only communication cost
considered)
Typical approaches build cost models and any optimization algorithm can be adapted to solve it
Heuristics are also available: (e.g., best-fit for non-replicated fragments)
Sub-optimal solutions

11 July 2016 32
Managing Data Replication
Replicating fragments improves the system throughput but raises some other issues:
Consistency
Update performance

Most used replication protocols


Eager Lazy replication
Primary Secondary versioning

Strong Eventually
Consistency Consistent

11 July 2016 33
Trade-off: Performance Vs. Consistency
Consistency Performance
(Ratio of correct answers) (System throughput)

11 July 2016 34
CAP Theorem
Eric Brewer. CAP Theorem: any networked shared-data system can have at most two of three
desirable properties:
consistency (C) equivalent to having a single up-to-date copy of the data;
high availability (A) of that data (for updates); and
tolerance to network partitions (P)

Example:

C ->
A ->
P ->

35
CAP Theorem
Eric Brewer. CAP Theorem: any networked shared-data system can have at most two of three
desirable properties:
consistency (C) equivalent to having a single up-to-date copy of the data;
high availability (A) of that data (for updates); and
tolerance to network partitions (P)

Example:

Update C ->
A ->
P ->

35
CAP Theorem
Eric Brewer. CAP Theorem: any networked shared-data system can have at most two of three
desirable properties:
consistency (C) equivalent to having a single up-to-date copy of the data;
high availability (A) of that data (for updates); and
tolerance to network partitions (P)

Example:

Eager Replication
Update C ->
A ->
P ->

35
CAP Theorem
Eric Brewer. CAP Theorem: any networked shared-data system can have at most two of three
desirable properties:
consistency (C) equivalent to having a single up-to-date copy of the data;
high availability (A) of that data (for updates); and
tolerance to network partitions (P)

Example:

Eager Replication
Update C -> OK
A -> NO
P -> OK

35
CAP Theorem
Eric Brewer. CAP Theorem: any networked shared-data system can have at most two of three
desirable properties:
consistency (C) equivalent to having a single up-to-date copy of the data;
high availability (A) of that data (for updates); and
tolerance to network partitions (P)

Example:
ACID
Eager Replication
Update C -> OK
A -> NO
P -> OK

35
CAP Theorem
Eric Brewer. CAP Theorem: any networked shared-data system can have at most two of three
desirable properties:
consistency (C) equivalent to having a single up-to-date copy of the data;
high availability (A) of that data (for updates); and
tolerance to network partitions (P).

Example:

C ->
A ->
P ->

36
CAP Theorem
Eric Brewer. CAP Theorem: any networked shared-data system can have at most two of three
desirable properties:
consistency (C) equivalent to having a single up-to-date copy of the data;
high availability (A) of that data (for updates); and
tolerance to network partitions (P).

Example:

Update C ->
A ->
P ->

36
CAP Theorem
Eric Brewer. CAP Theorem: any networked shared-data system can have at most two of three
desirable properties:
consistency (C) equivalent to having a single up-to-date copy of the data;
high availability (A) of that data (for updates); and
tolerance to network partitions (P).

Example:

Lazy Replication
Update C ->
A ->
P ->

36
CAP Theorem
Eric Brewer. CAP Theorem: any networked shared-data system can have at most two of three
desirable properties:
consistency (C) equivalent to having a single up-to-date copy of the data;
high availability (A) of that data (for updates); and
tolerance to network partitions (P).

Example:

Lazy Replication
Update C -> NO
A -> OK
P -> OK

36
CAP Theorem
Eric Brewer. CAP Theorem: any networked shared-data system can have at most two of three
desirable properties:
consistency (C) equivalent to having a single up-to-date copy of the data;
high availability (A) of that data (for updates); and
tolerance to network partitions (P).

Example:
BASE
Lazy Replication
Update C -> NO
A -> OK
P -> OK

36
CAP Theorem
Eric Brewer. CAP Theorem: any networked shared-data system can have at most two of three
desirable properties:
consistency (C) equivalent to having a single up-to-date copy of the data;
high availability (A) of that data (for updates); and
tolerance to network partitions (P).

Example:

C ->
A ->
P ->

37
CAP Theorem
Eric Brewer. CAP Theorem: any networked shared-data system can have at most two of three
desirable properties:
consistency (C) equivalent to having a single up-to-date copy of the data;
high availability (A) of that data (for updates); and
tolerance to network partitions (P).

Example:

Update C ->
A ->
P ->

37
CAP Theorem
Eric Brewer. CAP Theorem: any networked shared-data system can have at most two of three
desirable properties:
consistency (C) equivalent to having a single up-to-date copy of the data;
high availability (A) of that data (for updates); and
tolerance to network partitions (P).

Example:

Eager Replication
Update C ->
A ->
P ->

37
CAP Theorem
Eric Brewer. CAP Theorem: any networked shared-data system can have at most two of three
desirable properties:
consistency (C) equivalent to having a single up-to-date copy of the data;
high availability (A) of that data (for updates); and
tolerance to network partitions (P).

Example:

Eager Replication
Update C -> OK
A -> OK
P -> NO

37
CAP Theorem
Eric Brewer. CAP Theorem: any networked shared-data system can have at most two of three
desirable properties:
consistency (C) equivalent to having a single up-to-date copy of the data;
high availability (A) of that data (for updates); and
tolerance to network partitions (P).

Example:
????
Eager Replication
Update C -> OK
A -> OK
P -> NO

37
CAP Theorem Revisited
The CAP theorem is not about choosing two out of the three forever and ever
Distributed systems are not always partitioned

Without partitions: CA
Otherwise
Detect a partition
Normally by means of latency (time-bound connection)
Enter an explicit partition mode limiting some operations choosing either:
CP (i.e., ACID by means of e.g., 2PCP or PAXOS) or,
If a partition is detected, the operation is aborted
AP (i.e., BASE)
The operation goes on and we will tackle this next
If AP was chosen, enter a recovery process commonly known as partition recovery (e.g., compensate mistakes and get rid of
inconsistencies introduced)
Achieve consistency: Roll-back to consistent state and apply ops in a deterministic way (e.g., using time-stamps)
Reduce complexity by only allowing certain operations (e.g., Google Docs)
Commutative operations (concatenate logs, sort and execute them)
Repair mistakes: Restore invariants violated
Last writer wins

11 July 2016 38
Flexible Data Models (Schemaless)
Relational databases give for granted the relational model
The NOSQL wave presents other data models to boost performance and flexibility
I. Key-value: Hadoop (HDFS, HBase), Cassandra, Voldemort, etc.
II. Document (kind of key-value): MongoDB, CouchDB, etc.
III. Graph-based: Neo4J, Giraph, GrapX, etc.
IV. Streams: Apache Flink, Spark Stream, etc.
V. Etc.

11 July 2016 39
New Architectures
NOSQL introduces a critical reasoning on the reference database architecture
ALL relational databases follow System-R architecture (late 70s!)

11 July 2016 40
Most Typical Architectural Solutions
Primary indexes to implement the global catalog
Distributed B+: WiredTiger, HBase, etc.
Consistent Hashing: Voldemort, MongoDB (until 2.X), etc.
Bloom filters to avoid distributed look ups
In-memory processing
Columnar block iteration: Vertical fragmentation + fixed-size values + compression (run length
encoding)
Heavily exploited by column-oriented databases
Only for read-only workloads
Sequential reads
Key design

11 July 2016 41
Most Typical Architectural Solutions
Random Vs. Sequential Reads
Take the most out of databases by boosting sequential reads
Enables pre-fetching
Option to maximize the effective read ratio (by a good db design)

http://www.benstopford.com/2015/04/28/elements-of-scale-composing-and-scaling-data-platforms/

11 July 2016 42
Example: NewSQL
IDEA: For OLTP systems RDBMS can also be outperformed

Main memory DB High availability


A DB less than 1Tb fits in memory Cannot wait for the recovery process
20 nodes x 32 Gb (or more) costs less than 50,000US$ Multiple machines in a Peer-To-Peer configuration
Undo log is in-memory and discarded on commit
Reduce costs
One thread systems Human costs are higher than Hw and Sw
Perform incoming SQL commands to completion, An expert DBA is expensive and rare
without interruption Alternative is brute force
One transaction takes less than 1ms Automatic horizontal partitioning and replication
No isolation needed Execute queries at any replica and updates to all of them

Grid computing Optimize queries at compile time


Enjoy horizontal partitioning and parallelism
Add new nodes to the grid without going down

11 July 2016 43
Example: the Data Lake
IDEA: Load -first, Model-Later
Modeling at load time restricts the potential
analysis that can be done later (Big Analytics)
Store raw data and create on -demand views to
handle with precise analysis needs

11 July 2016 44
Example: The Lambda-Architecture
Batch layer Serving layer

Batch view

Master Batch view Query


dataset
New data
(Stream)
Query
Speed layer (Stream Processing Engine)

IDEA: Accommodate volume and velocity Real-time view Real-time view


Real time Vs. Batch processing
Precise Vs. Approximate results Storage manager

Temporary Summary Static


(working data) (synopses) (metadata)

11 July 2016 45
Example: Polyglot Systems
IDEA: Federate different kinds of NOSQL databases in a single system

Coined by Martin Fowler: http://martinfowler.com/bliki/PolyglotPersistence.html

46
Is It About New Things?
No, it is not. Current architectures do not introduce a single new concept
Old theoretical findings are now used or broadly used
Primary indexes
Sequential reads
Vertical partitioning
Compression (e.g., run length encoding)
Fixed-size values
In-memory processing
Bloom filters
Polyglot systems

47
Is It About New Things?
No, it is not. Current architectures do not introduce a single new concept
Old theoretical findings are now used or broadly used
Primary indexes
Sequential reads
Vertical partitioning
Compression (e.g., run length encoding)
Fixed-size values
In-memory processing
Bloom filters
Polyglot systems

Big Data brings critical thinking on traditional assumptions and tears them down. Nevertheless, Big
Data solutions are fully built on top of the database theory

47
11 July 2016 48
Watch Out! The Problem is NOT SQL
SQL is a query language. SQL is cool! Not a problem
The problem is that relational systems are too generic
OLTP: stored procedures and simple queries
OLAP: ad-hoc complex queries
Documents: large objects
Streams: time windows with volatile data
Scientific: uncertainty and heterogeneity
But the overhead of RDBMS has nothing to do with SQL
Low-level, record-at-a-time interface is not the solution
SQL Databases vS. NoSQL Databases
Michael Stonebraker
Communications of the ACM, 53(4), 2010

11 July 2016 49
Data Management
DATA LIFECYCLE

11 July 2016 50
Data Management
Data management refers to the tasks a database management system (DBMS) must provide to cover the
data lifecycle within an IT system. In this context, we focus on:
Ingestion: the means provided to insert /upload / put data into the DBMS (in SQL, the INSERT command)
Modeling: the conceptual data structures used to arrange data within the DBMS (e.g., tables in the relational
model, graphs in graph modelling, etc.)
Storage: the physical data structures used to persist data in the DBMS (e.g., hash, b-tree, heap files, etc.)
Processing: the means provided (many times, in algebraic form) to manipulate data once stored in the the DBMS
physical data structures (in SQL, the relational algebra)
Querying / fetching: the means provided to allow the DBMS user to specify the data processing she would like to
perform (in SQL, SELECT queries). It is typically triggered in terms of the conceptual data structures
In Big Data settings, it refers to the same concepts but assuming a NOSQL system is behind
Typically, a distributed system
Possibly with an alternative data model to the relational one
Implementing ad-hoc architectural solutions

11 July 2016 51
Data Management

11 July 2016 52
Data Management

John, BCN,
33; Maria, Ingestion
BCN, 22

11 July 2016 52
Data Management

Name City Age


John BCN 23
John, BCN,
33; Maria, Ingestion Maria BCN 22
BCN, 22
Conceptual
Schema

11 July 2016 52
Data Management

Name City Age Hash-based storage


H(age) = age%20
John BCN 23 Bucket2 = {Maria|BCN|22}
John, BCN, Bucket3 = {John|BCN|23}
33; Maria, Ingestion Maria BCN 22
BCN, 22
Conceptual Physical
Schema Schema

11 July 2016 52
Data Management
Get(name = john)

Name City Age Hash-based storage


H(age) = age%20
John BCN 23 Bucket2 = {Maria|BCN|22}
John, BCN, Bucket3 = {John|BCN|23}
33; Maria, Ingestion Maria BCN 22
BCN, 22
Conceptual Physical
Schema Schema

11 July 2016 52
Data Management
Get(name = john)

Data Processing

Name City Age Hash-based storage


H(age) = age%20
John BCN 23 Bucket2 = {Maria|BCN|22}
John, BCN, Bucket3 = {John|BCN|23}
33; Maria, Ingestion Maria BCN 22
BCN, 22
Conceptual Physical
Schema Schema

11 July 2016 52
Data Management
Get(name = john)

Data Processing

Name City Age Hash-based storage


H(age) = age%20
John BCN 23 Bucket2 = {Maria|BCN|22}
John, BCN, Bucket3 = {John|BCN|23}
33; Maria, Ingestion Maria BCN 22
BCN, 22
Conceptual Physical
Schema Schema

11 July 2016 52
Apache Hadoop as Example

By Vctor Herrero. Big Data Management & Analytics (UPC School)

11 July 2016 53
Apache Hadoop as Example

By Vctor Herrero. Big Data Management & Analytics (UPC School)

11 July 2016 53
Apache Hadoop as Example

By Vctor Herrero. Big Data Management & Analytics (UPC School)

11 July 2016 53
Apache Hadoop as Example

By Vctor Herrero. Big Data Management & Analytics (UPC School)

11 July 2016 53
Apache Hadoop as Example

By Vctor Herrero. Big Data Management & Analytics (UPC School)

11 July 2016 53
Apache Hadoop as Example
Ingestion

By Vctor Herrero. Big Data Management & Analytics (UPC School)

11 July 2016 53
Apache Hadoop as Example
Ingestion

Modeling and
Storage

By Vctor Herrero. Big Data Management & Analytics (UPC School)

11 July 2016 53
Apache Hadoop as Example
Ingestion
Processing

Modeling and
Storage

By Vctor Herrero. Big Data Management & Analytics (UPC School)

11 July 2016 53
Apache Hadoop as Example
Ingestion Querying

Processing

Modeling and
Storage

By Vctor Herrero. Big Data Management & Analytics (UPC School)

11 July 2016 53
Apache Hadoop as Example
Small Analytics Big Analytics
Ingestion Querying

Processing

Modeling and
Storage

By Vctor Herrero. Big Data Management & Analytics (UPC School)

11 July 2016 53
Apache Hadoop as Example
Small Analytics Big Analytics
Ingestion Querying

Processing

Modeling and
Storage

The whole ecosystem maps to a DBMS!


By Vctor Herrero. Big Data Management & Analytics (UPC School)

11 July 2016 53
Data Ingestion
RELATIONAL DBMS HADOOP ECOSYSTEM

Data must be in the same machine Data must reach the cluster first (SCP, FTP,
etc.) and from there insert it into the Big Data
INSERT INTO (DML command)
system
Main options:
Main options:
o Bulk load (aka batch loading)
o Bulk load (aka batch loading)
o Record-at-a-time
o Record-at-a-time
Connection options:
Connection options:
o ODBC (JDBC in the general case)
o API (less standardised than ODBC)
o Database command (typically from a file)
o Database command (typically from a file)

11 July 2016 54
Data Modeling
RELATIONAL DBMS HADOOP ECOSYSTEM

Based on the relational model No single reference model


Key-value
Tables, rows and columns
Document
Each row represents an instance, columns Stream
attributes / features Graph
Constraints are limited
PK, FK, Check,
The closer the model looks to the way data is later
When creating (i.e., at creation time) the tables stored internally the better (impedance
mismatch)
you MUST specify the table schema (i.e.,
columns and constraints) Further, ideally, the schema should be defined at
insertion time and not at definition time
(schemaless databases)

11 July 2016 55
Impedance Mismatch

Petra Selmer, Advances in Data Management 2012

11 July 2016 56
Impedance Mismatch

Petra Selmer, Advances in Data Management 2012

11 July 2016 57
Key-Value
Characteristics
Entries in form of key-values (can be seen as enormous hash tables)
One key maps only to one value
Query on key only
Schemaless
key value

Suitable for Bob Michael_Elisabeth_30_Bobby_2010


Very large repositories of data (scalability is a must)
Unstructured data whose schema is difficult to predict
Main way to access data is by key
Systems
Hadoop ecosystem

11 July 2016 58
document

Document
Characteristics
Entries in form of key-values where the value is a XML or JSON document
One key maps only to one document
It is possible to index any of the document entries (in JSON, the JSON keys, in
XML the XML tags)
Query on key or indexed document entries
Schemaless
key
SocialNetwork
Suitable for
Data produced in XML / JSON format (web data)
Unstructured data whose schema is difficult to predict
Main way to access data is by key and all the document is processed together

Systems
MongoDB, CouchDB

11 July 2016 59
Graph
Characteristics
Data stored as nodes and edges graph
Relationships as first-class citizens
Schemaless

Suitable for
Unstructured data whose schema is difficult to predict
Topology queries (based on the shape of the graph)

Systems
Neo4J, Titan, Sparksee

http://grouplens.org/datasets/movielens/

11 July 2016 60
Streams
Characteristics
Strong temporal locality
The whole dataset is not available but a portion of it (window)
Stream (window)

Suitable for
Real-time applications
Data per se is not important but to gain insight (e.g., sensor data)
Approximate answers are enough

Systems
Apache Flink, Spark Streaming, Storm

11 July 2016 61
Data Storage
RELATIONAL DBMS HADOOP ECOSYSTEM
Generic architecture that can be tuned according Concepts that gained weight
to the needs: Primary indexes
Mainly write-only Systems (e.g., OLTP) Sequential reads
Normalization Vertical partitioning
Indexes: B+, Hash Compression (e.g., run length encoding)
Joins: BNL, RNL, Hash-Join, Merge Join Fixed-size values
In-memory processing
Read-only Systems (e.g., DW) Bloom filters
Denormalized data Polyglot systems
Indexes: Bitmaps
Joins: Star-join Such architectures are very specific and good (but
Materialized Views very good) in solving a specific problem

11 July 2016 62
Data Processing
RELATIONAL DBMS HADOOP ECOSYSTEM

The relational algebra as cornerstone Several data processing engines


Each of them works better under certain assumptions
Data processing is hidden for us and the DBMS 40% to 80% of current Spark / MR analytical jobs
takes care of it (query optimizer) triggered in a cluster require a single machine to be
Data Mining and Machine Learning is not executed
performed inside the DBMS However, Big Data workflows (i.e., migrating data
Data is dumped to files that are later loaded in among systems) require the power of the cluster for
specific tools (R, SAS) large datasets
For large volumes, the data locality principle must
be applied and DM / ML performed without moving
data

11 July 2016 63
The Ancestor of them All: MapReduce

SQL

File System

RDBMS

11 July 2016 64
The Ancestor of them All: MapReduce

SQL

File System

RDBMS

11 July 2016 64
The Ancestor of them All: MapReduce

SQL

File System

RDBMS
HBase

11 July 2016 64
The Ancestor of them All: MapReduce

MapReduce

File System

RDBMS
HBase

11 July 2016 64
The Ancestor of them All: MapReduce

MapReduce

File System

RDBMS
HBase

11 July 2016 64
The Ancestor of them All: MapReduce

MapReduce

File
HDFS System

RDBMS
HBase

11 July 2016 64
MapReduce Basics

Map Merge-Sort Reduce

Simple model to express relatively sophisticated distributed programs


Processes pairs [key, value]
Signature:

11 July 2016 65
MapReduce: WordCount Example

11 July 2016 66
MapReduce: WordCount Example

<#line, text>

11 July 2016 66
MapReduce: WordCount Example

<#line, text> Map

11 July 2016 66
MapReduce: WordCount Example

<#line, text> Map

The 1
Project 1
Gutemberg 1
Ebook 1
of 1
The 1
Outline 1
Of 1
Science, 1
Vol. 1
1 1
(of 1
4), 1
11 July 2016 by 1 66
MapReduce: WordCount Example

[The,[1,1,1,1,]]
<#line, text> Map Merge-Short

The 1
Project 1
Gutemberg 1
Ebook 1
of 1
The 1
Outline 1
Of 1
Science, 1
Vol. 1
1 1
(of 1
4), 1
11 July 2016 by 1 66
MapReduce: WordCount Example

[The,[1,1,1,1,]]
<#line, text> Map Merge-Short Reduce

The 1
Project 1
Gutemberg 1
Ebook 1
of 1
The 1
Outline 1
Of 1
Science, 1
Vol. 1
1 1
(of 1
4), 1
11 July 2016 by 1 66
MapReduce: WordCount Example

[The,[1,1,1,1,]]
<#line, text> Map Merge-Short Reduce

The
Project
1
1 The 57631
Gutemberg 1
Ebook 1
of 1
The 1
Outline 1
Of 1
Science, 1
Vol. 1
1 1
(of 1
4), 1
11 July 2016 by 1 66
MapReduce: WordCount Example
public void map(LongWritable key, Text value) {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
write(new Text(tokenizer.nextToken()), new IntWritable(1));
}
}
public void reduce(Text key, Iterable<IntWritable> values) {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
write(key, new IntWritable(sum));
}

11 July 2016 67
MapReduce: WordCount Example
KEY key, Text value)
public void map(LongWritable VALUE {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
write(new Text(tokenizer.nextToken()), new IntWritable(1));
}
}
public void reduce(Text key, Iterable<IntWritable> values) {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
write(key, new IntWritable(sum));
}

11 July 2016 67
MapReduce: WordCount Example
KEY key, Text value)
public void map(LongWritable VALUE {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
KEY
write(new Text(tokenizer.nextToken()), VALUE
new IntWritable(1));
}
}
public void reduce(Text key, Iterable<IntWritable> values) {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
write(key, new IntWritable(sum));
}

11 July 2016 67
MapReduce: WordCount Example
KEY key, Text value)
public void map(LongWritable VALUE {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
KEY
write(new Text(tokenizer.nextToken()), VALUE
new IntWritable(1));
}
}
KEY
public void reduce(Text key, VALUE
Iterable<IntWritable> values) {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
write(key, new IntWritable(sum));
}

11 July 2016 67
MapReduce: WordCount Example
KEY key, Text value)
public void map(LongWritable VALUE {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
KEY
write(new Text(tokenizer.nextToken()), VALUE
new IntWritable(1));
}
}
KEY
public void reduce(Text key, VALUE
Iterable<IntWritable> values) {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
KEY
write(key, VALUE
new IntWritable(sum));
}

11 July 2016 67
MapReduce: WordCount Example
KEY key, Text value)
public void map(LongWritable VALUE {
String line = value.toString();
StringTokenizer tokenizer BLACKBOX
= new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
KEY
write(new Text(tokenizer.nextToken()), VALUE
new IntWritable(1));
}
}
KEY
public void reduce(Text key, VALUE
Iterable<IntWritable> values) {
int sum = 0;
for (IntWritable val : values) {
BLACKBOX
sum += val.get();
}
KEY
write(key, VALUE
new IntWritable(sum));
}

11 July 2016 67
Main Processing Approaches
Distributed Frameworks: Thought to run on a cluster and exploit parallelism
MapReduce, Spark (bear in mind Spark-R and Spark MLlib are built on top of Spark), Presto,

Centralized Processing on Distributed Storage: Precise results on small sets Recall some issues due to Big
Metis, Hadoop-R, Data characteristics:
In distributed settings,
Graph Processing: Thought to deal with graph-like
approximations are fine
Distributed: Pregel / Giraph, GraphLab, GraphLINQ, and real-time answers are
Centralized: GraphChi, Neo4J, . preferred
Centralized approaches
Stream Processing: Deal with data streams
yield precise results in
Spark Streaming, Apache Flink, Storm, batch processing but need
to deal with small inputs
(<= 0,5 GB as thumb rule)

11 July 2016 68
Workflow Orchestrators
There is no single data processing framework that is always the best (we learnt that from NOSQL!)
Workflow orchestrators are another abstraction layer, above data processing frameworks that,
given a query, decides what data processing frameworks is the more adequate

Ingestion Querying

Processing

Modeling and
Storage
11 July 2016 69
Workflow Orchestrators
There is no single data processing framework that is always the best (we learnt that from NOSQL!)
Workflow orchestrators are another abstraction layer, above data processing frameworks that,
given a query, decides what data processing frameworks is the more adequate
Workflow Orchestrators (Oozie, Musketeer)

Ingestion Querying

Processing

Modeling and
Storage
11 July 2016 69
Workflow Orchestrators
Current workflow orchestrators are rather poor: Oozie
But there are attempts for smarter approaches: the ideas behindMusketeer deserves special
attention Query Data Processing
Frameworks Frameworks

In short, does a similar


job to global query
optimizers of traditional
distributed RDBMS

11 July 2016 70
Data Querying
RELATIONAL DBMS HADOOP ECOSYSTEM

SQL is the only way to query the database Querying is done in a programmatic way using the
operators provided by the data processing
No matter if through ODBC or built-in framework (i.e., programming in MapReduce,
commands or syntactic sugar Spark, etc)
SELECT FROM WHERE Programming is typically done in Java, Python or Scala
The translation from the query to a procedural access
plan is not transparent as it was before (now you do
Declarative languages are COOL because they the job)
lower the entry barrier
Learn the language and use the database Unfortunately, very few attempts to develop
declarative languages like SQL for Big Data
Did you hear about data processing before? Hive / SparkQL and Pig / Spork for Hadoop
Thank SQL for that Cypher for Neo4J (graph database)
SparkR / MLlib / MLBase

11 July 2016 71
Cloud Services
PROVIDING ACCESS TO INFRASTRUCTURE

11 July 2016 72
Analogy: Electricity as a Utility

Pay-per-use

11 July 2016 73
Analogy: Electricity as a Utility

Own production Pay-per-use

11 July 2016 73
Computation as a Utility

11 July 2016 74
Computation as a Utility

Private Data Centre Public Cloud


(Own production) (Pay-per-use)

11 July 2016 74
Cloud Computing (Definition)
Cloud computing is a model for enabling convenient, on-demand
network access to a shared pool of configurable computing resources
(e.g., networks, servers, storage, applications, and services) that can
be rapidly provisioned and released with minimal management effort
or service provider interaction.

NIST (National Institute of Standards and Technology)

11 July 2016 75
Management Improvement

Daniel Abady, UC Berkeley

11 July 2016 76
Undercapacity Risk

Daniel Abady, UC Berkeley

11 July 2016 77
Benefits of Cloud Computing
Benefits for deploying in a cloud environment
Resolve problems related to updating/upgrading 39%

Able to scale IT resources to meet needs 39%

Relieve preassure on internal resources 39%

Rapid development 39%

Able to take advantage of latest functionality 40%

Reduce IT support needs 40%

Lower outside maintenance costs 42%

Lower labor costs 44%

Software license savings 46%

Hardware savings 47%

Pay only for what we use 50%

IBM global survey of IT and line-of-business decision makers 2012

11 July 2016 78
Benefits of Cloud Computing
Benefits for deploying in a cloud environment
Resolve problems related to updating/upgrading 39%

Able to scale IT resources to meet needs 39% Reduce


Relieve preassure on internal resources 39% time
Rapid development
to value
39%

Able to take advantage of latest functionality 40%

Reduce IT support needs 40%

Lower outside maintenance costs 42%


Cost
Lower labor costs 44%
Reduction
Software license savings 46%

Hardware savings 47%

Pay only for what we use 50%

IBM global survey of IT and line-of-business decision makers 2012

11 July 2016 78
Levels of Service
The company outsources some responsibility to the service provider. These levels are
incremental and thus, SaaS implies PaaS and PaaS implies IaaS
Infrastructure as a Service (IaaS)
You get a server to which you can connect through remote connection protocols (VPN, SSH, FTP, etc.)
Typically it covers the hardware (computers, network, virtualization, etc.)
Platform as a Service (PaaS)
You get those software modules needed to run applications (databases, web servers, security, etc.)
Software as a Service (SaaS)
Besides IaaS and PaaS some software is there ready to be used (e.g., Google Docs, Dropbox, etc.)

11 July 2016 79
Levels of Service
The company outsources some responsibility to the service provider. These levels are
incremental and thus, SaaS implies PaaS and PaaS implies IaaS
Infrastructure as a Service (IaaS)
You get a server to which you can connect through remote connection protocols (VPN, SSH, FTP, etc.)
Typically it covers the hardware (computers, network, virtualization, etc.)
Platform as a Service (PaaS)
You get those software modules needed to run applications (databases, web servers, security, etc.)
Software as a Service (SaaS)
Besides IaaS and PaaS some software is there ready to be used (e.g., Google Docs, Dropbox, etc.)

Sometimes, one may talk about Business as a Service (BaaS)


A whole business process is outsourced (e.g., Paypal, Amadeus, etc.)

11 July 2016 79
Share of Responsibility

You manage
Application Application Application Application

Runtime Runtime Runtime Runtime

You manage
Security Security Security Security

Integration Integration Integration Integration

Provider manages
You manage

Databases Databases Databases Databases

Provider manages
Servers Servers Servers Servers
Provider manages

Virtualization Virtualization Virtualization Virtualization

Server Hw Server Hw Server Hw Server Hw

Storage Storage Storage Storage

Networking Networking Networking Networking

11 July 2016 80
Service Providers
International
Amazon Services
Redfshift
Microsoft Azure
Google Cloud Platform
Etc.

11 July 2016 81
Thanks! Any Question?
OROM E RO @ ESSI.U PC.EDU
HO M E PAGE: HT T P :/ /WW W. ESSI .U PC. EDU /DTI M/PEO P LE/O RO MERO
T W I T TER: @ RO M E RO _ M_ O SCAR

11 July 2016 82

You might also like