Professional Documents
Culture Documents
OSCAR ROMERO
D T I M R E S E A R C H G R O U P ( H T T P : / / W W W. E S S I . U P C . E D U / D T I M / )
U N I V E R S I TAT P O L I T C N I C A D E C ATA L U N YA - B A R C E L O N AT E C H
A L I C A N T E , 1 1 T H J U LY, 2 0 1 6
11 July 2016 2
Introduction to Big Data
11 July 2016 3
A New Business Model
Traditionally databases have been seen as a passive asset
OLTP systems: Data gathered is structured to facilitate (automate) daily operations
The relational model as de facto standard
Soon, many realized data is a valuable asset for any organization. So, use it!
Decisional systems: Stored data is analysed to better understand our activity (I want to know)
Data warehousing as de facto standard
11 July 2016 4
Instagrams Fable
11 July 2016 5
Other Examples
The overkilling approach
Facebook: Facebook + Instagram + Whatsapp + LinkedIn +
Google: Android + Google Search + Calendar + Gmail + Doodle +
Even if most of us are not Facebook or Google we can still benefit from data
Cross available data (companies fusion, buying data, open data, agreements, etc.)
Digitalise the organization processes (e.g., the national health system, tax collection, e-banking, etc.)
Monitor the user (phone apps, internet navigation, service usage, wearables, RFID, etc.)
Sometimes, in an indirect way (e.g., provide (free) services to learn habits; such as free wi-fi to geolocate the user)
Sensors (Smart Cities, Internet of Things, etc.)
Bottom line: most of the times, the most interesting data is not available (innovation comes to play!)
11 July 2016 6
Data as the New Cornerstone
We have witnessed the bloom of a new business model based on data analytics: Data is not a passive
but an active asset
Data is the new oil! - Clive Humby, 2006
No! Data is the new soil - David McCandless, 2010
The effective use of data to make decisions gave rise to thedata-driven society concept
The confluence of three major socio-economic and technological trends makes data-driven
innovation a new phenomenon today. These three trends include (OECD):
The exponential growth in data generated and collected,
the widespread use of data analytics including start-ups and small and medium enterprises (SMEs), and
the emergence of a paradigm shift in knowledge
Organizations must adapt their infrastructures to benefit from the data deluge
Digital data doubling every 18 months (IDC)
Innovation is mandatory!
11 July 2016 7
Same Purpose; Different Means
The economic and social role of data is not new
Economic and social activities have long evolved around the analysis and use of data
In business, concepts such as Business Intelligence and Data Warehousing already emerged in the
1960s and became popular in the late 1980s
11 July 2016 8
The Business Intelligence Lifecycle
Business strategy
11 July 2016 9
The DW Exploitation
Typically, the analysis of the data has been considered at three different levels of detail
Querying & Reporting: Static report generation
OLAP: Dynamic summarizations of data
Data Mining and Machine Learning: Inference of hidden patterns or trends
11 July 2016 10
The DW Exploitation
Typically, the analysis of the data has been considered > (kc <- kmeans(newiris, 3))
K-means clustering with 3 clusters of sizes 38, 50, 62
at three different levels of detail
Querying & Reporting: Static report generation Cluster means:
Sepal.Length Sepal.Width Petal.Length Petal.Width
OLAP: Dynamic summarizations of data 1 6.850000 3.073684
2 5.006000 3.428000
5.742105 2.071053
1.462000 0.246000
3 5.901613 2.748387 4.393548 1.433871
Data Mining and Machine Learning: Inference of hidden
patterns or trends Clustering vector:
[1] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
[30] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 1 3 3 3 3
3
[59] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 1 3 3 3 3 3 3 3 3
3
[88] 3 3 3 3 3 3 3 3 3 3 3 3 3 1 3 1 1 1 1 3 1 1 1 1 1 1 3 3
1
[117] 1 1 1 3 1 3 1 3 1 1 3 3 1 1 1 1 1 3 1 1 1 1 3 1 1 1 3 1
1
[146] 1 3 1 1 3
11 July 2016 11
Types of Analysis
Descriptive: Deterministic, not probabilistic
Computing summarizations, counts, min, max, etc.
Typical OLAP operations
Predictive: Probabilistic by nature. Try to forecast what may happen according to what have happened
Linear and non-linear regression,
Classification,
Clustering,
Association rules,
etc.
Prescriptive: Given the prediction(s) of a (several) model(s), understand why something is happening and
undertake automatic action(s). Examples:
Stock market indicators (to buy or sell shares)
Automatically increase / decrease prices
11 July 2016 12
Data Warehousing Vs. Big Data
Both are decision support systems (DSS)
Big Data can be seen as the evolution of Data Warehousing ecosystems to incorporate external data to
the decision making processes of the organization as first class-citizen
External Vs. Internal data
Semi-structured and unstructured data as first-class citizens
Dynamic Vs. static data sources
Not in control Vs. well-controlled sources
On-demand data quality threshold
Lightweight transformations Vs. heavy transformations
On-demand Vs. static goals
Load-first model-later (Data Lake) Vs. schema-fixed approach (DW)
And many others consequences
Semantic-aware solutions
Privacy
Etc.
11 July 2016 13
Big Data [Management]
Definition based on the limitations traditional DSS (such as Data Warehousing) cannot overcome
Velocity
Volume
Variety
11 July 2016 14
Big Data [Analytics]
Consider the previous slide and now recall we previously classified data analysis as:
Query & reporting, OLAP, data mining / machine learning
Descriptive, predictive, prescriptive
This classification still applies to Big Data but with a subtle change:
Query & Reporting + OLAP
Data Mining & Machine Learning
11 July 2016 15
Big Data [Analytics]
Consider the previous slide and now recall we previously classified data analysis as:
Query & reporting, OLAP, data mining / machine learning
Descriptive, predictive, prescriptive
This classification still applies to Big Data but with a subtle change:
Query &Small
Reporting + OLAP
Analytics
Big Analytics
Data Mining & Machine Learning
11 July 2016 15
An Example: BigBench
11 July 2016 16
NOSQL: A Paradigm Shift
FROM RELATIONAL TO NOSQL
11 July 2016 17
The End of an Architectural Era
Navigate
Search
11 July 2016 18
The End of an Architectural Era
Real time
Big Data
Concurrency
Unstructured and structured
data (users)
11 July 2016 19
RDBMS: One Size Fits All
Mainly write-only Systems (e.g., OLTP)
Data storage
Normalization
Queries
Indexes: B+, Hash
Joins: BNL, RNL, Hash-Join, Merge Join
11 July 2016 20
Distributed Data Management
(to the rescue)
Too many reads? Replication
FRAGMENTATION
REPLICATION
11 July 2016 21
Why Distribution?
How long does it take to read 1TB of data from disk in a
centralized database compared to a shared-nothing and shared-
memory distributed system?
Centralized: 1024 1024 100 ~ 166 minutes
1
10241024 1
Distributed (100 nodes): 100
100
~ 90 seconds
Assuming an ideal scenario with linear scalability
In a memory-shared environment, as size of data increases, , the CPU of the
computer is typically overwhelmed at some point by the data flow and it is slowed
down
In a shared-nothing environment this does not happen (as 100 CPUs are available)
but network communication is required
Extracted from Web Data Management (Abiteboul et al.): An Introduction to Distributed Systems
11 July 2016 22
Why Distribution? Some Reflections
With current technologies
Disk transfer rate is a bottleneck for batch processing of large scale datasets; parallelization and distribution
can help to eliminate this bottleneck
It is faster to read the memory of a computer in your LAN (1GB/s) than your own disk (100MB/s)
Distributing data (i.e., by fragmenting and allocating it on a cluster) enables parallel processing
Several machines working together to solve a problem
Direct application of the divide-and-conquer principle
11 July 2016 23
Distributed RDBMS
Distributed RDBMS (DRDBMS) are not flexible enough
Logging ACID properties
Persistent redo log
Undo log
CLI interfaces (JDBC, ODBC, etc.) Data structure
Concurrency control (locking) ACID properties
Latching associated with multi-thread Architectural limitation
Two-phase commit (2PC) protocol in distributed transactions ACID properties
Buffers management (cache disk pages) Architectural limitation
Variable length records management Data structure
Locate records in a page
Locate fields in a record
The End of an Architectural Area. Its Time for a Complete Rewrite. Michael Stonebraker et al.
VLDB Proceedings, 2007
11 July 2016 24
A New Type of Distributed Systems
NOSQL (Not Only SQL) systems have the following goals:
Schemaless: No explicit schema [new data structures] [distribution]
Reliability / availability: Keep delivering service even if its software or hardware components fail
Scalability: Continuously evolve to support a growing amount of tasks [distribution]
Efficiency: How well the system performs, usually measured in terms of response time (latency) and
throughput (bandwith)
[distribution]
11 July 2016 25
NOSQL Three Pillars
1. Distributed Data Management
Divide-and-conquer principle
Use (and abuse) of parallelism
11 July 2016 26
Distributed Data Management
Distributed DB design
Node distribution People Name Age Gender Birthplace
Data fragments I1 Oscar 34 Male Lleida
Data replication F1
I2 Anna 35 Female Barcelona
Distributed DB catalog
Fragmentation trade-off: Location I3 Merc 45 Female Gav
Global or local for each node
F2
Centralized in a single node or distributed I4 Carme 22 Female Terrassa
Single-copy vs. Multi-copy
I5 Miquel 27 Male Matar
Distributed query processing
Data distribution / replication F3 I6 John 55 Male Glasgow
Communication overhead
Security issues
Network security
11 July 2016 27
Distributed Data Management
Distributed DB design
Node distribution People Name Age Gender Birthplace
Data fragments I1 Oscar 34 Male Lleida
Data replication F1
I2 Anna 35 Female Barcelona
Distributed DB catalog
Fragmentation trade-off: Location I3 Merc 45 Female Gav
Global or local for each node
F2
Centralized in a single node or distributed I4 Carme 22 Female Terrassa
Single-copy vs. Multi-copy
I5 Miquel 27 Male Matar
Distributed query processing
Data distribution / replication F3 I6 John 55 Male Glasgow
Communication overhead
Security issues
Network security
11 July 2016 27
Distributed Data Management
Distributed DB design
Node distribution
Data fragments
Data replication F1
Distributed DB catalog
Fragmentation trade-off: Location
Global or local for each node
Centralized in a single node or distributed
F2
Single-copy vs. Multi-copy
Security issues
Network security NETWORK
11 July 2016 27
Distributed Data Management
Distributed DB design
Node distribution
Data fragments
Data replication F1
Distributed DB catalog
Fragmentation trade-off: Location
Global or local for each node
Centralized in a single node or distributed
F2
Single-copy vs. Multi-copy
Security issues
Network security NETWORK
11 July 2016 27
Distributed Data Management
Distributed DB design
Node distribution
Data fragments
Data replication
Distributed DB catalog
Fragmentation trade-off: Location
Global or local for each node
Centralized in a single node or distributed
F2
Single-copy vs. Multi-copy
Security issues
Network security NETWORK
11 July 2016 27
Distributed Data Management
Distributed DB design
Node distribution
Data fragments
Data replication
Distributed DB catalog
Fragmentation trade-off: Location
Global or local for each node
Centralized in a single node or distributed
F2
Single-copy vs. Multi-copy
Security issues
Network security NETWORK
11 July 2016 27
Distributed Data Management
Distributed DB design
Node distribution
Data fragments
Data replication
Distributed DB catalog
Fragmentation trade-off: Location
Global or local for each node
Centralized in a single node or distributed
Single-copy vs. Multi-copy
Security issues
Network security NETWORK
11 July 2016 27
Distributed Data Management
Distributed DB design
Node distribution
Data fragments
Data replication
Distributed DB catalog
Fragmentation trade-off: Location
Global or local for each node
Centralized in a single node or distributed
Single-copy vs. Multi-copy
Security issues
Network security NETWORK
11 July 2016 27
Distributed Data Management
Distributed DB design
Node distribution
Data fragments
Data replication
Distributed DB catalog
Fragmentation trade-off: Location
Global or local for each node
Centralized in a single node or distributed
Single-copy vs. Multi-copy
Security issues
Network security NETWORK
11 July 2016 27
Distributed Data Management
Distributed DB design
Node distribution
Data fragments
Data replication
Distributed DB catalog
Fragmentation trade-off: Location
Global or local for each node
Centralized in a single node or distributed
Single-copy vs. Multi-copy
Security issues
Network security NETWORK
11 July 2016 27
Distributed Data Management
Distributed DB design
Node distribution
Data fragments
Data replication
Distributed DB catalog
Fragmentation trade-off: Location
Global or local for each node
Centralized in a single node or distributed
Single-copy vs. Multi-copy
Security issues
Network security NETWORK
11 July 2016 27
Distributed Data Management
Distributed DB design
Node distribution
Data fragments
Data replication
Distributed DB catalog
Fragmentation trade-off: Location
Global or local for each node
Centralized in a single node or distributed
Single-copy vs. Multi-copy
Security issues
Network security NETWORK
11 July 2016 27
Distributed Data Management
Distributed DB design
Node distribution
Data fragments
Data replication
Security issues
Network security NETWORK
11 July 2016 27
Distributed Data Management
Distributed DB design
Node distribution
Data fragments
Data replication
Security issues
Network security NETWORK
11 July 2016 27
Distributed Data Management
Distributed DB design
Node distribution
Data fragments
Data replication
Security issues
Network security NETWORK
11 July 2016 27
Distributed Data Management
Distributed DB design
Node distribution
Data fragments
Data replication
Security issues
Network security NETWORK
11 July 2016 27
Distributed Database Design
Given a DB and its workload, how
should the DB be split and allocated to
sites as to optimize certain objective
functions
Minimize resource consumption for query
processign
11 July 2016 28
Global Catalog Management
Centralized version (@master)
Accessing it is a bottleneck
Single-point failure
May add a mirror
Poorer performance
11 July 2016 29
Global Catalog Example: HDFS
11 July 2016 30
Global Catalog Example: HBase
LSM: Distributed B+ Tree
Key
11 July 2016 31
Decide Data Allocation
Given a set of fragments, a set of sites on which a number of applications are running, allocate
each fragment such that some optimization criterion is met (subject to certain constraints)
It is known to be a NP-hard problem
The optimal solution depends on many factors
Location in which the query originates
The query processing strategies (e.g., join methods)
Furthermore, in a dynamic environment the workload and access pattern may change
The problem is typically simplified with certain assumptions (e.g., only communication cost
considered)
Typical approaches build cost models and any optimization algorithm can be adapted to solve it
Heuristics are also available: (e.g., best-fit for non-replicated fragments)
Sub-optimal solutions
11 July 2016 32
Managing Data Replication
Replicating fragments improves the system throughput but raises some other issues:
Consistency
Update performance
Strong Eventually
Consistency Consistent
11 July 2016 33
Trade-off: Performance Vs. Consistency
Consistency Performance
(Ratio of correct answers) (System throughput)
11 July 2016 34
CAP Theorem
Eric Brewer. CAP Theorem: any networked shared-data system can have at most two of three
desirable properties:
consistency (C) equivalent to having a single up-to-date copy of the data;
high availability (A) of that data (for updates); and
tolerance to network partitions (P)
Example:
C ->
A ->
P ->
35
CAP Theorem
Eric Brewer. CAP Theorem: any networked shared-data system can have at most two of three
desirable properties:
consistency (C) equivalent to having a single up-to-date copy of the data;
high availability (A) of that data (for updates); and
tolerance to network partitions (P)
Example:
Update C ->
A ->
P ->
35
CAP Theorem
Eric Brewer. CAP Theorem: any networked shared-data system can have at most two of three
desirable properties:
consistency (C) equivalent to having a single up-to-date copy of the data;
high availability (A) of that data (for updates); and
tolerance to network partitions (P)
Example:
Eager Replication
Update C ->
A ->
P ->
35
CAP Theorem
Eric Brewer. CAP Theorem: any networked shared-data system can have at most two of three
desirable properties:
consistency (C) equivalent to having a single up-to-date copy of the data;
high availability (A) of that data (for updates); and
tolerance to network partitions (P)
Example:
Eager Replication
Update C -> OK
A -> NO
P -> OK
35
CAP Theorem
Eric Brewer. CAP Theorem: any networked shared-data system can have at most two of three
desirable properties:
consistency (C) equivalent to having a single up-to-date copy of the data;
high availability (A) of that data (for updates); and
tolerance to network partitions (P)
Example:
ACID
Eager Replication
Update C -> OK
A -> NO
P -> OK
35
CAP Theorem
Eric Brewer. CAP Theorem: any networked shared-data system can have at most two of three
desirable properties:
consistency (C) equivalent to having a single up-to-date copy of the data;
high availability (A) of that data (for updates); and
tolerance to network partitions (P).
Example:
C ->
A ->
P ->
36
CAP Theorem
Eric Brewer. CAP Theorem: any networked shared-data system can have at most two of three
desirable properties:
consistency (C) equivalent to having a single up-to-date copy of the data;
high availability (A) of that data (for updates); and
tolerance to network partitions (P).
Example:
Update C ->
A ->
P ->
36
CAP Theorem
Eric Brewer. CAP Theorem: any networked shared-data system can have at most two of three
desirable properties:
consistency (C) equivalent to having a single up-to-date copy of the data;
high availability (A) of that data (for updates); and
tolerance to network partitions (P).
Example:
Lazy Replication
Update C ->
A ->
P ->
36
CAP Theorem
Eric Brewer. CAP Theorem: any networked shared-data system can have at most two of three
desirable properties:
consistency (C) equivalent to having a single up-to-date copy of the data;
high availability (A) of that data (for updates); and
tolerance to network partitions (P).
Example:
Lazy Replication
Update C -> NO
A -> OK
P -> OK
36
CAP Theorem
Eric Brewer. CAP Theorem: any networked shared-data system can have at most two of three
desirable properties:
consistency (C) equivalent to having a single up-to-date copy of the data;
high availability (A) of that data (for updates); and
tolerance to network partitions (P).
Example:
BASE
Lazy Replication
Update C -> NO
A -> OK
P -> OK
36
CAP Theorem
Eric Brewer. CAP Theorem: any networked shared-data system can have at most two of three
desirable properties:
consistency (C) equivalent to having a single up-to-date copy of the data;
high availability (A) of that data (for updates); and
tolerance to network partitions (P).
Example:
C ->
A ->
P ->
37
CAP Theorem
Eric Brewer. CAP Theorem: any networked shared-data system can have at most two of three
desirable properties:
consistency (C) equivalent to having a single up-to-date copy of the data;
high availability (A) of that data (for updates); and
tolerance to network partitions (P).
Example:
Update C ->
A ->
P ->
37
CAP Theorem
Eric Brewer. CAP Theorem: any networked shared-data system can have at most two of three
desirable properties:
consistency (C) equivalent to having a single up-to-date copy of the data;
high availability (A) of that data (for updates); and
tolerance to network partitions (P).
Example:
Eager Replication
Update C ->
A ->
P ->
37
CAP Theorem
Eric Brewer. CAP Theorem: any networked shared-data system can have at most two of three
desirable properties:
consistency (C) equivalent to having a single up-to-date copy of the data;
high availability (A) of that data (for updates); and
tolerance to network partitions (P).
Example:
Eager Replication
Update C -> OK
A -> OK
P -> NO
37
CAP Theorem
Eric Brewer. CAP Theorem: any networked shared-data system can have at most two of three
desirable properties:
consistency (C) equivalent to having a single up-to-date copy of the data;
high availability (A) of that data (for updates); and
tolerance to network partitions (P).
Example:
????
Eager Replication
Update C -> OK
A -> OK
P -> NO
37
CAP Theorem Revisited
The CAP theorem is not about choosing two out of the three forever and ever
Distributed systems are not always partitioned
Without partitions: CA
Otherwise
Detect a partition
Normally by means of latency (time-bound connection)
Enter an explicit partition mode limiting some operations choosing either:
CP (i.e., ACID by means of e.g., 2PCP or PAXOS) or,
If a partition is detected, the operation is aborted
AP (i.e., BASE)
The operation goes on and we will tackle this next
If AP was chosen, enter a recovery process commonly known as partition recovery (e.g., compensate mistakes and get rid of
inconsistencies introduced)
Achieve consistency: Roll-back to consistent state and apply ops in a deterministic way (e.g., using time-stamps)
Reduce complexity by only allowing certain operations (e.g., Google Docs)
Commutative operations (concatenate logs, sort and execute them)
Repair mistakes: Restore invariants violated
Last writer wins
11 July 2016 38
Flexible Data Models (Schemaless)
Relational databases give for granted the relational model
The NOSQL wave presents other data models to boost performance and flexibility
I. Key-value: Hadoop (HDFS, HBase), Cassandra, Voldemort, etc.
II. Document (kind of key-value): MongoDB, CouchDB, etc.
III. Graph-based: Neo4J, Giraph, GrapX, etc.
IV. Streams: Apache Flink, Spark Stream, etc.
V. Etc.
11 July 2016 39
New Architectures
NOSQL introduces a critical reasoning on the reference database architecture
ALL relational databases follow System-R architecture (late 70s!)
11 July 2016 40
Most Typical Architectural Solutions
Primary indexes to implement the global catalog
Distributed B+: WiredTiger, HBase, etc.
Consistent Hashing: Voldemort, MongoDB (until 2.X), etc.
Bloom filters to avoid distributed look ups
In-memory processing
Columnar block iteration: Vertical fragmentation + fixed-size values + compression (run length
encoding)
Heavily exploited by column-oriented databases
Only for read-only workloads
Sequential reads
Key design
11 July 2016 41
Most Typical Architectural Solutions
Random Vs. Sequential Reads
Take the most out of databases by boosting sequential reads
Enables pre-fetching
Option to maximize the effective read ratio (by a good db design)
http://www.benstopford.com/2015/04/28/elements-of-scale-composing-and-scaling-data-platforms/
11 July 2016 42
Example: NewSQL
IDEA: For OLTP systems RDBMS can also be outperformed
11 July 2016 43
Example: the Data Lake
IDEA: Load -first, Model-Later
Modeling at load time restricts the potential
analysis that can be done later (Big Analytics)
Store raw data and create on -demand views to
handle with precise analysis needs
11 July 2016 44
Example: The Lambda-Architecture
Batch layer Serving layer
Batch view
11 July 2016 45
Example: Polyglot Systems
IDEA: Federate different kinds of NOSQL databases in a single system
46
Is It About New Things?
No, it is not. Current architectures do not introduce a single new concept
Old theoretical findings are now used or broadly used
Primary indexes
Sequential reads
Vertical partitioning
Compression (e.g., run length encoding)
Fixed-size values
In-memory processing
Bloom filters
Polyglot systems
47
Is It About New Things?
No, it is not. Current architectures do not introduce a single new concept
Old theoretical findings are now used or broadly used
Primary indexes
Sequential reads
Vertical partitioning
Compression (e.g., run length encoding)
Fixed-size values
In-memory processing
Bloom filters
Polyglot systems
Big Data brings critical thinking on traditional assumptions and tears them down. Nevertheless, Big
Data solutions are fully built on top of the database theory
47
11 July 2016 48
Watch Out! The Problem is NOT SQL
SQL is a query language. SQL is cool! Not a problem
The problem is that relational systems are too generic
OLTP: stored procedures and simple queries
OLAP: ad-hoc complex queries
Documents: large objects
Streams: time windows with volatile data
Scientific: uncertainty and heterogeneity
But the overhead of RDBMS has nothing to do with SQL
Low-level, record-at-a-time interface is not the solution
SQL Databases vS. NoSQL Databases
Michael Stonebraker
Communications of the ACM, 53(4), 2010
11 July 2016 49
Data Management
DATA LIFECYCLE
11 July 2016 50
Data Management
Data management refers to the tasks a database management system (DBMS) must provide to cover the
data lifecycle within an IT system. In this context, we focus on:
Ingestion: the means provided to insert /upload / put data into the DBMS (in SQL, the INSERT command)
Modeling: the conceptual data structures used to arrange data within the DBMS (e.g., tables in the relational
model, graphs in graph modelling, etc.)
Storage: the physical data structures used to persist data in the DBMS (e.g., hash, b-tree, heap files, etc.)
Processing: the means provided (many times, in algebraic form) to manipulate data once stored in the the DBMS
physical data structures (in SQL, the relational algebra)
Querying / fetching: the means provided to allow the DBMS user to specify the data processing she would like to
perform (in SQL, SELECT queries). It is typically triggered in terms of the conceptual data structures
In Big Data settings, it refers to the same concepts but assuming a NOSQL system is behind
Typically, a distributed system
Possibly with an alternative data model to the relational one
Implementing ad-hoc architectural solutions
11 July 2016 51
Data Management
11 July 2016 52
Data Management
John, BCN,
33; Maria, Ingestion
BCN, 22
11 July 2016 52
Data Management
11 July 2016 52
Data Management
11 July 2016 52
Data Management
Get(name = john)
11 July 2016 52
Data Management
Get(name = john)
Data Processing
11 July 2016 52
Data Management
Get(name = john)
Data Processing
11 July 2016 52
Apache Hadoop as Example
11 July 2016 53
Apache Hadoop as Example
11 July 2016 53
Apache Hadoop as Example
11 July 2016 53
Apache Hadoop as Example
11 July 2016 53
Apache Hadoop as Example
11 July 2016 53
Apache Hadoop as Example
Ingestion
11 July 2016 53
Apache Hadoop as Example
Ingestion
Modeling and
Storage
11 July 2016 53
Apache Hadoop as Example
Ingestion
Processing
Modeling and
Storage
11 July 2016 53
Apache Hadoop as Example
Ingestion Querying
Processing
Modeling and
Storage
11 July 2016 53
Apache Hadoop as Example
Small Analytics Big Analytics
Ingestion Querying
Processing
Modeling and
Storage
11 July 2016 53
Apache Hadoop as Example
Small Analytics Big Analytics
Ingestion Querying
Processing
Modeling and
Storage
11 July 2016 53
Data Ingestion
RELATIONAL DBMS HADOOP ECOSYSTEM
Data must be in the same machine Data must reach the cluster first (SCP, FTP,
etc.) and from there insert it into the Big Data
INSERT INTO (DML command)
system
Main options:
Main options:
o Bulk load (aka batch loading)
o Bulk load (aka batch loading)
o Record-at-a-time
o Record-at-a-time
Connection options:
Connection options:
o ODBC (JDBC in the general case)
o API (less standardised than ODBC)
o Database command (typically from a file)
o Database command (typically from a file)
11 July 2016 54
Data Modeling
RELATIONAL DBMS HADOOP ECOSYSTEM
11 July 2016 55
Impedance Mismatch
11 July 2016 56
Impedance Mismatch
11 July 2016 57
Key-Value
Characteristics
Entries in form of key-values (can be seen as enormous hash tables)
One key maps only to one value
Query on key only
Schemaless
key value
11 July 2016 58
document
Document
Characteristics
Entries in form of key-values where the value is a XML or JSON document
One key maps only to one document
It is possible to index any of the document entries (in JSON, the JSON keys, in
XML the XML tags)
Query on key or indexed document entries
Schemaless
key
SocialNetwork
Suitable for
Data produced in XML / JSON format (web data)
Unstructured data whose schema is difficult to predict
Main way to access data is by key and all the document is processed together
Systems
MongoDB, CouchDB
11 July 2016 59
Graph
Characteristics
Data stored as nodes and edges graph
Relationships as first-class citizens
Schemaless
Suitable for
Unstructured data whose schema is difficult to predict
Topology queries (based on the shape of the graph)
Systems
Neo4J, Titan, Sparksee
http://grouplens.org/datasets/movielens/
11 July 2016 60
Streams
Characteristics
Strong temporal locality
The whole dataset is not available but a portion of it (window)
Stream (window)
Suitable for
Real-time applications
Data per se is not important but to gain insight (e.g., sensor data)
Approximate answers are enough
Systems
Apache Flink, Spark Streaming, Storm
11 July 2016 61
Data Storage
RELATIONAL DBMS HADOOP ECOSYSTEM
Generic architecture that can be tuned according Concepts that gained weight
to the needs: Primary indexes
Mainly write-only Systems (e.g., OLTP) Sequential reads
Normalization Vertical partitioning
Indexes: B+, Hash Compression (e.g., run length encoding)
Joins: BNL, RNL, Hash-Join, Merge Join Fixed-size values
In-memory processing
Read-only Systems (e.g., DW) Bloom filters
Denormalized data Polyglot systems
Indexes: Bitmaps
Joins: Star-join Such architectures are very specific and good (but
Materialized Views very good) in solving a specific problem
11 July 2016 62
Data Processing
RELATIONAL DBMS HADOOP ECOSYSTEM
11 July 2016 63
The Ancestor of them All: MapReduce
SQL
File System
RDBMS
11 July 2016 64
The Ancestor of them All: MapReduce
SQL
File System
RDBMS
11 July 2016 64
The Ancestor of them All: MapReduce
SQL
File System
RDBMS
HBase
11 July 2016 64
The Ancestor of them All: MapReduce
MapReduce
File System
RDBMS
HBase
11 July 2016 64
The Ancestor of them All: MapReduce
MapReduce
File System
RDBMS
HBase
11 July 2016 64
The Ancestor of them All: MapReduce
MapReduce
File
HDFS System
RDBMS
HBase
11 July 2016 64
MapReduce Basics
11 July 2016 65
MapReduce: WordCount Example
11 July 2016 66
MapReduce: WordCount Example
<#line, text>
11 July 2016 66
MapReduce: WordCount Example
11 July 2016 66
MapReduce: WordCount Example
The 1
Project 1
Gutemberg 1
Ebook 1
of 1
The 1
Outline 1
Of 1
Science, 1
Vol. 1
1 1
(of 1
4), 1
11 July 2016 by 1 66
MapReduce: WordCount Example
[The,[1,1,1,1,]]
<#line, text> Map Merge-Short
The 1
Project 1
Gutemberg 1
Ebook 1
of 1
The 1
Outline 1
Of 1
Science, 1
Vol. 1
1 1
(of 1
4), 1
11 July 2016 by 1 66
MapReduce: WordCount Example
[The,[1,1,1,1,]]
<#line, text> Map Merge-Short Reduce
The 1
Project 1
Gutemberg 1
Ebook 1
of 1
The 1
Outline 1
Of 1
Science, 1
Vol. 1
1 1
(of 1
4), 1
11 July 2016 by 1 66
MapReduce: WordCount Example
[The,[1,1,1,1,]]
<#line, text> Map Merge-Short Reduce
The
Project
1
1 The 57631
Gutemberg 1
Ebook 1
of 1
The 1
Outline 1
Of 1
Science, 1
Vol. 1
1 1
(of 1
4), 1
11 July 2016 by 1 66
MapReduce: WordCount Example
public void map(LongWritable key, Text value) {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
write(new Text(tokenizer.nextToken()), new IntWritable(1));
}
}
public void reduce(Text key, Iterable<IntWritable> values) {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
write(key, new IntWritable(sum));
}
11 July 2016 67
MapReduce: WordCount Example
KEY key, Text value)
public void map(LongWritable VALUE {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
write(new Text(tokenizer.nextToken()), new IntWritable(1));
}
}
public void reduce(Text key, Iterable<IntWritable> values) {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
write(key, new IntWritable(sum));
}
11 July 2016 67
MapReduce: WordCount Example
KEY key, Text value)
public void map(LongWritable VALUE {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
KEY
write(new Text(tokenizer.nextToken()), VALUE
new IntWritable(1));
}
}
public void reduce(Text key, Iterable<IntWritable> values) {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
write(key, new IntWritable(sum));
}
11 July 2016 67
MapReduce: WordCount Example
KEY key, Text value)
public void map(LongWritable VALUE {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
KEY
write(new Text(tokenizer.nextToken()), VALUE
new IntWritable(1));
}
}
KEY
public void reduce(Text key, VALUE
Iterable<IntWritable> values) {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
write(key, new IntWritable(sum));
}
11 July 2016 67
MapReduce: WordCount Example
KEY key, Text value)
public void map(LongWritable VALUE {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
KEY
write(new Text(tokenizer.nextToken()), VALUE
new IntWritable(1));
}
}
KEY
public void reduce(Text key, VALUE
Iterable<IntWritable> values) {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
KEY
write(key, VALUE
new IntWritable(sum));
}
11 July 2016 67
MapReduce: WordCount Example
KEY key, Text value)
public void map(LongWritable VALUE {
String line = value.toString();
StringTokenizer tokenizer BLACKBOX
= new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
KEY
write(new Text(tokenizer.nextToken()), VALUE
new IntWritable(1));
}
}
KEY
public void reduce(Text key, VALUE
Iterable<IntWritable> values) {
int sum = 0;
for (IntWritable val : values) {
BLACKBOX
sum += val.get();
}
KEY
write(key, VALUE
new IntWritable(sum));
}
11 July 2016 67
Main Processing Approaches
Distributed Frameworks: Thought to run on a cluster and exploit parallelism
MapReduce, Spark (bear in mind Spark-R and Spark MLlib are built on top of Spark), Presto,
Centralized Processing on Distributed Storage: Precise results on small sets Recall some issues due to Big
Metis, Hadoop-R, Data characteristics:
In distributed settings,
Graph Processing: Thought to deal with graph-like
approximations are fine
Distributed: Pregel / Giraph, GraphLab, GraphLINQ, and real-time answers are
Centralized: GraphChi, Neo4J, . preferred
Centralized approaches
Stream Processing: Deal with data streams
yield precise results in
Spark Streaming, Apache Flink, Storm, batch processing but need
to deal with small inputs
(<= 0,5 GB as thumb rule)
11 July 2016 68
Workflow Orchestrators
There is no single data processing framework that is always the best (we learnt that from NOSQL!)
Workflow orchestrators are another abstraction layer, above data processing frameworks that,
given a query, decides what data processing frameworks is the more adequate
Ingestion Querying
Processing
Modeling and
Storage
11 July 2016 69
Workflow Orchestrators
There is no single data processing framework that is always the best (we learnt that from NOSQL!)
Workflow orchestrators are another abstraction layer, above data processing frameworks that,
given a query, decides what data processing frameworks is the more adequate
Workflow Orchestrators (Oozie, Musketeer)
Ingestion Querying
Processing
Modeling and
Storage
11 July 2016 69
Workflow Orchestrators
Current workflow orchestrators are rather poor: Oozie
But there are attempts for smarter approaches: the ideas behindMusketeer deserves special
attention Query Data Processing
Frameworks Frameworks
11 July 2016 70
Data Querying
RELATIONAL DBMS HADOOP ECOSYSTEM
SQL is the only way to query the database Querying is done in a programmatic way using the
operators provided by the data processing
No matter if through ODBC or built-in framework (i.e., programming in MapReduce,
commands or syntactic sugar Spark, etc)
SELECT FROM WHERE Programming is typically done in Java, Python or Scala
The translation from the query to a procedural access
plan is not transparent as it was before (now you do
Declarative languages are COOL because they the job)
lower the entry barrier
Learn the language and use the database Unfortunately, very few attempts to develop
declarative languages like SQL for Big Data
Did you hear about data processing before? Hive / SparkQL and Pig / Spork for Hadoop
Thank SQL for that Cypher for Neo4J (graph database)
SparkR / MLlib / MLBase
11 July 2016 71
Cloud Services
PROVIDING ACCESS TO INFRASTRUCTURE
11 July 2016 72
Analogy: Electricity as a Utility
Pay-per-use
11 July 2016 73
Analogy: Electricity as a Utility
11 July 2016 73
Computation as a Utility
11 July 2016 74
Computation as a Utility
11 July 2016 74
Cloud Computing (Definition)
Cloud computing is a model for enabling convenient, on-demand
network access to a shared pool of configurable computing resources
(e.g., networks, servers, storage, applications, and services) that can
be rapidly provisioned and released with minimal management effort
or service provider interaction.
11 July 2016 75
Management Improvement
11 July 2016 76
Undercapacity Risk
11 July 2016 77
Benefits of Cloud Computing
Benefits for deploying in a cloud environment
Resolve problems related to updating/upgrading 39%
11 July 2016 78
Benefits of Cloud Computing
Benefits for deploying in a cloud environment
Resolve problems related to updating/upgrading 39%
11 July 2016 78
Levels of Service
The company outsources some responsibility to the service provider. These levels are
incremental and thus, SaaS implies PaaS and PaaS implies IaaS
Infrastructure as a Service (IaaS)
You get a server to which you can connect through remote connection protocols (VPN, SSH, FTP, etc.)
Typically it covers the hardware (computers, network, virtualization, etc.)
Platform as a Service (PaaS)
You get those software modules needed to run applications (databases, web servers, security, etc.)
Software as a Service (SaaS)
Besides IaaS and PaaS some software is there ready to be used (e.g., Google Docs, Dropbox, etc.)
11 July 2016 79
Levels of Service
The company outsources some responsibility to the service provider. These levels are
incremental and thus, SaaS implies PaaS and PaaS implies IaaS
Infrastructure as a Service (IaaS)
You get a server to which you can connect through remote connection protocols (VPN, SSH, FTP, etc.)
Typically it covers the hardware (computers, network, virtualization, etc.)
Platform as a Service (PaaS)
You get those software modules needed to run applications (databases, web servers, security, etc.)
Software as a Service (SaaS)
Besides IaaS and PaaS some software is there ready to be used (e.g., Google Docs, Dropbox, etc.)
11 July 2016 79
Share of Responsibility
You manage
Application Application Application Application
You manage
Security Security Security Security
Provider manages
You manage
Provider manages
Servers Servers Servers Servers
Provider manages
11 July 2016 80
Service Providers
International
Amazon Services
Redfshift
Microsoft Azure
Google Cloud Platform
Etc.
11 July 2016 81
Thanks! Any Question?
OROM E RO @ ESSI.U PC.EDU
HO M E PAGE: HT T P :/ /WW W. ESSI .U PC. EDU /DTI M/PEO P LE/O RO MERO
T W I T TER: @ RO M E RO _ M_ O SCAR
11 July 2016 82