Professional Documents
Culture Documents
&
Data Science
By
Prof.Dr. O.P.Vyas
DAAD Fellow (Germany)
Indian Institute of Information Technology Allahabad (India)
Email: ranjanavyas@gmail.com
Outline
Data, Information and Knowledge.
DBMS Overview: DBMS Models, SQL & NoSQL
Operational Data Vs Informational Data
VLDBMS Vs Data Warehouse: Can DBMS grow to become Data Warehouse
Data Warehouse
Why a separate Data Warehouse: Data Mining & Data Warehousing
Multidimensional Data
Data Cube for Visualization
Data Mining Functionalities
Descriptive Analytics & Predictive Analytics
Big Data Analytics
How Big is Big Data: Tools and Technologies for Big Data
Netflix Case Study & Recommender System Researches
Internship Areas & Opportunities in Uni.Paderborn (Germany)
3
Database System Concepts 7.3 ©Silberschatz, Korth and Sudarshan
“ We are pouring in
data but starving in
Knowledge ”
Jiawei Han
4
Database System Concepts 7.4 ©Silberschatz, Korth and Sudarshan
Data Science-Chronological Development
Evolutionary steps Business query Enabling Product vendors Characteristics
technologies
Database
Technology Statistics
7
Database System Concepts 7.7 ©Silberschatz, Korth and Sudarshan
Why Data Science…
Over the last decade there’s been a
massive collection of data, whereas we
need tools and techniques to derive
meaning information and actionable
knowledge from this continuously
generated data.
We need to have thorough
understanding of how to manipulate,
design and manage this massive
databases.
One of the most productive use is to
encourage the usage of data analytics for
effective (business) decision making.
The demand for data scientists is
increasing so quickly, that McKinsey
predicts that by 2018, there will be a
50 percent gap in the supply of data
scientists versus demand.
Database System Concepts 7.8 ©Silberschatz, Korth and Sudarshan
Three existing approaches in Data
Science..
Database
11
Database System Concepts 7.11 ©Silberschatz, Korth and Sudarshan
Database Management System (DBMS)
DBMS is designed to manage large bodies of data, contains
information about a particular enterprise.
DBMS provides an environment that is both convenient and
efficient to use.
Database Applications:
Banking: all transactions uses DBMS
Railways: reservations, train-schedules (first database in a
geographical distributed)
Universities: registration, grades
Sales: customers, products, purchases
Manufacturing: production, inventory, orders, supply chain
Human resources: employee records, salaries, tax deductions
Accessing Data through Mobile phones via Internet…
Internet
Databases touch all aspects of our lives.
data
data relationships
data semantics
data constraints
Entity-Relationship model
Relational model
Other models:
Object-Oriented model
Semi-structured data models
NoSQL: NoRel Models
P r o je c t L e a d e r
D e v e lo p e r s DBAs
18
Database System Concepts 7.18 ©Silberschatz, Korth and Sudarshan
Three Techniques in Data Science..
20
Database Management system: Limitations
21
Database System Concepts 7.21 ©Silberschatz, Korth and Sudarshan
Can DBMS grow to a Data Warehouse ..?
24
What is Data Warehouse?
Defined in many different ways, but not rigorously.
Data warehousing:
Client Client
Metadata Warehouse
Integration
27
Database System Concepts 7.27 ©Silberschatz, Korth and Sudarshan
From Tables and Spreadsheets to Data
Cubes
A data warehouse is based on a multidimensional data model
which views data in the form of a data cube
A data cube, such as sales, allows data to be modeled and viewed
in multiple dimensions
Dimension tables, such as item (item_name, brand, type), or
time(day, week, month, quarter, year)
Fact table contains measures (such as dollars_sold) and keys to
each of the related dimension tables
In data warehousing we form a data cube which has three
dimensions, each representing one aspect of the Data…
Office Day
TV
u
od
PC U.S.A
Pr
VCR
sum
Country
Canada
Mexico
sum
Data Mining
Task-relevant Data
Data
Selection
Warehouse
Data Cleaning
Data Integration
Transactional
Database
Database System Concepts 7.33 ©Silberschatz, Korth and Sudarshan
Data Mining Approach
Data Mining
Descriptive Predictive
34
Database System Concepts 7.34 ©Silberschatz, Korth and Sudarshan
34
Data Mining Functionalities
Data Mining
Association
Mining
Classification
Association Rule Mining is one of
the most common and useful types
of data mining.
•The purpose is to determine the Clustering Classification Asso.Classifiers
“inter-dependence” among various Techniques
items.
Support-Confidence Framework
•Apriori (Agarwal et.al. 1993),
•Sequential pattern mining(Agarwal et.al.’95)
•Hashing in Apriori (Park et.al. 1995a)
•FP Growth (Han et.al. 2000): ARM
without Candidate generation. 35
Database System Concepts 7.35 ©Silberschatz, Korth and Sudarshan
Data Mining
Descriptive
Approach
Clustering
Association Classification
Clustering is the process of grouping a set of
Mining
(physical or abstract) objects into classes of
similar objects
Spatial Data Mining & Web Mining
PageRank Algorithm 1998 changed the world !!
Application Classification
Domain •K-Means Algo (Mac-Queen et al., 1967) Techniques
•CLARANS (Han et al., 1994)
Techniques •DBSCAN ( Ester et al 1996) Asso.Classification
Search Engine Result Mining
•Categorizes documents using phrases in titles and
snippets.
•Clustering Search Result (Leouski and Croft, 1996,
Zamir and Etzioni, 1997). 36
Database System Concepts 7.36 ©Silberschatz, Korth and Sudarshan
Data Analytics
Data Analytic
Descriptive Predictive
37
Database System Concepts 7.37 ©Silberschatz, Korth and Sudarshan
37
Data Mining Approaches
Predictive Modeling
Prediction based on past history
Predict if a credit card applicant poses a good credit risk, based on some
attributes (income, job type, age, ..) and past history
Predict if a customer is likely to switch brand loyalty
Predict if a customer is likely to respond to “junk mail”
Predict if a pattern of phone calling card usage is likely to be fraudulent.
Some examples of prediction mechanisms:
Classification Assoc.
Techniques Classification
42
Database System Concepts 7.42 ©Silberschatz, Korth and Sudarshan
42
Business Intelligence…
BI is simply Data Mining Applications to
Business.
Right ??
No…not always !
Data Mining though works as good foundation to
many BI Solution but BI is more than DM?
How exactly BI is different from Data Mining?
BI is an interdisciplinary area and requires
understanding of Business processes
appropriately combined to make BI work.
Well, Areas like Digital Marketing,
Marketing Data Analytics
combined with Business Processes with
innovative approaches make it to BI !!
Database System Concepts 7.43 ©Silberschatz, Korth and Sudarshan
Data Mining Approach
Data Mining
Descriptive Predictive
44
Database System Concepts 7.44 ©Silberschatz, Korth and Sudarshan
44
Examples of Classification Task
Relational Database
Anamolies & Design issues
DBMS: Design & Implementation
User Groups and
1. Identify business
Queries expected
requirements and then Model .
For designing the given real life scenario into Database systems,
E-R diagrams were used .
So far we have assumed that Entity-attributes are grouped to
form a relation schema by using the common sense of database
designer or by mapping a schema defined by ER model.
Functional Dependencies
Second Normal Form & Third Normal
Decomposition & Properties of Decomposition
Overall Database Design Process
60
Database System Concepts 7.60 ©Silberschatz, Korth and Sudarshan
Constraints enforce limits to the
data or type of data that can be
inserted/updated/deleted from a
table.
Database System Concepts 04/19/18 Lecture 8 7.69 ©Silberschatz, Korth and Sudarshan
Schema Analysis: EmpDept
The first problem students usually identify with the EmpDept
schema is that it combines two different ideas: employee
information and department information. But what is wrong
with this?
Also that if we want to open a new Deptt. and thus want to
insert DeptName but do not have any employees so EID
(Primary key) and Name field will have no values…
Without any values in Primary Key.. How will the record exist..
So we can not Insert new DeptName is this scheme..
Insert Anomaly..?
e.g. if we remove the entity, course_no:351 from the above table, the
details of room C320 get deleted.
e.g. Room H940 has been improved, it is now of RSize = 500. For
updating a single entity, we have to update all other columns where
Room=H940.
To Identify 1 NF ?
1NF
2NF
3NF
BCNF Concept of FD’s ( Functional Dependency) required
This table is not in first normal form because the [Color] column can
contain multiple values.
For example, the first row includes values "red" and "green."
To bring this table to first normal form, we split the table into two
tables and now we have the resulting tables:
Composite
Primary Key
STUDENT_COURSE
To understand
4NF
5NF
Concept of MVD (Multi Valued Dependency) is required
Step III. Identify & Analyse the FDs of the given Scheme.
(2NF/3NF) then Step IV.
1NF
2NF
3NF
BCNF Concept of FD’s ( Functional Dependency) required
To understand
4NF
5NF
Concept of MVD (Multi Valued Dependency) is required
116
Database System Concepts 7.116 ©Silberschatz, Korth and Sudarshan
Functional Dependencies and Keys
What are
FDs ??
List them....
Design Anomalies
Identifying FDs : When & How ?
Design Analyses
Teaches Schema
Database System Concepts 7.134 ©Silberschatz, Korth and Sudarshan
Common pitfalls in FDs in the Schema
Whether the FDs
Course Professor
Professor Course
Satisfied in the following Schema ? Why so ?
If a given set of values for each attribute in X uniquely determines each of the
values of the attributes in Y.
Then verify that whether R in real world scenario satisfies
Professor -> Course or not ?
Can same value for Professor attribute may have more than one values for
Normal Form Test for TEACHES
Let us consider Teaches Relation for testing whether it
fulfills any Normal Form ?
What are the FDs in this scheme ?
TEACHES contains attributes Professor, Course, Room,
Room_Cap, Enrol_Lmt (Enrolment Limit).
The relation scheme for the relation TEACHES is (Prof, Course, Room,
Room_Cap,Enrol_lmt)
The domain of the attribute Prof is all the faulty members of the university.
The domain of the attribute Course is the courses offered by the university.
The domain of Room is the rooms in the buildings of the university.
The domain of Room_Cap is an integer value indicating the seating capacity of the room.
FDs ?
The course is scheduled in a given room and each Course uniquely
identifies Room
Course Room
since the room has the given maximum number of available seats, there
is a functional dependency
Room Room_Cap And hence from transitivity
Course Room Room_Cap
Thus the functional dependencies in this relation are
{Course (Prof, Room, Room_Cap, Enrol_Lmt),
Room Room_Cap}
Room_Cap
Key attribute
A B C
D
Non-key
E attribute
Entity
F
. 2NF : 3 NF
Non –key attribute are functionally-
No non-key attribute is functionally-
dependent on the key – attributes
dependent upon any Non-key attribute
Partial Dependency is not allowed Transitive Dependency not allowed
Key attribute
A B C
D
Non-key
E attribute
Entity
F
K R, and
for no K, R
Functional dependencies allow us to express constraints that
cannot be expressed using superkeys. Consider the schema:
Loan-info-schema = (customer-name, loan-number,
branch-name, amount).
We expect this set of functional dependencies to hold:
loan-number amount
loan-number branch-name
but would not expect the following to hold:
loan-number customer-name
Key attribute
A B C
D
Non-key
E attribute
Entity
F
. 2NF : 3 NF
Non –key attribute are functionally-
No non-key attribute is functionally-
dependent on the key – attributes
dependent upon any Non-key attribute
Partial Dependency is not allowed Transitive Dependency not allowed
Partial
Dependency
BOOK
BOOK
BOOK
BOOK
Product_ID Description
All attributes are directly or
indirectly determined by the
primary key; therefore, the relation
is at least in 1 NF
ORDER
ORDER
ORDER
PART
PART
Composite
Primary Key
STUDENT_COURSE
Composite
Primary Key
Stud_ID Course_ID
101 MSI 250
101 MSI 415
125 MSI 331
COURSE
Course_ID Units
MSI 250 3.00
MSI 415 3.00
MSI 331 3.00
Database System Concepts 7.180 ©Silberschatz, Korth and Sudarshan
Bringing a Relation to 3NF
Transitive
Dependency
DEPARTMENT
Dept_ID Dept_Name
1 Acct
2 Mktg
184
Database System Concepts 7.184 ©Silberschatz, Korth and Sudarshan