You are on page 1of 184

Data Centric Systems

&
Data Science

Can we get all the data related answers from DBMS?

What happens if Database grows too large..?

Can DBMS grow to a Data Warehouse ..?


Data Science
from
Data Base Management System
to
Big Data Analytics

By

Prof.Dr. O.P.Vyas
DAAD Fellow (Germany)
Indian Institute of Information Technology Allahabad (India)
Email: ranjanavyas@gmail.com
Outline
 Data, Information and Knowledge.
 DBMS Overview: DBMS Models, SQL & NoSQL
 Operational Data Vs Informational Data
 VLDBMS Vs Data Warehouse: Can DBMS grow to become Data Warehouse
 Data Warehouse
 Why a separate Data Warehouse: Data Mining & Data Warehousing
 Multidimensional Data
 Data Cube for Visualization
 Data Mining Functionalities
 Descriptive Analytics & Predictive Analytics
 Big Data Analytics
 How Big is Big Data: Tools and Technologies for Big Data
 Netflix Case Study & Recommender System Researches
 Internship Areas & Opportunities in Uni.Paderborn (Germany)

3
Database System Concepts 7.3 ©Silberschatz, Korth and Sudarshan
“ We are pouring in
data but starving in
Knowledge ”
Jiawei Han

4
Database System Concepts 7.4 ©Silberschatz, Korth and Sudarshan
Data Science-Chronological Development
Evolutionary steps Business query Enabling Product vendors Characteristics
technologies

Data Collection “ What was my Computers, tapes, IBM Retrospective,


total revenue in the disks static data delivery
(1960s)
last five years”

Data access “ What were A.C Relational Oracle, Sybase, Retrospective


Unit sales in New Databases, SQL, Informix,
dynamic data
ODBC IBM,Microsoft
England last delivery at record
(1980s) March? level.
Data “ What were A.C. OLAP, Multi Pilot, Comshare, Retrospective,
Warehousing & Unit sales in CG -dimensional Arbor,Cognos, dynamic data
state last March? databases, DW Microstrategy.
Decision support delivery at multiple
systems1990s) Drill down to levels.
Allahabad”
Data Mining “What’s likely to Advanced Pilot, Lockheed, Prospective,
happen to Raipur’s algorithms, IBM, SGI, Mineset proactive
A.C. Unit sales next multiprocessor etc.
information
(2000s ) month? Why ? computers, massive
databases.
delivery.

Big Data “What’s the best Distributed File Google, Prospective,


Analytics strategy to Win US System-MapReduce proactive
and Hadoop Seisint Inc.
Presidential
(Emerging) (now LexisNexis information delivery
Election ? Barack for 3Vs of Data-
Obama won in 2012 Group). 5
Volume, Velocity,
Database System Concepts ? 7.5 ©Silberschatz, Korth and Sudarshan
Variety
Data Science is an interdisciplinary field
about processes and systems to
extract knowledge or insights from data in
various forms, either structured or
unstructured, which is a continuation of
some of the data analysis fields such as
statistics, data mining, and predictive
analytics, similar to Knowledge Discovery
in Databases (KDD)
Data Science: Confluence of Multiple Disciplines

Database
Technology Statistics

Machine Data Visualization


Learning Science

Data Mining Big Data


Algorithm Analytics

7
Database System Concepts 7.7 ©Silberschatz, Korth and Sudarshan
Why Data Science…
 Over the last decade there’s been a
massive collection of data, whereas we
need tools and techniques to derive
meaning information and actionable
knowledge from this continuously
generated data.
 We need to have thorough
understanding of how to manipulate,
design and manage this massive
databases.
 One of the most productive use is to
encourage the usage of data analytics for
effective (business) decision making.
 The demand for data scientists is
increasing so quickly, that McKinsey
predicts that by 2018, there will be a
50 percent gap in the supply of data
scientists versus demand.
Database System Concepts 7.8 ©Silberschatz, Korth and Sudarshan
Three existing approaches in Data
Science..

Data Base Management System (Data stored in Tables..)


Data Warehousing (Data stored in Data Cubes..)
Data Mining…(Structured and Un-structured Data)
Data, Information & Knowledge
Data is a collection of raw facts,
facts figures and symbols, such as
numbers, words, images, video and sound, given during the
input phase.
 The data is required to be processed to create Information which
is data that is organized, meaningful, and useful.
useful
 During the output Phase, the information that has been created
is put into some form, such as a printed report.
A DBMS processes the data & provide
information.
 Knowledge is not simply the information presented, but is
information further processed with intelligent mechanism
incorporating experience, domain knowledge and specialized
techniques.
 To generate (business) insight / strategy based on the Knowledge
acquired from IT based systems. KDD ( Knowledge Discovery in
Database)
Database is an significant concept related to Data Science.
Science

Database System Concepts 7.10 ©Silberschatz, Korth and Sudarshan


DBMS: Definition

 Database

A collection of data stored in a


standardized format, designed to be shared
by multiple users.
 Database Management System

Software that defines a database, stores


the data, supports a query language,
produces reports, and creates data entry
screens.

11
Database System Concepts 7.11 ©Silberschatz, Korth and Sudarshan
Database Management System (DBMS)
 DBMS is designed to manage large bodies of data, contains
information about a particular enterprise.
 DBMS provides an environment that is both convenient and
efficient to use.
 Database Applications:
 Banking: all transactions uses DBMS
 Railways: reservations, train-schedules (first database in a
geographical distributed)
 Universities: registration, grades
 Sales: customers, products, purchases
 Manufacturing: production, inventory, orders, supply chain
 Human resources: employee records, salaries, tax deductions
 Accessing Data through Mobile phones via Internet…
Internet
 Databases touch all aspects of our lives.

 Purpose of Database Systems: Developing


Information system 12
Database System Concepts 7.12 ©Silberschatz, Korth and Sudarshan
Developing Information systems
 The purpose of a implementing a database system in any
organization is to develop an effective Information System in
the organization.
What is information system?
 Information system is an “computerization of some existing
manual /system which after automation provides useful
information to the targeted user.
 One of the most popular example of an information system is
computerization of “Railway Reservation Systems”
 Developed by “CRIS” [Centre for Railways Information
Systems] Is providing very valuable information like
1.Availability of trains between two stations .
2. Availability of Tickets in a particular Train, on particular date etc.
3. Booking a Ticket online.
 Being used as Web based system it is sound implementation 13
of DBMS !
Database System Concepts 7.13 ©Silberschatz, Korth and Sudarshan
DBMS Language: SQL (Structured Query Language)

Database System Concepts 7.14 ©Silberschatz, Korth and Sudarshan


Web Enabled Databases

Real life examples of Web Enabled Databases……?

Database System Concepts 7.15 ©Silberschatz, Korth and Sudarshan


Data Models
 A collection of tools for describing

data
data relationships
data semantics
data constraints
 Entity-Relationship model
 Relational model
 Other models:

Object-Oriented model
Semi-structured data models
NoSQL: NoRel Models

Database System Concepts 7.16 ©Silberschatz, Korth and Sudarshan


Relational Database Management
System
 RDBMS is one of the most successful model of Data Base
Management.
 RDBMS uses the concepts introduced first by Researcher Boyce
F. Codd.
 In RDBMS is a DBMS in which data is stored in tables and the
relationships among the data are also stored in tables.
 The data can be accessed or reassembled in many different
ways without having to change the table forms.
 In RDBMS large body of data can be modeled and described as
group of tables which are connected to each other through
Relations.

Database System Concepts 7.17 ©Silberschatz, Korth and Sudarshan


DBMS Project Team

P r o je c t L e a d e r

D a t a S p e c ia lis t s FRO NT END BACK END


- - > M o d e le r (V B , V C + + , J A V A ) ( O r a c le ,
- - > A r c h it e c t S Q L -S e rv e r, D B -2 )

D e v e lo p e r s DBAs

18
Database System Concepts 7.18 ©Silberschatz, Korth and Sudarshan
Three Techniques in Data Science..

Database System Concepts 7.19 ©Silberschatz, Korth and Sudarshan


DBMS & Data Warehouse

OLTP & OLAP

20
Database Management system: Limitations

 Purpose of Database Systems: Developing Information


system
 Data growing at large pace…VLDBMS

VLDBMS: Very Large DBMS !!


 Transaction Management & Analytical Processing both on
SAME system..?
 Solution was Multi Dimensional Data with separate
systems for day-to-day transaction and Analytical …

DBMS after growing become VLDBMS &


Data Warehouse..?

21
Database System Concepts 7.21 ©Silberschatz, Korth and Sudarshan
Can DBMS grow to a Data Warehouse ..?

ऑटटो ररिक्शशा कके पपीछके रलिखपी…

 जब भरविष्य मकेमें मम में बडशा


हटो जशाऊऊँगशा तब मम में
टट् रिक बन जशाऊऊँगशा !!

Database System Concepts 7.22 ©Silberschatz, Korth and Sudarshan


Why Separate Data Warehouse?
 High performance for both systems

DBMS tuned for OLTP: access methods, indexing, concurrency


 DBMS—
control, recovery
 Warehouse—tuned
Warehouse for OLAP: complex OLAP queries,
multidimensional view, consolidation
 Different functions and different data:

data Decision support requires historical data which


 missing data:
operational DBs do not typically maintain
consolidation DS requires consolidation (aggregation,
 data consolidation:
summarization) of data from heterogeneous sources
quality different sources typically use inconsistent data
 data quality:
representations, codes and formats which have to be reconciled
 Note: There are more and more systems which perform OLAP
analysis directly on relational databases
Data Mining: Concepts
and Techniques 23
Database System Concepts April 19, 2018 7.23 ©Silberschatz, Korth and Sudarshan
Data Warehousing…

24
What is Data Warehouse?
 Defined in many different ways, but not rigorously.

 A decision support database that is maintained separately from


the organization’s operational database

 Support information processing by providing a solid platform of


consolidated, historical data for analysis.

 “A data warehouse is a subject-oriented, integrated, time-


variant, and nonvolatile collection of data in support of
management’s decision-making process.”—W. H. Inmon

 Data warehousing:

 Uses OLAP ( On Line Analytical Processing Systems…)


Systems
Data Mining: Concepts
and Techniques 25
Database System Concepts April 19, 2018 7.25 ©Silberschatz, Korth and Sudarshan
Data Warehouse vs. Operational DBMS
 OLTP (on-line transaction processing)

 Major task of traditional relational DBMS


 Day-to-day operations: purchasing, inventory, banking, manufacturing,
payroll, registration, accounting, etc.
 OLAP (on-line analytical processing)

 Major task of data warehouse system


 Data analysis and decision making
 Distinct features (OLTP vs. OLAP):

 User and system orientation: customer vs. market


 Data contents: current, detailed vs. historical, consolidated
 Database design: ER + application vs. star + subject
 View: current, local vs. evolutionary, integrated
 Access patterns: update vs. read-only but complex queries
Data Mining: Concepts
and Techniques 26
Database System Concepts April 19, 2018 7.26 ©Silberschatz, Korth and Sudarshan
Data Warehouse Architecture

Client Client

Query & Analysis

Metadata Warehouse

Integration

Source Source Source

27
Database System Concepts 7.27 ©Silberschatz, Korth and Sudarshan
From Tables and Spreadsheets to Data
Cubes
 A data warehouse is based on a multidimensional data model
which views data in the form of a data cube
 A data cube, such as sales, allows data to be modeled and viewed
in multiple dimensions
 Dimension tables, such as item (item_name, brand, type), or
time(day, week, month, quarter, year)
 Fact table contains measures (such as dollars_sold) and keys to
each of the related dimension tables
 In data warehousing we form a data cube which has three
dimensions, each representing one aspect of the Data…

Data Mining: Concepts


and Techniques 28
Database System Concepts April 19, 2018 7.28 ©Silberschatz, Korth and Sudarshan
Multidimensional Data

 Sales volume as a function of product, month, and region

Dimensions: Product, Location, Time


Hierarchical summarization paths
on
gi

Industry Region Year


Re

Category Country Quarter


Product

Product City Month Week

Office Day

Month Data Mining: Concepts


and Techniques 29
Database System Concepts April 19, 2018 7.29 ©Silberschatz, Korth and Sudarshan
A Sample Data Cube

Total annual sales


Date of TV in U.S.A.
1Qtr 2Qtr 3Qtr 4Qtr sum
ct

TV
u
od

PC U.S.A
Pr

VCR
sum

Country
Canada

Mexico

sum

Data Mining: Concepts


and Techniques 30
Database System Concepts April 19, 2018 7.30 ©Silberschatz, Korth and Sudarshan
Data Mining

Database System Concepts 7.31 ©Silberschatz, Korth and Sudarshan


Data Mining using Large Databases

 What is Data Mining?


 Methods for finding interesting structure in large databases
 E.g. patterns, prediction rules, unusual cases

 Focus on efficient, scalable algorithms


 Related to data warehousing, machine learning
 Why is Data Mining important?
 Well marketed; now a large industry; pays well
 Handles large databases directly
 Can make data analysis more accessible to end users
 Semi-automation of analysis
 Results can be easier to interpret than e.g. regression models
 Strong focus on decisions and their implementation

Database System Concepts 7.32 ©Silberschatz, Korth and Sudarshan


Data Mining: A KDD Process
Data mining: the core of Knowledge
Discovery in Databases process.
Pattern Evaluation

Data Mining

Task-relevant Data

Data
Selection
Warehouse

Data Cleaning

Data Integration

Transactional
Database
Database System Concepts 7.33 ©Silberschatz, Korth and Sudarshan
Data Mining Approach

Data Mining

Descriptive Predictive

Association Clustering Classification Asso.


Rule Mining Analysis Techniques Classifiers

34
Database System Concepts 7.34 ©Silberschatz, Korth and Sudarshan
34
Data Mining Functionalities

Data Mining
Association
Mining
Classification
Association Rule Mining is one of
the most common and useful types
of data mining.
•The purpose is to determine the Clustering Classification Asso.Classifiers
“inter-dependence” among various Techniques
items.
Support-Confidence Framework
•Apriori (Agarwal et.al. 1993),
•Sequential pattern mining(Agarwal et.al.’95)
•Hashing in Apriori (Park et.al. 1995a)
•FP Growth (Han et.al. 2000): ARM
without Candidate generation. 35
Database System Concepts 7.35 ©Silberschatz, Korth and Sudarshan
Data Mining

Descriptive
Approach

Clustering

Association Classification
Clustering is the process of grouping a set of
Mining
(physical or abstract) objects into classes of
similar objects
Spatial Data Mining & Web Mining
PageRank Algorithm 1998 changed the world !!
Application Classification
Domain •K-Means Algo (Mac-Queen et al., 1967) Techniques
•CLARANS (Han et al., 1994)
Techniques •DBSCAN ( Ester et al 1996) Asso.Classification
Search Engine Result Mining
•Categorizes documents using phrases in titles and
snippets.
•Clustering Search Result (Leouski and Croft, 1996,
Zamir and Etzioni, 1997). 36
Database System Concepts 7.36 ©Silberschatz, Korth and Sudarshan
Data Analytics

Data Analytic

Descriptive Predictive

Association Classification mining analyzes a set of training Associative


Rule Mining data (i.e. a set of objects whose class labels are Classification
known) and constructs a model for each class
Application based on the features in the data.
domain A set of classification rules are generated by the
classification process, and these can be used to
classify future data, as well as develop a better
understanding of each class in the database.

37
Database System Concepts 7.37 ©Silberschatz, Korth and Sudarshan
37
Data Mining Approaches
Predictive Modeling
 Prediction based on past history
 Predict if a credit card applicant poses a good credit risk, based on some
attributes (income, job type, age, ..) and past history
 Predict if a customer is likely to switch brand loyalty
 Predict if a customer is likely to respond to “junk mail”
 Predict if a pattern of phone calling card usage is likely to be fraudulent.
 Some examples of prediction mechanisms:

 Classification ( discrete data)


 Given a training set consisting of items belonging to different classes,
and a new item whose class is unknown, predict which class it
belongs to
 Regression formulae ( continuous data )
 given a set of parameter-value to function-result mappings for an
unknown function, predict the function-result for a new parameter-
value
Dr. O.P. Vyas - IIIT-
38
Allahabad
Database System Concepts 7.38 ©Silberschatz, Korth and Sudarshan
Applications of Data Mining (Cont.)

 Descriptive Patterns: Explain the trend & pattern in the data


 Associations
 Find books that are often bought by the same customers. If a new
customer buys one such book, suggest that he buys the others too.
 Other similar applications: camera accessories, clothes, etc .
 Associations may also be used as a first step in detecting causation
 E.g. association between exposure to chemical X and cancer, or new
medicine and cardiac problems
 Clusters
 E.g. typhoid cases were clustered in an area surrounding a
contaminated well
 Detection of clusters remains important in detecting epidemics

Database System Concepts 7.39 ©Silberschatz, Korth and Sudarshan


Association Rules

 Retail shops are often interested in associations between different


items that people buy. Supermarkets uses this pattern;
 Someone who buys bread is quite likely also to buy milk
 Associations information can be used in several ways.
 E.g. when a customer buys a particular book, an online shop
suggest associated books to be bought.
 A.R.M. was initially applied to Market Basket Analysis on
Transaction data of Supermarket sales using Apriori Algorithm:

Association Rule Mining provides the


customer purchase patterns…
patterns
Marketing Strategist accordingly formulate
Discount Policies.. !!!
40
Database System Concepts 7.40 ©Silberschatz, Korth and Sudarshan
MARKET BASKET ANALYSIS
 INPUT: list of purchases by Customer

do not have names..only items bought


 identify purchase patterns

what items tend to be purchased together


obvious Bread-Butter; Milk-Sugar
 obvious:

what items are purchased sequentially


obvious house-furniture; car-tires
 obvious:

what items tend to be purchased by season


Marketing expert generate Business
Intelligence

Database System Concepts 7.41 ©Silberschatz, Korth and Sudarshan


Application of ARM
Market Basket Analysis
Various products bought by a customer in a Supermarket were
analysed for “interdependency”.
Mining Result is used for, Shelf space organization, formulating
discounting-policies and for selective marketing.

Classification Assoc.
Techniques Classification

42
Database System Concepts 7.42 ©Silberschatz, Korth and Sudarshan
42
Business Intelligence…
 BI is simply Data Mining Applications to
Business.
 Right ??
 No…not always !
 Data Mining though works as good foundation to
many BI Solution but BI is more than DM?
 How exactly BI is different from Data Mining?
 BI is an interdisciplinary area and requires
understanding of Business processes
appropriately combined to make BI work.
 Well, Areas like Digital Marketing,
Marketing Data Analytics
combined with Business Processes with
innovative approaches make it to BI !!
Database System Concepts 7.43 ©Silberschatz, Korth and Sudarshan
Data Mining Approach

Data Mining

Descriptive Predictive

Association Classification mining analyzes a set of training Associative


Rule Mining data (i.e. a set of objects whose class labels are Classification
known) and constructs a model for each class
Clustering based on the features in the data.
A set of classification rules are generated by the
classification process, and these can be used to
classify future data, as well as develop a better
understanding of each class in the database.

44
Database System Concepts 7.44 ©Silberschatz, Korth and Sudarshan
44
Examples of Classification Task

 Classifying whether the new credit card applicant


will be of High,
High Medium or Low credit risk.
 Classifying credit card transactions
as legitimate or fraudulent.
 Categorizing news stories as finance,
weather, entertainment, sports, etc

 Predicting tumor cells as benign or malignant


 Classifying secondary structures of protein
as alpha-helix, beta-sheet, or random
coil

Database System Concepts 7.45 ©Silberschatz, Korth and Sudarshan


DBMS transforms Data to Information..

Data Warehousing provides Data Visualization at multiple


levels…

Data Mining generates Knowledge and insight from the


raw data

So what more do we need..?


Why Big Data Analytics
Big Data Analytics
What & Why..?
Database System Concepts 7.48 ©Silberschatz, Korth and Sudarshan
Big Data means huge huge Data ..?
Big Data…How big it is..?
 Big Data means the Data is simply grown too big to be
handled with existing technologies.
 Right ??
 No….
 Big Data though poses problems and many current
technology can not give Solution but Big Data is more
than only being too Big?
 How exactly Big Data is different from Data
Warehousing?
 Big Data is an interdisciplinary area and many different
field like Distributed Systems are combined with existing
technologies to make Big Data work.
 Well, more important is 4 Vs innovative approaches make
it to Big Data Analytics!!
Analytics
Database System Concepts 7.50 ©Silberschatz, Korth and Sudarshan
Big data "size" is a constantly moving
target, as of 2012 ranging from a few
dozen terabytes to many petabytes of
data.

Big data requires a set of techniques and


technologies with new forms of integration
to reveal insights from datasets that are
diverse, complex, and of a massive scale
What is Big Data?
 Big data is a popular term used to describe the exponential
growth and availability of data,
data both structured and unstructured.
 And big data may be as important to business – and society – as
the Internet has become.
 Why? More data may lead to more accurate
analyses.
 More accurate analyses may lead to more
effective decision making.
 And better decisions can mean greater
operational efficiencies, cost reductions and
reduced risk.
 As far back as 2001, industry analyst Doug Laney (currently with
Gartner) articulated the now mainstream definition of big data as
the Four Vs: Volume, Velocity, Variety and Veracity…

Database System Concepts 7.52 ©Silberschatz, Korth and Sudarshan


Big Data characteristics
 Big data can be described by the following :

 Volume: The quantity of generated and stored data. The


size of the data determines the value and potential insight-
and whether it can actually be considered big data or not.
 Variety: The type and nature of the data. This helps
people who analyze it to effectively use the resulting
insight.
 Velocity: In this context, the speed at which the data is
generated and processed to meet the demands and
challenges that lie in the path of growth and development.
 Value: The quality of captured data can vary greatly,
affecting accurate analysis. Noise, incompleteness of data
has to be considered.

Database System Concepts 7.53 ©Silberschatz, Korth and Sudarshan


Developing Information system with
Database Management

Relational Database
Anamolies & Design issues
DBMS: Design & Implementation
User Groups and
1. Identify business
Queries expected
requirements and then Model .

2. Define tables and relationships.


relationships
Design Table with
constraints imposed

3. Create input forms and


reports.

4. Develop applications for


different users in mind.
mind 55
Database System Concepts 7.55 ©Silberschatz, Korth and Sudarshan
We have to focus on Design issues for
DBMS first…
RDBMS Design issues

 For designing the given real life scenario into Database systems,
E-R diagrams were used .
 So far we have assumed that Entity-attributes are grouped to
form a relation schema by using the common sense of database
designer or by mapping a schema defined by ER model.

We still need some formal measure of why


one grouping of attributes into a relation
schema may be better than another.
Without careful design the Relation
schemas – are liable to have some
problems in database…Problem of Data
integrity and/or Anomaly.

Database System Concepts 7.57 ©Silberschatz, Korth and Sudarshan


Database Design Phases

The Normalization Theory in Database also relates to Logical Database Design

Database System Concepts 7.58 ©Silberschatz, Korth and Sudarshan


What are the Constraints in Database?
When and why they are required..??
Database Management System
 DBMS Constraints
 What are database constraints
 Why constraints in DBMS
 Data Integrity in DBMS

 Design & Anomalies : Various anomalies


 Data base Schema and Anomalies examples
 Insert, Delete and Update Anomalies
 Why Anomalies ?
 ER Diagrams may not yield good design
 Pitfalls in Relational Database design

 Normalization & Overview of Normal Forms


 First Normal Form

 Functional Dependencies
 Second Normal Form & Third Normal
 Decomposition & Properties of Decomposition
 Overall Database Design Process
60
Database System Concepts 7.60 ©Silberschatz, Korth and Sudarshan
Constraints enforce limits to the
data or type of data that can be
inserted/updated/deleted from a
table.

The whole purpose of constraints is


to maintain the data
integrity during an
update/delete/insert into a table.
Data integrity is the maintenance of, and the
assurance of the accuracy and consistency
of, data over its entire life-cycle.
Data integrity is usually imposed during the database
design phase through the use of standard procedures
and rules
Why Constraints in DBMS

 Constraints are used to make sure that the integrity of data is


maintained in the database.

Data integrity is the overall completeness,


accuracy and consistency of data.
This can be indicated by the absence of
alteration between two instances or
between two updates of a data record,
meaning data is intact and unchanged..
Data integrity can be maintained through
the use of various error-checking methods
and validation procedures.

Database System Concepts 7.62 ©Silberschatz, Korth and Sudarshan


Relational Database Design
&
Database Anomalies

Often ER diagram are directly converted into


Tables…without considering the Database design
aspects but this may have Anomalies and
inconsistency in databases to be felt during its
entire life-cycle.
Better to consider Normalization during the database
design phase through the use of standard procedures
and rules
Database Design
The process of designing the database is first step in
developing Information system:
 Design – Database design requires that we find a
“good” collection of tables (relation ) schemas .
 What attributes should we record in the database so that
our expected queries are efficiently answered?
 What relation schemas should we have and how should
the attributes be distributed among the various relation
schemas so that
 All significant aspects of the system are included in the
Database design-Completeness!
Completeness
 No aspects of system are getting repeated- Redundancy

 Measure of Design Appropriateness: How to


measure that our Database design is optimum

Database System Concepts 7.64 ©Silberschatz, Korth and Sudarshan


RDBMS Design issues

 So far we have assumed that attributes are grouped to form a


relation schema by using the common sense of database
designer or by mapping a schema defined by ER model.

We still need some formal measure of why


one grouping of attributes into a relation
schema may be better than another.

Unsatisfactory relation schemas that do


not meet certain conditions – the normal
form tests – are liable to have some
problems in database…known as
Anomaly.

Database System Concepts 7.65 ©Silberschatz, Korth and Sudarshan


Normalization & Design objectives
The basic objective of normalization is to
reduce the various anomalies in the
database.
 Database design & Normalization can be
looked upon as a process of analyzing the
given relation schemas based on their FDs and
primary keys to achieve the desirable
properties of ;
 Minimizing redundancy
 Minimizing the insertion, deletion, and update
anomalies.

Database System Concepts 7.66 ©Silberschatz, Korth and Sudarshan


Anomalies are problems that can occur in
poorly planned, un-normalized databases where

all the data is stored in one table and


performing Database operations like Insert,
delete and Updates are resulting into some
errors / discrepancies' in the Database.
Types of Anomalies
Insert Anomaly
Delete Anomaly
Update Anomaly
Anomaly & Schema Analysis?

 Schema Analysis & Refinement is the study of which schemas


are best to describe an application.
 For example, consider this schema EmpDept describing
Employees and Departments;

EID Name DeptID DeptName


A01 Ali 12 Wing
EmpDept A12 Eric 10 Tail
A13 Eric 12 Wing
A03 Tyler 12 Wing

 How is this schema …do you think it is OK or not? Why?


Why

Database System Concepts 04/19/18 Lecture 8 7.69 ©Silberschatz, Korth and Sudarshan
Schema Analysis: EmpDept
 The first problem students usually identify with the EmpDept
schema is that it combines two different ideas: employee
information and department information. But what is wrong
with this?
 Also that if we want to open a new Deptt. and thus want to
insert DeptName but do not have any employees so EID
(Primary key) and Name field will have no values…
 Without any values in Primary Key.. How will the record exist..
 So we can not Insert new DeptName is this scheme..
 Insert Anomaly..?

EID Name DeptID DeptName


A01 Ali 12 Wing
A12 Eric 10 Tail
A13 Eric 12 Wing
A03 Tyler 12 Wing

Database System Concepts 7.70 ©Silberschatz, Korth and Sudarshan


Insert Anomaly

 An Insert Anomaly occurs when certain attributes cannot be


inserted into the database without the presence of other
attributes.
 Take another Example of Courses and Rooms-eacher allotment
schema.
 What do you think..? Is it ok?

Course _no Tutor Room Room_size En_limit


353 Smith A532 45 40
351 Smith C320 100 60
355 Clark H940 400 300
456 Turner H940 400 45

Database System Concepts 7.71 ©Silberschatz, Korth and Sudarshan


Insert Anomaly

Course _no Tutor Room Room_size En_limit


353 Smith A532 45 40
351 Smith C320 100 60
355 Clark H940 400 300
456 Turner H940 400 45

e.g. we have built a new room (e.g. B123)


B123 but it has
not yet been timetabled for any courses or members of
staff.
>> Inserting B123 is problem here !!

Database System Concepts 7.72 ©Silberschatz, Korth and Sudarshan


Delete Anomaly

A Delete Anomaly exists when certain


attributes are lost because of the deletion of
other attributes.
Presence of such Anomaly clearly suggests
that we have not designed our database
properly..

Database System Concepts 7.73 ©Silberschatz, Korth and Sudarshan


Delete Anomaly

Course_no Tutor Room Room_size En_limit


353 Smith A532 45 40
351 Smith C320 100 60
355 Clark H940 400 300
456 Turner H940 400 45

e.g. if we remove the entity, course_no:351 from the above table, the
details of room C320 get deleted.

Which implies the corresponding course will also get deleted.

Database System Concepts 7.74 ©Silberschatz, Korth and Sudarshan


Update Anomaly

An Update Anomaly exists when one or


more instances of duplicated data is
updated, but not all.

Database System Concepts 7.75 ©Silberschatz, Korth and Sudarshan


Update Anomaly

Course_no Tutor Room Room_size En_limit


353 Smith A532 45 40
351 Smith C320 100 60
355 Clark H940 400 300
456 Turner H940 400 45

e.g. Room H940 has been improved, it is now of RSize = 500. For
updating a single entity, we have to update all other columns where
Room=H940.

Database System Concepts 7.76 ©Silberschatz, Korth and Sudarshan


Update, deletion, and insertion
anomalies are very undesirable in any
database.

Anomalies are avoided by the process


of normalization.
Anomalies and Normalization
 Normalization is the process of splitting relations into well structured
relations that allow users to insert, delete, and update tuples without
introducing database inconsistencies. Without normalization many problems
can occur when trying to load an integrated conceptual model into the
DBMS.
 These problems arise from relations that are generated directly from E-R
Diagrams without Design considerations.
 An update anomaly is a data inconsistency that results from
data redundancy and a partial update.
 A deletion anomaly is the unintended loss of data due to
deletion of other data
 An insertion anomaly is the inability to add data to the
database due to absence of other data
 These Anomalies can be eliminated with appropriate Design
with the help of Normalization.

Database System Concepts 7.78 ©Silberschatz, Korth and Sudarshan


Unsatisfactory relation schemas that do not meet
certain conditions – the normal form tests – are
decomposed into smaller relation schemas that
meet the tests and hence possess the desirable
properties.
Thus, the normalization procedure provides
database designers with;

 A formal framework for analyzing relation


schemas based on their keys and on the functional
dependencies among their attributes.
 A series of Normal Form tests that can be
carried out on individual relation schemas so that
the relational database can be normalized to any
desired degree.
Database Design

The Normalization Theory in Database also relates to Logical Database Design

Database System Concepts 7.80 ©Silberschatz, Korth and Sudarshan


Database Design

Database System Concepts 7.81 ©Silberschatz, Korth and Sudarshan


RDBMS Design issues

 So far we have assumed that attributes are grouped to form a


relation schema by using the common sense of database designer
or by mapping a schema defined by ER model.
 We still need some formal measure of why one grouping of attributes
into a relation schema may be better than another.
 Unsatisfactory relation schemas that do not meet certain
conditions – the normal form tests – are decomposed into
smaller relation schemas that meet the tests and hence
possess the desirable properties.
 Thus, the normalization procedure provides database
designers with;
 A formal framework for analyzing relation schemas based on their
keys and on the functional dependencies among their attributes.
 A series of normal form tests that can be carried out on individual
relation schemas so that the relational database can be normalized
to any desired degree.

Database System Concepts 7.82 ©Silberschatz, Korth and Sudarshan


Normalization & Design objectives
The basic objective of normalization is to
reduce the various anomalies in the
database.
Normalization can be looked upon as a
process of analyzing the given relation
schemas based on their FDs and primary
keys to achieve the desirable properties of ;
 Minimizing redundancy
 Minimizing the insertion, deletion, and update
anomalies.

Database System Concepts 7.83 ©Silberschatz, Korth and Sudarshan


Database System Concepts 7.84 ©Silberschatz, Korth and Sudarshan
Unsatisfactory relation schemas that do not meet
certain conditions – the normal form tests – are
decomposed into smaller relation schemas that
meet the tests and hence possess the desirable
properties.
Thus, the normalization procedure provides
database designers with;

 A formal framework for analyzing relation


schemas based on their keys and on the functional
dependencies among their attributes.
 A series of Normal Form tests that can be
carried out on individual relation schemas so that
the relational database can be normalized to any
desired degree.
Relationship Between Normal
Forms

Database System Concepts 7.86 ©Silberschatz, Korth and Sudarshan


Normalization…
 The normal form of a relation refers to the highest normal form
condition that it meets, and hence indicates the degree to which
it has been normalized.
 Normal forms when considered in isolation from other factors, do
not guarantee a good database design.
design
 It is generally not sufficient to check separately that each relation
schema in the database is, say, in BCNF or 3NF.
 Rather, the process of normalization through decomposition
must also confirm the existence of additional properties that the
relation schemas, taken together should possess;
 The Lossless join,
join
 The dependency preservation property,
property which ensures that each
functional dependency is represented in some individual relations
resulting after decomposition.

Database System Concepts 7.87 ©Silberschatz, Korth and Sudarshan


First Normal Form

 A relational database table that adheres to 1NF is one that


meets a certain minimum set of criteria.

These criteria are basically concerned with


ensuring that the table is a faithful
representation of a relation and that it is
free of repeating groups.
First Normal Form is minimum normal
form a schema should follow, in order to
fulfill design objectives.

Database System Concepts 7.88 ©Silberschatz, Korth and Sudarshan


1 NF

 Some definitions of 1NF, most notably that of Edgar F. Codd,


make reference to the concept of atomicity.
 Codd states that the "values in the domains on which each
relation is defined are required to be atomic with respect to the
DBMS."
 Codd defines an atomic value as one that "cannot be
decomposed into smaller pieces by the DBMS (excluding certain
special functions).“
 Meaning a field should not be divided into parts with more than
one kind of data in it such that what one part means to the
DBMS depends on another part of the same field.

Database System Concepts 7.89 ©Silberschatz, Korth and Sudarshan


First Normal Form

 Domain is atomic if its elements are considered to be indivisible


units
 Examples of non-atomic domains:
 Set of names, composite attributes
 Identification numbers like CS101 that can be broken up into
parts
 A relational schema R is in first normal form if the domains of all
attributes of R are atomic
 Non-atomic values complicate storage and encourage redundant
(repeated) storage of data
 Example: Set of accounts stored with each customer, and set
of owners stored with each account
 We assume all relations are in first normal form (and revisit this
again!)

Database System Concepts 7.90 ©Silberschatz, Korth and Sudarshan


1 Normal Form
 Atomicity is actually a property of how the elements of the domain are
used.
 Example: Strings would normally be considered indivisible

 Suppose that students are given roll numbers which are


strings of the form CS0012 or EE1127
 If the first two characters are extracted to find the department,
the domain of roll numbers is not atomic.
 What is wrong with this Approach !!!
 Doing so is a bad idea:
idea Such approach does
not conform to the database theory and leads
to encoding of information in application
program rather than in the database

Database System Concepts 7.91 ©Silberschatz, Korth and Sudarshan


Atomicity in 1 NF
 "Atomic" has never really meant "indivisible", which is why that term is
finally falling out of favor. Loosely speaking, "atomic" means if a value
has component parts, the dbms either ignores the existence of those
parts, or it provides functions to manipulate them. For example, a
timestamp value has these parts.
 Year
 Month
 Day
 Hours
 Minutes
 Seconds
 Milliseconds

That kind of value is obviously divisible, and all database management


systems provide functions to manipulate those parts. They also
provide a way to select a timestamp as a single value. (Which, of
course, it is.)

Database System Concepts 7.92 ©Silberschatz, Korth and Sudarshan


Normal Forms & FDs: Review

 Unnormalized – There are multivalued attributes or repeating


groups and record can not be uniquely identified by Primary key
 1 NF – No composite attributes or repeating groups,
 2 NF – 1 NF plus no partial dependencies
 3 NF – 2 NF plus no transitive dependencies

Database System Concepts 7.93 ©Silberschatz, Korth and Sudarshan


Database System Concepts 7.94 ©Silberschatz, Korth and Sudarshan
First Normal Form

 To Identify 1 NF ?

If not in 1 NF then bring it to 1 NF.


Decompose the Schema with some
conditions.
After decomposition again check the
Normal Form

Database System Concepts 7.95 ©Silberschatz, Korth and Sudarshan


How to identify Normal Forms

 Unnormalized – There are multivalued attributes or repeating


groups, No Key attributes.

 1NF
 2NF
 3NF
 BCNF Concept of FD’s ( Functional Dependency) required

 What are the Key Attributes ?


 What are the Non-Key Attributes ?
 Find out Functional Dependencies ?
 What about Higher Normal Forms?

Database System Concepts 7.96 ©Silberschatz, Korth and Sudarshan


First Normal Form

 A relational database table that adheres to 1NF is one that meets a


certain minimum set of criteria.
 These criteria are basically concerned with ensuring that the table is a
faithful representation of a relation and that it is free of repeating
groups.
 Some definitions of 1NF, most notably that of Edgar F. Codd, make
reference to the concept of atomicity.
 Codd states that the "values in the domains on which each relation is
defined are required to be atomic with respect to the DBMS."
 First Normal Form Scheme should be such that in a given table there
should be
 No Composite Attributes
 No Repeating groups and
 All the attributes can be uniquely identified by Key Attributes

Database System Concepts 7.97 ©Silberschatz, Korth and Sudarshan


Normal Forms

This table is not in first normal form because the [Color] column can
contain multiple values.

For example, the first row includes values "red" and "green."

To bring this table to first normal form, we split the table into two
tables and now we have the resulting tables:

Database System Concepts 7.98 ©Silberschatz, Korth and Sudarshan


Decomposition for Normalization

Database System Concepts 7.99 ©Silberschatz, Korth and Sudarshan


Relation in 1NF ?

 If Primary key can not uniquely identify the


records…..!
 There are repeating groups…!!
 What’s to be done to bring to 1 NF…??

Database System Concepts 7.100 ©Silberschatz, Korth and Sudarshan


Bringing a Relation to 1NF
 Option 1: Make a determinant of the repeating group (or the
multivalued attribute) a part of the primary key.

Composite
Primary Key

Database System Concepts 7.101 ©Silberschatz, Korth and Sudarshan


Bringing a Relation to 1NF
 Option 2: Remove the entire repeating group from the relation.
 Create another relation which would contain all the attributes of
the repeating group, plus the primary key from the first relation.
In this new relation, the primary key from the original relation
and the determinant of the repeating group will comprise a
primary key.

Database System Concepts 7.102 ©Silberschatz, Korth and Sudarshan


Bringing a Relation to 1NF

STUDENT_COURSE

Stud_ID Course Units


101 MSI 250 3
101 MSI 415 3
125 MSI 331 3

Database System Concepts 7.103 ©Silberschatz, Korth and Sudarshan


Example Schema
 Consider the relation schema:
Lending-schema = (branch-name, branch-city, assets,
customer-name, loan-number, amount)

 What are the Problems this design will face??


 Redundancy:
 Data for branch-name, branch-city, assets are repeated for each loan that a
branch makes
 Wastes space
 Complicates updating, introducing possibility of inconsistency of assets value
 Null values
 Cannot store information about a branch if no loans exist
 Can use null values, but they are difficult to handle.
Database System Concepts 7.104 ©Silberschatz, Korth and Sudarshan
Example Schema
 Consider the relation schema:
Lending-schema = (branch-name, branch-city, assets,
customer-name, loan-number, amount)

 What are the Problems this design will face??


 Redundancy:
 Data for branch-name, branch-city, assets are repeated for each loan that a
branch makes
 Wastes space
 Complicates updating, introducing possibility of inconsistency of assets value
 Null values
 Cannot store information about a branch if no loans exist
 Can use null values, but they are difficult to handle.
Database System Concepts 7.105 ©Silberschatz, Korth and Sudarshan
Example Schema
 Consider the relation schema:
Lending-schema = (branch-name, branch-city, assets,
customer-name, loan-number, amount)

 What are the Problems this design will face??


 Redundancy:
 Data for branch-name, branch-city, assets are repeated for each loan that a
branch makes
 Wastes space
 Complicates updating, introducing possibility of inconsistency of assets value
 Null values
 Cannot store information about a branch if no loans exist
 Can use null values, but they are difficult to handle.
Database System Concepts 7.106 ©Silberschatz, Korth and Sudarshan
Example Schema
 Consider the relation schema:
Lending-schema = (branch-name, branch-city, assets,
customer-name, loan-number, amount)

 What are the Problems this design will face??


 Redundancy:
 Data for branch-name, branch-city, assets are repeated for each loan that a
branch makes
 Wastes space
 Complicates updating, introducing possibility of inconsistency of assets value
 Null values
 Cannot store information about a branch if no loans exist
 Can use null values, but they are difficult to handle.
Database System Concepts 7.107 ©Silberschatz, Korth and Sudarshan
Normal Forms: FDs & MDs
 1NF ( First Normal Form)
To understand
 2NF
 3NF
 BCNF Concept of FD’s ( Functional Dependency) required

To understand

 4NF
 5NF
Concept of MVD (Multi Valued Dependency) is required

Database System Concepts 7.108 ©Silberschatz, Korth and Sudarshan


DBMS Design & Normalization :

 Step I : Test the given Relation scheme for any Normal


form

 Step II :If the given R-scheme is not even in 1 N.F,


Decompose the Relation-scheme into small Relation-
schema & test again.

 Step III. Identify & Analyse the FDs of the given Scheme.
(2NF/3NF) then Step IV.

 Step IV: Decompose the Relation-scheme into small


Relation-schema.
 Step V : Test the decomposed schema for Normal forms .

Database System Concepts 7.109 ©Silberschatz, Korth and Sudarshan


Goal — Devise a Theory for the Following

 Decide whether a particular relation R is in “good” form.


 In the case that a relation R is not in “good” form, decompose it
into a set of relations {R1, R2, ..., Rn} such that
 each relation is in good form
 the decomposition is a lossless-join decomposition
 Our theory is based on:
 functional dependencies
 multivalued dependencies

Database System Concepts 7.110 ©Silberschatz, Korth and Sudarshan


Decomposition

 Decompose the relation schema Lending-schema into:


Branch-schema = (branch-name, branch-city,assets)
Loan-info-schema = (customer-name, loan-number,
branch-name, amount)
 All attributes of an original schema (R) must appear in
the decomposition (R1, R2):
R = R 1  R2
 Lossless-join decomposition.
For all possible relations r on schema R
r = R1 (r) R2 (r)

Database System Concepts 7.111 ©Silberschatz, Korth and Sudarshan


RDBMS Design
&
Higher Normal forms
Design & Normalization
 In RDBMS design, Normalization plays significant
role:
 Normalization is a process of analyzing the given relation
schemas based on their FDs and primary keys to achieve
the desirable properties
 Normalization provides testing of given schema in 1NF /
2NF / 3 NF etc.
 Any design in lower normal form can be further improved (by
decomposition)
decomposition and again tested for 1NF / 2NF / 3 NF etc.
 Concept of FDs are required !
 Decomposition properties needs to be understood and
adhered.
adhered
 Normalization help the design to be proper and does
not have any anomaly.

Database System Concepts 7.113 ©Silberschatz, Korth and Sudarshan


How to identify Normal Forms

 Unnormalized – There are multivalued attributes or repeating groups,


No Key attributes

 1NF
 2NF
 3NF
 BCNF Concept of FD’s ( Functional Dependency) required

 What are the Key Attributes ?

 What are the Non-Key Attributes ?

 Find out Functional Dependencies ?

Database System Concepts 7.114 ©Silberschatz, Korth and Sudarshan


Overview of Normal Forms

 1NF ( First Normal Form)


To understand
 2NF
 3NF
 BCNF Concept of FD’s ( Functional Dependency) required

To understand

 4NF
 5NF
Concept of MVD (Multi Valued Dependency) is required

Database System Concepts 7.115 ©Silberschatz, Korth and Sudarshan


Keys & Functional Dependency
 A Key Attributes is special attribute which can
uniquely identify the record for Entity
 A super key of an entity set is a set of one or more
attributes whose values uniquely determine each
entity.
Ex. Super key : ( customer-id, Customer-name)
 A candidate key of an entity set is a minimal super key
 Customer-id is candidate key of customer
 account-number is candidate key of account
 Although several candidate keys may exist, one of the
candidate keys is selected to be the primary key.
 Functional Dependencies shows us Inter – relation
among attributes

116
Database System Concepts 7.116 ©Silberschatz, Korth and Sudarshan
Functional Dependencies and Keys

 A FD is a generalization of the notion of a key.


 How ?
 If a given set of values for each attribute in X uniquely determines each
of the values of the attributes in Y
X - Y
 For Example in Relation schema
Student (sid, name, supervisor_id, specialization),

we write the functional dependency:


{sid}  {name, supervisor_id, specialization}
 The sid determines all attributes (i.e., the entire record)
 If two tuples in the relation student have the same sid, then they must
have the same values on all attributes.
 In other words they must be the same tuple (since the relational modes
does not allow duplicate records)

Database System Concepts 7.117 ©Silberschatz, Korth and Sudarshan


Functional Dependency
Functional Dependency :
 Inter – relation among attributes of an entity.
 Let R be a relation on the relation scheme R, then R satisfies the
functional – dependency X - Y
 If a given set of values for each attribute in X uniquely determines
each of the values of the attributes in Y.
 ( X determines Y
 Y functionally dependent on X )
 FD’s can be used to group the attributes into Relation-scheme,
which is in a particular Normal forms
Approach of Normalization :
 Step I : Test the given Relation scheme for any Normal form
 Step II : If the given R-scheme is not in any N.F. (2NF/3NF)
 Step III : Decompose the Relation-scheme into small Relation-
schema.
 Step IV : Test the decomposed schema for Normal forms.

Database System Concepts 7.118 ©Silberschatz, Korth and Sudarshan


FDs example

What are
FDs ??
List them....

Database System Concepts 7.119 ©Silberschatz, Korth and Sudarshan


FDs example

Are any other FDs ??


List them....

Database System Concepts 7.120 ©Silberschatz, Korth and Sudarshan


How to identify FDs

 Identification of correct FDs are key to efficient DBMS


Design .
 There are many commonly committed errors in
identifying FDs.
R satisfies the functional – dependency X  Y
 If a given set of values for each attribute in X uniquely
determines each of the values of the attributes in Y.
X determines Y
 While identifying FDs we should ensure that each
attribute in X uniquely determines and not that
some tuples satisfy it !

Database System Concepts 7.121 ©Silberschatz, Korth and Sudarshan


Procedure to identify FDs
How to Identify the FDs in Scheme.
We start finding any two tuples with the same
X value and then Y values in these tuples
must be same.
Repeat these procedure until all such pairs of
tuples with the same X values are examined
with Y values.
Then verify that whether R in real world
scenario satisfies X  Y or not ?
 Should FDs be identified only by observing the tuples/data ?
 Analyse the FDs and see whether it will work ?

Database System Concepts 7.122 ©Silberschatz, Korth and Sudarshan


Approach of Normalization :

 Step I : Test the given Relation scheme for any Normal


form

 Step II : Identify & Analyse the FDs of the given


Scheme.

 Step III If the given R-scheme is not in any N.F.


(2NF/3NF)

 Step IV: Decompose the Relation-scheme into small


Relation-schema.
 Step V : Test the decomposed schema for Normal forms .

Database System Concepts 7.123 ©Silberschatz, Korth and Sudarshan


Functional Dependency

 Inter – relation among attributes of an entity.


 Let R be a relation on the relation scheme R, then R
satisfies the functional – dependency X - Y
 If a given set of values for each attribute in X uniquely
determines each of the values of the attributes in Y.
X determines Y
 Y functionally dependent on X
 FDs allows us to express constraints that we can
not express with Superkeys !
 FD’s can be used to group the attributes into Relation-
scheme, which is in a particular Normal forms

Database System Concepts 7.124 ©Silberschatz, Korth and Sudarshan


FDs

 Note: A specific instance of a relation schema may satisfy a functional


dependency even if the functional dependency does not hold on all legal
instances.
 For example, a specific instance of Loan may, by chance, satisfy
amount  customer_name.

 Functional dependencies allow us to express constraints that


cannot be expressed using superkeys. Consider the schema:
bor_loan = (customer_id, loan_number, amount , customer_name).
We expect this functional dependency to hold:
loan_number  amount
but would not expect the following to hold:
amount  customer_name

Database System Concepts 7.125 ©Silberschatz, Korth and Sudarshan


Common pitfalls in identifying FDs

 Design Goals & Normalization

Design Anomalies
 Identifying FDs : When & How ?

Procedure of FDs identifications ?


 Pitfalls in FDs identification

Limitations with Normal Forms


 Decompose the Relation Scheme

FDs to be taken care of while decomposition


Dependency preservation, Lossless join.

Database System Concepts 7.126 ©Silberschatz, Korth and Sudarshan


Functional Dependencies (Cont.)

 K is a superkey for relation schema R if and only if K  R


 K is a candidate key for R if and only if
 K  R, and
 for no   K,   R
 Functional dependencies allow us to express constraints that cannot be
expressed using superkeys. Consider the schema:
bor_loan = (customer_id, loan_number, amount ).
We expect this functional dependency to hold:
loan_number  amount
but would not expect the following to hold:
amount  customer_name

Database System Concepts 7.127 ©Silberschatz, Korth and Sudarshan


1NF
1NF :- A Relation scheme is said to be in 1NF .if there
are no composite attributes, and every attribute is
having atomic or indivisible values.

2NF :- A Relation is said to be in 2NF;


 1. IF it is in 1NF and
 2. Non –key attribute are functionally-dependent
on the key – attributes.
attributes
further , if the key has more than one attribute, then no
non key attributes should be functionally dependent
upon a part of the key attributes.
attributes

Database System Concepts 7.128 ©Silberschatz, Korth and Sudarshan


Design Goals & Pitfalls with FDs

 The basic objective of normalization is to reduce the various


anomalies in the database.
 Normalization can be looked upon as a process of analyzing
the given relation schemas based on their FDs and primary
keys to achieve the desirable properties of ;
 Minimizing redundancy
 Minimizing the insertion, deletion and update anomalies.
insertion deletion,
 How to identify FDs and prime / nonprime attributes?
 An Attribute A in a relation is called a prime attribute if A is a part of
any candidate key of the relation.
 There are some pitfalls in FDs identifying !

Database System Concepts 7.129 ©Silberschatz, Korth and Sudarshan


Decomposition
 The bad design suggests that we should decompose that table.
 It is generally not sufficient to check separately that each
relation schema in the database is, say, in 2NF or 3NF.
 Rather, the process of normalization through
decomposition must also confirm the existence of
additional properties that the relation schemas, taken
together should possess;
 The Lossless join,
join
 The dependency preservation property,
property which ensures that
each functional dependency is represented in some individual
relations resulting after decomposition.

Database System Concepts 7.130 ©Silberschatz, Korth and Sudarshan


Normal Forms & FDs: Review

 Unnormalized – There are multivalued attributes or repeating


groups
 1 NF – No composite attributes or repeating groups,
 2 NF – 1 NF plus no partial dependencies
 3 NF – 2 NF plus no transitive dependencies

Database System Concepts 7.131 ©Silberschatz, Korth and Sudarshan


Design Analyses

It is simpler to test Normal Forms when the FDs of the


schema is known

Given a schema: Identify FDs and then Normal Forms


Need for Higher Normal Forms

Design Analyses
Teaches Schema
Database System Concepts 7.134 ©Silberschatz, Korth and Sudarshan
Common pitfalls in FDs in the Schema
Whether the FDs
Course  Professor
Professor  Course
Satisfied in the following Schema ? Why so ?

If a given set of values for each attribute in X uniquely determines each of the
values of the attributes in Y.
Then verify that whether R in real world scenario satisfies
Professor -> Course or not ?
Can same value for Professor attribute may have more than one values for
Normal Form Test for TEACHES
 Let us consider Teaches Relation for testing whether it
fulfills any Normal Form ?
What are the FDs in this scheme ?
TEACHES contains attributes Professor, Course, Room,
Room_Cap, Enrol_Lmt (Enrolment Limit).

 The relation scheme for the relation TEACHES is (Prof, Course, Room,
Room_Cap,Enrol_lmt)
 The domain of the attribute Prof is all the faulty members of the university.
 The domain of the attribute Course is the courses offered by the university.
 The domain of Room is the rooms in the buildings of the university.
 The domain of Room_Cap is an integer value indicating the seating capacity of the room.

Database System Concepts 7.136 ©Silberschatz, Korth and Sudarshan


Design - Analysis

 Go for Normal Form Tests ?

The Teaches relation is in First Normal Form or not ?


 Identify the FDs in Scheme.
 Should FDs be identified only by observing the tuples/data ?
 Analyse the FDs and see how will it work ?
 Go for higher Normal Form Tests ?
 What is the remedy of the problem ?

Database System Concepts 7.137 ©Silberschatz, Korth and Sudarshan


Teaches Relation

Which Normal Form ?


What are the Key Attributes ?

What are the Non-Key Attributes ?

Find out Functional Dependencies ?


Are Non –key attribute functionally-dependent
on the key – attributes.?

Database System Concepts 7.138 ©Silberschatz, Korth and Sudarshan


FDs in Scheme
 The domain of the Enrlo_Lmt is also integer value and should be less
than or equal to the corresponding value for Room_Cap.
 The TEACHES relation is in first normal form since it’s attributes contain
only atomic values and there are key attributes uniquely identifying record.

 FDs ?
 The course is scheduled in a given room and each Course uniquely
identifies Room
Course  Room
 since the room has the given maximum number of available seats, there
is a functional dependency
Room Room_Cap And hence from transitivity
Course  Room  Room_Cap
Thus the functional dependencies in this relation are
 {Course  (Prof, Room, Room_Cap, Enrol_Lmt),
 Room Room_Cap}
Room_Cap

Database System Concepts 7.139 ©Silberschatz, Korth and Sudarshan


FDs Analysis

The Given Schema named TEACHES Relation : Normal Form?

Database System Concepts 7.140 ©Silberschatz, Korth and Sudarshan


Analysis of Scheme
 There is an other transitive dependency (Here we assume that Enrol_Lmt is
the upper limit on registration for a course and is based solely on the room
capacity)
 Room  Room_Cap  Enrol_Lmt
 The presence of these transitive dependency in TEACHES will cause the
following problems:
 The capacity of a room cannot be entered in the database
unless a course is scheduled in the room ( Insert
Anamoly)and
 The capacity of a room in which only one course is
scheduled will be deleted if the only course scheduled in
that room is deleted. ( Delete Anamoly)
 Because the same room can appear more than once in the
database, there could be inconsistencies between the
multiple occurrences of the attribute pair Room and
Room_Cap.

Database System Concepts 7.141 ©Silberschatz, Korth and Sudarshan


Why Anomalies?

TEACHES ( Professor, Course, Room, Room_Cap, Enrol_Lmt )


 Anomalies occur in the given TEACHES scheme because of
partial dependencies;
dependencies

which means Non-key attributes are NOT


functionally dependent on Key attributes !
This violates the condition of Second
Normal form !
Thus the given relation is in First Normal
Form only.
We have to Decompose the relation .

Database System Concepts 7.142 ©Silberschatz, Korth and Sudarshan


First Normal Form

Database System Concepts 7.143 ©Silberschatz, Korth and Sudarshan


Second Normal Form

2NF :- A Relation is said to be in 2NF;


 1. IF it is in 1NF and
 2. Non –key attribute are functionally-dependent
on the key – attributes.
further ,if the key has more than one attribute ,then no non key
attributes should be functionally dependent upon a part of the
key attributes.
attributes

Database System Concepts 7.144 ©Silberschatz, Korth and Sudarshan


2 NF & 3 NF…

Key attribute
A B C
D
Non-key
E attribute
Entity
F

 . 2NF :  3 NF
Non –key attribute are functionally-
 No non-key attribute is functionally-
dependent on the key – attributes
dependent upon any Non-key attribute
Partial Dependency is not allowed Transitive Dependency not allowed

Database System Concepts 7.145 ©Silberschatz, Korth and Sudarshan


Decomposition
 The bad design suggests that we should decompose that table.
 It is generally not sufficient to check separately that each
relation schema in the database is, say, in 2NF or 3NF.
 Rather, the process of normalization through
decomposition must also confirm the existence of
additional properties that the relation schemas, taken
together should possess;
 The Lossless join,
join
 The dependency preservation property,
property which ensures that
each functional dependency is represented in some individual
relations resulting after decomposition.

Database System Concepts 7.146 ©Silberschatz, Korth and Sudarshan


Decomposition Relation & 2NF ?

Here Course is Key attribute for COURSE_DETAILS


and Room is key attribute for ROOM_DETAILS

The Decomposed Relation does not have Partial Dependency.


Thus the Schema is now in Second Normal Form .

Is Second Normal Form capable to remove all anomalies


and give good Database implementation ?

Database System Concepts 7.147 ©Silberschatz, Korth and Sudarshan


No Anomaly ?
 Although 2 NF does not give any Partial dependency but there can
be other possibilities of anomalies because of dependencies,
 We can not insert a Room value in the relation alongwith an
Enrol_Lmt value unless we have a Course value to go alongwith
Room value.

 There is a interrelation join dependency between


COURSE_DETAILS and ROOM_DETAILS to enforce the constraint
that the Enrol_Lmt be less than or equal to the Room_Cap.

 Relation COURSE_DETAILS has a Transitive dependency


Course  Room  Enrol_Lmt
 Non-key attribute is functionally dependent on another non key
attribute !

Database System Concepts 7.148 ©Silberschatz, Korth and Sudarshan


Further decomposition & 3 NF ?

Database System Concepts 7.149 ©Silberschatz, Korth and Sudarshan


Normal Forms & Decomposition

 Case Study of TEACHES Relation

First Normal Form?


Anomalies in 1NF
Decomposition for 2NF
Is 2NF without anomalies?
Decomposition for 3NF
Decomposition principles
Definition of 3NF
 Need for Closure

Armstrong Axiom for Closure

Database System Concepts 7.150 ©Silberschatz, Korth and Sudarshan


Third Normal Form :
Third Normal Form :
 A relation scheme is said to be in 3 NF if ;
 It is in 2 NF.
 No non-key attribute is functionally-dependent upon any Non-
key attribute.
attribute
 Thus, there should be no transitive-dependency of a non-key attribute
on the primary key.
 3 NF Scheme does not allow partial dependencies like in 2NF but
additionally 3NF does not allow any transitive dependencies.

Another Defn of Third Normal Form:-


 A relation scheme R < S, F > is in 3 NF if for all non-trivial Functional
dependencies in F + (closure of F) of the form X  A, either X
contains a key (i.e. X is a super-key) or A is a prime attribute

Database System Concepts 7.151 ©Silberschatz, Korth and Sudarshan


Database System Concepts 7.152 ©Silberschatz, Korth and Sudarshan
First Normal Form

 A relational database table that adheres to 1NF is one that meets a


certain minimum set of criteria.
 These criteria are basically concerned with ensuring that the table is
a faithful representation of a relation and that it is free of repeating
groups.
 Some definitions of 1NF, most notably that of Edgar F. Codd, make
reference to the concept of atomicity.
 Codd states that the "values in the domains on which each relation is
defined are required to be atomic with respect to the DBMS."
 First Normal Form Scheme should be such that in a given table there
should be
 No Composite Attributes
 No Repeating groups and
 All the attributes can be uniqueky identified by Key Attributes

Database System Concepts 7.153 ©Silberschatz, Korth and Sudarshan


Second Normal Form

2NF :- A Relation is said to be in 2NF;


 1. IF it is in 1NF and
 2. Non –key attribute are functionally-dependent
on the key – attributes.
further ,if the key has more than one attribute ,then no non key
attributes should be functionally dependent upon a part of the
key attributes.
attributes

Database System Concepts 7.154 ©Silberschatz, Korth and Sudarshan


2 NF Cont…

Key attribute
A B C
D
Non-key
E attribute
Entity
F

Partial Dependency is not allowed

Why ? What is the effect of allowing


partial dependency ?
Is Second Normal Form capable to give
good Database implementation ?

Database System Concepts 7.155 ©Silberschatz, Korth and Sudarshan


Third Normal Form :

Third Normal Form :


 A relation scheme is said to be in 3 NF if ;
 It is in 2 NF.
 No non-key attribute is functionally-dependent upon any Non-
key attribute.
attribute
 Thus, there should be no transitive-dependency of a non-key attribute
on the primary key.
 3 NF Scheme does not allow partial dependencies like in 2NF but
additionally 3NF does not allow any transitive dependencies.
 A relation scheme R < S, F > is in 3 NF if for all non-trivial Functional
dependencies in F + (closure of F) of the form X  A, either X
contains a key (i.e. X is a super-key) or A is a prime attribute

Database System Concepts 7.156 ©Silberschatz, Korth and Sudarshan


Functional Dependencies (Cont.)

 K is a superkey for relation schema R if and only if K  R


 K is a candidate key for R if and only if

K  R, and
for no   K,   R
 Functional dependencies allow us to express constraints that
cannot be expressed using superkeys. Consider the schema:
Loan-info-schema = (customer-name, loan-number,
branch-name, amount).
We expect this set of functional dependencies to hold:
loan-number  amount
loan-number  branch-name
but would not expect the following to hold:
loan-number  customer-name

Database System Concepts 7.157 ©Silberschatz, Korth and Sudarshan


2 NF & 3 NF…

Key attribute
A B C
D
Non-key
E attribute
Entity
F

 . 2NF :  3 NF
Non –key attribute are functionally-
 No non-key attribute is functionally-
dependent on the key – attributes
dependent upon any Non-key attribute
Partial Dependency is not allowed Transitive Dependency not allowed

Database System Concepts 7.158 ©Silberschatz, Korth and Sudarshan


Dependencies
 Multivalued Attributes (or repeating groups): non-key
attributes or groups of non-key attributes the values of which
are not uniquely identified by (directly or indirectly) (not
functionally dependent on) the value of the Primary Key (or its
part).
 As we can see Course_ID and Units (credits) are not
uniquely identified by Stud_ID and Name

Database System Concepts 7.159 ©Silberschatz, Korth and Sudarshan


Dependencies Type
 Partial Dependency – when an non-key attribute is determined by a
part, but not the whole, of a COMPOSITE primary key.
 Name is non-key attribute which is determined by Cust_ID and does
not require the other part of primary key namely Order_ID

Partial
Dependency

Database System Concepts 7.160 ©Silberschatz, Korth and Sudarshan


Dependencies types
 Transitive Dependency – when a non-key attribute
determines another non-key attribute.
 Employee is the relation in which Emp_ID is the key attribute
Dept_ID is a non-key attribute and also the Dept_Name is
non-key attribute.
Transitive
Dependency

Database System Concepts 7.161 ©Silberschatz, Korth and Sudarshan


Example 1: Determine NF with FDs
There is no composite
attributes. All attributes are
 ISBN  Title directly or indirectly
 ISBN  Publisher determined by the primary
key;, thus the relation is at
 Publisher  Address least in 1 NF

BOOK

ISBN Title Publisher Address

Database System Concepts 7.162 ©Silberschatz, Korth and Sudarshan


Example 1: Determine NF
The relation is at least in 1NF.
 ISBN  Title There is no COMPOSITE
primary key, therefore there
 ISBN  Publisher can’t be partial dependencies.
 Publisher  Address Therefore, the relation is at
least in 2NF

BOOK

ISBN Title Publisher Address

Database System Concepts 7.163 ©Silberschatz, Korth and Sudarshan


Example 1: Determine NF
Publisher is a non-key attribute,
 ISBN  Title and it determines Address,
another non-key attribute.
 ISBN  Publisher Therefore, there is a transitive
 Publisher  Address dependency, which means that
the relation is NOT in 3 NF.
BOOK

ISBN Title Publisher Address

Database System Concepts 7.164 ©Silberschatz, Korth and Sudarshan


Example 1: Determine NF
We know that the relation is at
 ISBN  Title least in 2NF, and it is not in 3
 ISBN  Publisher NF. Therefore, we conclude
that the relation is in 2NF.
 Publisher  Address

BOOK

ISBN Title Publisher Address

Database System Concepts 7.165 ©Silberschatz, Korth and Sudarshan


Example 1: Determine NF

 ISBN  Title In our solution we will write the


following justification:
 ISBN  Publisher 1) No Composite attributes,
 Publisher  therefore at least 1NF
2) No partial dependencies,
Address therefore at least 2NF
3) There is a transitive dependency
(Publisher  Address), therefore,
not 3NF
Conclusion:
Conclusion The relation is in 2NF

BOOK

ISBN Title Publisher Address

Database System Concepts 7.166 ©Silberschatz, Korth and Sudarshan


Example 2: Determine NF

 Product_ID  Description
All attributes are directly or
indirectly determined by the
primary key; therefore, the relation
is at least in 1 NF
ORDER

Order_No Product_ID Description

Database System Concepts 7.167 ©Silberschatz, Korth and Sudarshan


Example 2: Determine
NF
 Product_ID  Description
The relation is at least in 1NF.
There is a COMPOSITE Primary Key (PK)
(Order_No, Product_ID), therefore there can be
partial dependencies. Product_ID,
Product_ID which is a part
of PK, determines Description; hence, there is a
partial dependency. Therefore, the relation is not
2NF.
2NF
No sense to check for transitive dependencies!

ORDER

Order_No Product_ID Description

Database System Concepts 7.168 ©Silberschatz, Korth and Sudarshan


Example 2: Determine
NF
 Product_ID  Description
We know that the relation is at least
in 1NF, and it is not in 2 NF.
Therefore, we conclude that the
relation is in 1 NF.

ORDER

Order_No Product_ID Description

Database System Concepts 7.169 ©Silberschatz, Korth and Sudarshan


Example 2: Determine
NF
 Product_ID 
Description In your solution you will write the
following justification:
1) No M/V attributes, therefore at least
1NF
2) There is a partial dependency
(Product_ID  Description),
Description therefore not
in 2NF
Conclusion: The relation is in 1NF
ORDER

Order_No Product_ID Description

Database System Concepts 7.170 ©Silberschatz, Korth and Sudarshan


Example 3: Determine NF
Comp_ID and No are not
 Part_ID  Description determined by the primary
 Part_ID  Price key; therefore, the relation
is NOT in 1 NF.
 Part_ID, Comp_ID  No  No sense in looking at
partial or transitive
dependencies.

PART

Part_ID Descr Price Comp_ID No

Database System Concepts 7.171 ©Silberschatz, Korth and Sudarshan


Example 3: Determine NF
In your solution you will write
 Part_ID  Description the following justification:
1) There are M/V attributes;
 Part_ID  Price therefore, not 1NF
 Part_ID, Comp_ID  No Conclusion: The relation is not
normalized.

PART

Part_ID Descr Price Comp_ID No

Database System Concepts 7.172 ©Silberschatz, Korth and Sudarshan


Bringing a Relation to 1NF

Database System Concepts 7.173 ©Silberschatz, Korth and Sudarshan


Bringing a Relation to 1NF
 Option 1: Make a determinant of the repeating group (or the
multivalued attribute) a part of the primary key.

Composite
Primary Key

Database System Concepts 7.174 ©Silberschatz, Korth and Sudarshan


Bringing a Relation to 1NF
 Option 2: Remove the entire repeating group from the relation.
 Create another relation which would contain all the attributes of
the repeating group, plus the primary key from the first relation.
In this new relation, the primary key from the original relation
and the determinant of the repeating group will comprise a
primary key.

Database System Concepts 7.175 ©Silberschatz, Korth and Sudarshan


Bringing a Relation to 1NF

STUDENT_COURSE

Stud_ID Course Units


101 MSI 250 3
101 MSI 415 3
125 MSI 331 3

Database System Concepts 7.176 ©Silberschatz, Korth and Sudarshan


Bringing a Relation to 2NF

Composite
Primary Key

Database System Concepts 7.177 ©Silberschatz, Korth and Sudarshan


Bringing a Relation to 2NF
 Goal: Remove Partial Dependencies
Partial
Composite Dependencies
Primary Key

Database System Concepts 7.178 ©Silberschatz, Korth and Sudarshan


Bringing a Relation to 2NF
 Remove attributes that are dependent from the part
but not the whole of the primary key from the original
relation. For each partial dependency, create a new
relation, with the corresponding part of the primary
key from the original as the primary key.

Database System Concepts 7.179 ©Silberschatz, Korth and Sudarshan


Bringing a Relation to 2NF
STUDENT_COURSE

Stud_ID Course_ID
101 MSI 250
101 MSI 415
125 MSI 331

COURSE

Course_ID Units
MSI 250 3.00
MSI 415 3.00
MSI 331 3.00
Database System Concepts 7.180 ©Silberschatz, Korth and Sudarshan
Bringing a Relation to 3NF

 Goal: Get rid of transitive dependencies.

Transitive
Dependency

Database System Concepts 7.181 ©Silberschatz, Korth and Sudarshan


Bringing a Relation to 3NF
 Remove the attributes, which are dependent on a
non-key attribute, from the original relation. For each
transitive dependency, create a new relation with the
non-key attribute which is a determinant in the
transitive dependency as a primary key, and the
dependent non-key attribute as a dependent.

Database System Concepts 7.182 ©Silberschatz, Korth and Sudarshan


Bringing a Relation to 3NF

DEPARTMENT

Dept_ID Dept_Name
1 Acct
2 Mktg

Database System Concepts 7.183 ©Silberschatz, Korth and Sudarshan


Database Management System

 DBMS Design Analyses : 1NF, 2 NF & 3NF


 Why Higher Normal Forms?
 Closure of FDs & Armstrong Axioms
 3NF and BCNF: Achievability

 Third Normal Form & BC Normal Form


 Multivalued Dependencies and Fourth
Normal Form
 Overall Database Design Process

184
Database System Concepts 7.184 ©Silberschatz, Korth and Sudarshan

You might also like