Professional Documents
Culture Documents
ANSWER SHEET
TH
5 SEMESTER REGULAR EXAMINATION 2017-18
B.TECH
PCS5H002
DATA MINING & DATA WAREHOUSING
BRANCH: CSE
MAX MARKS: 100
Q. CODE: B307
SUBMITTED TO:-
BIJU PATNAIK UNIVERSITY OF TECHNOLOGY, ROURKELA
Prepared By: -
MR. ASWINI KUMAR PALO
(ASST. PROF)
(COMPUTER SCIENCE & ENGINEERING DEPARTMENT)
(GHANASHYAM HEMALATA INSTITUTE OF TECHNOLOGY AND MANGEMENT, PURI)
1)
SLNO CORRECT OPTION CORRECT ANSWER Marks
a) (b) Data access tools to be used 2
b) (c) Cleaning up of data 2
c) (b) OLAP 2
d) (c) Cluster 2
e) (a) KDD process 2
f) (e) All (a), (b), (c) and (d) above 2
g) (d) Visual Studio 2
h) (e) All (a), (b), (c) and (d) above 2
i) (b) Rapid changing dimension policy 2
j) (c) Cluster 2
2)
A) How is a data warehouse differing from database? (2 marks)
Database
Used for Online Transactional Processing (OLTP) but can be used for other purposes such as Data
Warehousing. This records the data from the user for history.
The tables and joins are complex since they are normalized (for RDMS). This is done to reduce
redundant data and to save storage space.
Entity – Relational modeling techniques are used for RDMS database design.
Optimized for write operation.
Performance is low for analysis queries.
Data Warehouse
Used for Online Analytical Processing (OLAP). This reads the historical data for the Users for
business decisions.
The Tables and joins are simple since they are de-normalized. This is done to reduce the response
time for analytical queries.
Data – Modeling techniques are used for the Data Warehouse design.
Optimized for read operations.
High performance for analytical queries.
Is usually a Database.
C) List data warehouse backend tools and the utilities and their functions. (2 marks)
Data Warehouse Back-End Tools and Utilities
Data extraction: get data from multiple, heterogeneous, and external sources
Data cleaning: detect errors in the data and rectify them when possible
Data transformation: convert data from legacy or host format to warehouse format
Load: sort, summarize, consolidate, compute views, check integrity, and build indices and partitions
Refresh : propagate the updates from the data sources to the warehouse
D) What is Business Intelligence? (2 marks)
The term Business Intelligence (BI) refers to technologies, applications and practices for the collection,
integration, analysis, and presentation of business information. The purpose of Business Intelligence is to
support better business decision making. Essentially, Business Intelligence systems are data-driven Decision
Support Systems (DSS). Business Intelligence is sometimes used interchangeably with briefing books, report
and query tools and executive information systems.
G) What is the drawback of using separate set of samples to evaluate pruning. (2 marks)
The drawback of using a separate set of samples to evaluate pruning is that it may not be representative of
the training samples used to create the original decision tree. If the separate sets of samples are skewed, then
using them to evaluate the pruned tree would not be a good indicator of the pruned tree’s classification
accuracy. Furthermore, using a separate set of samples to evaluate pruning means there are less samples to
use for creation and testing of the tree. While this is considered a drawback in machine learning, it may not
be so in data mining due to the availability of larger data sets.
H) List any two software tools associated with data mining and highlight their features. (2 marks)
Two Software Tools Associated With Data Mining Are:-
1. Sisense: - Sisense allows companies of any size and industry to mash up data sets from various sources and
build a repository of rich reports that are shared across departments.
2. Rapidminer: - rapidminer is an integrated environment dedicated to machine learning and text mining,
and one of the best rated predictive analysis systems available on the market. The tool can be used for
business intelligence, research, training and education, and application development.
Dimensional Modeling Schema, resembles a Star and hence called Star Schema
Dimensional Modeling – Fact Table
In a Dimensional Model, Fact table contains the measurements or metrics or facts of your business processes.
If your business process is Sales, then a measurement of this business process such as “monthly sales
number” is captured in the fact table. In addition to the measurements, the only other things a fact table
contains are foreign keys for the dimension tables.
Dimensional Modeling – Dimension Table
In a Dimensional Model, context of the measurements are represented in dimension tables. You can also
think of the context of a measurement as the characteristics such as who, what, where, when, how of a
measurement (subject ). In your business process Sales, the characteristics of the ‘monthly sales number’
measurement can be a Location (Where), Time (When), Product Sold (What).
The Dimension Attributes are the various columns in a dimension table. In the Location dimension, the
attributes can be Location Code, State, Country, Zip code. Generally the Dimension Attributes are used in
report labels, and query constraints such as where Country=’India’. The dimension attributes also contain
one or more hierarchical relationships. Before designing your data warehouse, you need to decide what this
data warehouse contains. Say if you want to build a data warehouse containing monthly sales numbers
across multiple store locations, across time and across products then your dimensions are:
Location
Time
Product
Each dimension table contains data for one dimension. In the above example you get all your store location
information and put that into one single table called Location. Your store location data may be spanning
across multiple tables in your OLTP system (unlike OLAP). You need to de-normalize all that data into one
single dimension table.
4a) Explain the algorithm for constructing a decision tree from training samples. (10 marks)
A decision tree is a structure that includes a root node, branches, and leaf nodes. Each internal node denotes
a test on an attribute, each branch denotes the outcome of a test, and each leaf node holds a class label. The
topmost node in the tree is the root node.
The following decision tree is for the concept buy_computer that indicates whether a customer at a company
is likely to buy a computer or not. Each internal node represents a test on an attribute. Each leaf node
represents a class.
p.t.o
The benefits of having a decision tree are as follows −
It does not require any domain knowledge.
It is easy to comprehend.
The learning and classification steps of a decision tree are simple and fast.
Decision Tree Induction Algorithm
A machine researcher named J. Ross Quinlan in 1980 developed a decision tree algorithm known as ID3
(Iterative Dichotomiser). Later, he presented C4.5, which was the successor of ID3. ID3 and C4.5 adopt a
greedy approach. In this algorithm, there is no backtracking; the trees are constructed in a top-down
recursive divide-and-conquer manner.
Generating a decision tree form training tuples of data partition D
Algorithm : Generate_decision_tree
Input:
Data partition, D, which is a set of training tuples and their associated class labels.
attribute_list, the set of candidate attributes.
Attribute selection method, a procedure to determine the splitting criterion that best partitions that the data
tuples into individual classes. This criterion includes a splitting_attribute and either a splitting point or
splitting subset.
Output:
A Decision Tree
Method
create a node N;
if tuples in D are all of the same class, C then
return N as leaf node labeled with class C;
if attribute_list is empty then
return N as leaf node with labeled
with majority class in D;|| majority voting
apply attribute_selection_method(D, attribute_list)
to find the best splitting_criterion;
label node N with splitting_criterion;
if splitting_attribute is discrete-valued and
multiway splits allowed then // no restricted to binary trees
attribute_list = splitting attribute; // remove splitting attribute
for each outcome j of splitting criterion
// partition the tuples and grow subtrees for each partition
let Dj be the set of data tuples in D satisfying outcome j; // a partition
if Dj is empty then
attach a leaf labeled with the majority
class in D to node N;
else
attach the node returned by Generate
decision tree(Dj, attribute list) to node N;
end for
return N;
4b) Describe the k-Mean Clustering algorithm. (5 marks)
K-means clustering is a type of unsupervised learning, which is used when you have unlabeled data (i.e., data
without defined categories or groups). The goal of this algorithm is to find groups in the data, with the
number of groups represented by the variable K. The algorithm works iteratively to assign each data point
to one of K groups based on the features that are provided. Data points are clustered based on feature
similarity. The results of the K-means clustering algorithm are:
1. The centroids of the K clusters, which can be used to label new data
2. Labels for the training data (each data point is assigned to a single cluster)
The k-means algorithm is an algorithm to cluster n objects based on attributes into k partitions, there
k < n.
It is similar to the expectation-maximization algorithm for mixtures of Gaussians in that they both
attempt to find the centers of natural clusters in the data.
It assumes that the object attributes form a vector space.
An algorithm for partitioning (or clustering) N data points into K disjoint subsets S j containing data
points so as to minimize the sum-of-squares criterion
where xn is a vector representing the the nth data point and u j is the geometric centroid of the data
points in Sj.
Simply speaking k-means clustering is an algorithm to classify or to group the objects based on
attributes/features into K number of group.
K is positive integer number.
The grouping is done by minimizing the sum of squares of distances between data and the
corresponding cluster centroid.
How the K-Mean Clustering algorithm works?
7a) What is the role of data mining in spatial database? (10 marks)
A spatial database is a database that is optimized for storing and querying data that represents objects
defined in a geometric space. Most spatial databases allow representing simple geometric objects such as
points, lines and polygons. Some spatial databases handle more complex structures such as 3D objects,
topological coverages, linear networks, and TINs. While typical databases have developed to manage various
numeric and character types of data, such databases require additional functionality to process spatial data
types efficiently, and developers have often added geometry or feature data types.
A geodatabase (also geographical database and geospatial database) is a database of geographic data, such as
countries, administrative divisions, cities, and related information. Such databases can be useful for websites
that wish to identify the locations of their visitors for customization purposes.
Features of spatial databases
Database systems use indexes to quickly look up values and the way that most databases index data is
not optimal for spatial queries. Instead, spatial databases use a spatial index to speed up database
operations.
In addition to typical SQL queries such as SELECT statements, spatial databases can perform a wide variety
of spatial operations.
Spatial Measurements: Computes line length, polygon area, the distance between geometries, etc.
Spatial Functions: Modify existing features to create new ones, for example by providing a buffer
around them, intersecting features, etc.
Spatial Predicates: Allows true/false queries about spatial relationships between geometries.
Geometry Constructors: Creates new geometries, usually by specifying the vertices (points or nodes)
which define the shape.
Observer Functions: Queries which return specific information about a feature such as the location of
the center of a circle
Some databases support only simplified or modified sets of these operations, especially in cases of
NoSQL systems like MongoDB and CouchDB.
Spatial index
Spatial indices are used by spatial databases (databases which store information related to objects in
space) to optimize spatial queries. Conventional index types do not efficiently handle spatial queries
such as how far two points differ, or whether points fall within a spatial area of interest. Common
spatial index methods include: Grid (spatial index),
Spatial database systems
SpatiaLite extends Sqlite with spatial datatypes, functions, and utilities.
IBM DB2 Spatial Extender can spatially-enable any edition of DB2, including the free DB2 Express-
C, with support for spatial types
7b) Detail on Data Warehouse MetaData. (5 marks)
Metadata is simply defined as data about data. The data that is used to represent other data is known as
metadata. For example, the index of a book serves as a metadata for the contents in the book. In other words,
we can say that metadata is the summarized data that leads us to detailed data. In terms of data warehouse,
we can define metadata as follows.
Metadata is the road-map to a data warehouse.
Metadata in a data warehouse defines the warehouse objects.
Metadata acts as a directory. This directory helps the decision support system to locate the contents of
a data warehouse.
Note − In a data warehouse, we create metadata for the data names and definitions of a given data
warehouse. Along with this metadata, additional metadata is also created for time-stamping any extracted
data, the source of extracted data.
Categories of Metadata
Metadata can be broadly categorized into three categories −
Business Metadata − It has the data ownership information, business definition, and changing
policies.
Technical Metadata − It includes database system names, table and column names and sizes, data
types and allowed values. Technical metadata also includes structural information such as primary
and foreign key attributes and indices.
Operational Metadata − It includes currency of data and data lineage. Currency of data means
whether the data is active, archived, or purged. Lineage of data means the history of data migrated
and transformation applied on it.
Role of Metadata
Metadata has a very important role in a data warehouse. The role of metadata in a warehouse is
different from the warehouse data, yet it plays an important role. The various roles of metadata are
explained below.
Metadata acts as a directory.
This directory helps the decision support system to locate the contents of the data warehouse.
Metadata helps in decision support system for mapping of data when data is transformed from
operational environment to data warehouse environment.
Metadata helps in summarization between current detailed data and highly summarized data.
Metadata also helps in summarization between lightly detailed data and highly summarized data.
Metadata is used for query tools.
Metadata is used in extraction and cleansing tools.
Metadata is used in reporting tools.
Metadata is used in transformation tools.
Metadata plays an important role in loading functions.
8b) How is web usage mining different from web structure mining and web content mining? (5 marks)
Web usage mining refers to the discovery of user access patterns from Web usage logs. Web structure
mining tries to discover useful knowledge from the structure of hyperlinks. Web content mining aims to
extract/mine useful information or knowledge from web page contents.
Web Content Mining
Web content mining targets the knowledge discovery, in which the main objects are the traditional
collections of multimedia documents such as images, video, and audio, which are embedded in or linked to
the web pages.
Web Structure Mining
Web Structure Mining focuses on analysis of the link structure of the web and one of its purposes is to
identify more preferable documents. The different objects are linked in some way. The intuition is that a
hyperlink from document A to document B implies that the author of document. A thinks document B
contains worthwhile information. Web structure mining helps in discovering similarities between web sites
or discovering important sites for a particular topic or discipline or in discovering web communities.
Web Usage Mining
Web Usage Mining focuses on techniques that could predict the behavior of users while they are interacting
with the WWW. Web usage mining, discover user navigation patterns from web data, tries to discovery the
useful information from the secondary data derived from the interactions of the users while surfing on the
Web. Web usage mining collects the data from Web log records to discover user access patterns of web
pages. There are several available research projects and commercial tools that analyze those patterns for
different purposes. The insight knowledge could be utilized in personalization, system improvement, site
modification, business intelligence and usage characterization.
9a) Write Short Note On:
i. Issues regarding classification and prediction (5 marks)
1. Preparing the Data for Classification and Prediction
The following preprocessing steps may be applied to the data in order to help improve the accuracy, efficiency, and
scalability of the classification or prediction process.
Data Cleaning: This refers to the preprocessing of data in order to remove or reduce noise (by
applying smoothing techniques) and the treatment of missing values (e.g., by replacing a missing
value with the most commonly occurring value for that attribute, or with the most probable value
based on statistics.) Although most classification algorithms have some mechanisms for handling
noisy or missing data, this step can help reduce confusion during learning.
Relevance Analysis: Many of the attributes in the data may be irrelevant to the classification or
prediction task. For example, data recording the day of the week on which a bank loan application
was filed is unlikely to be relevant to the success of the application. Furthermore, other attributes may
be redundant. Hence, relevance analysis may be performed on the data with the aim of removing any
irrelevant or redundant attributes from the learning process. In machine learning, this step is known
as feature selection. Including such attributes may otherwise slow down, and possibly mislead, the
learning step.
Ideally, the time spent on relevance analysis, when added to the time spent on learning from the
resulting “reduced” feature subset should be less than the time that would have been spent on
learning from the original set of features. Hence, such analysis can help improve classification
efficiency and scalability.
Data Transformation: The data can be generalized to higher – level concepts. Concept hierarchies
may be used for this purpose. This is particularly useful for continuous – valued attributes. For
example, numeric values for the attribute income may be generalized to discrete ranges such as low,
medium, and high. Similarly, nominal – valued attributes like street, can be generalized to higher –
level concepts, like city. Since generalization compresses the original training data, fewer input /
output operations may be involved during learning.
2. Comparing Classification Methods
Classification and prediction methods can be compared and evaluated according to the following criteria:
Predictive Accuracy: This refers to the ability of the model to correctly predict the class label of new
or previously unseen data.
Speed: This refers to the computation costs involved in generating and using the model.
Robustness: This is the ability of the model to make correct predictions given noisy data or data with
missing values.
Scalability: This refers to the ability to construct the model efficiently given large amount of data.
Interpretability: This refers to the level of understanding and insight that is provided by the model.
ii. Outer Analysis (5 marks)
database may contain data objects that do not comply with the general behavior or model of the data. Such
data objects, which are grossly different from or inconsistent with the remaining set of data, are called outliers.
Most data mining methods discard outliers as noise or exceptions.
However, in some applications such as fraud detection, the rare events can be more interesting than
the more regularly occurring ones.
The analysis of outlier data is referred to as outlier mining.
Outliers may be detected using statistical tests that assume a distribution or probability model for the
data, or using distance measures where objects that are a substantial distance from any other cluster
are considered outliers.
Rather than using statistical or distance measures, deviation-based methods identify outliers by
examining differences in the main characteristics of objects in a group.
Outliers can be caused by measurement or execution error.
Outliers may be the result of inherent data variability.
Many data mining algorithms try to minimize the influence of outliers or eliminate them all together.
This, however, could result in the loss of important hidden information because one person’s noise
could be another person’s signal.
Thus, outlier detection and analysis is an interesting data mining task, referred to as outlier mining.
9b) Discuss about social impacts and various trends in data mining. (5 marks)
Data mining is a young discipline with wide and diverse applications
„ There is still a nontrivial gap between general principles of data mining and domain-specific, effective data
mining tools for particular applications. Some application domains
„ Biomedical and DNA data analysis
„ Financial data analysis
„ Retail industry
„ Telecommunication industry
Social Impacts: Threat to Privacy and Data Security?
„ Is data mining a threat to privacy and data security?
o “Big Brother”, “Big Banker”, and “Big Business” are carefully watching you
o Profiling information is collected every time
„ credit card, debit card, supermarket loyalty card, or frequent flyer card, or apply for
any of the above
„ You surf the Web, rent a video, fill out a contest entry form,
„ You pay for prescription drugs, or present you medical care number when visiting the
doctor
o „ Collection of personal data may be beneficial for companies and consumers, there is also
potential for misuse
„ Medical Records, Employee Evaluations, Etc.