You are on page 1of 14

TEACHER CODE: - T121722101

ANSWER SHEET
TH
5 SEMESTER REGULAR EXAMINATION 2017-18
B.TECH
PCS5H002
DATA MINING & DATA WAREHOUSING
BRANCH: CSE
MAX MARKS: 100
Q. CODE: B307

SUBMITTED TO:-
BIJU PATNAIK UNIVERSITY OF TECHNOLOGY, ROURKELA

Prepared By: -
MR. ASWINI KUMAR PALO
(ASST. PROF)
(COMPUTER SCIENCE & ENGINEERING DEPARTMENT)
(GHANASHYAM HEMALATA INSTITUTE OF TECHNOLOGY AND MANGEMENT, PURI)
1)
SLNO CORRECT OPTION CORRECT ANSWER Marks
a) (b) Data access tools to be used 2
b) (c) Cleaning up of data 2
c) (b) OLAP 2
d) (c) Cluster 2
e) (a) KDD process 2
f) (e) All (a), (b), (c) and (d) above 2
g) (d) Visual Studio 2
h) (e) All (a), (b), (c) and (d) above 2
i) (b) Rapid changing dimension policy 2
j) (c) Cluster 2
2)
A) How is a data warehouse differing from database? (2 marks)
Database
 Used for Online Transactional Processing (OLTP) but can be used for other purposes such as Data
Warehousing. This records the data from the user for history.
 The tables and joins are complex since they are normalized (for RDMS). This is done to reduce
redundant data and to save storage space.
 Entity – Relational modeling techniques are used for RDMS database design.
 Optimized for write operation.
 Performance is low for analysis queries.
Data Warehouse
 Used for Online Analytical Processing (OLAP). This reads the historical data for the Users for
business decisions.
 The Tables and joins are simple since they are de-normalized. This is done to reduce the response
time for analytical queries.
 Data – Modeling techniques are used for the Data Warehouse design.
 Optimized for read operations.
 High performance for analytical queries.
 Is usually a Database.

B) Distinguish the feature between OLAP & OLTP. (2 marks)


 OLTP stands for On-line Transaction processing while OLAP stands for On-line Analytical
Processing.
 OLTP provides data to data warehouse while OLAP analyze this data.
 OLTP deals with operational data while OLAP deals with historical data.
 In OLTP queries are simple while in OLAP queries are relatively complex.
 Processing speed of OLTP is very fast while in OLAP processing speed depends upon the amount of
data.
 Database design of OLAP is highly normalized with many tables while in OLAP the database design
is de-normalized with few tables.
 In OLTP database transactions are short while in OLAP database transaction are long.
 IN OLTP volume transactions are high while in OLAP volume transaction is low.
 In OLAP transaction recovery is necessary while in OLTP transaction recovery is not necessary.
 OLTP focuses on updating data while OLTP focuses on reporting and retrieval of data.

C) List data warehouse backend tools and the utilities and their functions. (2 marks)
Data Warehouse Back-End Tools and Utilities
 Data extraction: get data from multiple, heterogeneous, and external sources
 Data cleaning: detect errors in the data and rectify them when possible
 Data transformation: convert data from legacy or host format to warehouse format
 Load: sort, summarize, consolidate, compute views, check integrity, and build indices and partitions
 Refresh : propagate the updates from the data sources to the warehouse
D) What is Business Intelligence? (2 marks)
The term Business Intelligence (BI) refers to technologies, applications and practices for the collection,
integration, analysis, and presentation of business information. The purpose of Business Intelligence is to
support better business decision making. Essentially, Business Intelligence systems are data-driven Decision
Support Systems (DSS). Business Intelligence is sometimes used interchangeably with briefing books, report
and query tools and executive information systems.

E) What do you mean by neural clustering? (2 marks)


Neural Clustering refers to a pattern recognition methodology for machine learning. The resulting model
from neural
Clustering is often called an artificial neural network (ANN) or a neural network. Neural networks have
been used in
Many business applications for pattern recognition, forecasting, prediction, and classification. Neural
network Clustering is a key component of any data mining tool kit.

F) Mention the utility of knowledge base. (2 marks)


In relation to Information technology (IT), a knowledge base is a machine-readable resource for the
dissemination of information, generally online or with the capacity to be put online. An integral component
of knowledge management systems, a knowledge base is used to optimize information collection,
organization, and retrieval for an organization, or for the general public. A well-organized knowledge base
can save an enterprise money by decreasing the amount of employee time spent trying to find information
about - among myriad possibilities - tax laws or company policies and procedures.

G) What is the drawback of using separate set of samples to evaluate pruning. (2 marks)
The drawback of using a separate set of samples to evaluate pruning is that it may not be representative of
the training samples used to create the original decision tree. If the separate sets of samples are skewed, then
using them to evaluate the pruned tree would not be a good indicator of the pruned tree’s classification
accuracy. Furthermore, using a separate set of samples to evaluate pruning means there are less samples to
use for creation and testing of the tree. While this is considered a drawback in machine learning, it may not
be so in data mining due to the availability of larger data sets.

H) List any two software tools associated with data mining and highlight their features. (2 marks)
Two Software Tools Associated With Data Mining Are:-
1. Sisense: - Sisense allows companies of any size and industry to mash up data sets from various sources and
build a repository of rich reports that are shared across departments.
2. Rapidminer: - rapidminer is an integrated environment dedicated to machine learning and text mining,
and one of the best rated predictive analysis systems available on the market. The tool can be used for
business intelligence, research, training and education, and application development.

I) What are the steps involved in KDD processes? (2 marks)


Here is the list of steps involved in the knowledge discovery process −
 Data Cleaning − In this step, the noise and inconsistent data is removed.
 Data Integration − In this step, multiple data sources are combined.
 Data Selection − In this step, data relevant to the analysis task are retrieved from the database.
 Data Transformation − In this step, data is transformed or consolidated into forms appropriate for
mining by performing summary or aggregation operations.
 Data Mining − In this step, intelligent methods are applied in order to extract data patterns.
 Pattern Evaluation − In this step, data patterns are evaluated.
 Knowledge Presentation − In this step, knowledge is represented.

J) Define Meta Data? (2 marks)


Metadata is your control panel to the data warehouse. It is data that describes the data warehousing and
business intelligence system:
Reports, Cubes, Tables (Records, Segments, Entities, etc.), Columns (Fields, Attributes, Data Elements, etc.)
3a) Describe the architecture and implementation of data warehouse? (10 marks)
Generally a data warehouses adopts three-tier architecture. Following are the three tiers of the data
warehouse architecture.
Bottom Tier − The bottom tier of the architecture is the data warehouse database server. It is the relational
database system. We use the back end tools and utilities to feed data into the bottom tier. These back end
tools and utilities perform the Extract, Clean, Load, and refresh functions.
Middle Tier − In the middle tier, we have the OLAP Server that can be implemented in either of the
following ways.
 By Relational OLAP (ROLAP), which is an extended relational database management system. The
ROLAP maps the operations on multidimensional data to standard relational operations.
 By Multidimensional OLAP (MOLAP) model, which directly implements the multidimensional data
and operations.
Top-Tier − This tier is the front-end client layer. This layer holds the query tools and reporting tools,
analysis tools and data mining tools.

The following diagram depicts the three-tier architecture of data warehouse –

DATA WAREHOUSE IMPLEMENTATION


Implementation steps
1. Requirements analysis and capacity planning: In other projects, the first step in data warehousing
involves defining enterprise needs, defining architecture, carrying out capacity planning and selecting the
hardware and software tools. This step will involve consulting senior management as well as the various
stakeholders.
2. Hardware integration: Once the hardware and software have been selected, they need to be put together
by integrating the servers, the storage devices and the client software tools.
3. Modeling: Modeling is a major step that involves designing the warehouse schema and views. This may
involve using a modeling tool if the data warehouse is complex.
4. Physical modeling: For the data warehouse to perform efficiently, physical modeling is required. This
involves designing the physical data warehouse organization, data placement, data partitioning, deciding on
access methods and indexing.
5. Sources: The data for the data warehouse is likely to come from a number of data sources. This step
involves identifying and connecting the sources using gateways, ODBC drives or other wrappers.
6. ETL: The data from the source systems will need to go through an ETL process. The step of designing and
implementing the ETL process may involve identifying a suitable ETL tool vendor and purchasing and
implementing the tool. This may include customizing the tool to suit the needs of the enterprise.
7. Populate the data warehouse: Once the ETL tools have been agreed upon, testing the tools will be
required, perhaps using a staging area. Once everything is working satisfactorily, the ETL tools may be used
in populating the warehouse given the schema and view definitions.
8. User applications: For the data warehouse to be useful there must be end-user applications. This step
involves designing and implementing applications required by the end users.
9. Roll-out the warehouse and applications: Once the data warehouse has been populated and the end-user
applications tested, the warehouse system and the applications may be rolled out for the user community to
use.
3b) Briefly explain the basic dimensional modeling techniques. (5 marks)
Dimensional Modeling
Many data warehouse designers use Dimensional modeling design concepts to build data warehouses.
Dimensional model is the underlying data model used by many of the commercial OLAP products available
today in the market. In this dimensional model, we store all data in just two types of tables. They are Fact
Tables and Dimension Tables. The Fact table contains the main facts or measures. Fact table links to many
dimension tables thru foreign keys. We call this resulting schema as star schema because it looks like a star.
Because of these multiple dimension tables, all connecting to single fact table, this design concept is named
dimensional modeling.

Dimensional Modeling Schema, resembles a Star and hence called Star Schema
Dimensional Modeling – Fact Table
In a Dimensional Model, Fact table contains the measurements or metrics or facts of your business processes.
If your business process is Sales, then a measurement of this business process such as “monthly sales
number” is captured in the fact table. In addition to the measurements, the only other things a fact table
contains are foreign keys for the dimension tables.
Dimensional Modeling – Dimension Table
In a Dimensional Model, context of the measurements are represented in dimension tables. You can also
think of the context of a measurement as the characteristics such as who, what, where, when, how of a
measurement (subject ). In your business process Sales, the characteristics of the ‘monthly sales number’
measurement can be a Location (Where), Time (When), Product Sold (What).
The Dimension Attributes are the various columns in a dimension table. In the Location dimension, the
attributes can be Location Code, State, Country, Zip code. Generally the Dimension Attributes are used in
report labels, and query constraints such as where Country=’India’. The dimension attributes also contain
one or more hierarchical relationships. Before designing your data warehouse, you need to decide what this
data warehouse contains. Say if you want to build a data warehouse containing monthly sales numbers
across multiple store locations, across time and across products then your dimensions are:
 Location
 Time
 Product
Each dimension table contains data for one dimension. In the above example you get all your store location
information and put that into one single table called Location. Your store location data may be spanning
across multiple tables in your OLTP system (unlike OLAP). You need to de-normalize all that data into one
single dimension table.

4a) Explain the algorithm for constructing a decision tree from training samples. (10 marks)
A decision tree is a structure that includes a root node, branches, and leaf nodes. Each internal node denotes
a test on an attribute, each branch denotes the outcome of a test, and each leaf node holds a class label. The
topmost node in the tree is the root node.
The following decision tree is for the concept buy_computer that indicates whether a customer at a company
is likely to buy a computer or not. Each internal node represents a test on an attribute. Each leaf node
represents a class.

p.t.o
The benefits of having a decision tree are as follows −
 It does not require any domain knowledge.
 It is easy to comprehend.
 The learning and classification steps of a decision tree are simple and fast.
Decision Tree Induction Algorithm
A machine researcher named J. Ross Quinlan in 1980 developed a decision tree algorithm known as ID3
(Iterative Dichotomiser). Later, he presented C4.5, which was the successor of ID3. ID3 and C4.5 adopt a
greedy approach. In this algorithm, there is no backtracking; the trees are constructed in a top-down
recursive divide-and-conquer manner.
Generating a decision tree form training tuples of data partition D
Algorithm : Generate_decision_tree
Input:
Data partition, D, which is a set of training tuples and their associated class labels.
attribute_list, the set of candidate attributes.
Attribute selection method, a procedure to determine the splitting criterion that best partitions that the data
tuples into individual classes. This criterion includes a splitting_attribute and either a splitting point or
splitting subset.
Output:
A Decision Tree
Method
create a node N;
if tuples in D are all of the same class, C then
return N as leaf node labeled with class C;
if attribute_list is empty then
return N as leaf node with labeled
with majority class in D;|| majority voting
apply attribute_selection_method(D, attribute_list)
to find the best splitting_criterion;
label node N with splitting_criterion;
if splitting_attribute is discrete-valued and
multiway splits allowed then // no restricted to binary trees
attribute_list = splitting attribute; // remove splitting attribute
for each outcome j of splitting criterion
// partition the tuples and grow subtrees for each partition
let Dj be the set of data tuples in D satisfying outcome j; // a partition

if Dj is empty then
attach a leaf labeled with the majority
class in D to node N;
else
attach the node returned by Generate
decision tree(Dj, attribute list) to node N;
end for
return N;
4b) Describe the k-Mean Clustering algorithm. (5 marks)
K-means clustering is a type of unsupervised learning, which is used when you have unlabeled data (i.e., data
without defined categories or groups). The goal of this algorithm is to find groups in the data, with the
number of groups represented by the variable K. The algorithm works iteratively to assign each data point
to one of K groups based on the features that are provided. Data points are clustered based on feature
similarity. The results of the K-means clustering algorithm are:
1. The centroids of the K clusters, which can be used to label new data
2. Labels for the training data (each data point is assigned to a single cluster)
 The k-means algorithm is an algorithm to cluster n objects based on attributes into k partitions, there
k < n.
 It is similar to the expectation-maximization algorithm for mixtures of Gaussians in that they both
attempt to find the centers of natural clusters in the data.
 It assumes that the object attributes form a vector space.
 An algorithm for partitioning (or clustering) N data points into K disjoint subsets S j containing data
points so as to minimize the sum-of-squares criterion

where xn is a vector representing the the nth data point and u j is the geometric centroid of the data
points in Sj.
 Simply speaking k-means clustering is an algorithm to classify or to group the objects based on
attributes/features into K number of group.
 K is positive integer number.
 The grouping is done by minimizing the sum of squares of distances between data and the
corresponding cluster centroid.
How the K-Mean Clustering algorithm works?

Step 1: Begin with a decision on the value of k = number of clusters .


Step 2: Put any initial partition that classifies the data into k clusters. You may assign the training samples
randomly,or systematically as the following:
1.Take the first k training sample as singleelement clusters
2. Assign each of the remaining (N-k) training sample to the cluster with the nearest centroid. After each
assignment, recompute the centroid of the gaining cluster.
Step 3: Take each sample in sequence and compute its distance from the centroid of each of the clusters. If a
sample is not currently in the cluster with the closest centroid, switch this sample to that cluster and update
the centroid of the cluster gaining the new sample and the cluster losing the sample.
Step 4 . Repeat step 3 until convergence is achieved, that is until a pass through the training sample causes
no new assignments.
5a) what do you mean by data mining functionality? Explain with suitable examples. (10 marks)
The said functionalities are measured to perceive the type of patterns to be found in data mining tasks, Data
Mining tasks can be categorized in to two categories.
Descriptive Task:
These tasks present the general properties of data stored in database. The descriptive tasks are used to find
out patterns in data i.e. cluster, correlation, trends and anomalies etc.
Predictive Tasks:
Predictive data mining tasks predict the value of one attribute on the bases of values of other attributes,
which is known as target or dependent variable and the attributes used for making the prediction are known
as independent variables.
Data mining functionalities are described as follows:-
1) Prediction: Predictive model determined the future outcome rather than present behavior. The
predictive attribute of a predictive model can be geometric or categorical. It engross the ruling of set
of characteristics relevant to the attribute of interest and predicting the value distribution based on
the set of data similar to the selected object (S) for example one may predict the kind of disease based
on the symptoms of patient.
2) Classification: Classification is used to builds models from data with predefined classes as the model
is used to classify new instance whose classification is not known. The instances used to create the
model are known as training data. A decision tree or set of classification rules is based on such type of
mechanism of classification which can be retrieved for identification of future data for example one
may classify the employee’s potential salary on the bases of salary classification of similar employees
in the company.
3) Clustering: Clustering is the process of partitioning a set of object or data in a same group called a
cluster. These objects are more similar (in some sense or another) to each other than to those in other
groups ( clusters). Clustering is used in many fields, including machine learning, patterns recognition,
bioinformatics, image analysis and information retrieval.
4) Mining Frequent patterns, Associations and correlations: Frequent patterns can be defined as a
pattern (a set of items, subsequence, substructures, etc.) that appears intermittently in data. A
intermittent item set is a set of data that occurs frequently together in a transaction data set for
example, a set of items, such as table and chair. Subsequence means first of all buying a Computer
system, then UPS, and thereafter a printer. This appears frequently in a shopping history data base
and is called a frequent sequential pattern. Substructure as particular structural forms such as sub
graphs, sub tree. If a substructure appears intermittently, it is named as a frequent structural
pattern. Discovering such type of frequent pattern plays an important role in correlation mining
association clustering and other data mining tasks.
5) Outlier Analysis: Outer analysis is an object in database which is significantly different from the
existing data. “An outlier is an observation which deviates so much from the other observations as to
arouse suspicions that it was generated by a different mechanism”. Deviants, Abnormalities,
Discordant and Anomalies are also referred as outliers in data mining and statistics literature. The
outlier can be diagnosed with the help of statistical tests that assume probability model for the data.

5b) Explain OLAP Operation In Multidimensional Data Model. (5 marks)


OLAP servers are based on multidimensional view of data, we will discuss OLAP operations in OLAP
servers are based on multidimensional view of data, we will discuss OLAP operations in multidimensional
data.
Here is the list of OLAP operations − Roll-up Drill-down Slice and dice Pivot (rotate)
Roll-up : Roll-up performs aggregation on a data cube in any of the following ways −
By climbing up a concept hierarchy for a dimension, By dimension reduction
Drill-down: Drill-down is the reverse operation of roll-up. It is performed by either of the following ways −
 By stepping down a concept hierarchy for a dimension, By introducing a new dimension.
Slice: The slice operation selects one particular dimension from a given cube and provides a new sub-cube.
Dice : Dice selects two or more dimensions from a given cube and provides a new sub-cube.
Pivot : The pivot operation is also known as rotation. It rotates the data axes in view in order to provide an
6a) Explain the classification of major clustering methods. (10 marks)
Clustering Methods
The clustering methods can be classified into following categories:
Kmeans ,Partitioning Method, Hierarchical Method, Density-based Method, Grid-Based Method, Model-
Based Method, Constraint-based Method
1. K-means :-
Given k, the k-means algorithm is implemented in four steps:
1. Partition objects into k nonempty subsets
2. Compute seed points as the centroids of the clusters of the current partition (the centroid is the
center, i.e., mean point, of the cluster)
3. Assign each object to the cluster with the nearest seed point
4. Go back to Step 2, stop when no more new assignment
2. Partitioning Method :-
Suppose we are given a database of n objects, the partitioning method construct k partition of data.
Each partition will represent a cluster and k≤n. It means that it will classify the data into k groups, which
satisfy the following requirements:
 Each group contain at least one object.
 Each object must belong to exactly one group.
Typical methods:
K-means, k-medoids, CLARANS
3. Hierarchical Methods :-
This method creates the hierarchical decomposition of the given set of data objects.:
 Agglomerative Approach :- This approach is also known as bottom-up approach. In this we start with
each object forming a separate group. It keeps on merging the objects or groups that are close to one
another. It keep on doing so until all of the groups are merged into one or until the termination
condition holds.
 Divisive Approach :- This approach is also known as top-down approach. In this we start with all of
the objects in the same cluster. In the continuous iteration, a cluster is split up into smaller clusters. It
is down until each object in one cluster or the termination condition holds.
4 Density-based Method
Clustering based on density (local cluster criterion), such as density-connected points
Major features:
Discover clusters of arbitrary shape
Handle noise
One scan
Need density parameters as termination condition
5. Grid-based Method
Using multi-resolution grid data structure
Advantage
The major advantage of this method is fast processing time.
It is dependent only on the number of cells in each dimension in the quantized space.
Typical methods: STING, WaveCluster, CLIQUE
6 Model-based methods :-
Attempt to optimize the fit between the given data and some mathematical model
Based on the assumption: Data are generated by a mixture of underlying probability distribution
In this method a model is hypothesize for each cluster and find the best fit of data to the given model.
This method also serve a way of automatically determining number of clusters based on standard statistics,
taking outlier or noise into account. It therefore yields robust clustering methods.
7 Constraint-based Method :-
Clustering by considering user-specified or application-specific constraints
Typical methods: COD (obstacles), constrained clustering
Need user feedback: Users know their applications the best
Less parameters but more user-desired constraints, e.g., an ATM allocation problem: obstacle & desired
clusters
6b) Explain briefly about various steps of Data Mining process. (5 marks)
Steps of Data Mining
There are various steps that are involved in mining data as shown in the picture.
 Data Integration: First of all the data are collected and integrated from all the different sources.
 Data Selection: We may not all the data we have collected in the first step. So in this step we select
only those data which we think useful for data mining.
 Data Cleaning: The data we have collected are not clean and may contain errors, missing values,
noisy or inconsistent data. So we need to apply different techniques to get rid of such anomalies.
 Data Transformation: The data even after cleaning are not ready for mining as we need to transform
them into forms appropriate for mining. The techniques used to accomplish this are smoothing,
aggregation, normalization etc.
 Data Mining: Now we are ready to apply data mining techniques on the data to discover the
interesting patterns. Techniques like clustering and association analysis are among the many different
techniques used for data mining.
 Pattern Evaluation and Knowledge Presentation: This step involves visualization, transformation,
removing redundant patterns etc from the patterns we generated.
 Decisions / Use of Discovered Knowledge: This step helps user to make use of the knowledge acquired
to take better decisions.

7a) What is the role of data mining in spatial database? (10 marks)
A spatial database is a database that is optimized for storing and querying data that represents objects
defined in a geometric space. Most spatial databases allow representing simple geometric objects such as
points, lines and polygons. Some spatial databases handle more complex structures such as 3D objects,
topological coverages, linear networks, and TINs. While typical databases have developed to manage various
numeric and character types of data, such databases require additional functionality to process spatial data
types efficiently, and developers have often added geometry or feature data types.
A geodatabase (also geographical database and geospatial database) is a database of geographic data, such as
countries, administrative divisions, cities, and related information. Such databases can be useful for websites
that wish to identify the locations of their visitors for customization purposes.
Features of spatial databases
 Database systems use indexes to quickly look up values and the way that most databases index data is
not optimal for spatial queries. Instead, spatial databases use a spatial index to speed up database
operations.
In addition to typical SQL queries such as SELECT statements, spatial databases can perform a wide variety
of spatial operations.
 Spatial Measurements: Computes line length, polygon area, the distance between geometries, etc.
 Spatial Functions: Modify existing features to create new ones, for example by providing a buffer
around them, intersecting features, etc.
 Spatial Predicates: Allows true/false queries about spatial relationships between geometries.
 Geometry Constructors: Creates new geometries, usually by specifying the vertices (points or nodes)
which define the shape.
 Observer Functions: Queries which return specific information about a feature such as the location of
the center of a circle
 Some databases support only simplified or modified sets of these operations, especially in cases of
NoSQL systems like MongoDB and CouchDB.
Spatial index
 Spatial indices are used by spatial databases (databases which store information related to objects in
space) to optimize spatial queries. Conventional index types do not efficiently handle spatial queries
such as how far two points differ, or whether points fall within a spatial area of interest. Common
spatial index methods include: Grid (spatial index),
Spatial database systems
 SpatiaLite extends Sqlite with spatial datatypes, functions, and utilities.
 IBM DB2 Spatial Extender can spatially-enable any edition of DB2, including the free DB2 Express-
C, with support for spatial types
7b) Detail on Data Warehouse MetaData. (5 marks)
Metadata is simply defined as data about data. The data that is used to represent other data is known as
metadata. For example, the index of a book serves as a metadata for the contents in the book. In other words,
we can say that metadata is the summarized data that leads us to detailed data. In terms of data warehouse,
we can define metadata as follows.
 Metadata is the road-map to a data warehouse.
 Metadata in a data warehouse defines the warehouse objects.
 Metadata acts as a directory. This directory helps the decision support system to locate the contents of
a data warehouse.
Note − In a data warehouse, we create metadata for the data names and definitions of a given data
warehouse. Along with this metadata, additional metadata is also created for time-stamping any extracted
data, the source of extracted data.
Categories of Metadata
Metadata can be broadly categorized into three categories −
 Business Metadata − It has the data ownership information, business definition, and changing
policies.
 Technical Metadata − It includes database system names, table and column names and sizes, data
types and allowed values. Technical metadata also includes structural information such as primary
and foreign key attributes and indices.
 Operational Metadata − It includes currency of data and data lineage. Currency of data means
whether the data is active, archived, or purged. Lineage of data means the history of data migrated
and transformation applied on it.
Role of Metadata
 Metadata has a very important role in a data warehouse. The role of metadata in a warehouse is
different from the warehouse data, yet it plays an important role. The various roles of metadata are
explained below.
 Metadata acts as a directory.
 This directory helps the decision support system to locate the contents of the data warehouse.
 Metadata helps in decision support system for mapping of data when data is transformed from
operational environment to data warehouse environment.
 Metadata helps in summarization between current detailed data and highly summarized data.
 Metadata also helps in summarization between lightly detailed data and highly summarized data.
 Metadata is used for query tools.
 Metadata is used in extraction and cleansing tools.
 Metadata is used in reporting tools.
 Metadata is used in transformation tools.
 Metadata plays an important role in loading functions.

8a) Explain in details about Text Mining application. (10 marks)


Text mining, also referred to as text data mining, roughly equivalent to text analytics, is the process of
deriving high-quality information from text. High-quality information is typically derived through the
devising of patterns and trends through means such as statistical pattern learning. Text mining usually
involves the process of structuring the input text (usually parsing, along with the addition of some derived
linguistic features and the removal of others, and subsequent insertion into a database), deriving patterns
within the structured data, and finally evaluation and interpretation of the output. 'High quality' in text
mining usually refers to some combination of relevance, novelty, and interestingness. Typical text mining
tasks include text categorization, text clustering, concept/entity extraction, production of granular
taxonomies, sentiment analysis, document summarization, and entity relation modeling (i.e., learning
relations between named entities).
Text mining can be used to make the large quantities of unstructured data accessible and useful, thereby
generating not only value, but delivering ROI from unstructured data management as we’ve seen with
applications of text mining for Risk Management Software and Cybercrime applications.
Some Application Of Text Mining
 Fraud detection through claims investigation
Text analytics is a tremendously effective technology in any domain where the majority of
information is collected as text. Insurance companies are taking advantage of text mining technologies
by combining the results of text analysis with structured data to prevent frauds and swiftly process
claims.
 Business intelligence
This process is used by large companies to uphold and support decision making. Here, text mining
really makes the difference, enabling the analyst to quickly jump at the answer even when analyzing
petabytes of internal and open source data. Applications such as the Cogito Intelligence Platform
(link to CIP) are able to monitor thousands of sources and analyze large data volumes to extract from
them only the relevant content.
 Spam filtering
E-mail is an effective, fast and reasonably cheap way to communicate, but it comes with a dark side:
spam. Today, spam is a major issue for internet service providers, increasing their costs for service
management and hardware software updating; for users, spam is an entry point for viruses and
impacts productivity. Text mining techniques can be implemented to improve the effectiveness of
statistical-based filtering methods.

8b) How is web usage mining different from web structure mining and web content mining? (5 marks)
Web usage mining refers to the discovery of user access patterns from Web usage logs. Web structure
mining tries to discover useful knowledge from the structure of hyperlinks. Web content mining aims to
extract/mine useful information or knowledge from web page contents.
Web Content Mining
Web content mining targets the knowledge discovery, in which the main objects are the traditional
collections of multimedia documents such as images, video, and audio, which are embedded in or linked to
the web pages.
Web Structure Mining
Web Structure Mining focuses on analysis of the link structure of the web and one of its purposes is to
identify more preferable documents. The different objects are linked in some way. The intuition is that a
hyperlink from document A to document B implies that the author of document. A thinks document B
contains worthwhile information. Web structure mining helps in discovering similarities between web sites
or discovering important sites for a particular topic or discipline or in discovering web communities.
Web Usage Mining
Web Usage Mining focuses on techniques that could predict the behavior of users while they are interacting
with the WWW. Web usage mining, discover user navigation patterns from web data, tries to discovery the
useful information from the secondary data derived from the interactions of the users while surfing on the
Web. Web usage mining collects the data from Web log records to discover user access patterns of web
pages. There are several available research projects and commercial tools that analyze those patterns for
different purposes. The insight knowledge could be utilized in personalization, system improvement, site
modification, business intelligence and usage characterization.
9a) Write Short Note On:
i. Issues regarding classification and prediction (5 marks)
1. Preparing the Data for Classification and Prediction
The following preprocessing steps may be applied to the data in order to help improve the accuracy, efficiency, and
scalability of the classification or prediction process.
 Data Cleaning: This refers to the preprocessing of data in order to remove or reduce noise (by
applying smoothing techniques) and the treatment of missing values (e.g., by replacing a missing
value with the most commonly occurring value for that attribute, or with the most probable value
based on statistics.) Although most classification algorithms have some mechanisms for handling
noisy or missing data, this step can help reduce confusion during learning.
 Relevance Analysis: Many of the attributes in the data may be irrelevant to the classification or
prediction task. For example, data recording the day of the week on which a bank loan application
was filed is unlikely to be relevant to the success of the application. Furthermore, other attributes may
be redundant. Hence, relevance analysis may be performed on the data with the aim of removing any
irrelevant or redundant attributes from the learning process. In machine learning, this step is known
as feature selection. Including such attributes may otherwise slow down, and possibly mislead, the
learning step.
 Ideally, the time spent on relevance analysis, when added to the time spent on learning from the
resulting “reduced” feature subset should be less than the time that would have been spent on
learning from the original set of features. Hence, such analysis can help improve classification
efficiency and scalability.
 Data Transformation: The data can be generalized to higher – level concepts. Concept hierarchies
may be used for this purpose. This is particularly useful for continuous – valued attributes. For
example, numeric values for the attribute income may be generalized to discrete ranges such as low,
medium, and high. Similarly, nominal – valued attributes like street, can be generalized to higher –
level concepts, like city. Since generalization compresses the original training data, fewer input /
output operations may be involved during learning.
2. Comparing Classification Methods
Classification and prediction methods can be compared and evaluated according to the following criteria:
 Predictive Accuracy: This refers to the ability of the model to correctly predict the class label of new
or previously unseen data.
 Speed: This refers to the computation costs involved in generating and using the model.
 Robustness: This is the ability of the model to make correct predictions given noisy data or data with
missing values.
 Scalability: This refers to the ability to construct the model efficiently given large amount of data.
 Interpretability: This refers to the level of understanding and insight that is provided by the model.
ii. Outer Analysis (5 marks)
 database may contain data objects that do not comply with the general behavior or model of the data. Such
data objects, which are grossly different from or inconsistent with the remaining set of data, are called outliers.
 Most data mining methods discard outliers as noise or exceptions.
 However, in some applications such as fraud detection, the rare events can be more interesting than
the more regularly occurring ones.
 The analysis of outlier data is referred to as outlier mining.
 Outliers may be detected using statistical tests that assume a distribution or probability model for the
data, or using distance measures where objects that are a substantial distance from any other cluster
are considered outliers.
 Rather than using statistical or distance measures, deviation-based methods identify outliers by
examining differences in the main characteristics of objects in a group.
 Outliers can be caused by measurement or execution error.
 Outliers may be the result of inherent data variability.
 Many data mining algorithms try to minimize the influence of outliers or eliminate them all together.
 This, however, could result in the loss of important hidden information because one person’s noise
could be another person’s signal.
 Thus, outlier detection and analysis is an interesting data mining task, referred to as outlier mining.
9b) Discuss about social impacts and various trends in data mining. (5 marks)
Data mining is a young discipline with wide and diverse applications
„ There is still a nontrivial gap between general principles of data mining and domain-specific, effective data
mining tools for particular applications. Some application domains
 „ Biomedical and DNA data analysis
 „ Financial data analysis
 „ Retail industry
 „ Telecommunication industry
Social Impacts: Threat to Privacy and Data Security?
 „ Is data mining a threat to privacy and data security?
o “Big Brother”, “Big Banker”, and “Big Business” are carefully watching you
o Profiling information is collected every time
 „ credit card, debit card, supermarket loyalty card, or frequent flyer card, or apply for
any of the above
 „ You surf the Web, rent a video, fill out a contest entry form,
 „ You pay for prescription drugs, or present you medical care number when visiting the
doctor
o „ Collection of personal data may be beneficial for companies and consumers, there is also
potential for misuse
 „ Medical Records, Employee Evaluations, Etc.

You might also like