You are on page 1of 9

I Mid Key

1. Describe General Characteristics of Data sets in detail.


Data Set: A data set is a set of data objects, records, events, patterns, entities or
vectors. Data Objects are defined using several fundamental characteristics of an object
such as the time of the event.
The common characteristics of data sets that influence data mining techniques includes
a) Dimensionality
b) Resolution
c) Sparsity
a) Dimensionality :
The number of attributes associated with a data set object is called dimensionality.
Objects that possess few dimensions relatively to high dimensions are more qualitative.
Further the dimensions are more difficult to analyze which is also known as the curse of
dimensions. Dimensionality reduction is therefore in data cleaning process.
b) Resolution:
Data can be obtained at different levels of resolution. The level of resolution
influences the pattern of data as if the resolution is too coarse then the pattern becomes
invisible while if the resolution is too fine the pattern may be buried due to noise. Besides
the level of resolution, different resolution produces different results.
Example
If the earth surface is analyzed at the resolution of fifteen kilometers it will be
smooth however, when it is analyzed within three kilometers the surface will be uneven
C) Sparsity
Generally, high volume of attributes of data object has zero value which is called
sparsity of value. One of such attribute is asymmetric attribute that have zero value
associated to it. The sparsity of values is advantageous in data mining because it reduces
the memory and computing resources.
2. Explain any three different Data Preprocessing techniques.
The three different data preprocessing strategies are
1. Aggregation
2. Sampling
3. Attribute Transformation
Aggregation
Aggregation refers to combining two or more attributes (or objects) into a single
attribute (or object) for the following purposes.
Data reduction
Reduce the number of attributes or objects

Change of scale
Cities aggregated into regions, states, countries, etc
More stable data
Aggregated data tends to have less variability

Sampling
Sampling is the main technique employed for data selection. It is often used for
both the preliminary investigation of the data and the final data analysis. Statisticians
sample because obtaining the entire set of data of interest is too expensive or time
consuming. Sampling is used in data mining because processing the entire set of data of
interest is too expensive or time consuming.
The key principle for effective sampling is the following:
using a sample will work almost as well as using the entire data sets, if the
sample is representative
A sample is representative if it has approximately the same property (of interest)
as the original set of data
Types of Sampling:
Simple Random Sampling
There is an equal probability of selecting any particular item
Sampling without replacement
As each item is selected, it is removed from the population
Sampling with replacement
Objects are not removed from the population as they are selected for the sample.

In sampling with replacement, the same object can be picked up more than once
Stratified sampling
Split the data into several partitions; then draw random samples from each
partition
Sample Size

Attribute Transformation
A function that maps the entire set of values of a given attribute to a new set of
replacement values such that each old value can be identified with one of the new values
Simple functions:
Standardization and Normalization
3. Explain different OLAP operations in multidimensional data.
Different OLAP Operations in multidimensional data are
Roll-up
Drill-down
Slice and dice
Pivot (rotate)
ROLL-UP
This operation performs aggregation on a data cube in any of the following way:
By climbing up a concept hierarchy for a dimension
By dimension reduction.
Consider the following diagram showing the roll-up operation.

The roll-up operation is performed by climbing up a concept hierarchy for the dimension

location.
Initially the concept hierarchy was "street < city < province < country".
On rolling up the data is aggregated by ascending the location hierarchy from the level of

city to level of country.


The data is grouped into cities rather than countries.
When roll-up operation is performed then one or more dimensions from the data cube are
removed.
DRILL-DOWN
Drill-down operation is reverse of the roll-up. This operation is performed by either of the

following way:
By stepping down a concept hierarchy for a dimension.
By introducing new dimension.
Consider the following diagram showing the drill-down operation:

The drill-down operation is performed by stepping down a concept hierarchy for the

dimension time.
Initially the concept hierarchy was "day < month < quarter < year."
On drill-up the time dimension is descended from the level quarter to the level of month.
When drill-down operation is performed then one or more dimensions from the data cube

are added.
It navigates the data from less detailed data to highly detailed data.

SLICE
The slice operation performs selection of one dimension on a given cube and give us a
new sub cube. Consider the following diagram showing the slice operation.

The Slice operation is performed for the dimension time using the criterion time ="Q1".
It will form a new sub cube by selecting one or more dimensions.
DICE
The Dice operation performs selection of two or more dimension on a given cube and
give us a new subcube. Consider the following diagram showing the dice operation:

The dice operation on the cube based on the following selection criteria that involve three

dimensions.
(location = "Toronto" or "Vancouver")
(time = "Q1" or "Q2")
(item =" Mobile" or "Modem").
PIVOT
The pivot operation is also known as rotation.It rotates the data axes in view in order to
provide an alternative presentation of data.Consider the following diagram showing the
pivot operation.

In this the item and location axes in 2-D slice are rotated.
4. Write an algorithm of Decision Tree Induction.
Generating a decision tree form training tuples of data partition D
Algorithm : Generate_decision_tree
Input:
Data partition, D, which is a set of training tuples
and their associated class labels.
attribute_list, the set of candidate attributes.
Attribute selection method, a procedure to determine the
splitting criterion that best partitions that the data
tuples into individual classes. This criterion includes a
splitting_attribute and either a splitting point or splitting subset.
Output:
A Decision Tree
Method
create a node N;
if tuples in D are all of the same class, C then
return N as leaf node labeled with class C;
if attribute_list is empty then
return N as leaf node with labeled

with majority class in D;|| majority voting


apply attribute_selection_method(D, attribute_list)
to find the best splitting_criterion;
label node N with splitting_criterion;
if splitting_attribute is discrete-valued and
multiway splits allowed then
attribute_list = splitting attribute;
for each outcome j of splitting criterion
let Dj be the set of data tuples in D satisfying outcome j;
if Dj is empty then
attach a leaf labeled with the majority
class in D to node N;
else
attach the node returned by Generate
decision tree(Dj, attribute list) to node N;
end for
return N;

You might also like