Professional Documents
Culture Documents
Volume 2, Issue 10, October - 2015. ISSN 2348 4853, Impact Factor 1.317
I.
INTRODUCTION
A dataset is a collection of homogeneous objects. An object is a instance in the dataset. Dimension is the
property which define an object. Dimensionality reduction is the process where at each step the
irrelevant dimensions are reduced without substantial loss of information & without affecting the final
output. Feature extraction is process of extracting new set of reduced features from original features
based on some attributes transformation. Feature selection is a process that chooses an optimal subset of
features according to a objective function. As real world large dataset consist of irrelevant ,redundant &
noisy dimensions, so before applying clustering/classification/regression algorithms on these dataset,
one must consider dimensionality reduction as a pre-processing step.
A. High Dimensionality Data Reduction Challenges In Big Data
Big data is relentless. It is continuously generated on a massive scale. It is generated by online
interactions between people ,systems & by sensor enabled devices. It can be related, linked & integrated
to provide highly detailed information such a detail makes it possible for banks, health care & public
safety to provide specific services. It is creating new business & transforming new traditional markets to
create new business trends. So it is a challenge to statistical community. Additional information for big
data can be obtained from a single large set as opposed to separate smaller sets. It allows correlation to
be found, for instance to spot business trends. It involves increasing volume, that is amount of data,
velocity, that is speed at which data is in & out & variety that is range of data types & sources. This
requires new form of processing for decision making. It produces massive sample sizes that allows us to
discover hidden patterns associated with small subsets of big dataset. High dimensionality & big data
have special features such as noise, accumulation & spurious correlation. Spurious correlation occurs due
to the fact that many uncorrelated random variables may have sample correlation coefficient in high
dimensions. Such correlation leads to wrong inferences. High dimensional data can be generated from
sectors like Biotech, Financial, Satellite imagery & customer financial data. So such data can be stored in
form of data matrix- web term document data, sensor array data, consumer financial data etc. It is
computationally infeasible to directly make inferences based on the raw data. To handle big data from
both the statistical & the computational views, the idea of dimension reduction is an important step
before start processing of big data. High dimensional data can be analyzed through classification,
62 | 2015, IJAFRC All Rights Reserved
www.ijafrc.org
www.ijafrc.org
www.ijafrc.org
Loading of Dataset- First of all very basic step is to load a dataset in to the machine.
Extraction of Features-Apply some algorithm to extract a relevant features.
Stopping criteria- set some threshold which the features has to satisfy that criteria.
Provides results- The features which satisfy the criteria that come as a output.
www.ijafrc.org
If the above example has been solved using any data mining algorithm, that algorithm need to
approximate the XOR function in order to accurately predict Y. So alternate method is to use MDR
constructive induction algorithm which changes the representation of the data. MDR algorithm change
the representation of data by selecting two attributes like x1 & x2 which is there in above example. Each
combination of values for X1 and X2 are examined and the number of times Y=1 and/or Y=0 is counted.
With MDR, the ratio of these counts is computed and compared to a fixed threshold. Here, the ratio of
counts is 0/1 which is less than our fixed threshold of 1. Since 0/1 < 1 we encode a new attribute (Z) as a
0. When the ratio is greater than one we encode Z as a 1. This process is repeated for all unique
combinations of values for X1 and X2.
3. Independent Component Analysis (ICA)
ICA finds the independent components (also called factors, latent variables or sources) by maximizing the
statistical independence of the estimated components. ICA uses the two broadest definitions for
calculating of independence of components 1) Minimization of mutual information- The Minimization-ofMutual information (MMI) family of ICA algorithms uses measures like Kullback-Leibler
Divergence and maximum entropy.2) Maximization of non-Gaussianity-The non-Gaussianity family of ICA
algorithms, motivated by the central limit theorem, uses kurtosis and negentropy. Typical algorithms for
ICA use centering (subtract the mean to create a zero mean signal), whitening (usually with the eigen
value decomposition), and dimensionality reduction as preprocessing steps in order to simplify and
reduce the complexity of the problem for the actual iterative algorithm. Whitening and dimension
66 | 2015, IJAFRC All Rights Reserved
www.ijafrc.org
transformation W as
function
into
maximally
independent
using a linear
components measured
by
static
some
of independence.
MDR
ICA
The goal is to
minimize the stastical
dependence between
the vectors.
NN
The goal is to
train the neural
network to
minimize the
network error.
2.Methodology
Table 3
PCA
67 | 2015, IJAFRC All Rights Reserved
MDR
ICA
NN
www.ijafrc.org
It uses multiple
hidden layers for
dimensionality
reduction operation
where each
dimension is logistic
function of input
3.Merits
Table 4
PCA
MDR
ICA
NN
It
uses
gradient
descent method to
locally minimize the
squared output error.
4.Demerits
Table 5
PCA
Vectors are less spatially
localized
MDR
Implementation of
Mining pattern with MDR
from real data is
computationally complex.
ICA
Vectors are neither
orthogonal nor in
order
NN
Neural networks are
difficult to model because
a small change in a single
input will affect the entire
network
www.ijafrc.org
www.ijafrc.org
www.ijafrc.org
Step 3- Then we take the average of all relevance values. We named it as Threshold value. We check one
by one each features relevance with the Threshold Value. If The relevance is greater than the Threshold
value we insert that Feature as relevant one else irrelevant one.
Fig 6. SUFS Algorithm
VI. IMPLEMENTATION RESULTS
A. Dataset Description
The Proposed algorithm uses the Insurance dataset which comprises of 69 dimensions. The attribute of
dataset as follow
1.Contact No
2.Product Code
3.Prod long des
4.Sum Assured
5.Bill Frequency
6.Premium cessation term
7.Policy_term
8.Status_code
9.Premium_Paying_status
10.Premium_status_description
11.med nonmed
12.Mode_of_payment
Likewise more are there and same can be visualized through graph given below
1.Age
www.ijafrc.org
Fig 7
2. Agent branch short description
Fig 8
3. Agent Client Id
Fig 9
4. Agent Full Name
Fig 10
The above dataset can be given as input to SUFS algorithm described above & the output would be the
reduced set of relevant features which satisfies the criteria.
B. Output
www.ijafrc.org
www.ijafrc.org
[2]
Prof.M.Mohamed Musthafa, R.Rokit Kumar, " A Fast Clustering-Based Feature Subset Selection
Algorithm for High-Dimensional Data" IJETS (2014)
[3]
Tao Liu, Shengping Liu, Zheng Chen Wei-Ying Ma, "An Evaluation on Feature Selection for Text
clustering" ICML (2003)
[4]
[5]
A. Keerthiram Murugesan, B.Jun Zhang, "A New Term Weighting Scheme for Document
Clustering".
[6]
Jirada Kuntraruk and William M. Pottenger, " Massively Parallel Distributed Feature Extraction in
Textual Data Mining Using HDDI",IEEE (2001).
[7]
[8]
Avita Katal, Mohammad Wazid, R H Goudar, " Big Data: Issues, Challenges, Tools and Good
Practices" IEEE (2013).
[9]
Yuri Demchenko, Cees de Laat, Peter Membrey," Defining Architecture Components of the Big
Data Ecosystem ", IEEE (2014).
[10]
Lei Wang, Jianfeng Zhan , Chunjie Luo, Yuqing Zhu, Qiang Yang, Yongqiang He, Wanling Gao, Zhen
Jia,Yingjie Shi, Shujie Zhang, Chen Zheng, Gang Lu, Kent Zhan, Xiaona Li, and Bizhu Qiu ,"
BigDataBench: a Big Data Benchmark Suite from Internet Services", IEEE (2014).
[11]
Melanie Swan," Philosophy of Big Data Expanding the Human-Data Relation with Big Data Science
Services", IEEE (2015).
[12]
Han Hu, Yonggang Wen, Tat-Seng Chua, And Xuelong Li," Toward Scalable Systems for Big Data
Analytics: A Technology Tutorial", IEEE (2014).
www.ijafrc.org