Ijde 20

Sameer S. Prabhune 1 and S.R.
Sathe 2
Sameer S. Prabhune 1 and Dr.S.R. Sathe 2
Reconstruction of a Complete Dataset from an Incomplete Dataset by

ARA (Attribute Relation Analysis): Some Results
Sameer S. Prabhune 1 ssprabhune@ssgmce.ac.in
Assistant Prof. & HOD- I.T.

S.S.G.M. College Of Engineering,
Shegaon, 444 203 ,Maharashtra, India
Dr.S.R. Sathe 2 srsathe@vnitnagpur.ac.in

Professor,
Dept. of E & CS, V.N.I.T.
Nagpur, Maharashtra,India
Abstract
Preprocessing is crucial steps used for variety of data warehousing and mining Real
world data is noisy and can often suffer from corruptions or incomplete values that may
impact the models created from the data. Accuracy of any mining algorithm greatly
depends on the input data sets. Incomplete data sets have become almost ubiquitous
in a wide variety of application domains. Common examples can be found in climate and
image data sets, sensor data sets and medical data sets. The incompleteness in these
data sets may arise from a number of factors: in some cases it may simply be a
reflection of certain measurements not being available at the time; in others the
information may be lost due to partial system failure; or it may simply be a result of
users being unwilling to specify attributes due to privacy concerns. When a significant
fraction of the entries are missing in all of the attributes, it becomes very difficult to
perform any kind of reasonable extrapolation on the original data. For such cases, we
introduce the novel idea of attribute weightage, in which we give weight to every
attribute for prediction of the complete data set from incomplete data sets, on which the
data mining algorithms can be directly applied. The attraction behind the idea of weights
on attribute and finally averaging it. We demonstrate the effectiveness of the approach
on a variety of real data sets. This paper describes a theory and implementation of a
1
Sameer S. Prabhune 1 and S.R. Sathe 2
new filter ARA (Attribute Relation Analysis) to the WEKA workbench, for finding the
complete dataset from an incomplete dataset.
Keywords: Data mining, data preprocessing, missing data.
1. INTRODUCTION
Many data analysis applications such as data mining, web mining, and information retrieval system
require various forms of data preparation. Mostly all this worked on the assumption that the data they
worked is complete in nature, but that is not true!
In data preparation, one takes the data in its raw form, removes as much as noise,
redundancy and incompleteness as possible and brings out that core for further processing. Common
solutions to missing data problem include the use of imputation, statistical or regression based
procedures [11]. We note that, the missing data mechanism would rely on the fact that the attributes in a
data set are not independent from one another, but that there is some predictive value from one attribute
to another [1]. Therefore we used the well-known principle namely, weightage on attribute instance [11],
for predicting the missing values. This paper gives the implementation details of addition of an ARA filter
in the WEKA workbench for finding the missing values.
1.1 Contribution of this paper

This paper gives the theory and implementation details of ARA filter addition to the WEKA workbench.
Also it gives the precise results on real datasets.
2 PRELIMENARY TOOLS KNOWLEDGE

To complete our main objective, i.e. to develop the ARA filter for the WEKA workbench we have used the
following technologies. These are as follows:
2.1 WEKA 3-5-4

Weka is an excellent workbench [4] for learning about machine learning techniques. We used this tool
and the package because it was completely written in java and its package gave us the ability to use
ARFF datasets in our filter. The weka package contains many useful classes, which were required to code
our filter. Some of the classes from weka package are as follows [4].
weka.core
weka.core.instances
weka.filters
weka.core.matrix.package weka.filters.unsupervised.attribute;
weka.core.matrix.Matrix;
weka.core.matrix.EigenvalueDecomposition; etc.
2
We have also studied the working of a simple filter by referring to the filters available in java [9,10].
2.2 JAVA
We used java as our coding language because of two reasons:
1. As the weka workbench is completely written in java and supports the java packages, it is useful to use
java as the coding language.
2. The second reason was that we could use some classes from java package and some from weka
package to create the filter.
3 PSEUDO CODE
This pseudo code is designed to give the user an understanding of the ARA algorithm. ARA is a
common technique for finding patterns in data of high dimensions [1,2].
3
Attribute Relation Analysis Algorithm
Missing Value predication
Procedure ARA( )
Input: Dataset D with M attributes, A1,A2, .., Am with missing values.
Output:
Completed dataset D with projected values.
Pred Miss Val (Ik,J, AVk It, MAk, PAk , ACk)
Where
IK - Instant in AVK
J - Element in AVK
AVK - Given attributes with Missing values
MAK ** - Missing attributes
PAk ** - Predictable attribute
It - Iteration
01. PAK // initially predictable attributes is null
[ ] // Array of attributes initially null
Cang num O // MAk initially null
Iteration O // Iteration start from initially zero
CV [ ] // array of instances
K No of attributes in MAK // missing attributes
02. Repeat Until (interaction > CKj)
// Check iteration or attributes until instance K predict 9 th element in given instances if it is
so stop our there other wise loop will be repeated.//
03. For only J attributes ( Aii.. Aij ( MAK) ;
(Aii.. Aij ; { Aii.. Aij} [ ]
// any iteration having values { Aii.. Aij} so by this loop check entire attributes { Ai i.. Aij} up to
if by missing attributes is in iteration { Aii.. Aij}(MAK) but if iteration not having any missing attributes
MAK so it is consider as it is print in given array i.e.,
(Aii .. Aij ; C [ ] other wise it will be written as {Ail ... Aij} [ ]
4
04. Replace the missing value ? with the predicted value, ?
? AVik .......? AV jk
// If iteration i.e., attributes from AiJ --- Aij having some missing value it will be replace by question mark
i.e., ? AVik .......? AV jk
05. If { correct classify (IK.T)}
// If given instances in iteration is correctly classify or not if it is not classify correctly so they
directly jump to the step 7 and restore its value by using given dataset Ai i --- AiJ ; chgnum + +
06. (V[chg Num] = AVik ........ AV jk
[ chg Num ] { Aii --- AiJ }
// store the iteration and its charging attribute in array if any change in instance so it will be
change and write a new iteration { Aii --- AiJ } chg Num ++
07. Restore old value of AiK . Aij; chg Num ++
08. Iteration ++
// If completion of one iteration move to the next and continual upto still all iteration will not
completed.
09. if { chgnum ! = 0 }
// if changing value contain zero so it will be replace by zero if it is not equal to replace it by new
value and return to step no 11.
10. MVK = missing value
ACik = Value having attributes or instance
AV jk = value of instance
n
Ai 1 w( Ain ) Ai2 w( Ain 1 ) ... Ain w( Ain 1 )

Av k
i
i 0
n
w( Ai
i 0
1 .... Ain )
5
Aj
j 0
1 w( Ajn ) Aj2 w( Ajn 1 ) ... Ajn w( Ajn 1 )
Av
k
j n
w( Aj
i 0
1 .... Ajn )
Avik Av kj
Mv K
2
11. return ( )
// return some predicted value
12. Return ( )
// Return zero if not
Figure 1 shows the ARA algorithm for prediction of the missing values.
4. IMPLEMENTATION
Coding Details
We were using datasets written in ARFF format as an input to this algorithm and the ARA filter [2,7,8]. The
filter would then take ARFF dataset as input and find out the missing values in the input dataset. After
finding out the missing values in the given dataset it would apply the ARA algorithm and predict the
missing values and also reconstruct the whole dataset by putting the missing values into the given
dataset.
We have created an ARA filter class, which is an extension of the Simple Batch Filter class,
which is an abstract class. Our algorithm first of all takes an ARFF format database as input then read
how many attribute in given data set. It takes each attribute individually and writes it into array format.
After that, it insert all instances into that array, including missing instances and find first missing instance,
if it got the instance replace it by zero. After that it calculates average of all upper instances by using its
cost effect on that particular instance. Nearest instance get more cost and cost will be decreases instance
by instance. After that it also calculates average of all lower instances by using its cost effect on that
particular instance, nearest instance get more cost and cost will be decreases instance by instance and
vice-versa. Finally we are calculating the average of lower as well as upper instances.
5 EXPERIMENTAL RESULTS
The objective of our experiment is to build the filter as preprocessing steps in Weka Workbench, which
completes the data sets from missing data sets.
6
We did not intentionally select those data sets in UCI [12], which originally come with missing
values because even if they do contain missing values, we dont know the accuracy of our approach. For
experimental set up, we take the complete dataset from UCI repository [12], for the evaluation of our
approach.
In Table 1, we used the UCI [12] dataset HorseColic, in the original dataset, there are seven numeric
attributes a to g. The first column of Table 1 gives the original dataset values. In the second column of
Table 1, we purposely deleted eight values for making it incomplete datasets. Finally in the third column,
after applying the ARA filter, we get the projected values. These projected values as compared to the
original values are in the same domain, therefore, gives the expected results.
7
Original dataset Dataset with 6 missing values Output after applying filter
@relation testPDF @relation testPDF @relation HorseColic-
weka.filters.unsupervised.attribute.Ara
@attribute a numeric @attribute a numeric Filter
@attribute b numeric @attribute b numeric
@attribute c numeric @attribute c numeric @attribute a numeric
@attribute d numeric @attribute d numeric @attribute b numeric
@attribute e numeric @attribute e numeric @attribute c numeric
@attribute f numeric @attribute f numeric @attribute d numeric
@attribute g numeric @attribute g numeric @attribute e numeric
@attribute f numeric
@data @data @attribute g numeric
@data
85,92,45,27,31,0.0,1 85,92,45,27,31,0.0,1
85,64,59,32,23,0.0,2 85,64,? ,32,23,0.0,2 85,92,45,27,31,0,1
86,54,33,16,54,0.0,2 86,54,33,16,54, ? ,2 85,64,32.466667,32,23,0,2
91,78,34,24,36,0.0,2 ? ,78,34,24,36,0.0,2 86,54,33,16,54,0.357143,2
87,70,12,28,10,0.0,2 87,70,12,28,? ,0.0,2 88.25,78,34,24,36,0,2
98,55,13,17,17,0.0,2 98,? ,13,17,17,0.0,2 87,70,12,28,25.35,0,2
88,62,20,17,9,0.5,1 88,62,20, ?, 9,0.5,1 98,65.514286,13,17,17,0,2
88,67,21,11,11,0.5,1 88,67,21,11,11,0.5,? 88,62,20,19.42381,9,0.5,1
92,54,22,20,7,0.5,1 92,54,? ,20, 7,0.5,1 88,67,21,11,11,0.5,1.357143
90,60,25,19,5,0.5,1 90,60,25,19, 5, 0.5,1 92,54,23.485185,20,7,0.5,1
90,60,25,19,5,0.5,1
TABLE 1
Missing Attribute Value Prediction Result **

**Horse colic dataset (for 10 instances only) at UCI[12] repository.
8
6 CONCLUSION
In this paper, we provided the exact implementation details of adding a new filter viz. ARA in the WEKA
workbench. As seen from the result, the ARA filter works on the averaging of upper and lower instance
and predicts the missing data.
We demonstrate the complete procedure of making the filter by means of available technologies and also
addition of this filter as an extension to the WEKA workbench.
ACKNOWLEDGMENTS
Our special thanks to Mr.Peter Reutemann, of University of Waikato, fracpete@waikato.ac.nz, for
providing us the support as and when required.
REFERENCES
1. S.Parthsarthy and C.C. Aggarwal, On the Use of Conceptual Reconstruction for Mining Massively
Incomplete Data Sets,IEEE Trans. Knowledge and Data Eng., pp. 1512-1521,2003.
2. J. Quinlan, C4.5: Programs for Machine Learning, San Mateo, Calif.: Morgan Kaufmann, 1993.
3. http://weka.sourceforge.net/wiki/index.php/Writing_your_own_Filter
4. wekaWiki link : http://weka.sourceforge.net/wiki/index.php/Main_Page
5. S. Mehta,S.Parthsarthy and H. Yang Toward Unsupervised correlation preserving discretization,

IEEE Trans. Knowledge and Data Eng.,pp 1174-1185 ,2005.
6. Ian H. Witten and Eibe Frank , Data Mining: Practical Machine Learning Tools and Techniques
Second Edition, Morgan Kaufmann Publishers. ISBN: 81-312-0050-7.
7. http://weka.sourceforge.net/wiki/index.php/CVS
8. http://weka.sourceforge.net/wiki/index.php/Eclipse_3.0.x
9. weka.filters.SimpleBatchFilter
10. weka.filters.SimpleStreamFilter
11. R. Little, D. Rubin. Statastical Analysis with Missing Data. Wiley Series in Prob. and Stat., 2002.
12. UCI Machine Learning Repository,http://www.ics.uci.edu/umlearn/MLsummary.html

Ijde 20

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Ijde 20

Uploaded by

Copyright:

Available Formats

Sameer S. Prabhune 1 and S.R.

Sameer S. Prabhune 1 and Dr.S.R. Sathe 2

Reconstruction of a Complete Dataset from an Incomplete Dataset by

Sameer S. Prabhune 1 ssprabhune@ssgmce.ac.in

Assistant Prof. & HOD- I.T.

Dr.S.R. Sathe 2 srsathe@vnitnagpur.ac.in

Keywords: Data mining, data preprocessing, missing data.

1.1 Contribution of this paper

2 PRELIMENARY TOOLS KNOWLEDGE

2.1 WEKA 3-5-4

Attribute Relation Analysis Algorithm

Missing Value predication

Input: Dataset D with M attributes, A1,A2, .., Am with missing values.

Completed dataset D with projected values.

Pred Miss Val (Ik,J, AVk It, MAk, PAk , ACk)

AVK - Given attributes with Missing values

MAK ** - Missing attributes

PAk ** - Predictable attribute

01. PAK // initially predictable attributes is null

[ ] // Array of attributes initially null

Cang num O // MAk initially null

Iteration O // Iteration start from initially zero

K No of attributes in MAK // missing attributes

02. Repeat Until (interaction > CKj)

// Check iteration or attributes until instance K predict 9 th element in given instances if it is

so stop our there other wise loop will be repeated.//

03. For only J attributes ( Aii.. Aij ( MAK) ;

(Aii.. Aij ; { Aii.. Aij} [ ]

MAK so it is consider as it is print in given array i.e.,

(Aii .. Aij ; C [ ] other wise it will be written as {Ail ... Aij} [ ]

04. Replace the missing value ? with the predicted value, ?

i.e., ? AVik .......? AV jk

05. If { correct classify (IK.T)}

06. (V[chg Num] = AVik ........ AV jk

[ chg Num ] { Aii --- AiJ }

07. Restore old value of AiK . Aij; chg Num ++

value and return to step no 11.

10. MVK = missing value

ACik = Value having attributes or instance

Ai 1 w( Ain ) Ai2 w( Ain 1 ) ... Ain w( Ain 1 )

// return some predicted value

// Return zero if not

Missing Attribute Value Prediction Result **

4. wekaWiki link : http://weka.sourceforge.net/wiki/index.php/Main_Page

5. S. Mehta,S.Parthsarthy and H. Yang Toward Unsupervised correlation preserving discretization,

12. UCI Machine Learning Repository,http://www.ics.uci.edu/umlearn/MLsummary.html

You might also like