Professional Documents
Culture Documents
ir
http://www.itrans24.com/landing1.html
International Journal of Data Warehousing and Mining, 10(1), 55-76, January-March 2014 55
ABSTRACT
In this study, a new model is provided for customized privacy in privacy preserving data mining in which the data
owners define different levels for privacy for different features. Additionally, in order to improve perturbation
methods, a method combined of singular value decomposition (SVD) and feature selection methods is defined
so as to benefit from the advantages of both domains. Also, to assess the amount of distortion created by the
proposed perturbation method, new distortion criteria are defined in which the amount of created distortion
in the process of feature selection is considered based on the value of privacy in each feature. Different tests
and results analysis show that offered method based on this model compared to previous approaches, caused
the improved privacy, accuracy of mining results and efficiency of privacy preserving data mining systems.
Keywords: Customized Privacy, Data Owners, Perturbation, Privacy Preserving Data Mining, Singular
Value Decomposition (SVD)
DOI: 10.4018/ijdwm.2014010104
Copyright © 2014, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.
Downloaded from http://iranpaper.ir
http://www.itrans24.com/landing1.html
56 International Journal of Data Warehousing and Mining, 10(1), 55-76, January-March 2014
of this scope are used by various governmental, hide sensitive information (Keyvanpour &
industrial, commercial, medical, financial, and Seifi, 2010).
scientific due to several advantages. In fact, wide Approaches based on the data modifica-
range of data mining applications has made it tion usually have good efficiency in terms of
an important field of research (Keyvanpour, calculation but possess a few guarantees in
Javadieh, & Ebrahimi, 2011). preserving privacy and create balance with
As privacy is an issue of individual per- difficulty between ensuring privacy and data
ception, an infallible and general solution to utility (important information and patterns
this dichotomy is infeasible. However, there existing in the data which should be preserved
are measures that can be undertaken to raise during data modification so that the accuracy
privacy protection (Wahlstrom, Roddick, Sarre, of the data mining results in one level should be
Estivill-Castro, & de Vries, 2009). Accordingly acceptable). As a result, the main challenge of
in recent years due to increasing concerns re- the data modification based methods is to create
lated to privacy, data mining methods are faced a good and fair balance between privacy and
with a serious challenge which is to preserve data utility (Liu, Giannella, & Kargupta, 2006).
the privacy of sensitive data. This method is Recently, one of the most effective ap-
under attack from privacy advocates because of proaches to meet the challenges in privacy
a misunderstanding about what it really is and a preserving data mining is the use of methods
credible concern about how it’s generally done based on dimension reduction. The above meth-
(Vaidya & Clifton, 2004). The organizations ods operate based on this idea that they first
from one side should publish their custom- identify worthless information in the dataset
ized information so as to access the benefits and then eliminate these worthless data so as
of data mining and on the other hand, are not to be perturbed. On the other side, since in the
unwilling to share their data due to preserving data mining applications, the eliminated parts
the privacy. The occurrence of such problems are considered as noise, in many cases, the use
in data collection can be undesirable for data of these methods can produce better results in
mining methods success as to achieve its goals terms of accuracy compared to mining on the
(Seifi & Keyvanpour, 2012). original dataset (Xu, Zhang, Han, & Wang,
Hence, a new aspect of in the development 2006). One of the dimension reduction based
of data mining is the approaches which are methods which is used in PPDM is a Singular
related to the concerns about privacy, in par- Value Decomposition (SVD) method (Keyvan-
ticular, in regard to this issue that data mining pour & Seifi, 2010).
methods can produce accurate models without Generally, there are two general approaches
access to precise information of given records regarding dimension reduction area: The feature
and to access valid results of the data mining extraction approach and feature selection ap-
(Clifton, Kantarcioglu, & Vaidya, 2002). In proach. Strategies for feature extraction are the
response to such anxieties, the data mining procedure which produces new features based
researches started to work on methods which on transformation or the combination of the
preserved privacy along with data mining. As original features set. On the other side, feature
a result of this research, various approaches selection methods are those that determine the
of privacy preserving data mining (PPDM) best subset of features out of the feature set
approaches are defined. based on a certain criteria (Estévez, Tesmer,
Data modification is one of the most popu- Perez & Zurada, 2009).
lar approaches of privacy preserving data min- Singular value decomposition method
ing, especially for applications that require data (SVD) is also a popular feature extraction
owners to publish their personal and sensitive method. One of the major challenges in using
data. In this way, the data prior to publication this approach in supervised learning area is that
are changed through certain methods so as to some of the new features created may be unre-
lated to supervised learning work (Rakotomalala
Copyright © 2014, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.
Downloaded from http://iranpaper.ir
http://www.itrans24.com/landing1.html
International Journal of Data Warehousing and Mining, 10(1), 55-76, January-March 2014 57
& Mhamdi, 2006). In fact, SVD method is a a privacy model that allows the definition of
non-supervision method and therefore, to create different levels of privacy for each feature.
new features, all features are used whether it is Customized privacy model which is the basis
related to the supervised learning or not. To solve of proposed perturbed model in this study is
this problem, it is tried in this research to come determined by the data’s owner based on his/
up with better results in the area of supervised her privacy requirements.
learning through combining feature selection One of the major advantages of this cus-
strategies and SVD method. Therefore, before tomized privacy model is the reduction in the
applying SVD method, a feature selection step cost. Because usually it is costly for owners to
is added. At this stage only features associated preserve the privacy, which can be the cost of
with data mining task are chosen. executing calculations or the reduction in qual-
Another advantage of combining SVD ity or the utility of the data so as to access the
and feature selection method is the significant mining results (Kamalika, 2009). So the data
reduction in the computational cost. The reason owners, based on their own resources attempts
behind it is that one of the main challenges in to minimize the cost for privacy protection based
SVD method is its high computational cost on the definition of different levels of privacy
of the data matrices in a large scale (Wang, for each feature, while maximizing privacy
Zhang, Xu, & Zhong, 2008). So by removing and data utility.
irrelevant features of the data mining, before The aim of this study is to provide a custom-
applying the SVD method, computational cost ized privacy model based on the data owner’s
can be reduced. Also, by removing irrelevant opinions. Based on this model, data owners can
features of data mining task, which may be determine the value and the significance of each
very important in terms of privacy, you can feature in terms of privacy protection, based on
improve privacy was well. Overall, the benefits their concerns and their attitudes. Also based
of combining feature selection and SVD meth- on this model of privacy, a hybrid model will
ods include: 1) increasing privacy protection 2) be provided based on feature selection method
improving data mining results 3) reducing the and perturbation method of SVD in the area of
computational cost. privacy preserving data mining. This hybrid
In addition to this, another major challenge model will be able to resolve challenges such
related to privacy preserving data mining meth- a high computing costs, lower utility data and
ods based on data modification is that they regard the low privacy guarantee on the part of the
the same level of privacy for the features. While previous methods.
privacy is a social concept, and there may be
different concerns regarding privacy for distinct The Overall Function of Privacy
features (Liu, Kantarcioglu, & Thuraisingham, Preserving Data Mining System
2008). For example, in a medical dataset, the Based on the Proposed Method
features such as age and illness are much more
important than the patient’s zip code. In this part, the overall function of privacy
Also, the privacy level of each feature can preserving data mining system based on the
vary from a dataset to another based on the proposed method is described. The input of
application of data mining and sensitiveness the proposed model is a dataset that contains
of private information. In addition, the own- confidential information of individuals or or-
ers’ belief in the computational power and the ganizations. In proposed method, it is assumed
adversaries’ background knowledge can affect the collection of data is a numeric matrix that
their privacy expectations (Kamalika, 2009). contains a set of records with numerical fea-
Then, It is possible to define personal privacy tures and a categorical feature (the class label).
in PPDM is of utmost importance. Based on Also, the feature values from the domain of
this, another goal of this research is to provide real numbers are continuous. The output of
Copyright © 2014, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.
Downloaded from http://iranpaper.ir
http://www.itrans24.com/landing1.html
58 International Journal of Data Warehousing and Mining, 10(1), 55-76, January-March 2014
the proposed model is a perturbed dataset that The Features Valuation Phase
has been published by the owner to go through
data mining. The valuation phase of the features is one of
As can be seen in Figure 1, proposed pertur- the most important parts of the proposed model.
bation method has two main phases: the features The output of this step is a matrix in which
valuation phase and the data perturbation phase. each feature is assigned a weight. Existing
In the features valuation phase which is one of weights in the matrix indicate the importance
the important parts of the proposed method, all of privacy preservation for each feature for the
the features of the original matrix are valued owner. With respect to the existing demands
according to their importance degree in terms for defining a personal privacy model, in this
of privacy for the data owner. In the next phase research a model called Customized Privacy
of this model, the operations of perturbation Model (CPM) is provided to designate value
and data distortion are done and based on it, for the features which will be explained in the
a perturbed dataset as the system output are next section thoroughly.
produced and used as data mining.
The main purpose of the proposed pertur- Proposed Privacy Model of CPM
bation method in this research is to maximize
the difference between the original dataset and CPM privacy model is proposed based on the
the perturbed dataset and also to minimize the idea that the data owner may show different
difference between the data mining results on concerns regarding privacy in relation to dif-
the initial dataset and the perturbed dataset. In ferent features, so privacy model should be
the following sections, each phase of the pro- defined based on their comments.
posed privacy preserving data mining system This model assumes that the highest level
will be provided. of privacy protection that provides the existing
perturbation methods is equal to 1. Accordingly,
in this model, a range of the importance of pri-
vacy of the features is defined between [0, 1].
Copyright © 2014, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.
Downloaded from http://iranpaper.ir
http://www.itrans24.com/landing1.html
International Journal of Data Warehousing and Mining, 10(1), 55-76, January-March 2014 59
That is, the highest degree of the importance of a section, with respect to the inherent nature of
feature can be 1 in terms of privacy preservation the methods based on matrix decomposition
and the lowest degree of importance can be 0, and methods based on the feature selection,
in which case, these features are not important these two approaches in privacy preserving data
for the owner in terms of privacy. mining are able to reinforce each other, and are
In the model CPM, the data owner based somehow complementary.The reason is that the
on the degree of the importance of features use of feature selection methods before pertur-
privacy, and also to create a balance between the bation method of SVD will lead to the removal
accuracy of the mining results and the amount of irrelevant features of the data mining task.
of privacy preservation, assigns a weighing The removal of such features will improve data
between [0, 1] to each feature. In this case, the mining results and reduce the computational
matrix of the levels of privacy is: cost of the method SVD and most important
of all, if the above features possess higher
W = {α , α , …, α } : 0 ≤ α ≤ 1 importance in terms of privacy, the amount of
1 2 m i privacy protection of the given dataset will also
(1) increase significantly. Accordingly, in this study
a new method is proposed, based on combining
Data Perturbation Phase feature selection method with SSVD method
which is briefly called FS-SSVD.
Perturbation phase is the most important part
of the proposed method. In this research, the The Proposed FS-SSVD method
Sparse Singular Value Decomposition method
(SSVD) is used as the basic method of distor- The overall performance of the FS-SSVD
tion (Xu, Zhang, Han, & Wang, 2006; Wang, method is so that first the superior features of
Zhang, Xu, & Zhong, 2008; Xu, Zhang, Han, the initial matrix of the data are selected using
& Wang, 2005). As was discussed in the first the feature selection methods. In this method,
Copyright © 2014, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.
Downloaded from http://iranpaper.ir
http://www.itrans24.com/landing1.html
60 International Journal of Data Warehousing and Mining, 10(1), 55-76, January-March 2014
the aim is to choose features that are associ- irrelevant features with the target data before
ated with supervised learning. Then the next distorting the data.
step by applying the method of SSVD, the Feature selection methods aim to select
matrix containing the premium features will features which have the highest relation to the
be perturbed. General framework of FS-SSVD data mining goals, while the irrelevant data
approach is shown in Figure 2 and its related are removed and decrease the size of dataset
details is shown in the Algorithm 1. and therefore lower the computational cost and
As can be seen in the algorithm, first NSF access to the better accuracy of data mining
of the superior feature of matrix A is selected results. Feature selection algorithms can be
through a feature selection algorithm and based classified into two general categories: the filter
on it, matrix B including the selected features model and the wrapper model (John, Kohavi
and two arrays including the removed features & Pfleger, 1994).
indices (RemIndex) and the selected features Filter models, independent of the learning
indices (SelIndex) are produced. The perturbed algorithms and based on general characteristic
matrix of Bk is produced through using the of training data, select a subset of features as
SSVD method and based on rank k. After imple- a preprocessing step. The wrapper models use
mentation of the perturbation process, the classifiers (learning machine) to evaluate the
amount of created distortion is calculated. merit and to choose a subset of the features.
In the proposed method, new criteria are Since the model wrapper requires learning a
used to calculate the amount of distortion cre- particular classifier for each new set of features,
ated in the perturbation process of FS-SSVD. they choose features which have high predic-
At this stage and after implementing perturba- tion, which are estimated based on the accuracy
tion method of SSVD, a learning algorithm is of the specific learning algorithm. In contrast,
used to assess the amount of the data utility. this model is computationally very expensive
In fact, since choosing a suitable level of k in and thus is of lesser generality (Langley, 1994).
the method of SVD which can lead to optimal Accordingly, in this research, feature selection
performance in data mining algorithms is still a methods based on filter model is used.
topic that is open, it is tried, using the different In the filter model, different feature selec-
tests, t o investigate the impact of SVD rank on tion methods based on the ability of evaluating
the accuracy of a learning algorithm. the goodness of the features which can be done
Next, the process of superior features selec- in individually or as a subset of the features are
tion in the method of FS-SSVD is discussed. classified into two groups of features ranking
Finally, new criteria, which are created and and subset search algorithms. In features rank-
used to evaluate the distortion created by the ing methods, each feature is assigned a weight
above method, are precisely defined. In order independently based on its individual merits
to define the new measures for distortion, the (such as relevance, the entropy or information
privacy levels of the features are used which content), and then features with the highest
are defined in the CPM model. ranking are selected. These methods are efficient
in terms of computational, but one of the main
The Process of Superior disadvantages of the method is that they cannot
Features Selection remove the duplicate or redundant information.
In contrast, in the subset search methods
In this process, the features of dataset that of the features all possible subsets are searched
are most relevant to supervised learning are and then the quality of each subset is assessed
selected. In fact, the proposed method tries using assessment criteria. However, this method
to improve the efficiency of the perturbation may be very expensive computationally for
method in terms of computational, privacy and problems with high dimensions, because the
accuracy of mining results through eliminating size of the search space corresponds to the sum
Copyright © 2014, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.
Downloaded from http://iranpaper.ir
http://www.itrans24.com/landing1.html
International Journal of Data Warehousing and Mining, 10(1), 55-76, January-March 2014 61
Input: Initial data matrix A[n][m], Class Labels C[n][[1], Learning algorithm L, Feature selection
algorithm FS, Number of selected features NSF, Array of privacy levels of features W[m][1],
εV , εU .
Sparsification parameters
Output: Data perturbed matrix B[n ][d ] : d ≤ m ;
Begin
1. Select the superior features of A by feature selection algorithm
FS: {B, RemIndex,SelIndex}=FS (A,C,NSF);
2. Determine rank of matrix B: r = Rank(B);
3. for k=1 to r do
4. Do single-SVD Decomposition on A to compute a perturbed dataset:
Bk = SSVD (B, k, εV , εU );
5. Calculate data distortion criteria on Bk
{WVD, WRP, WRK, WCP, WCK} = Calculate distortion value of FS-SSVD method
(A, B, Bk ,W, RemIndex, SelIndex);
6. Calculate the mining accuracy on Bk by learning algorithm L.
7. k=k+1;
8. end for
9. Select one Bk as the final disturbed dataset B .
End
(4)
used in this study.
Copyright © 2014, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.
Downloaded from http://iranpaper.ir
http://www.itrans24.com/landing1.html
62 International Journal of Data Warehousing and Mining, 10(1), 55-76, January-March 2014
(Polat & Du, 2005). In this study, to calculate levels of privacy of features when defining new
the amount of created distortion in FS-SSVD criteria so as to demonstrate the extent of cre-
methods, new measures are used. Distortion ated perturbation in the process of eliminating
measures provided for SSVD methods in Xu, irrelevant features.
Zhang, Han, and Wang (2006), Wang, Zhang, The details of the calculation of the distor-
Xu, and Zhong (2008), Xu, Zhang, Han, and tion amount of FS-SSVD method is shown in
Wang (2005) have calculated the average Algorithm 2. As seen in the algorithm, first the
amount of distortion based on the two initial array of privacy levels of features is normalized.
matrixes of B and the perturbed matrix of B Normalization algorithm normalizes the matrix
which have the same dimensions. Moreover, of the privacy levels of features in a way that
in these criteria, the value of the privacy of the total weight of all its elements is equal to
features is not considered. one. Based on this normalization, if no feature
But some of the features are removed in is removed, the distortion value obtained in
FS-SSVD method during the selection of su- the FS-SSVD process becomes equal to the
perior features. Thus, initial matrix (A) of this obtained distortion in the process of SSVD.
method has not the same dimension with its Details of the normalization process are shown
perturbed matrix ( B ), so as to calculate the in Algorithm 3.
distortion amount based on SSVD criteria. It Then, the total of normalized weights of
is therefore necessary to change the distortion privacy of selected superior features (WSSF )
criteria in way to consider the amount of cre- and the total of the normalized weights of
ated perturbation in the process of eliminating privacy of the eliminated features ( WSRF ),
irrelevant features. To do so, since in this re- which are used in determining the distortion
search, the degree of importance of different amount of proposed method, are calculated.
features is valued based CPM model and in Next the amount of perturbation created in the
terms of privacy, it has been tried to use these process of perturbing of SSVD method are
Input: Initial data matrix A[n][m], Intermediate matrix containing the selected superior features B[n][d],
Data perturbed matrix B[n ][d ] , Array of privacy levels of features W[m][1], Array of removed
features indices RemIndex, Array of selected features indices SelIndex.
Output: Weighted Value Difference WVD, Weighted Rank Position WRP, Weighted Rank Maintenance
WRK, Weighted Feature Rank Change WCP, Weighted Feature Rank Maintenance WCK.
Begin
1. Normalize the array of privacy levels of features:
NW= Normalize the privacy levels of features (W).
2. Calculate the total of normalized weights of privacy of the selected superior features:
WSSF ← ∑NW(i) : ∀i ∈ SelIndex
i
3. Calculate the total of normalized weights of privacy of the removed features:
WSRF ← ∑NW(i) : ∀i ∈ RemIn
i
4. Calculate data distortion criteria of SSVD method based on distorted matrix B and B :
VD, RP, RK, CP, CK.
5. Calculate data distortion criteria of FS- SSVD based on distorted matrix A and B :
WVD, WRP, WRK, WCP, WCK.
End
Copyright © 2014, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.
Downloaded from http://iranpaper.ir
http://www.itrans24.com/landing1.html
International Journal of Data Warehousing and Mining, 10(1), 55-76, January-March 2014 63
calculated based on matrices of B and B . Then, of the dataset is represented by the relative
in the next stage, based on the matrices of A value difference in the Frobenius norm. Thus
VD is the ratio of the Frobenius norm of the
and B and B , the amounts of WSSF and WSRF
difference of B from B to the Frobenius norm
and through using the obtained amount of
of B :
distortion from previous stage and new distor-
tion criteria which will be introduced in the
next section, the amount of overall distortion || B-B ||
VD = (5)
of FS-SSVD method is calculated. || B ||
The distortion criteria, which will be used
for assessing perturbation created in the per-
turbation process of SSVD, will be described Rank Position (RP)
and then new distortion criteria defined for
After a data distortion, the order of the value of
FS-SSVD method will be presented.
the data elements changes, too. Then in order
Distortion Measures of SSVD Method to measure the difference of value position,
initially the values of each features must be
Xu, Zhang, Han, and Wang (2006) and Wang, sorted in ascending and then the RP and RK
Zhang, Xu, and Zhong (2008) have defined five metrics be used to measure the position differ-
criteria to evaluate the distortion extent created ence of values.
by the SSVD method which are: Value Differ- RP is used to denote the average change
ence (VD), Rank Position (RP), Rank Mainte- of rank for all the features. After the elements
nance (RK), Feature Rank Change (CP) and of a feature are distorted, the rank of each ele-
Feature Rank Maintenance (CK).These criteria ment in an ascending orders of its value
are defined based on the two initial matrixes of changes. Assume that the dataset B has n data
B and perturbed matrix of B which have the objects and d features. The Rank ij denotes the
same dimensions. Values of VD, RP and CP rank of the B ji element, and (Rank ij )* denotes
correlate with the level of distortion. In contrast,
the criteria of RK and CK are inversely related the rank of the distorted element Bij . Then RP
to distortion level (Wang, Zhang, Xu, & Zhong, is defined as:
2008).
m n
Copyright © 2014, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.
Downloaded from http://iranpaper.ir
http://www.itrans24.com/landing1.html
64 International Journal of Data Warehousing and Mining, 10(1), 55-76, January-March 2014
RK =
∑ ∑ i=1 j=1
Rk ij
(7) Distortion Measures of
m×n
FS-SSVD Method
Where Rk ji means whether an element keeps As suggested in this study to evaluate the distor-
tion amount created by using FS-SSVD method,
its position in the order of values:
new criteria of WVD, WRP, WRK, WCP, WCK
are used. Each of the above criteria from differ-
1 Rank ij =(Rank ij )* ent aspects both consider the extent of distor-
RKij =
0 otherwise tion created in the process of eliminating the
irrelevant features and the amount of distortion
(8)
created in the perturbation process of SSVD.
Copyright © 2014, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.
Downloaded from http://iranpaper.ir
http://www.itrans24.com/landing1.html
International Journal of Data Warehousing and Mining, 10(1), 55-76, January-March 2014 65
RK m = d
WRK =
VD
=
VDRF
⇒ VDRF = VD × WSRF / WSSF RK × 1 otherwise
WSSF WSRF m −d
(12) (16)
Therefore, WVD is calculated as follows: Where m is the number of initial features in the
matrix A, d is the number of selected features
in matrix B and RK is the average of rank
WVD = VD + VDRF
maintenance of elements in all distorted features
WVD = VD (1 +WS RF / WSSF ) (13) during the conversion B to B which is derived
based on the Equation (7).
Copyright © 2014, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.
Downloaded from http://iranpaper.ir
http://www.itrans24.com/landing1.html
66 International Journal of Data Warehousing and Mining, 10(1), 55-76, January-March 2014
Copyright © 2014, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.
Downloaded from http://iranpaper.ir
http://www.itrans24.com/landing1.html
International Journal of Data Warehousing and Mining, 10(1), 55-76, January-March 2014 67
• NSF: number of selected superior features is 98.067% or through removing eight irrelevant
features, the accuracy decreases about 0.1%. But
In this study, for all tests, the default values the accuracy of the mining results decreases by
of the sparsification parameters are set as fol- eliminating more than 8 irrelevant features. In
lows: εU = 0.038 and εV = 0.02 . In the this study, to compare the proposed method,
the amount of NSF is set equal to 22.
following is described how other parameters
are set. Evaluation of the Proposed
Method of FS-SSVD
The Initial Value of the Privacy
with SSVD method
Importance of Features
To compare the proposed FS-SSVD method
As was discussed in a previous section, in this
with other methods, tests for different ranks of
study, a new model called CPM is proposed for
distortion k = 1:30 are done, for example, the
the privacy of the features based on which, the
results for k = 7 are shown in Table 1. In the
data owners can determine a weight for each
experiments carried out, in accordance with
feature between 0 ≤ w i ≤ 1 based on their the procedures provided in Table 1, the average
anxieties regarding privacy protection of the amount of distortion and accuracy of data mining
features. result in FS-SSVD method for different ranks
Based on this model, if one of features has are obtained, so that for every k and for every
w i = 1 , it means that this feature is of high set privacy levels (each of 10 different random
importance for the owner in terms of privacy weight) the amount of distortion and accuracy
protection and in contrast, the lesser than 1 is of the perturbed dataset are obtained and the
the extent of a feature’s weight, the lesser will average value for
each criterion is calculated. In
be the importance of this feature for the owner. the following, the proposed FS-SSVD method
In this study, in order to do various tests and to and SSVD method will be compared on the basis
define matrix of different levels of features of distortion and accuracy criteria.
privacy, for any given feature of WDBC data-
set, 10 random but uniformly distributed num- Comparison of Mining Accuracy
bers between [0,1] are generated. In fact, for
each feature, 10 privacy weights are defined As can be seen in Figure 4, the mining accuracy
randomly. of the FS-SSVD method has not decreased
compared to SSVD method, the accuracy has
The Initial Value of the Number increased even for some ranks of distortions,
of Selected Features including 11, 14, 15, 16, 18, 19, 20, 21 and 22.
An important reason for maintaining the mining
Parameter NSF shows the number of superior accuracy in the proposed method is that it tries
features selected by Information Gain algorithm. to eliminate the unimportant information in the
The NSF or the number of selected features data mining which is not important for accessing
has a significant impact on the accuracy of the to data mining result. On the other hand, since
mining results. Graph of SVM classification some of the information and removed features
accuracy based on different values of NSF are noise and have a negative effect on mining
parameter depicted in Figure 3. result, in many cases have produced better result
Mining accuracy of SVM on the WDBC in terms of accuracy.
dataset with all features is 98.067%. As can be
seen in this figure, the removal of some of the Comparison of the Distortion
irrelevant features, the accuracy of the mining Amount of WVD
is not decreased or is decreased negligibly. For
example, by eliminating three irrelevant features To compare the mining accuracy with distortion
that has the least weight, accuracy of the mining extent simultaneously, a comparison graph of
Copyright © 2014, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.
Downloaded from http://iranpaper.ir
http://www.itrans24.com/landing1.html
68 International Journal of Data Warehousing and Mining, 10(1), 55-76, January-March 2014
Table 1. The result of FS-SSVD method on the WDBC (569× 22) for different levels of privacy
and k=7
accuracy and WVD distortion extent is provide Since based on this graph, the amount
in the Figure 5. In this figure, the two below of WVD and mining accuracy, with increas-
plots show the distortion amount of WVD and ing distortion rank of k, remains unchanged;
the two above plots show the accuracy of the therefore we can conclude that in the method
FS-SSVD and SSVD methods. As can be seen of FS-SSVD, the amount of WVD and the ac-
in this figure, the above two graphs are almost curacy of mining result are independent of the
overlapping each other, so the accuracy of the degree of distortion. Therefore, the best rank
two methods are almost identical, while the of distortion based on which the accuracy rate
distortion extent of the proposed FS-SSVD and distortion extent are the highest is k=18
method has increased 0.2 compared to SSVD in which the amount of accuracy is equal to
method which is so significant with regard to 93.84% and distortion is 0.71136.
maintaining the accuracy.
Copyright © 2014, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.
Downloaded from http://iranpaper.ir
http://www.itrans24.com/landing1.html
International Journal of Data Warehousing and Mining, 10(1), 55-76, January-March 2014 69
Figure 4. Comparison of accuracy of the SSVD and FS-SSVD methods for different ranks
Figure 5. Comparison of accuracy and distortion of the SSVD and FS-SSVD methods for dif-
ferent ranks
Copyright © 2014, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.
Downloaded from http://iranpaper.ir
http://www.itrans24.com/landing1.html
70 International Journal of Data Warehousing and Mining, 10(1), 55-76, January-March 2014
Figure 6. Comparison of WRP of the SSVD and FS-SSVD methods for different ranks
Figure 7. Comparison of WRK in the SSVD and FS-SSVD methods for different ranks
Copyright © 2014, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.
Downloaded from http://iranpaper.ir
http://www.itrans24.com/landing1.html
International Journal of Data Warehousing and Mining, 10(1), 55-76, January-March 2014 71
Figure 8. Comparison of WCP in the SSVD and FS-SSVD methods for different ranks
Figure 9. Comparison of WCK in the SSVD and FS-SSVD methods for different ranks
Copyright © 2014, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.
Downloaded from http://iranpaper.ir
http://www.itrans24.com/landing1.html
72 International Journal of Data Warehousing and Mining, 10(1), 55-76, January-March 2014
Table 2. Comparison of Thin-SVD and SSVD algorithms execution time on dataset of WDBC
with different dimensions (based on second)
SSVD Thin-SVD DB
3.7653 3.6431 WDBC (569 × 30)
1.9835 2.0914 WDBC (569 × 22)
0.1753 0.1623 WDBC (569 × 11)
FS-SSVD, the time of implementing Thin-SDV As can be seen in the results of this table,
and SSVD algorithms on the original matrix the implementation of perturbation method of
and the matrix derived from the feature selec- SSVD will lead to the reduction in the execu-
tion methods with different number of features tion time of implementing learning algorithms.
for all ranks of the input matrix are compared But in the proposed method of FS-SSVD, this
and the results of this evaluation are shown in amount of reduction of computational cost
Table 2. As seen in results, the running time of decreases, depending on the number of elimi-
the algorithm SSVD depends on the number of nated features.
features of input matrix. The less is the number
of features, the lesser will be the execution time Overall Evaluation of the
of this algorithm. For example, when the dimen- Proposed Method of FS-
sions of the input matrix are almost 1/3, the SSVD with Other Methods
algorithm execution time is reduced 22 times.
On the other hand, since the computational In this section, the FS-SSVD method is com-
cost of the learning algorithms often depends pared with the two previous methods in the area
on the number of input features, removing of Thin-SVD and SSVD. To compare the above
irrelevant features leads to the improvement methods, the average of distortion criteria and
in computation time of learning algorithm. mining accuracy are compared to each other
Accordingly, the CPU time to run four learn- for all the distortion ranks. Also for comparing
ing algorithm: SVM, NaiveBayes, J48 and criteria of privacy correctly, the basic parameters
DecisionTable on the original data matrix, are set as follows:
perturbed matrix based on the SSVD method
and perturbed matrix based on the proposed • Sparsification values: εU = 0.038 and
method of FS-SSVD will be compared with εV = 0.02
each other and the results of this evaluation are • Number of selected features: NSF = 22 in
shown in Table 3. FS-SSVD method
Table 3. Comparison of execution time of learning algorithms per different methods based on
the SVD (based on second)
Copyright © 2014, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.
Downloaded from http://iranpaper.ir
http://www.itrans24.com/landing1.html
International Journal of Data Warehousing and Mining, 10(1), 55-76, January-March 2014 73
This assessment results are given in Table comparison diagram of mining accuracy of
4. To easiest comparison way of above meth- SVM is shown in the Figure 11.
ods, the comparison diagram of the distortion As can be seen in Figure 10, the distortion
value of WVD is shown in Figure 10 and the extent in WVD (or VD) in Thin-SVD method
has the least amount in comparison with other
Figure 10. Comparison of average WVD of FS-SSVD method with methods of Thin-SVD and SSVD
Figure 11. Comparison of average mining accuracy of FS-SSVD method with methods of Thin-
SVD and SSVD
Copyright © 2014, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.
Downloaded from http://iranpaper.ir
http://www.itrans24.com/landing1.html
74 International Journal of Data Warehousing and Mining, 10(1), 55-76, January-March 2014
methods and is insignificant. So, the SSVD Results and Future Works
method is provided to further distort the values
and to improve privacy preservation. Accord- With respect to the efforts which have been made
ingly, as is shown in Figure 11, in the SSVD to remove the barriers to privacy preserving
method along with a reduction about 4% in data mining methods, in this study, a model of
accuracy of data mining result, the distortion personalized privacy called CPM was provided
extent increases considerably about 0.5. based on which the data owners can determine
According to Figure 10, it can be seen that the value and the degree of importance of each
the distortion amount of FS-SSVD compared feature in terms of privacy preservation based
to SSVD has increased about 0.2, while as can on their anxieties and views. Also, in this study
be seen in Figure 11, the amount of mining was presented perturbation method of FS-SSVD
accuracy remains unchanged. In fact, since in based on privacy model CPM.
the proposed method FS-SSVD with the aim Perturbation method of FS-SSVD is based
of preserving privacy, irrelevant information on the mixed idea of SSVD and method of
to data mining task are deleted, so the access Information Gain (feature selection method),
to valid results of data mining had no negative called FS-SSVD, which increases the amount
impact and even it is possible improve that it of privacy preservation without leaving a nega-
has led to the improvement in the accuracy of tive impact on the data utility. In this proposed
data mining. method to improve privacy it has been tried to
In order to compare the accuracy and pri- eliminate the irrelevant features to supervised
vacy of the proposed method simultaneously, learning tasks before implementing perturba-
a comparison chart of these three methods is tion method of SSVD, using feature selection
shown in the Figure 12. Based on the graphs methods. Also, in this study to evaluate the
presented in the figure, the FS-SSVD method amount of data distortion, new distortion criteria
has the highest level of privacy protection and were introduced in which the amount of created
data utility compared to the Thin-SVD and perturbation in the process of eliminating the
SSVD methods. So we can conclude that for irrelevant features was also based on the value
classification tasks, elimination of irrelevant of privacy for each feature. The elimination of
features before the perturbation can improve irrelevant information and features to data min-
data privacy and the accuracy of the mining ing task in the proposed method of FS-SSVD
results. has many benefits:
Figure 12. Simultaneous comparison of accuracy and distortion amount of FS-SSVD method
with methods of Thin-SVD and SSVD
Copyright © 2014, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.
Downloaded from http://iranpaper.ir
http://www.itrans24.com/landing1.html
International Journal of Data Warehousing and Mining, 10(1), 55-76, January-March 2014 75
Copyright © 2014, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.
Downloaded from http://iranpaper.ir
http://www.itrans24.com/landing1.html
76 International Journal of Data Warehousing and Mining, 10(1), 55-76, January-March 2014
Keyvanpour, M., & Seifi Moradi, S. (2010). Clas- Vaidya, J., & Clifton, C. (2004). Privacy-preserving
sification and evaluation the privacy preserving data mining: Why, how and when. IEEE Security and
data mining techniques by using a data modifica- Privacy, 2(6), 19–27. doi:10.1109/MSP.2004.108
tion–based framework. International Journal on
Computer Science and Engineering, 3(2). Wahlstrom, K., Roddick, J. F., Sarre, R., Estivill-
Castro, V., & de Vries, D. (2009). Legal and techni-
Keyvanpour, M. R., Javadieh, M., & Ebrahimi, M. R. cal issues of privacy preservation in data mining.
(2011). Detecting and investigating crime by means Encyclopedia of Data Warehousing and Mining,
of data mining: A general crime matching framework. 1158-1163.
Journal of Procedia Computer Science, 3, 872–880.
doi:10.1016/j.procs.2010.12.143 Wang, J., Zhang, J., Xu, S., & Zhong, W. (2008). A
novel data distortion approach via selective SSVD
Langley, P. (1994). Selection of relevant features in for privacy protection. International Journal of
machine learning. In Proceedings of the AAAI Fall Information and Computer Security, 2(1), 48–70.
Symposium on Relevance. doi:10.1504/IJICS.2008.016821
Liu, K., Giannella, C., & Kargupta, H. (2006). An at- Witten, I. H., & Frank, E. (2005). Data mining -
tacker’s view of distance preserving maps for privacy Practical machine learning tools and techniques
preserving data mining. In Proceedings of the 10th with JAVA implementations (2nd ed.). Morgan
European Conference on Principle and Practice of Kaufmann Publishers.
Knowledge Discovery in Databases (pp.297-308).
Xu, S., Zhang, J., Han, D., & Wang, J. (2005). Data
Liu, L., Kantarcioglu, M., & Thuraisingham, B. distortion for privacy protection in a terrorist analysis
(2008). The applicability of the perturbation based system. In Proceedings of the IEEE International
privacy preserving data mining for real-world Conference on Intelligence and Security Informatics,
data. International Journal of Data & Knowl- Atlanta, GA (pp.459-464).
edge Engineering, 65(1), 5–21. doi:10.1016/j.
datak.2007.06.011 Xu, S., Zhang, J., Han, D., & Wang, J. (2006). Singular
value decomposition based data distortion strategy
Polat, H., & Du, W. (2005). SVD-based collaborative for privacy protection. Journal of Knowledge and
filtering with privacy. In Proceedings of the 20th Information Systems, 10(3), 383–397. doi:10.1007/
ACM Symposium on Applied Computing, Track on s10115-006-0001-2
E-commerce Technologies, Santa Fe, NM.
Yu, L., & Liu, H. (2003). Feature selection for high-
Rakotomalala, R., & Mhamdi, F. (2006). Combining dimensional data: A fast correlation-based filter
feature selection and feature reduction for protein solution. In Proceedings of the 20th International
classification. In Proceedings of the 6th WSEAS Conference on Machine Learning, Washington DC.
International Conference on Simulation, Modelling
and Optimization, Lisbon, Portugal (pp. 444-451).
Seifi Moradi, S., & Keyvanpour, M. R. (2012).
Classification and evaluation the privacy preserv- ENDNOTES
ing distributed data mining techniques. Journal of
Theoretical and Applied Information Technology,
1
UCI Machine Learning Repository, http://
37(2), 204–210. www.ics.uci.edu/mlearn/mlre
Copyright © 2014, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.