You are on page 1of 22

Downloaded from http://iranpaper.

ir
http://www.itrans24.com/landing1.html

International Journal of Data Warehousing and Mining, 10(1), 55-76, January-March 2014 55

A Perturbation Method Based on


Singular Value Decomposition
and Feature Selection for
Privacy Preserving Data Mining
Mohammad Reza Keyvanpour, Department of Computer Engineering, Alzahra University,
Tehran, Iran
Somayyeh Seifi Moradi, Department of Information and Communication Technology, Ports
and Maritime Organization, Tehran, Iran

ABSTRACT
In this study, a new model is provided for customized privacy in privacy preserving data mining in which the data
owners define different levels for privacy for different features. Additionally, in order to improve perturbation
methods, a method combined of singular value decomposition (SVD) and feature selection methods is defined
so as to benefit from the advantages of both domains. Also, to assess the amount of distortion created by the
proposed perturbation method, new distortion criteria are defined in which the amount of created distortion
in the process of feature selection is considered based on the value of privacy in each feature. Different tests
and results analysis show that offered method based on this model compared to previous approaches, caused
the improved privacy, accuracy of mining results and efficiency of privacy preserving data mining systems.

Keywords: Customized Privacy, Data Owners, Perturbation, Privacy Preserving Data Mining, Singular
Value Decomposition (SVD)

INTRODUCTION resources, reduce costs and the exploitation of


opportunities. Data mining is tip-top described
Data mining or knowledge discovery is a pro- as the union of historical and recent develop-
cess that analyzes voluminous digital data in ments in statistics, artificial intelligence, and
order to discover hidden but effective patterns machine learning. These methods are then used
from digital data (Ashrafi, Taniar, & Smith, together to study information and find previ-
2005). In other words, this is a powerful tool ously hidden trends or patterns within (Daly,
for data analysis, with the goal of accurate and & Taniar, 2004). Data mining applications
efficient identification of hidden and valuable have extremely altered the strategic decision-
patterns in the data, can facilitate the process making procedures of organizations (Tjioe &
of decision making, improve the allocation of Taniar, 2005). Hence, the various applications

DOI: 10.4018/ijdwm.2014010104

Copyright © 2014, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.
Downloaded from http://iranpaper.ir
http://www.itrans24.com/landing1.html

56 International Journal of Data Warehousing and Mining, 10(1), 55-76, January-March 2014

of this scope are used by various governmental, hide sensitive information (Keyvanpour &
industrial, commercial, medical, financial, and Seifi, 2010).
scientific due to several advantages. In fact, wide Approaches based on the data modifica-
range of data mining applications has made it tion usually have good efficiency in terms of
an important field of research (Keyvanpour, calculation but possess a few guarantees in
Javadieh, & Ebrahimi, 2011). preserving privacy and create balance with
As privacy is an issue of individual per- difficulty between ensuring privacy and data
ception, an infallible and general solution to utility (important information and patterns
this dichotomy is infeasible. However, there existing in the data which should be preserved
are measures that can be undertaken to raise during data modification so that the accuracy
privacy protection (Wahlstrom, Roddick, Sarre, of the data mining results in one level should be
Estivill-Castro, & de Vries, 2009). Accordingly acceptable). As a result, the main challenge of
in recent years due to increasing concerns re- the data modification based methods is to create
lated to privacy, data mining methods are faced a good and fair balance between privacy and
with a serious challenge which is to preserve data utility (Liu, Giannella, & Kargupta, 2006).
the privacy of sensitive data. This method is Recently, one of the most effective ap-
under attack from privacy advocates because of proaches to meet the challenges in privacy
a misunderstanding about what it really is and a preserving data mining is the use of methods
credible concern about how it’s generally done based on dimension reduction. The above meth-
(Vaidya & Clifton, 2004). The organizations ods operate based on this idea that they first
from one side should publish their custom- identify worthless information in the dataset
ized information so as to access the benefits and then eliminate these worthless data so as
of data mining and on the other hand, are not to be perturbed. On the other side, since in the
unwilling to share their data due to preserving data mining applications, the eliminated parts
the privacy. The occurrence of such problems are considered as noise, in many cases, the use
in data collection can be undesirable for data of these methods can produce better results in
mining methods success as to achieve its goals terms of accuracy compared to mining on the
(Seifi & Keyvanpour, 2012). original dataset (Xu, Zhang, Han, & Wang,
Hence, a new aspect of in the development 2006). One of the dimension reduction based
of data mining is the approaches which are methods which is used in PPDM is a Singular
related to the concerns about privacy, in par- Value Decomposition (SVD) method (Keyvan-
ticular, in regard to this issue that data mining pour & Seifi, 2010).
methods can produce accurate models without Generally, there are two general approaches
access to precise information of given records regarding dimension reduction area: The feature
and to access valid results of the data mining extraction approach and feature selection ap-
(Clifton, Kantarcioglu, & Vaidya, 2002). In proach. Strategies for feature extraction are the
response to such anxieties, the data mining procedure which produces new features based
researches started to work on methods which on transformation or the combination of the
preserved privacy along with data mining. As original features set. On the other side, feature
a result of this research, various approaches selection methods are those that determine the
of privacy preserving data mining (PPDM) best subset of features out of the feature set
approaches are defined. based on a certain criteria (Estévez, Tesmer,
Data modification is one of the most popu- Perez & Zurada, 2009).
lar approaches of privacy preserving data min- Singular value decomposition method
ing, especially for applications that require data (SVD) is also a popular feature extraction
owners to publish their personal and sensitive method. One of the major challenges in using
data. In this way, the data prior to publication this approach in supervised learning area is that
are changed through certain methods so as to some of the new features created may be unre-
lated to supervised learning work (Rakotomalala

Copyright © 2014, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.
Downloaded from http://iranpaper.ir
http://www.itrans24.com/landing1.html

International Journal of Data Warehousing and Mining, 10(1), 55-76, January-March 2014 57

& Mhamdi, 2006). In fact, SVD method is a a privacy model that allows the definition of
non-supervision method and therefore, to create different levels of privacy for each feature.
new features, all features are used whether it is Customized privacy model which is the basis
related to the supervised learning or not. To solve of proposed perturbed model in this study is
this problem, it is tried in this research to come determined by the data’s owner based on his/
up with better results in the area of supervised her privacy requirements.
learning through combining feature selection One of the major advantages of this cus-
strategies and SVD method. Therefore, before tomized privacy model is the reduction in the
applying SVD method, a feature selection step cost. Because usually it is costly for owners to
is added. At this stage only features associated preserve the privacy, which can be the cost of
with data mining task are chosen. executing calculations or the reduction in qual-
Another advantage of combining SVD ity or the utility of the data so as to access the
and feature selection method is the significant mining results (Kamalika, 2009). So the data
reduction in the computational cost. The reason owners, based on their own resources attempts
behind it is that one of the main challenges in to minimize the cost for privacy protection based
SVD method is its high computational cost on the definition of different levels of privacy
of the data matrices in a large scale (Wang, for each feature, while maximizing privacy
Zhang, Xu, & Zhong, 2008). So by removing and data utility.
irrelevant features of the data mining, before The aim of this study is to provide a custom-
applying the SVD method, computational cost ized privacy model based on the data owner’s
can be reduced. Also, by removing irrelevant opinions. Based on this model, data owners can
features of data mining task, which may be determine the value and the significance of each
very important in terms of privacy, you can feature in terms of privacy protection, based on
improve privacy was well. Overall, the benefits their concerns and their attitudes. Also based
of combining feature selection and SVD meth- on this model of privacy, a hybrid model will
ods include: 1) increasing privacy protection 2) be provided based on feature selection method
improving data mining results 3) reducing the and perturbation method of SVD in the area of
computational cost. privacy preserving data mining. This hybrid
In addition to this, another major challenge model will be able to resolve challenges such
related to privacy preserving data mining meth- a high computing costs, lower utility data and
ods based on data modification is that they regard the low privacy guarantee on the part of the
the same level of privacy for the features. While previous methods.
privacy is a social concept, and there may be
different concerns regarding privacy for distinct The Overall Function of Privacy
features (Liu, Kantarcioglu, & Thuraisingham, Preserving Data Mining System
2008). For example, in a medical dataset, the Based on the Proposed Method
features such as age and illness are much more
important than the patient’s zip code. In this part, the overall function of privacy
Also, the privacy level of each feature can preserving data mining system based on the
vary from a dataset to another based on the proposed method is described. The input of
application of data mining and sensitiveness the proposed model is a dataset that contains
of private information. In addition, the own- confidential information of individuals or or-
ers’ belief in the computational power and the ganizations. In proposed method, it is assumed
adversaries’ background knowledge can affect the collection of data is a numeric matrix that
their privacy expectations (Kamalika, 2009). contains a set of records with numerical fea-
Then, It is possible to define personal privacy tures and a categorical feature (the class label).
in PPDM is of utmost importance. Based on Also, the feature values ​​from the domain of
this, another goal of this research is to provide real numbers are continuous. The output of

Copyright © 2014, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.
Downloaded from http://iranpaper.ir
http://www.itrans24.com/landing1.html

58 International Journal of Data Warehousing and Mining, 10(1), 55-76, January-March 2014

the proposed model is a perturbed dataset that The Features Valuation Phase
has been published by the owner to go through
data mining. The valuation phase of the features is one of
As can be seen in Figure 1, proposed pertur- the most important parts of the proposed model.
bation method has two main phases: the features The output of this step is a matrix in which
valuation phase and the data perturbation phase. each feature is assigned a weight. Existing
In the features valuation phase which is one of weights in the matrix indicate the importance
the important parts of the proposed method, all of privacy preservation for each feature for the
the features of the original matrix are valued owner. With respect to the existing demands
according to their importance degree in terms for defining a personal privacy model, in this
of privacy for the data owner. In the next phase research a model called Customized Privacy
of this model, the operations of perturbation Model (CPM) is provided to designate value
and data distortion are done and based on it, for the features which will be explained in the
a perturbed dataset as the system output are next section thoroughly.
produced and used as data mining.
The main purpose of the proposed pertur- Proposed Privacy Model of CPM
bation method in this research is to maximize
the difference between the original dataset and CPM privacy model is proposed based on the
the perturbed dataset and also to minimize the idea that the data owner may show different
difference between the data mining results on concerns regarding privacy in relation to dif-
the initial dataset and the perturbed dataset. In ferent features, so privacy model should be
the following sections, each phase of the pro- defined based on their comments.
posed privacy preserving data mining system This model assumes that the highest level
will be provided. of privacy protection that provides the existing
perturbation methods is equal to 1. Accordingly,
in this model, a range of the importance of pri-
vacy of the features is defined between [0, 1].

Figure 1. Basic architecture of proposed privacy preserving data mining system

Copyright © 2014, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.
Downloaded from http://iranpaper.ir
http://www.itrans24.com/landing1.html

International Journal of Data Warehousing and Mining, 10(1), 55-76, January-March 2014 59

That is, the highest degree of the importance of a section, with respect to the inherent nature of
feature can be 1 in terms of privacy preservation the methods based on matrix decomposition
and the lowest degree of importance can be 0, and methods based on the feature selection,
in which case, these features are not important these two approaches in privacy preserving data
for the owner in terms of privacy. mining are able to reinforce each other, and are
In the model CPM, the data owner based somehow complementary.The reason is that the
on the degree of the importance of features use of feature selection methods before pertur-
privacy, and also to create a balance between the bation method of SVD will lead to the removal
accuracy of the mining results and the amount of irrelevant features of the data mining task.
of privacy preservation, assigns a weighing The removal of such features will improve data
between [0, 1] to each feature. In this case, the mining results and reduce the computational
matrix of the levels of privacy is: cost of the method SVD and most important
of all, if the above features possess higher
W = {α , α , …, α } : 0 ≤ α ≤ 1 importance in terms of privacy, the amount of
1 2 m i privacy protection of the given dataset will also
(1) increase significantly. Accordingly, in this study
a new method is proposed, based on combining
Data Perturbation Phase feature selection method with SSVD method
which is briefly called FS-SSVD.
Perturbation phase is the most important part
of the proposed method. In this research, the The Proposed FS-SSVD method
Sparse Singular Value Decomposition method
(SSVD) is used as the basic method of distor- The overall performance of the FS-SSVD
tion (Xu, Zhang, Han, & Wang, 2006; Wang, method is so that first the superior features of
Zhang, Xu, & Zhong, 2008; Xu, Zhang, Han, the initial matrix of the data are selected using
& Wang, 2005). As was discussed in the first the feature selection methods. In this method,

Figure 2. The proposed framework of FS-SSVD method

Copyright © 2014, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.
Downloaded from http://iranpaper.ir
http://www.itrans24.com/landing1.html

60 International Journal of Data Warehousing and Mining, 10(1), 55-76, January-March 2014

the aim is to choose features that are associ- irrelevant features with the target data before
ated with supervised learning. Then the next distorting the data.
step by applying the method of SSVD, the Feature selection methods aim to select
matrix containing the premium features will features which have the highest relation to the
be perturbed. General framework of FS-SSVD data mining goals, while the irrelevant data
approach is shown in Figure 2 and its related are removed and decrease the size of dataset
details is shown in the Algorithm 1. and therefore lower the computational cost and
As can be seen in the algorithm, first NSF access to the better accuracy of data mining
of the superior feature of matrix A is selected results. Feature selection algorithms can be
through a feature selection algorithm and based classified into two general categories: the filter
on it, matrix B including the selected features model and the wrapper model (John, Kohavi
and two arrays including the removed features & Pfleger, 1994).
indices (RemIndex) and the selected features Filter models, independent of the learning
indices (SelIndex) are produced. The perturbed algorithms and based on general characteristic
matrix of Bk is produced through using the of training data, select a subset of features as
SSVD method and based on rank k. After imple- a preprocessing step. The wrapper models use
mentation of the perturbation process, the classifiers (learning machine) to evaluate the
amount of created distortion is calculated. merit and to choose a subset of the features.
In the proposed method, new criteria are Since the model wrapper requires learning a
used to calculate the amount of distortion cre- particular classifier for each new set of features,
ated in the perturbation process of FS-SSVD. they choose features which have high predic-
At this stage and after implementing perturba- tion, which are estimated based on the accuracy
tion method of SSVD, a learning algorithm is of the specific learning algorithm. In contrast,
used to assess the amount of the data utility. this model is computationally very expensive
In fact, since choosing a suitable level of k in and thus is of lesser generality (Langley, 1994).
the method of SVD which can lead to optimal Accordingly, in this research, feature selection
performance in data mining algorithms is still a methods based on filter model is used.
topic that is open, it is tried, using the different In the filter model, different feature selec-
tests, t o investigate the impact of SVD rank on tion methods based on the ability of evaluating
the accuracy of a learning algorithm. the goodness of the features which can be done
Next, the process of superior features selec- in individually or as a subset of the features are
tion in the method of FS-SSVD is discussed. classified into two groups of features ranking
Finally, new criteria, which are created and and subset search algorithms. In features rank-
used to evaluate the distortion created by the ing methods, each feature is assigned a weight
above method, are precisely defined. In order independently based on its individual merits
to define the new measures for distortion, the (such as relevance, the entropy or information
privacy levels of the features are used which content), and then features with the highest
are defined in the CPM model. ranking are selected. These methods are efficient
in terms of computational, but one of the main
The Process of Superior disadvantages of the method is that they cannot
Features Selection remove the duplicate or redundant information.
In contrast, in the subset search methods
In this process, the features of dataset that of the features all possible subsets are searched
are most relevant to supervised learning are and then the quality of each subset is assessed
selected. In fact, the proposed method tries using assessment criteria. However, this method
to improve the efficiency of the perturbation may be very expensive computationally for
method in terms of computational, privacy and problems with high dimensions, because the
accuracy of mining results through eliminating size of the search space corresponds to the sum

Copyright © 2014, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.
Downloaded from http://iranpaper.ir
http://www.itrans24.com/landing1.html

International Journal of Data Warehousing and Mining, 10(1), 55-76, January-March 2014 61

Algorithm 1. FS-SSVD based perturbation method

Input: Initial data matrix A[n][m], Class Labels C[n][[1], Learning algorithm L, Feature selection
algorithm FS, Number of selected features NSF, Array of privacy levels of features W[m][1],
εV , εU .
Sparsification parameters
Output: Data perturbed matrix B[n ][d ] : d ≤ m ;
Begin
     1. Select the superior features of A by feature selection algorithm
     FS: {B, RemIndex,SelIndex}=FS (A,C,NSF);
     2. Determine rank of matrix B: r = Rank(B);
     3. for k=1 to r do
     4. Do single-SVD Decomposition on A to compute a perturbed dataset:
Bk = SSVD (B, k, εV , εU );
     5. Calculate data distortion criteria on Bk
          {WVD, WRP, WRK, WCP, WCK} = Calculate distortion value of FS-SSVD method
(A, B, Bk ,W, RemIndex, SelIndex);
6. Calculate the mining accuracy on Bk by learning algorithm L.
     7. k=k+1;
     8. end for
9. Select one Bk as the final disturbed dataset B .
End

of all possible combinations of the features (Yu


H(X) = − ∑P(x i ) log2 (P(x i )) (3)
& Liu, 2003).
i
In following, the supervised feature se-
lection algorithm, based on ranking scheme H(X | Y) = − ∑P(y i ) ∑P(x i | y i )log2 (P(x i | y i ))
of Information Gain, is described, which are j i

(4)
used in this study.

Information Gain Algorithm The maximum amount of information


gain is 1. In this algorithm, the features which
Information gain (Cover & Thomas, 1991) is a have the high information gain are identified as
dependence criterion between features and class related. The algorithm evaluates each feature
labels. Because this algorithm is computation- independently and in the end, the NSF numbers
ally simple and easy to understand, it is one of of the features which have the highest values
the most popular methods of feature selection. are selected as related features.
Information gain (IG) of a feature X and the
class labels Y is defined as follows: The Process of Distortion
Value Calculation
IG(X, Y) = H(X) − H(X | Y) (2)
After implementing perturbation method so as
to evaluate the created distortion, a series of
Entropy (H) is a measure of uncertainty of distorted criteria is used. These criteria should
a random variable. H(X) and H(X | Y) are the be able to clearly determine that how far is the
entropy of X and Y after observing Y respec- accurate approximation of the original value of
tively. a given element based on its distorted data

Copyright © 2014, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.
Downloaded from http://iranpaper.ir
http://www.itrans24.com/landing1.html

62 International Journal of Data Warehousing and Mining, 10(1), 55-76, January-March 2014

(Polat & Du, 2005). In this study, to calculate levels of privacy of features when defining new
the amount of created distortion in FS-SSVD criteria so as to demonstrate the extent of cre-
methods, new measures are used. Distortion ated perturbation in the process of eliminating
measures provided for SSVD methods in Xu, irrelevant features.
Zhang, Han, and Wang (2006), Wang, Zhang, The details of the calculation of the distor-
Xu, and Zhong (2008), Xu, Zhang, Han, and tion amount of FS-SSVD method is shown in
Wang (2005) have calculated the average Algorithm 2. As seen in the algorithm, first the
amount of distortion based on the two initial array of privacy levels of features is normalized.
matrixes of B and the perturbed matrix of B Normalization algorithm normalizes the matrix
which have the same dimensions. Moreover, of the privacy levels of features in a way that
in these criteria, the value of the privacy of the total weight of all its elements is equal to
features is not considered. one. Based on this normalization, if no feature
But some of the features are removed in is removed, the distortion value obtained in
FS-SSVD method during the selection of su- the FS-SSVD process becomes equal to the
perior features. Thus, initial matrix (A) of this obtained distortion in the process of SSVD.
method has not the same dimension with its Details of the normalization process are shown
perturbed matrix ( B ), so as to calculate the in Algorithm 3.
distortion amount based on SSVD criteria. It Then, the total of normalized weights of
is therefore necessary to change the distortion privacy of selected superior features (WSSF )
criteria in way to consider the amount of cre- and the total of the normalized weights of
ated perturbation in the process of eliminating privacy of the eliminated features ( WSRF ),
irrelevant features. To do so, since in this re- which are used in determining the distortion
search, the degree of importance of different amount of proposed method, are calculated.
features is valued based CPM model and in Next the amount of perturbation created in the
terms of privacy, it has been tried to use these process of perturbing of SSVD method are

Algorithm 2. Calculate distortion value of FS-SSVD method

Input: Initial data matrix A[n][m], Intermediate matrix containing the selected superior features B[n][d],
Data perturbed matrix B[n ][d ] , Array of privacy levels of features W[m][1], Array of removed
features indices RemIndex, Array of selected features indices SelIndex.
Output: Weighted Value Difference WVD, Weighted Rank Position WRP, Weighted Rank Maintenance
WRK, Weighted Feature Rank Change WCP, Weighted Feature Rank Maintenance WCK.
Begin
     1. Normalize the array of privacy levels of features:
     NW= Normalize the privacy levels of features (W).
2. Calculate the total of normalized weights of privacy of the selected superior features:
WSSF ← ∑NW(i) : ∀i ∈ SelIndex
i
     3. Calculate the total of normalized weights of privacy of the removed features:
WSRF ← ∑NW(i) : ∀i ∈ RemIn
i
     4. Calculate data distortion criteria of SSVD method based on distorted matrix B and B :
VD, RP, RK, CP, CK.
     5. Calculate data distortion criteria of FS- SSVD based on distorted matrix A and B :
WVD, WRP, WRK, WCP, WCK.
End

Copyright © 2014, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.
Downloaded from http://iranpaper.ir
http://www.itrans24.com/landing1.html

International Journal of Data Warehousing and Mining, 10(1), 55-76, January-March 2014 63

Algorithm 3. Normalize the privacy levels of features

Input: Array of privacy levels of features W[m][1].


Output: Normalized array of privacy levels of features
NW[m][1].
Begin
1. Calculate the total of privacy weights of all features: sum ← ∑W(i) : 1 ≤ i ≤ m
i
     2. for i=1 to m do
3. NW(i) ← W(i) / sum
     4. i=i+1;
     5. end for
End

calculated based on matrices of B and B . Then, of the dataset is represented by the relative
in the next stage, based on the matrices of A value difference in the Frobenius norm. Thus
VD is the ratio of the Frobenius norm of the
and B and B , the amounts of WSSF and WSRF
difference of B from B to the Frobenius norm
and through using the obtained amount of
of B :
distortion from previous stage and new distor-
tion criteria which will be introduced in the
next section, the amount of overall distortion || B-B ||
VD = (5)
of FS-SSVD method is calculated. || B ||
The distortion criteria, which will be used
for assessing perturbation created in the per-
turbation process of SSVD, will be described Rank Position (RP)
and then new distortion criteria defined for
After a data distortion, the order of the value of
FS-SSVD method will be presented.
the data elements changes, too. Then in order
Distortion Measures of SSVD Method to measure the difference of value position,
initially the values of each features must be
Xu, Zhang, Han, and Wang (2006) and Wang, sorted in ascending and then the RP and RK
Zhang, Xu, and Zhong (2008) have defined five metrics be used to measure the position differ-
criteria to evaluate the distortion extent created ence of values.
by the SSVD method which are: Value Differ- RP is used to denote the average change
ence (VD), Rank Position (RP), Rank Mainte- of rank for all the features. After the elements
nance (RK), Feature Rank Change (CP) and of a feature are distorted, the rank of each ele-
Feature Rank Maintenance (CK).These criteria ment in an ascending orders of its value
are defined based on the two initial matrixes of changes. Assume that the dataset B has n data
B and perturbed matrix of B which have the objects and d features. The Rank ij denotes the
same dimensions. Values of VD, RP and CP rank of the B ji element, and (Rank ij )* denotes
correlate with the level of distortion. In contrast,
the criteria of RK and CK are inversely related the rank of the distorted element Bij . Then RP
to distortion level (Wang, Zhang, Xu, & Zhong, is defined as:
2008).
m n

Value Difference (VD)


RP =
∑ ∑ i=1 j=1
| Rank ij – (Rank ij )* |

m×d
After a data matrix is distorted, the value of its (6)
elements changes. The value difference (VD)

Copyright © 2014, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.
Downloaded from http://iranpaper.ir
http://www.itrans24.com/landing1.html

64 International Journal of Data Warehousing and Mining, 10(1), 55-76, January-March 2014

If two elements have the same value, we m


define the element with the lower row index to
CK =
∑ i=1
CK i
(10)
have the higher rank. d
Rank Maintenance (RK)
Where CK i is computed as:
RK represents the percentage of elements that
keep their ranks of value in each column after 1 RAVi = RAVi*
the distortion. It is computed as: CKi =  (11)
 0 otherwise

m n

RK =
∑ ∑ i=1 j=1
Rk ij
(7) Distortion Measures of
m×n
FS-SSVD Method

Where Rk ji means whether an element keeps As suggested in this study to evaluate the distor-
tion amount created by using FS-SSVD method,
its position in the order of values:
new criteria of WVD, WRP, WRK, WCP, WCK
are used. Each of the above criteria from differ-
 1 Rank ij =(Rank ij )* ent aspects both consider the extent of distor-
RKij = 
0 otherwise tion created in the process of eliminating the
 irrelevant features and the amount of distortion
(8)
created in the perturbation process of SSVD.

Weighted Value Difference (WVD)


Change of Rank of Features (CP)
If VD is the value difference of distorted features
One may infer the content of a feature from during the perturbation process of SSVD and
its relative value difference compared with
converting B to B which is obtained based on
the other features. Thus it is desirable that the
the Equation (5), to calculate the values differ-
order of the average value of each feature varies
ence of all features (WVD) which are obtained
after the data distortion. Here we use the metric
during the perturbation process of FS-SSVD
CP to define the change of rank of the average
and converting A to B , then, the amount of
value of the features:
distortions created by the removal of irrelevant
features should also be considered. If elimi-
p
∑ | RAVi – RAVi* | nated features have a high degree of privacy in
CP = i=1
(9) the owner’s view, the amount of distortion is
d greatly enhanced. So the new criterion of value
difference should be such that calculates the
Where RAVi is the rank of the average final amount of distortion based on the impor-
tance level of privacy of the eliminated features
value of feature i, while RAVi * denotes its rank
and the amount of the distortion created in the
after the distortion.
selected features.
Maintenance of Rank To calculate the amount of value difference
of Features (CK) in the eliminated features (VDRF ), it is possible
to assume that the eliminated elements have
CK is defined to measure the percentage of the the same value difference with VD in the ratio
features that keep their ranks of average value of the total privacy of the eliminated features
after the distortion. So it is calculated as: to the total privacy of selected features; so, we
have:

Copyright © 2014, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.
Downloaded from http://iranpaper.ir
http://www.itrans24.com/landing1.html

International Journal of Data Warehousing and Mining, 10(1), 55-76, January-March 2014 65

 RK m = d 

WRK = 
VD
=
VDRF
⇒ VDRF = VD × WSRF / WSSF RK × 1 otherwise 
WSSF WSRF  m −d
(12) (16)

Therefore, WVD is calculated as follows: Where m is the number of initial features in the
matrix A, d is the number of selected features
in matrix B and RK is the average of rank
WVD = VD + VDRF
maintenance of elements in all distorted features
WVD = VD (1 +WS RF / WSSF ) (13) during the conversion B to B which is derived
based on the Equation (7).

Weighted Feature Rank


Weighted Rank Position (WRP) Change (WCP)
If RP is the average change in rank position of WCP is the average of rank change of the average
the elements of distorted features, during the value of all features during perturbation process
perturbation process of SSVD and conversion of FS-SSVD which is defined as follows:
of B to B which is derived based on the Equa-
tion (6) and RPRF is the average change in rank CP CPRF
position of the elements of eliminated features. = ⇒ CPRF = CP × WSRF / WSSF
WSSF WSRF
To calculate the WRP, the average change in (17)
rank of the elements of all features, during the
perturbation process of FS-SSVD and conver-
Therefore, WCP is calculated as follows:
sion of A to B , we do similar to the calculation
of WVD operation:
WCP = CP + CPRF
RP  RPRF WRP = CP (1 +WS RF / WSSF ) (18)
= ⇒ RPRF = RP ×WS RF / WSSF
WSSF WS RF
(14)
CP is the average of rank change of the
average value of distorted features which is
Therefore, WRP is calculated as follows:
derived during perturbation process SSVD and
conversion of B to B , based on the Equation
WRP = RP + RPRF (9).
Weighted Feature Rank Maintenance (WCK)
WRP = RP (1 +WS RF / WSSF ) (15)
As calculated in WRK, since the rank of the
average value of eliminated features will not
Weighted Rank Maintenance (WRK) be maintained, so WCK, the average of rank
Because the rank of eliminated features will maintenance of average value of all features
not be maintained, so WRK, the average of rank during the conversion of A to B is calculated
maintenance of elements in all features during as follows:
the conversion of A to B are calculated as
follows: 
 CK ' m =d


WCK =  ' 1

CK × otherwise 


 m −d
(19)

Copyright © 2014, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.
Downloaded from http://iranpaper.ir
http://www.itrans24.com/landing1.html

66 International Journal of Data Warehousing and Mining, 10(1), 55-76, January-March 2014

CK is the average of rank maintenance of Data Utility Measure


the average value of distorted features which
is derived during perturbation process SSVD This measure reflects the accuracy of data
and conversion of B to B , based on the Equa- mining algorithms on perturbed data (Wang,
tion (10). Zhang, Xu, & Zhong, 2008). In this study,
Based on the new measures defined, it can Support Vector Machine (SVM) classification
be seen if in the proposed method and during is used on perturbed dataset, as the criteria
the process of choosing the superior features, of data utility. SVM classification tries to
any features is not eliminated, i.e. NSF = 0, the find the best hyperplane which separates the
value of WSRF will be zero, and the values of samples with the highest margins (Joachims,
1999). Also, to adjust pre-required parameters
these criteria will be equal to those of SSVD
of proposed method, conducting evaluations
method, i.e. WVD = VD, WRP = RP, WRK =
and also comparing the proposed method with
RK, WCK = CK and WCP = CP.
other introduced methods in this field, some
Experiments and Results experiments have been done whose results are
explained in the upcoming sections. In all of
In this section, experiments conducted on the these exams to evaluate data utility, the method
proposed method, along with details of test of 10 - fold-Cross-validation is used to estimate
including evaluation criteria, test methods and the classification accuracy.
dataset used and the analysis of each test are
described. Experimental Environments
and the Dataset
Evaluation Measures
In this study, to do tests a personal computer
The aim of the proposed perturbation method with CPU.2.27 GHz and RAM 3GB is used. Also
in this study is to hide the sensitive and private to implement the proposed method, software
dataset in a way that data patterns and concepts Matlab 7.11 and to implement classification
still remain in the dataset, so that we can to algorithms and feature selection algorithm,
establish an accurate decision model based on software packages Weka 3.7 (Witten & Frank,
perturbed dataset. Therefore, in this study two 2005) were used.
groups of evaluation criteria are used to evaluate In addition, to conduct test and compare
the proposed method: data distortion Measures proposed method with other methods, a real da-
and data utility Measure (Wang, Zhang, Xu, & taset WDBC is used. This real dataset is selected
Zhong, 2008). from the UCI repository1. The WDBC dataset is
used for diagnosis purposes, and includes 560
Data Distortion Measures samples, and 30 real features. Also, there are in
it two classes that show benign and malignant
Data distortion criteria are used to evaluate the types of cancer. There were 357 cases of benign
amount of distortion in perturbation methods and and 212 cases of malignant.
to evaluate dissimilarity between the original
dataset and the perturbed dataset. In this study, Intermediate Experiments
to compare the proposed FS-SSVD method with to set the parameters
the previous methods, five measures are used:
Weighted Value Difference (WVD), Weighted The important parameters in proposed method
Rank Position (WRP), Weighted Rank Mainte- include:
nance (WRK), Weighted Feature Rank Change
(WCP) and Weighted Rank Maintenance (WRK) • εV and εU :Sparsification parameters of
discussed in the previous section. matrices V and U
• W: matrix of privacy levels of features

Copyright © 2014, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.
Downloaded from http://iranpaper.ir
http://www.itrans24.com/landing1.html

International Journal of Data Warehousing and Mining, 10(1), 55-76, January-March 2014 67

• NSF: number of selected superior features is 98.067% or through removing eight irrelevant
features, the accuracy decreases about 0.1%. But
In this study, for all tests, the default values the accuracy of the mining results decreases by
of the sparsification parameters are set as fol- eliminating more than 8 irrelevant features. In
lows: εU = 0.038 and εV = 0.02 . In the this study, to compare the proposed method,
the amount of NSF is set equal to 22.
following is described how other parameters
are set. Evaluation of the Proposed
Method of FS-SSVD
The Initial Value of the Privacy
with SSVD method
Importance of Features
To compare the proposed FS-SSVD method
As was discussed in a previous section, in this
with other methods, tests for different ranks of
study, a new model called CPM is proposed for
distortion k = 1:30 are done, for example, the
the privacy of the features based on which, the
results for k = 7 are shown in Table 1. In the
data owners can determine a weight for each
experiments carried out, in accordance with
feature between 0 ≤ w i ≤ 1 based on their the procedures provided in Table 1, the average
anxieties regarding privacy protection of the amount of distortion and accuracy of data mining
features. result in FS-SSVD method for different ranks
Based on this model, if one of features has are obtained, so that for every k and for every
w i = 1 , it means that this feature is of high set privacy levels (each of 10 different random
importance for the owner in terms of privacy weight) the amount of distortion and accuracy
protection and in contrast, the lesser than 1 is of the perturbed dataset are obtained and the
the extent of a feature’s weight, the lesser will average value for
​​ each criterion is calculated. In
be the importance of this feature for the owner. the following, the proposed FS-SSVD method
In this study, in order to do various tests and to and SSVD method will be compared on the basis
define matrix of different levels of features of distortion and accuracy criteria.
privacy, for any given feature of WDBC data-
set, 10 random but uniformly distributed num- Comparison of Mining Accuracy
bers between [0,1] are generated. In fact, for
each feature, 10 privacy weights are defined As can be seen in Figure 4, the mining accuracy
randomly. of the FS-SSVD method has not decreased
compared to SSVD method, the accuracy has
The Initial Value of the Number increased even for some ranks of distortions,
of Selected Features including 11, 14, 15, 16, 18, 19, 20, 21 and 22.
An important reason for maintaining the mining
Parameter NSF shows the number of superior accuracy in the proposed method is that it tries
features selected by Information Gain algorithm. to eliminate the unimportant information in the
The NSF or the number of selected features data mining which is not important for accessing
has a significant impact on the accuracy of the to data mining result. On the other hand, since
mining results. Graph of SVM classification some of the information and removed features
accuracy based on different values of NSF are noise and have a negative effect on mining
parameter depicted in Figure 3. result, in many cases have produced better result
Mining accuracy of SVM on the WDBC in terms of accuracy.
dataset with all features is 98.067%. As can be
seen in this figure, the removal of some of the Comparison of the Distortion
irrelevant features, the accuracy of the mining Amount of WVD
is not decreased or is decreased negligibly. For
example, by eliminating three irrelevant features To compare the mining accuracy with distortion
that has the least weight, accuracy of the mining extent simultaneously, a comparison graph of

Copyright © 2014, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.
Downloaded from http://iranpaper.ir
http://www.itrans24.com/landing1.html

68 International Journal of Data Warehousing and Mining, 10(1), 55-76, January-March 2014

Table 1. The result of FS-SSVD method on the WDBC (569× 22) for different levels of privacy
and k=7

Data Value Distortion Mining


Accuracy
W WVD WRP WRK WCK WCP
(SVM)
weighting 1 0.6463 218.8545 0.0013 0.0511 3.1487 92.2671
weighting 2 0.7309 247.509 0.0013 0.0511 3.561 92.2671
weighting 3 0.7729 261.7144 0.0013 0.0511 3.7654 92.2671
weighting 4 0.6887 233.1946 0.0013 0.0511 3.355 92.2671
weighting 5 0.7265 246.0151 0.0013 0.0511 3.5395 92.2671
weighting 6 0.6921 234.3424 0.0013 0.0511 3.3715 92.2671
weighting 7 0.7148 242.0606 0.0013 0.0511 3.4826 92.2671
weighting 8 0.631 213.6531 0.0013 0.0511 3.0739 92.2671
weighting 9 0.7744 262.2314 0.0013 0.0511 3.7728 92.2671
weighting 10 0.736 249.2407 0.0013 0.0511 3.5859 92.2671
Average 0.71136 240.8816 0.0013 0.0511 3.46563 92.2671

accuracy and WVD distortion extent is provide Since based on this graph, the amount
in the Figure 5. In this figure, the two below of WVD and mining accuracy, with increas-
plots show the distortion amount of WVD and ing distortion rank of k, remains unchanged;
the two above plots show the accuracy of the therefore we can conclude that in the method
FS-SSVD and SSVD methods. As can be seen of FS-SSVD, the amount of WVD and the ac-
in this figure, the above two graphs are almost curacy of mining result are independent of the
overlapping each other, so the accuracy of the degree of distortion. Therefore, the best rank
two methods are almost identical, while the of distortion based on which the accuracy rate
distortion extent of the proposed FS-SSVD and distortion extent are the highest is k=18
method has increased 0.2 compared to SSVD in which the amount of accuracy is equal to
method which is so significant with regard to 93.84% and distortion is 0.71136.
maintaining the accuracy.

Figure 3. The diagram of calibration of the NSF parameter value

Copyright © 2014, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.
Downloaded from http://iranpaper.ir
http://www.itrans24.com/landing1.html

International Journal of Data Warehousing and Mining, 10(1), 55-76, January-March 2014 69

Figure 4. Comparison of accuracy of the SSVD and FS-SSVD methods for different ranks

Comparison of the Distortion Comparison of the Distortion


Amount of WRP Amount of WRK

A diagram comparing distortion amount of A comparison diagram of distortion criteria


WRP in the methods of FS-SSVD and SSVD WRK in the methods of FS-SSVD and SSVD
is shown in Figure 6. As seen in this figure, the are shown in the Figure 7. As previously men-
average change in rank position of the elements tioned, the lesser is distortion level of WRK and
in FS-SSVD method has increased compared WCK, the higher will be the levels of privacy. In
to SSVD method. In addition, it is recognized fact, these two criteria have a reversed relation
in this figure that WRP value decreases with with the level of privacy. It can be seen in this
increasing distortion rank. So we can con- diagram that the WRK in FS-SSVD method is
clude that the WRP extent is dependent on the significantly lesser than SSVD method so that
distortion rank and it is possible to control dis- its level of privacy is higher proportionately.
tortion amount in terms of WRP through setting It is also observed that in the method SSVD,
the value of k in the above methods. along with increasing rank of distortion, the
WRK (or RK) will increase, so, the amount of
WRK in the above method is dependent on k

Figure 5. Comparison of accuracy and distortion of the SSVD and FS-SSVD methods for dif-
ferent ranks

Copyright © 2014, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.
Downloaded from http://iranpaper.ir
http://www.itrans24.com/landing1.html

70 International Journal of Data Warehousing and Mining, 10(1), 55-76, January-March 2014

Figure 6. Comparison of WRP of the SSVD and FS-SSVD methods for different ranks

rank. But in the FS-SSVD, WRK value is not Comparison of Distortion


very dependent on the amount of ranks. Amount of WCK
Comparison of Distortion As can be seen in Figure 9, similar to WRK,
Amount of WCP the distortion amount of WCK in the FS-SSVD
method has been much lesser than those of
As can be seen in Figure 8, in FS-SSVD method, SSVD, so, its privacy level is higher propor-
the average of rank change of the average value tionately as well. It is also observed that in the
of all features (WCP) has decreased compared method of SSVD by increasing rank of distor-
to SSVD method. One of the reasons is in this tion, the WCK (or CK) increases and so, the
study for WCP amount of eliminated features, amount of WCK in the above method depends
the average WCP of selected features is consid- on k rank, while in the FS- SSVD method, WCK
ered and since the highest changes in elements is not very dependent on the rank of distortion
rank has been in the eliminated features, as such
this amount has decreased.

Figure 7. Comparison of WRK in the SSVD and FS-SSVD methods for different ranks

Copyright © 2014, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.
Downloaded from http://iranpaper.ir
http://www.itrans24.com/landing1.html

International Journal of Data Warehousing and Mining, 10(1), 55-76, January-March 2014 71

Figure 8. Comparison of WCP in the SSVD and FS-SSVD methods for different ranks

Comparison of Computational Cost Lanczos algorithm is from the time order


O(nm 2 ) .
One of the challenges in the SVD based methods So the SVD calculation on the part of the
is the high computational cost on the large scale original matrix leads to reduced computational
data matrices. Computational cost of the SVD on cost and improvement in the efficiency of data
a matrix A with n samples and m features using mining algorithms. Because in this case, the
the Lanczos-type procedure can be expressed
multiplication of matrix by sub-matrix of A
as follows (Berry, Drmac, & Jessup, 1995):
with less dimensions is done instead of the
original matrix A (Wang, Zhang, Xu, & Zhong,
Total Cost = I × cost(A T Ax) + k × cost(Ax) 2008). Accordingly, since in the proposed
(20) method FS-SSVD, irrelevant feature are
eliminated with data mining task before distort-
The highest computational cost is related ing the data, so it leads to the reduction in the
to the complexity associated with multiplying computational cost in this method compared to
A by A T . Then, the time complexity of the SVD SSVD.
In order to evaluate the reduction in the
computational cost of the proposed method of

Figure 9. Comparison of WCK in the SSVD and FS-SSVD methods for different ranks

Copyright © 2014, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.
Downloaded from http://iranpaper.ir
http://www.itrans24.com/landing1.html

72 International Journal of Data Warehousing and Mining, 10(1), 55-76, January-March 2014

Table 2. Comparison of Thin-SVD and SSVD algorithms execution time on dataset of WDBC
with different dimensions (based on second)

SSVD Thin-SVD DB
3.7653 3.6431 WDBC (569 × 30)
1.9835 2.0914 WDBC (569 × 22)
0.1753 0.1623 WDBC (569 × 11)

FS-SSVD, the time of implementing Thin-SDV As can be seen in the results of this table,
and SSVD algorithms on the original matrix the implementation of perturbation method of
and the matrix derived from the feature selec- SSVD will lead to the reduction in the execu-
tion methods with different number of features tion time of implementing learning algorithms.
for all ranks of the input matrix are compared But in the proposed method of FS-SSVD, this
and the results of this evaluation are shown in amount of reduction of computational cost
Table 2. As seen in results, the running time of decreases, depending on the number of elimi-
the algorithm SSVD depends on the number of nated features.
features of input matrix. The less is the number
of features, the lesser will be the execution time Overall Evaluation of the
of this algorithm. For example, when the dimen- Proposed Method of FS-
sions of the input matrix are almost 1/3, the SSVD with Other Methods
algorithm execution time is reduced 22 times.
On the other hand, since the computational In this section, the FS-SSVD method is com-
cost of the learning algorithms often depends pared with the two previous methods in the area
on the number of input features, removing of Thin-SVD and SSVD. To compare the above
irrelevant features leads to the improvement methods, the average of distortion criteria and
in computation time of learning algorithm. mining accuracy are compared to each other
Accordingly, the CPU time to run four learn- for all the distortion ranks. Also for comparing
ing algorithm: SVM, NaiveBayes, J48 and criteria of privacy correctly, the basic parameters
DecisionTable on the original data matrix, are set as follows:
perturbed matrix based on the SSVD method
and perturbed matrix based on the proposed • Sparsification values: εU = 0.038 and
method of FS-SSVD will be compared with εV = 0.02
each other and the results of this evaluation are • Number of selected features: NSF = 22 in
shown in Table 3. FS-SSVD method

Table 3. Comparison of execution time of learning algorithms per different methods based on
the SVD (based on second)

Classifiers Original SSVD (rank=7) FS-SSVD (rank=7)


All Features All Features 22 Features 11 Features
SVM 0.2 0.1 0.06 0.04
Naïve Bayes 0.07 0 0 0
J48 0.17 0.02 0.02 0.01
Decision Table 0.41 0.17 0.11 0.05

Copyright © 2014, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.
Downloaded from http://iranpaper.ir
http://www.itrans24.com/landing1.html

International Journal of Data Warehousing and Mining, 10(1), 55-76, January-March 2014 73

Table 4. Comparison of proposed method with SVD- based methods

Methods Data Value Distortion Accuracy (%)


Average Average Average Average W Average SVM Average
WVD WRP WRK CK WCP
Thin-SVD 0.0050 42.25415 0.3553 0.9855 0.0155 96.373
SSVD 0.50402 146.0979 0.0095 0.2844 3.875 92.325
FS- SSVD 0.7117 209.2639 0.0013 0.0413 3.944 92.315

This assessment results are given in Table comparison diagram of mining accuracy of
4. To easiest comparison way of above meth- SVM is shown in the Figure 11.
ods, the comparison diagram of the distortion As can be seen in Figure 10, the distortion
value of WVD is shown in Figure 10 and the extent in WVD (or VD) in Thin-SVD method
has the least amount in comparison with other

Figure 10. Comparison of average WVD of FS-SSVD method with methods of Thin-SVD and SSVD

Figure 11. Comparison of average mining accuracy of FS-SSVD method with methods of Thin-
SVD and SSVD

Copyright © 2014, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.
Downloaded from http://iranpaper.ir
http://www.itrans24.com/landing1.html

74 International Journal of Data Warehousing and Mining, 10(1), 55-76, January-March 2014

methods and is insignificant. So, the SSVD Results and Future Works
method is provided to further distort the values
and to improve privacy preservation. Accord- With respect to the efforts which have been made
ingly, as is shown in Figure 11, in the SSVD to remove the barriers to privacy preserving
method along with a reduction about 4% in data mining methods, in this study, a model of
accuracy of data mining result, the distortion personalized privacy called CPM was provided
extent increases considerably about 0.5. based on which the data owners can determine
According to Figure 10, it can be seen that the value and the degree of importance of each
the distortion amount of FS-SSVD compared feature in terms of privacy preservation based
to SSVD has increased about 0.2, while as can on their anxieties and views. Also, in this study
be seen in Figure 11, the amount of mining was presented perturbation method of FS-SSVD
accuracy remains unchanged. In fact, since in based on privacy model CPM.
the proposed method FS-SSVD with the aim Perturbation method of FS-SSVD is based
of preserving privacy, irrelevant information on the mixed idea of SSVD and method of
to data mining task are deleted, so the access Information Gain (feature selection method),
to valid results of data mining had no negative called FS-SSVD, which increases the amount
impact and even it is possible improve that it of privacy preservation without leaving a nega-
has led to the improvement in the accuracy of tive impact on the data utility. In this proposed
data mining. method to improve privacy it has been tried to
In order to compare the accuracy and pri- eliminate the irrelevant features to supervised
vacy of the proposed method simultaneously, learning tasks before implementing perturba-
a comparison chart of these three methods is tion method of SSVD, using feature selection
shown in the Figure 12. Based on the graphs methods. Also, in this study to evaluate the
presented in the figure, the FS-SSVD method amount of data distortion, new distortion criteria
has the highest level of privacy protection and were introduced in which the amount of created
data utility compared to the Thin-SVD and perturbation in the process of eliminating the
SSVD methods. So we can conclude that for irrelevant features was also based on the value
classification tasks, elimination of irrelevant of privacy for each feature. The elimination of
features before the perturbation can improve irrelevant information and features to data min-
data privacy and the accuracy of the mining ing task in the proposed method of FS-SSVD
results. has many benefits:

Figure 12. Simultaneous comparison of accuracy and distortion amount of FS-SSVD method
with methods of Thin-SVD and SSVD

Copyright © 2014, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.
Downloaded from http://iranpaper.ir
http://www.itrans24.com/landing1.html

International Journal of Data Warehousing and Mining, 10(1), 55-76, January-March 2014 75

1. Increasing level of privacy protection: the ACKNOWLEDGEMENT


more is the importance of the eliminated
features in terms of privacy, the greater is The complete source code implementation and
the amount of the privacy preservation. dataset can be downloaded from http://users.
2. Improving accuracy in the mining results: monash.edu/~dtaniar/IJDWM.
since in the data mining applications, the
eliminated information are considered as
noise, thus, there is the possibility that we REFERENCES
can get better results in terms of accuracy,
compared to the mining on the original Ashrafi, M. Z., Taniar, D., & Smith, K. A. (2005).
dataset. PPDAM: Privacy-preserving distributed association-
rule-mining algorithm. International Journal of
3. Reducing computational cost: The imple- Intelligent Information Technologies, 1(1), 49–69.
mentation time of the perturbation method doi:10.4018/jiit.2005010104
based on the SSVD is dependent on the size
Berry, M. W., Drmac, Z., & Jessup, E. R. (1995).
of input matrix, so the lesser the amount
Matrics, vector spaces, and information retrieval.
of the features of dataset, the lesser will be SIAM Review, 41(2), 335–362. doi:10.1137/
the running time of this algorithm. On the S0036144598347035
other hand, since the computational cost of
Clifton, C., Kantarcioglu, M., & Vaidya, J. (2002).
the learning algorithms often depend on the Defining privacy for data mining. In Proceedings
number of input features, the removal of of the National Science Foundation Workshop on
irrelevant features will improve computa- Next Generation Data Mining, Baltimore, MD (pp.
tion time of learning algorithm. 126-133).
Cover, M., & Thomas, J. A. (1999). Elements of
Therefore, it can be concluded that the information theory. New York, NY: Wiley.
pursued objectives in this study which were the Daly, O., & Taniar, D. (2004). Exception rules mining
improvement in the preservation level of privacy based on negative association rules. In Proceedings
and data utility were realized efficiently through of the Computational Science and Its Applications
providing proposed method of perturbation and (ICCSA), Part IV, Lecture Notes in Computer Sci-
the model of personalized privacy. ence (Vol. 3046, pp. 543-552). Heidelberg, Germany:
Springer-Verlag.
Some of the approaches used to improve
and develop the proposed method which can Estévez, P. A., Tesmer, M., Perez, C. A., & Zurada,
be taken into account in the future are: J. M. (2009). Normalized mutual information feature
selection. IEEE Transactions on Neural Networks,
20(2), 189–20. doi:10.1109/TNN.2008.2005601
1. Employing the selection feature methods PMID:19150792
based on subset search instead of the
Joachims, T. (1999). Making large-scale SVM learn-
methods of ranking selection features
ing practical. Advances in Kernel Methods - Support
2. Adopting other matrix decomposition Vector Learning. Cambridge, MA: MIT Press.
based methods such as NMF
3. Investigating the proposed perturbation John, G. H., Kohavi, R., & Pfleger, K. (1994). Ir-
relevant features and the subset selection problem.
method for clustering tasks In Proceedings of the 11th International Conference
4. Investigating the proposed method on on Machine Learning (pp. 121–129).
the real dataset of large scale which have
Kamalika, D. (2009). Privacy preserving distributed
substantial dimensions such as the medi-
data mining based on multi-objective optimization
cal dataset and algorithmic game theory. PHD dissertation,
5. Determining the optimal rank of k in the University of Maryland.
methods based on SVD and threshold
values in the methods based on SSVD in
various applications

Copyright © 2014, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.
Downloaded from http://iranpaper.ir
http://www.itrans24.com/landing1.html

76 International Journal of Data Warehousing and Mining, 10(1), 55-76, January-March 2014

Keyvanpour, M., & Seifi Moradi, S. (2010). Clas- Vaidya, J., & Clifton, C. (2004). Privacy-preserving
sification and evaluation the privacy preserving data mining: Why, how and when. IEEE Security and
data mining techniques by using a data modifica- Privacy, 2(6), 19–27. doi:10.1109/MSP.2004.108
tion–based framework. International Journal on
Computer Science and Engineering, 3(2). Wahlstrom, K., Roddick, J. F., Sarre, R., Estivill-
Castro, V., & de Vries, D. (2009). Legal and techni-
Keyvanpour, M. R., Javadieh, M., & Ebrahimi, M. R. cal issues of privacy preservation in data mining.
(2011). Detecting and investigating crime by means Encyclopedia of Data Warehousing and Mining,
of data mining: A general crime matching framework. 1158-1163.
Journal of Procedia Computer Science, 3, 872–880.
doi:10.1016/j.procs.2010.12.143 Wang, J., Zhang, J., Xu, S., & Zhong, W. (2008). A
novel data distortion approach via selective SSVD
Langley, P. (1994). Selection of relevant features in for privacy protection. International Journal of
machine learning. In Proceedings of the AAAI Fall Information and Computer Security, 2(1), 48–70.
Symposium on Relevance. doi:10.1504/IJICS.2008.016821
Liu, K., Giannella, C., & Kargupta, H. (2006). An at- Witten, I. H., & Frank, E. (2005). Data mining -
tacker’s view of distance preserving maps for privacy Practical machine learning tools and techniques
preserving data mining. In Proceedings of the 10th with JAVA implementations (2nd ed.). Morgan
European Conference on Principle and Practice of Kaufmann Publishers.
Knowledge Discovery in Databases (pp.297-308).
Xu, S., Zhang, J., Han, D., & Wang, J. (2005). Data
Liu, L., Kantarcioglu, M., & Thuraisingham, B. distortion for privacy protection in a terrorist analysis
(2008). The applicability of the perturbation based system. In Proceedings of the IEEE International
privacy preserving data mining for real-world Conference on Intelligence and Security Informatics,
data. International Journal of Data & Knowl- Atlanta, GA (pp.459-464).
edge Engineering, 65(1), 5–21. doi:10.1016/j.
datak.2007.06.011 Xu, S., Zhang, J., Han, D., & Wang, J. (2006). Singular
value decomposition based data distortion strategy
Polat, H., & Du, W. (2005). SVD-based collaborative for privacy protection. Journal of Knowledge and
filtering with privacy. In Proceedings of the 20th Information Systems, 10(3), 383–397. doi:10.1007/
ACM Symposium on Applied Computing, Track on s10115-006-0001-2
E-commerce Technologies, Santa Fe, NM.
Yu, L., & Liu, H. (2003). Feature selection for high-
Rakotomalala, R., & Mhamdi, F. (2006). Combining dimensional data: A fast correlation-based filter
feature selection and feature reduction for protein solution. In Proceedings of the 20th International
classification. In Proceedings of the 6th WSEAS Conference on Machine Learning, Washington DC.
International Conference on Simulation, Modelling
and Optimization, Lisbon, Portugal (pp. 444-451).
Seifi Moradi, S., & Keyvanpour, M. R. (2012).
Classification and evaluation the privacy preserv- ENDNOTES
ing distributed data mining techniques. Journal of
Theoretical and Applied Information Technology,
1
UCI Machine Learning Repository, http://
37(2), 204–210. www.ics.uci.edu/mlearn/mlre

Tjioe, H. C., & Taniar, D. (2005). Mining associa-


tion rules in data warehouses. International Journal
of Data Warehousing and Mining, 1(3), 28–62.
doi:10.4018/jdwm.2005070103

Copyright © 2014, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.

You might also like