You are on page 1of 14

This article has been accepted for publication in a future issue of this journal, but has not been

fully edited. Content may change prior to final publication.


IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING
1

On the Design and Analysis of the


Privacy-Preserving SVM Classifier
Keng-Pei Lin and Ming-Syan Chen, Fellow, IEEE

Abstract—The support vector machine (SVM) is a widely used tool in classification problems. The SVM trains a classifier by solving
an optimization problem to decide which instances of the training dataset are support vectors, which are the necessarily informative
instances to form the SVM classifier. Since support vectors are intact tuples taken from the training dataset, releasing the SVM classifier
for public use or shipping the SVM classifier to clients will disclose the private content of support vectors. This violates the privacy-
preserving requirements for some legal or commercial reasons. The problem is that the classifier learned by the SVM inherently
violates the privacy. This privacy violation problem will restrict the applicability of the SVM. To the best of our knowledge, there has
not been work extending the notion of privacy-preservation to tackle this inherent privacy violation problem of the SVM classifier. In
this paper, we exploit this privacy violation problem, and propose an approach to post-process the SVM classifier to transform it to a
privacy-preserving classifier which does not disclose the private content of support vectors. The post-processed SVM classifier without
exposing the private content of training data is called Privacy-Preserving SVM Classifier (abbreviated as PPSVC). The PPSVC is
designed for the commonly used Gaussian kernel function. It precisely approximates the decision function of the Gaussian kernel SVM
classifier without exposing the sensitive attribute values possessed by support vectors. By applying the PPSVC, the SVM classifier is
able to be publicly released while preserving privacy. We prove that the PPSVC is robust against adversarial attacks. The experiments
on real datasets show that the classification accuracy of the PPSVC is comparable to the original SVM classifier.

Index Terms—Privacy-preserving data mining, classification, support vector machines

1 I NTRODUCTION will also reveal the private content of some individuals


in the training data. Consequently the classifier learned
T HERE is an increasing degree of concern on the
privacy protection of personal information recently
due to the popularity of electronic data held by com-
by the SVM cannot be publicly released or be shipped
to clients with privacy-preservation.
mercial corporations. Data mining techniques [1] have There is a significant difference between the SVM
been viewed as a threat to the sensitive content of per- and other popular classification algorithms: the classifier
sonal information. This kind of privacy issue has led to learned by the SVM contains some intact instances of the
research for privacy-preserving data mining techniques training data. The subset of the training data kept in the
[2], [3], [4]. One of the important data mining tasks is SVM classifier are called support vectors, which are the
classification. The classification algorithm learns a clas- informative entries making up the classifier. The support
sification model (i.e., the classifier) from labeled training vectors are intact instances taken from the training data.
data for the future use of classifying unseen data. There The inclusion of those intact instances of the training
have been many privacy-preserving schemes designed data prevents the SVM classifier from being public re-
for various classification algorithms [2], [5]. The support leasing or shipping to client users since the release of
vector machine (SVM) [6], [7], a powerful classification the SVM classifier will disclose individual privacy which
algorithm with state-of-the-art performance, has also may violate the privacy-preservation requirements for
attracted lots of attention from researchers who studied some legal or commercial reasons. For instance, HIPAA
privacy-preserving data mining techniques [8], [9], [10], laws require the medical data not to be released without
[11], [12]. appropriate anonymization [13]. The leakage of personal
However, a problem has still not been addressed information is also prohibited by laws in many countries.
in existing privacy-preserving SVM work: the classifier Most popular classification algorithms do not suffer
learned by the SVM contains some intact instances of from such direct violation of individual privacy. For
the training data. The classification model of the SVM example, in the decision tree classifier, each node of the
inherently violates the privacy. Revealing the classifier decision tree stands for an attribute and denotes splitting
points of the attribute values for proceeding to the next
• K.-P. Lin is with the Department of Electrical Engineering, National level [14]. The naı̈ve Bayesian classifier consists of prior
Taiwan University, Taipei, Taiwan. probabilities of each class and class conditional indepen-
E-mail: kplin@arbor.ee.ntu.edu.tw.
• M.-S. Chen is with the Department of Electrical Engineering, National
dent probabilities of each value [15]. The neural network
Taiwan University, Taipei, Taiwan, and the Research Center of Information classifier possesses simply a set of weights and biases,
Technology Innovation, Academia Sinica, Taipei, Taiwan. accompanied with an activation function [15]. Unlike
E-mail: mschen@cc.ee.ntu.edu.tw.
the SVM classifier which contains some intact training

Digital Object Indentifier 10.1109/TKDE.2010.193 1041-4347/10/$26.00 © 2010 IEEE


This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING
2

instances, these classifiers merely have aggregate statis-


tics of the training data. Disclosing aggregate statistics
also breaches the privacy in some extent since the ac-
tual content of some training instances may be derived
from the aggregate statistics with the help of external
information sources [16]. However, the direct privacy
violation of the SVM classifier which discloses some
intact training instances without any extent of protection
Fig. 1. Application scenario: Releasing the learned SVM
is much severer. As long as the privacy-preserving issue
classifier to clients or outsourcing to cloud-computing
is considered, this is the fundamental difference between
service providers without exposing the sensitive content
the SVM and other popular classification algorithms.
of the training data.
The classifier of the SVM inherently violates the privacy.
It incorporates a subset of training data. Hence releasing
the classifier will violate the privacy of individuals. The
Existing works which studied the privacy-preserving
other example of the classification algorithm which also
SVMs [8], [9], [10], [11], [12] mainly focused on privacy-
directly violates individual privacy in its classification
preservation at training time. The privacy violation of
model is the k-nearest neighbor (kNN) classifier, which
the classification model of the SVM and releasing the
requires all training instances being kept in the classifier
SVM classifier has not been addressed. The methods
[15].
proposed in [9], [10], [11], [12] aim to prevent the training
The violation of privacy in the classification model
data from being revealed to each other when the training
will restrict the applicability of the SVM. Consider an
data are separately held by several parties. Testing must
application scenario as follows: A hospital, or a medical
be cooperatively done by the holders of the training data.
institute, has collected a large amount of medical records.
The work of [8] considered a scenario that the training
The institute intends to capitalize those collected medical
data owner delivers the perturbed training data to an
records to learn an SVM classifier for predicting whether
untrustworthy 3rd-party to learn an SVM classifier.
a patient is subject to a disease or not. Due to the inclu-
To the best of our knowledge, there has not been
sion of some medical records in the classifier, releasing
work extending the notion of privacy-preservation to the
the classifier to other hospitals or research institutes
release of the SVM classifier. In this paper, we propose
will expose the sensitive information of some patients.
the Privacy-Preserving SVM Classifier (abbreviated as
This violation of privacy limits the applicability of the
PPSVC) to protect the sensitive content of support vec-
learned SVM classifier. Although the identifier field of
tors in the SVM classifier. The PPSVC is designed for the
each record has been removed, the identity of individual
SVM classifier trained with the commonly used Gaussian
data may still be recognized from quasi-identifiers like
kernel function. It post-processes the SVM classifier to
gender, blood type, age, date of birth, and zip code [17].
destroy the attribute values of support vectors, and
There is also an increasing trend to outsource IT ser-
outputs a function which precisely approximates the
vices to external service providers. Major IT companies
decision function of the original SVM classifier to act as a
like Google and Microsoft are constructing infrastruc-
privacy-preserving SVM classifier. Fig. 2 shows the con-
tures to run Software as a Service. This benefits small
cept of the PPSVC. The support vectors in the decision
companies to run applications in the cloud-computing
function of the SVM classifier are transformed to a Taylor
environment. Outsourcing can save much hardware,
polynomial of linear combinations of monomial feature
software and personnel investments, but data privacy
mapped support vectors, where the sensitive content of
is a critical concern in outsourcing since the external
individual support vectors are destroyed by the linear
service providers may be malicious or compromised.
combination. We prove that the PPSVC is robust against
For using SVM classifiers in the cloud-computing en-
adversarial attacks, and in the experiments, we verified
vironment, the private information of the training data
with real data that the PPSVC can achieve comparable
should not be disclosed to unauthorized parties. Fig. 1
classification accuracy to the original SVM classifier.
illustrates a general application scenario: the training
data owner trains a classifier, and then publishes or
ships the classifier to client users, or puts to the cloud-
computing environment.
Although the anonymous data publishing technique k-
anonymity [18] can be applied to data mining tasks [19],
the performance may be degraded due to the distortion Fig. 2. The PPSVC post-processes the SVM classifier to
of data caused by generalized and suppressed quasi- transform it to a privacy-preserving SVM classifier which
identifiers. Furthermore, k-anonymity actually breaches does not disclose the private content of the training data.
privacy since the identity may be recognized from gen-
eralized quasi-identifiers and unmodified attributes with The PPSVC can be viewed as a general scheme which
the help of external information sources. is able to offer a proper compromise between the approx-
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING
3

imating precision and the computational complexity of The rest of this paper is organized as follows: Sec-
the resulted classifier. A higher degree of approximation tion 2 briefly reviews the related work of the privacy-
will result in a classifier with close classification accuracy preserving data mining and privacy-preserving SVMs.
to the original at the cost of higher computational com- Section 3 reviews the SVM and discusses the privacy
plexity. The PPSVC with a low approximation degree, violation of its classification model. Section 4 constructs
i.e., low computational complexity, is enough to precisely the PPSVC. In Section 5, we discuss the security and
approximate the SVM classifier and hence achieves com- approximating precision issues of the PPSVC. Section 6
parable classification accuracy. In the experiments, we shows the experimental results. Section 7 concludes this
demonstrate that the Taylor polynomial in the PPSVC paper.
with degree ≤ 5 is able to obtain almost the same
accuracy with the original SVM classifier.
The privacy-preserving release of the SVM classifier
2 R ELATED W ORK
enabled by PPSVC can benefit the users other than the In this section, we first briefly review some privacy-
data owner without compromising privacy. For example, preserving data mining works, and then focus on the
in addition to learning an SVM from the medical records, works related to privacy-preserving SVMs.
learning from the financial transactions collected by a The work of [2] utilized a randomization-based per-
bank is useful to predict the credit of customers, and turbation approach to perturb the data. The data are
learning a spam filter from a mail server or learning a individually perturbed by adding noise randomly drawn
network intrusion detector from network server’s logs from a known distribution. A decision tree classifier is
are also important applications of classification. The then learned from the reconstructed aggregate distri-
privacy violation of the SVM classifier will restrict its use butions of the perturbed data. In [5], a condensation-
only to the ones who can collect the data, but collecting based approach is proposed. Data are first clustered into
the data is usually an expensive task or can only be groups, and then pseudo-data are generated from those
performed by professional institutes. Since the PPSVC clustered groups. Data mining tasks are then done on
makes available the release of the SVM classifier without the generated synthetic data instead of the original data.
violating privacy, the SVM classifiers are not restricted to The k-anonymity [18] is an anonymous data publish-
be utilized by the data owners, but can benefit the users ing technique. It makes each quasi-identifier value be
who are not able to collect a large amount of training able to indistinguishably map into at least k-records by
data. generalizing or suppressing the values in quasi-identifier
The following summarizes our contributions: attributes. The l-diversity [20] enhances k-anonymity by
• We address the privacy violation problem of releas- making each sensitive value appear no more than m/l
ing or publishing the SVM classifier. We propose the times in a quasi-identifier group with m tuples. The k-
PPSVC, which precisely approximates the decision anonymity has been successfully utilized in data mining.
function of the Gaussian kernel SVM classifier in a For example, the work of [19] studied the performance
privacy-preserving form. The PPSVC is realized by of the SVM built upon the anonymized data and the
transforming the original decision function of the anonymized data with additional statistics of the gen-
SVM classifier to an infinite series of linear combina- eralized fields. The distortion of data in k-anonymity
tions of monomial feature mapped support vectors may degrade the data mining performance, and the
in which the infinite series is then approximated by privacy is actually breached due to the disclosing of
a Taylor polynomial. The releasable PPSVC bene- generalized values and unmodified sensitive attributes,
fits the classifier users with the good classification which may incur the risk of being identified from the
performance of the SVM without violating the indi- help of external information sources.
vidual privacy of the training data. Another family of privacy-preserving data mining al-
• We study the SVM kernel parameter’s influence gorithms is distributed methods [21]. The distributed
on the approximating precision of the PPSVC, and methods perform data mining over the entire dataset
provide a simple but subtle strategy for selecting the which is separately held by several parties without
kernel parameter for obtaining good approximating compromising the data privacy of each party. The
precision in PPSVC. We also study the security issue dataset may either be horizontally partitioned, vertically
of the PPSVC by considering the adversarial attack partitioned, or arbitrarily partitioned. The distributed
with the help of external information sources. privacy-preserving data mining algorithm exchanges
• Extensive experiments are conducted to evaluate the necessary information between parties to compute ag-
performance of the PPSVC. Experimental results on gregate results without sharing the actual private content
real data show that the PPSVC can achieve almost with each other. This method capitalizes the secure
the same accuracy with the original SVM classifier. multi-party computations from cryptography. Several
The effect of the kernel parameter selecting strategy privacy-preserving SVM works [9], [10], [11], [12] also
is also experimented and the results validate the belong to this family.
claim that it does not apparently affect the classi- In the following, we detail the works of privacy-
fication performance. preserving SVMs. The works of [9], [10], [11], [12]
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING
4

designed privacy-preserving protocols to exchange the


y=1
y=−1
SV
wx+b=1

necessary information for training the SVM on the data 0.5 wx+b=−1
wx+b=0

partitioned among different parties without revealing 0

the actual content of each one’s data to others. In [9], [10], −0.5

[11], the secure multi-party integer sum are utilized in


the protocols to cooperatively compute the Gram matrix −1
−1 −0.5 0 0.5 1

in the SVM formulation from the data separately held Fig. 3. The SVM maximizes the margin between two
by several parties. In [12], a privacy-preserving protocol classes of data. Squared points are support vectors.
to perform the kernel adatron algorithm for training the
SVM on the data separately held by different parties is
designed based on the additively homomorphic public- 3.1 Review of the SVM
key cryptosystem. In these distributed methods, at the
The SVM is a statistically robust learning method based
end of running the protocols, each party will hold a
on the structural risk minimization [6]. It trains a classi-
share of the learned SVM classifier. Testing must be
fier by finding an optimal separating hyperplane which
cooperatively performed by all involved parties since the
maximizes the margin between two classes of data in the
support vectors, which come from the training data, are
kernel induced feature space. Without loss of generality,
separately held. The goal of these distributed methods
suppose that there are m instances of training data. Each
is to train an SVM classifier from the whole data sep-
instance consists of an (xi , yi ) pair where xi ∈ RN is
arately held by different parties without compromising
a vector containing attributes of the i-th instance, and
each party’s privacy, and is orthogonal to our work for
yi ∈ {+1, −1} is the class label for the instance. The
releasing the learned SVM classifier without violating
objective of the SVM is to find the optimal separating
the privacy of support vectors.
hyperplane w · x + b = 0 between the two classes of data.
The work of [8] exploits the rotation invariant property
To classify a testing instance x, the decision function is
of common kernel functions, and applies the rotation
matrix to transform the data for outsourcing the training f (x) = w · x + b (1)
of the SVM to an external service provider without
revealing the actual content of the data. The purpose of The corresponding classifier is sgn(f (x)).
this work is also orthogonal to our work for privacy- The SVM finds the optimal separating hyperplane by
preserving release of the SVM classifier. The privacy- solving the following quadratic programming optimiza-
preserving scheme used in this work for outsourcing tion problem:
the SVM training is not able to be utilized in privacy-  m
1 2
preserving release of the SVM classifier since it requires arg min ||w|| + C ξi
w,b,ξ 2
the testing data also be rotationally transformed by i=1
(2)
the same matrix applying to the training data, but the subject to
matrix should be kept secret, or the original content yi (w · xi + b) ≥ 1 − ξi , ξi ≥ 0, for i = 1, ..., m
of the rotationally transformed support vectors can be
2
recovered by multiplying the inverse of the matrix. In the objective function, minimizing 12 ||w|| corresponds
Compared to existing privacy-preserving SVM works to maximizing the margin between w · x + b = 1 and
where [8] aims at outsourcing the SVM training without w · x + b = −1. The constraints aim to put the instances
revealing the actual content of data and [9], [10], [11], [12] with positive labels at one side of the margin w·x+b ≥ 1,
aim at cooperatively train the SVM without revealing and the ones with negative labels at the other side
each one’s own data when data are separately held, our w · x + b ≤ −1. The variables ξi , i = 1, · · · , m are
work addresses the inherent privacy violation problem called slacks. Each ξi denotes the extent of xi falling
of the SVM classifier which incorporates a subset of outside its corresponding region. C is called cost param-
training data, and design a mathematical transforming eter, which is a positive constant specified by the user.
method to protect the private content of support vectors The cost parameter denotes the penalty of slacks. The
to make available the release of the SVM classifier. objective function of the optimization problem is a trade-
Compared to anonymous data publishing techniques, off between maximizing the margin and minimizing
our scheme achieves better performance and provides the slacks. A larger C corresponds to assigning higher
stronger privacy protection by hiding all the feature penalty to slacks, which will result in less slacks but a
values. smaller margin. The value of the cost parameter C is
usually determined by cross-validation. Fig. 3 gives an
3 SVM AND P RIVACY -P RESERVATION example to illustrate the concept of the formulation of
We first briefly review the SVM in Section 3.1 to give the SVM.
the preliminaries of this work. Then in Section 3.2, The optimization problem of the SVM is usually
we discuss the privacy violation problems of the SVM solved in its dual form derived by applying the Lagrange
classifier that a subset of the training data will inevitably multipliers and KKT-conditions [6], [7]. Solving the dual
be disclosed. problem is equivalent to solving the primal problem. The
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING
5

dual form of the SVM’s optimization problem implies training, as (SVi , yi ). The number of support vectors
the applicability of the kernel trick since the data vectors is denoted as m . So in the following paragraphs the
of the training instances {x1 , x2 , · · · , xm } and the testing decision function will be represented as
instance x appear only in dot product computations both 

in the optimization problem and the decision function. A 


m
f (x) = αi yi K(SVi , x) + b (6)
kernel function K(x, y) implicitly maps the data x and i=1
y into some high-dimensional space and computes their
dot product there without actually mapping the data [6]. 3.2 Privacy Violation of the SVM Classifiers
By replacing the dot products with kernel functions, the
From the decision function of the SVM classifier (6),
kernelized dual form of the SVM’s optimization problem
we note that the support vectors existing in the SVM
is
classifier are a subset of the training data. Parts of the
1  
m m
training data are kept in their original content in the
arg min αi αj yi yj K(xi , xj ) − αi
α 2 decision function for performing kernel evaluations with
i,j=1 i=1
the testing instance. Releasing the SVM classifier will
subject to (3)
violate privacy due to the inclusion of the sensitive
m
content.
αi yi = 0, 0 ≤ αi ≤ C for i = 1, ..., m
The linear kernel SVM is an exception. The SVM
i=1
m classifier learned with the linear kernel is inherently
Since w = i=1 αi yi xi in the duality, the kernelized privacy-preserving. With the linear kernel, the support
decision function in the dual form is vectors incorporated in the decision function f (x) =
m  m
i=1 αi yi SV · x + b can be linearly combined to one
i m
f (x) = αi yi K(xi , x) + b (4)
i=1
vector w = i=1 αi yi SVi so
The bias term b can be calculated from KKT- f (x) = w · x + b (7)
complementarity conditions [6], [7] after solving the
Hence the classifier sgn(f (x)) of the linear kernel SVM
optimization problem.
can be simply represented by a hyperplane w · x + b = 0.
By applying the kernel trick, the SVM implicitly maps
The w is a linear combination of all support vectors.
data into a high-dimensional space and finds an optimal
Sensitive content of each individual support vector is
separating hyperplane there. The testing is also done in
destroyed by the weighted adding up, and therefore the
the kernel induced high-dimensional space by the ker-
classifier does not include individual private information
nelized decision function. The kernel induced mapping
of the training data. Fig. 3 shows a linear kernel SVM
and high-dimensional space are usually called feature
classifier. Merely the separating hyperplane w · x + b = 0
mapping and feature space respectively. The original dot
is enough to classify the data. No individual support
product is called linear kernel, i.e., K(x, y) = x · y.
vector (squared points in Fig. 3) needs to be kept in
With linear kernel, the optimal separating hyperplane
the classifier. Hence the linear kernel SVM classifier is
is found in the original space without feature mapping.
inherently privacy-preserving.
The feature mapping of nonlinear kernel functions could
Since the linear kernel SVM is only suitable to learn
be very complex and we may not even know the actual
a classifier on linearly separable data, its usability on
mapping. A commonly used kernel function is Gaussian
classification is limited. For linearly inseparable data,
kernel
2 the linear kernel is inappropriate. A large part of the
K(x, y) = exp(−g||x − y|| ) (5)
power of the SVM comes from the kernel trick. Without
where g > 0 is a parameter. Gaussian kernel represents applying kernel functions, the SVM is merely a linear
each instance by a kernel-shaped function sitting on the separator only suitable to linearly separable data. By
instance; each instance is represented by its similarity to replacing the dot products with kernel functions in the
all other instances. The induced mapping of Gaussian SVM formulation, data are non-linearly mapped into
kernel is infinite-dimensional [6], [22]. a high-dimensional feature space, and the SVM learns
In the decision function of the dual form (4), it is a linear classifier there. Since data in high-dimensional
seen that only the non-zero αi ’s and the corresponding space are highly sparse, it is easy to separate the data
(xi , yi ) pairs are required to be kept in the decision there by a linear separator.
function. Those (xi , yi ) pairs with non-zero αi are called However, the inherent privacy-preserving property of
support vectors. They are the instances falling outside the linear kernel SVM classifier disappears when the
their corresponding region after solving the optimization nonlinear kernel is applied. In the nonlinear kernel
problem (the squared points in Fig. 3). Support vectors SVM, the w in the decision function f (x) = w · x + b
are the informative points to make up the SVM classifier. cannot be computed explicitly like the linear kernel.
All training data except support vectors are discarded The vector w exists in the kernel induced feature space
m
after training. For ease of exposition, we will denote as w = i=1 αi yi Φ(SVi ), where Φ() denotes the fea-
support vectors, the (xi , yi ) pairs with nonzero αi after ture mapping induced by the kernel function. Since the
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING
6

The components of the classifier are the kernel


y=1
y=−1
SV
f(x)=1
0.5 f(x)=−1
f(x)=0 parameter g, the bias term b, support vectors
0 {(SV1 , y1 ), · · · , (SVm , ym )} and their corresponding
−0.5
supports {α1 , · · · , αm }. The content of each attribute
vector SVi is considered to be sensitive, but the class
−1
−1 −0.5 0 0.5 1
labels yi ’s are usually not. We intend to destroy the
Fig. 4. A Gaussian kernel SVM classifier: All support content of all support vectors’ attribute vectors in the
vectors must be kept in the classifier, which violates decision function by an irreversible way similar to the
privacy. effect the linear combination causes in the linear kernel
SVM classifier, as we have mentioned in Section 3.2.
The value of the Gaussian kernel function K(x, y) =
feature mapping is done implicitly, the w can only be exp(−g||x − y||2 ) depends on the relative distance be-
stated as a linear combination of kernel evaluations as tween two instances ||x − y||. In the decision function
w =
m
i=1 α i yi K(SVi , ), and the decision function is (8), the term ||SVi −x||2 which calculates the square
 m of the distance between the testing instance x and a
f (x) = i=1 αi yi K(SVi , x) + b. This restriction causes
us not able to linearly combine the support vectors support vector SVi can be computed by ||SVi − x||2 =
into one vector w. The classifier has all support vectors ||SVi ||2 − 2(SVi · x) + ||x||2 . So the decision function (8)
in their original content to make possible the kernel can be equivalently formulated as
evaluations K(SVi , x) between the testing instance x
and each support vector. f (x) = b+
Fig. 4 illustrates a Gaussian kernel SVM trained on a 

m
small dataset. The three curves are the points evaluated exp(−g||x||2 ) αi yi exp(−g||SVi ||2 ) exp(2gSVi · x)
to f (x) = −1, +1, and 0 in the figure. They correspond i=1
to the hyperplanes w · Φ(x) + b = −1, +1, and 0 in the (9)
kernel induced feature space. The support vectors are the
instances falling into wrong region in the feature space. In (9), the expanded form of the decision func-
The curve corresponding to f (x) = 0 is the decision tion, there are two terms containing support vectors:
boundary in the original space, which is the optimal exp(−g||SVi ||2 ) and exp(2gSVi · x) in the summation
separating hyperplane in the feature space. All support operator. The former term depends merely on the mag-
vectors (the squared points) are required to be kept in nitude of SVi and hence can be computed a priori. All
the classifier in order to do kernel evaluations with the exp(−g||SVi ||2 ), i = 1 to m can be combined with αi yi
testing instance, i.e., the computation of the decision to constants {c1 , c2 , · · · , cm } as
function (6), to decide which side of the separating
hyperplane in the feature space the testing instance
falls into. Releasing the classifier will expose the private ci = αi yi exp(−g||SVi ||2 ) for i = 1 to m
content of support vectors, which are intact tuples of
a subset of the training data, therefore violating the Then the decision function becomes
privacy.


m
2
f (x) = exp(−g||x|| ) ci exp(2gSVi · x) + b (10)
4 P RIVACY -P RESERVING SVM C LASSIFIER
i=1
The objective of our work is to construct a method which
makes possible the release of the Gaussian kernel SVM
The term exp(−g||x||2 ) extracted from the summation
classifier with privacy-preservation. In this section, we
operator in (9) or (10) is a scalar related only to the
construct the privacy-preserving SVM classifier (PPSVC).
testing instance x. This term has no connection with the
The PPSVC precisely approximates the Gaussian kernel
privacy of the training data. Now the support vectors
SVM classifier with protection to the private content of
exist only in the term exp(2gSVi · x) in the summation
support vectors, and therefore enable the release of the
operator. We proceed to tackle it by replacing the expo-
SVM classifier with privacy-preservation.
nential function with its infinite series representation:

4.1 Construction of the Privacy-Preserving Decision x2 x3  xd ∞


Function exp(x) = 1 + x + + + ··· = (11)
2! 3! d!
d=0
The decision function of the Gaussian kernel SVM clas-
sifier is
By replacing exp(2gSVi · x) with its infinite series

m 
∞ (2gSVi ·x)d
representation exp(2gSVi · x) = , the
f (x) = αi yi exp(−g||SVi − x||2 ) + b (8)  m d=0 d!
i=1 c
i=1 i exp(2gSV i · x) of the decision function (10) be-
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING
7

comes By substituting both (14) and (15) into (13), which is


  m

m 

m 

 equivalent to the i=1 ci exp(2gSVi · x) of the decision
(2gSVi · x)d function (10), the (13) can be represented as
ci exp(2gSVi · x) = ci
i=1 i=1 d=0
d! ⎛ ⎞
m ∞
 d m
∞ d ∞
 d (2g)
=
c1 (2g)d (SV1 · x)
+ ··· +
cm (2g)d (SVm · x) ci + Φd (x) · ⎝ ci Φd (SVi )⎠
d! d! i=1
d! i=1
d=1
d=0 d=0
∞ ∞

 (2g) d 
= c1 (SV1 · x)d + · · · + cm (SVm · x)d = w0 + Φd (x) · wd (16)
d! d=1
d=0

⎛  ⎞

m ∞
 m By feeding the (16) into the decision function (10), the
(2g)d ⎝
= ci + ci (SVi · x)d ⎠ (12) decision function becomes:
d!  ∞

i=1 d=1 i=1 
2
f (x) = exp(−g||x|| ) w0 + Φd (x) · wd + b (17)
It follows that the support vectors exist only in the d=1
term (SVi ·x)d of the inner summation operator. We next
take a key step by applying the monomial feature mapping. This is the privacy-preserving form of the decision
The form of (x · y)d corresponds to the monomial function of the Gaussian kernel SVM classifier. In this
feature kernel [23], which can be defined as the dot new form of the decision function, the data which need
product of the monomial feature mapped x and y as to be preserved in the classifier are wd ’s of each order-
(x · y)d = Φd (x) · Φd (y) where Φd () is the order-d mono- d instead of support vectors in the original decision
mial feature mapping (The rationale of the monomial function. The private content of support vectors has been
feature mapping will be given in Section 4.1.1). destroyed by the linear combinations, and the necessary
Thus the (SVi · x)d in (12) with monomial feature information to perform classification originally provided
kernel form can be equivalently computed by the dot from support vectors can be given by wd ’s, which are lin-
product of the order-d monomial feature mapped sup- ear combinations of monomial feature mapped support
port vector SVi and testing instance x: vectors.
The privacy-preserving decision function (17) has an
(SVi · x)d = Φd (SVi ) · Φd (x) infinite series, which contains wd , d = 1, . . . , ∞, the
linear combinations of monomial feature mapped sup-
A key step arises from writing the monomial feature port vectors from order-1 to order-∞, and the mono-
kernel as the dot product of monomial feature mapped mial feature mapped testing instance Φd (x) from order-
instances. By replacing the (SVi ·x)d with Φd (SVi )·Φd (x) 1 to order-∞. The infinite complexity of the privacy-
in (12), we have preserving decision function is surely impractical. How-
⎛  ⎞ ever, since the infinite series in the privacy-preserving
 m ∞
(2g)d ⎝
m
decision function is a Taylor series, it can be precisely
ci + ci Φd (SVi ) · Φd (x)⎠
d! approximated near the evaluating point by merely a little
i=1 d=1 i=1

⎛ ⎞ number of low-order terms and hence makes possible
 m ∞ d m
(2g) the practical use. Later we will study the precision of
= ci + Φd (x) · ⎝ ci Φd (SVi )⎠ (13) approximating by the Taylor polynomial both in theo-
i=1
d! i=1
d=1
retical analyses and empirical experiments to show that
It is noted that in each order-d monomial feature the privacy-preserving decision function can be precisely
mapped space, all the order-d monomial feature mapped approximated by using merely a few low-order terms of
support vectors {Φd (SV1 ), · · · , Φd (SVm )} can be lin- the infinite series. Before going to the approximation of
early combined into one vector: the privacy-preserving decision function, we first present
the monomial feature mapping.

(2g)d 
m
wd = ci Φd (SVi ) (14) 4.1.1 Monomial Feature Mapping
d! i=1
Lemma 1 below states how monomial feature mapping
In each wd , all support vectors are mapped into the replaces (x · y)d by the dot product of Φd (x) and Φd (y),
order-d monomial feature space and linearly combined, the order-d monomial feature mapped x and y.
and hence the content of each support vector SVi has Lemma 1: For x, y ∈ RN and d ∈ N, the monomial
been destroyed in the linear combination similar to the feature kernel K(x, y) = (x · y)d generates order-d
 m monomial features of x and y. Suppose x and y are
w = i=1 αi yi SVi in the linear kernel SVM classifier
(7). n-dimensional. The feature map of this kernel can be
We then let defined coordinate-wise as
m

 d! n
w0 = ci (15) Φm (x) = n xm i
(18)
i=1 mi ! i=1
i
i=1
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING
8

n
for every m ∈ Nn with i=1 mi = d. Every such m
1 1 1 1

0.5 0.5 0.5 0.5

corresponds to each dimension of monomial features 0 0 0 0

[22], [23]. -0.5 -0.5 -0.5 -0.5

Proof: All terms in the expansion of (x ·


-1 -1 -1 -1
-1 -0.5 0 0.5 1 -1 -0.5 0 0.5 1 -1 -0.5 0 0.5 1 -1 -0.5 0 0.5 1

1 1 1 1

y)d = (x1 y1 + · · · + xn yn )d will be in the form 0.5 0.5 0.5 0.5

(x1 y1 )m1 (x2 y2 )m2 · · · (xn yn )mn , where each nmi is an


0 0 0 0

integer with 0 ≤ mi ≤ d and i=1 mi = -0.5 -0.5 -0.5 -0.5

d. By multinomial theorem, the coefficient of each


-1 -1 -1 -1
-1 -0.5 0 0.5 1 -1 -0.5 0 0.5 1 -1 -0.5 0 0.5 1 -1 -0.5 0 0.5 1

(x1 y1 )m1 (x2 y2 )m2 · · · (xn yn )mn term is m1 !m2d!!···mn ! . Thus


Fig. 5. The PPSVC’s approximation of the decision
 dimension of the monomial feature mapped x is
each
boundary from du = 1 to du = 8. The solid curve in
d!
xm 1 xm 2 · · · x m n
for every m ∈ Nn with each sub-figure is the original decision boundary, and the
nm1 !m2 !···mn ! 1 2 n

i=1 mi = d. dotted curve is the approximation of the PPSVC. From


A simple example to illustrate the monomial feature du = 5, the two curves are almost overlapped together.
mapping is given as follows. The order-2 monomial
feature kernel of x, y ∈ R2 [6], [23]:
its infinite series representation (11). The approximation
(x · y)2 = ((x1 , x2 ) · (y1 , y2 ))2 = (x1 y1 + x2 y2 )2 of the privacy-preserving decision function is done by
√ √
= (x21 , 2x1 x2 , x22 ) · (y12 , 2y1 y2 , y22 ) taking the summation of low-order terms from d = 1 to
2 a user-specified approximation degree (denoted as du ),
From Lemma 1, those m’s satisfying i=1 mi = 2 with and then the approximated privacy-preserving decision
0 ≤ mi ≤ 2 are (2, 0), (1, 1), and (0, 2). So the order-2 function is
monomial features of x ∈ R2 are√x21 , x1 x2 , and x22 . The  
corresponding coefficients are 1, 2, and 1 from the (18) du
2
f (x) = exp(−g||x|| ) w0 + Φd (x) · wd + b (19)
of Lemma 1. Hence the order-2 monomial feature √ map-
d=1
ping of x = (x1 , x2 ) (y, respectively) is (x21 , 2x1 x2 , x22 ).
The dimensionality of the order-d monomial feature This is the decision function of our privacy-preserving
mapping for n-dimensional vectors is stated in Lemma SVM classifier (PPSVC). Users may intend to have an
2 below. approximated classifier that is able to provide similar
Lemma 2: For x ∈ Rn , the dimensionality of x’s order- decision boundary like the original one. The higher the
 
d monomial feature mapping is d+n−1 . user specified approximation degree du is, the closer to
d
the original decision function the approximated deci-
n Proof: From Lemma 1, every m ∈ N with
n
sion function gets. The approximated privacy-preserving
i=1 mi = d where each mi is an integer with 0 ≤ mi ≤
d corresponds to one dimension of monomial features. decision function becomes equivalent to the original
Enumerating all such m’s is equivalent to finding all Gaussian kernel SVM’s decision function when the du
integer solutions of the equation m1 + m2 + · · · + mn = approaches infinity. However, a high approximation de-
d where mi ≥ 0 for i = 1 to n. Enumerating all integer gree du will result in a high computational complexity
solutions to this equation is equivalent to enumerating classifier due to the high dimensionality of monomial
all size-d combinations with repetitions from n kinds feature mappings. From Lemma 1, it is seen that for
of objects, and n-dimensional data, the dimensionality
  of their order-d
 the number of the combinations with
repetitions is d+n−1 . monomial feature mapping is d+n−1 d . If the value of du
d
To generate the monomial feature mapping, an algo- is too big, the monomial feature mapping will become
rithm to enumerate all size-d combinations with repeti- intractable. However, the property of quick approxima-
tions from n kinds of objects is required. Due to the space tion of exp(x) by low-order terms of its infinite series
limit, its detail is omitted. Notice that the monomial representation also exists in the PPSVC. We will show
feature mappings do not need to be generated in testing that the PPSVC with a low approximation degree du
time. To make a more efficient classifier, those mappings (usually du ≤ 5) is enough to obtain precise approxi-
can be generated off-line, and the classifier simply takes mation and hence close classification accuracy both in
the corresponded mapping to use. theoretical reasons and empirical experiments.
Fig. 5 illustrates a series of approximated decision
boundaries generated by the PPSVC from du = 1 to
4.2 PPSVC: Approximation of the Privacy- du = 8 to approximate the decision boundary of the
Preserving Decision Function Gaussian kernel SVM classifier trained in Fig. 4. In each
The privacy-preserving decision function (17), an equiv- sub-figure, the solid curve is the decision boundary of
alent of the Gaussian kernel SVM’s decision function (8), the original Gaussian kernel SVM classifier, and the
contains an infinite series. That infinite series comes from dotted curve is the approximated decision boundary
the exponential function and hence is able to be approx- generated by the PPSVC. We can see that after du = 3,
imated by a finite number of low-order terms as well the approximated decision boundary generated by the
as the approximation of exp(x) by low-order terms of PPSVC becomes very close to the original one. After
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING
9

sions of the monomial feature mappings from order-


1 to du . If du is large, the complexity of the PPSVC
will be very high. However, the good approximation
property benefits the PPSVC and a large du is hence
unnecessary. We will verify this claim on real data in
Fig. 6. The flow diagram of deriving the privacy- the experiments. Most benchmark data with a small du
preserving SVM classifier (PPSVC). (du ≤ 5) can get nearly the same classification accuracy
to the original SVM classifier. Therefore it is empirically
suggested to set du = 5 to obtain a good trade-off
du = 5, the PPSVC provides almost the same decision between the approximating precision and the complexity
boundary to the original. Similar to the approximation of of the PPSVC. For a user who intends to have a precise
exp(x) by low-order terms of its infinite series represen- approximation, he may estimate the resulting complexity
tation, the PPSVC (19) well approximates the Gaussian and select a du as high as possible within the tolerable
kernel SVM classifier by low-order terms of the privacy- complexity.
preserving decision function (17). If the dimensionality of the data is large (n > 100),
Figure 6 shows the flow of deriving the PPSVC. After the dimensionality of the monomial feature mapping
training the SVM, there will be a subset of the training may become intractable. In this case, it needs to apply
data being selected as support vectors accompanied with the feature selection to select the important features
the coefficients (αi yi , SVi ), i = 1, . . . , m to form the (attributes) which affect the classification performance
decision function of the classifier. To transform and most, and then use this subset of attributes to train an
approximate the decision function by the approximated SVM classifier. The work of [24] suggests several feature
privacy-preserving decision function (19), the support selection strategies which work well with the SVM. For
vectors are first mapped to monomial feature space from example, the F-score can be used to measure the discrim-
order-1 to order-du , where du is pre-determined. Then inative ability of an attribute. For high-dimensional data,
the approximated privacy-preserving decision function’s by selecting an appropriate number of attributes with
components w0 and wd , which is the linear combination high F-scores (for instance, select the top 50 attributes)
of the order-d monomial feature mapped support vec- to train an SVM classifier, the PPSVC can then be well
tors, from d = 1 to du , can be computed by (14) and constructed from this classifier.
(15). For the SVM classifier which has a large number of
The components of the original Gaussian kernel SVM support vectors but the data is in low dimensional, the
classifier are support vectors {SV1 , · · · , SVm } and their PPSVC provides an extra benefit that the complexity of
corresponding coefficients {α1 y1 , · · · , αm ym }, the bias the PPSVC can be lower than the original SVM classifier.
term b, and the kernel parameter g. The b and g are The complexity of the original Gaussian kernel SVM
common components in both the original SVM classifier classifier is O(nm ) where m denotes the number of
and the PPSVC. Compared to the original SVM classifier, support vectors. For large-scale training datasets or a
the PPSVC does not incorporate intact tuples of support small cost parameter C, the SVM may result in a large
vectors and their coefficients but the linear combinations number of support vectors. Unlike the original SVM
of monomial feature mapped support vectors w0 , and classifier, the complexity of the PPSVC is independent
w1 to wdu . The wd ’s comprise essential information of of the number of support vectors. Thus with a small du ,
support vectors to precisely approximate the Gaussian the PPSVC can be more efficient than the original SVM
kernel SVM classifier without exposing support vectors’ classifier.
sensitive content, where the attribute values of support
vectors have been destroyed by linear combinations. The
5 S ECURITY AND A PPROXIMATING P RECI -
approximation of the PPSVC is similar to image com-
pression which stores only low-frequency components SION OF THE PPSVC
to compress images. In the PPSVC, the significant infor- In this section, we first show the PPSVC’s security on
mation provided by support vectors to form a classifier the protection of the private content of support vectors.
is compressed into low-order wd terms. Then we discuss the precision issues of the PPSVC’s
approximation to the original SVM classifier.
4.3 Complexity of the PPSVC
The complexity of the resulted privacy-preserving clas- 5.1 Security of the PPSVC Against Adversarial At-
sifier (PPSVC) depends on the dimensionality of the tacks
monomial feature mappings from d = 1 to du . The Consider the case that an adversarial attacker knows
dimensionality
 of the monomial feature mapping of the content of part of support vectors, and he wants
order-d is d+n−1
d , where n is the dimensionality of to recover the content of remaining support vectors
the input data. Hence the complexity of the PPSVC is from the components of the PPSVC. Without loss of
du d+n−1
O( d=1 d ), which is the summation of dimen- generality, suppose that there are totally m support
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING
10

vectors, and the attacker knows the content of m − 1 k-th monomial features of SV1 , . . . , SVm respectively,
support vectors. Lemma 3 below proves that by keeping to derive the k-th monomial feature of SVm from the
secret the supports αi , i = 1, . . . , m of the support wd,k when SV1 , . . . , SVm −1 are known, the coefficients
vectors {(SV1 , y1 ), · · · , (SVm , ym )} of the original SVM c1 , . . . , cm in the linear combination are required. The
classifier, the remaining instance cannot be recovered coefficients ci ’s come from ci = αi yi exp(−g||SVi ||2 ),
from the help of the information disclosed by the PPSVC. i = 1, . . . , m . Since the supports αi , i = 1, . . . , m are
The supports are the optimal solution of the SVM’s kept secret, the attacker does not know the coefficients
optimization problem (3). The support vectors are the c1 , . . . , cm in the linear combination. Hence the content
instances being assigned non-zero supports to form the of SVm cannot be derived from a wd,k .
optimal separating hyperplane in the feature space. To The following considers on all wd,k ’s to show that the
obtain the supports, knowing the complete training data content of SVm cannot be derived from eliminating the
(including the instances with zero support) and the ci ’s between wd,k ’s with the help of known instances
cost and kernel parameters are required. Due to the SV1 , . . . , SVm −1 . Without loss of generality, suppose
optimizing property of the SVM’s formulation, if one m = 2, i.e., there are two support vectors {SV1 , SV2 } ∈
does not have completely the same data and parameters, Rn , where the content of SV1 = (v1,1 , . . . , v1,n ) is known
it is in general not able to result in the same optimal solu- by the attacker, and the attacker intends to obtain the
tion. Especially with the Gaussian kernel whose implicit content of SV2 = (v2,1 , . . . , v2,n ). The following discus-
mapping is infinite-dimensional, a slight difference in the sion in this m = 2 case can be generalized to m > 2
training data may lead to a much different solution, and where m − 1 of total m support vectors are known
hence different supports and support vectors. Reverse by the attacker. Let c1 and c2 denote the corresponding
engineering from the approximated decision boundary coefficients of SV1 and SV2 respectively. Then wd,k is
of the PPSVC is also difficult since the attacker knows computed by
only the content of partial support vectors. Hence it is
reasonable to assume that the attacker who knows only (2g)d d! m m
wd,k = (c1 Πnj=1 v1,jd,k,j +c2 Πnj=1 v2,jd,k,j )
part of training data cannot obtain the supports. d! Πnj=1 md,k,j !
Lemma 3: By keeping secret the supports α1 , . . . , αm
of support vectors, an adversarial attacker who knows where (md,k,1 , . . . , md,k,n ) corresponds to the (d, k) pair.
the content of m −1 support vectors is not able to recover Since v1,j , j = 1, . . . , n are known, it is able to ob-
the content of the remaining one from the components tain the formulas having the same c1 -paired terms by
qj
of the PPSVC. multiplying or dividing some wd,k ’s by some Πnj=1 v1,j
Proof: Suppose that the support vectors known by with certain (q1 , . . . , qn ). Then the terms paired with c1
the attacker are {(SV1 , y1 ), · · · , (SVm −1 , ym −1 )}, and are identical and hence can be eliminated by subtraction
the attacker wants to derive the content of SVm from between the formulas. For example, let n = 2, i.e., the
the known instances and the components of the PPSVC. data is 2-dimensional. Then w1,1 = 2g(c1 v1,1 + c2 v2,1 ),
The components of the PPSVC are w0 , w1 and w1,2 = 2g(c1 v1,2 + c2 v2,2 ). Then, w1,1 v1,2 − w1,2 v1,1 =
to wdu , the kernel parameter g, and the bias 2gc2 (v2,1 v1,2 − v2,2 v1,1 ). With another formula which has
term b. Each wd is the linear combination of eliminated c1 -paired terms, the coefficient c2 can then be
the order-d monomial feature mapped support eliminated by division between the two formulas.
vectors Φd (SV However, although the coefficients ci , i = 1, . . . , m can
1 ), Φd(SV2 ), · · · , Φd (SVm ). Let wd,k ,
d+n−1
=  1, . . . , denote the elements of the be eliminated, the v2,j , j = 1, . . . , n, i.e., the components
kd+n−1 d
of SV2 , are not able to be derived by eliminating
d -dimensional vector wd , 1 ≤ d ≤ du . Each
wd,k corresponds to a unique monomial feature ci ’s. The reasons are explained as follows: In a wd,k
m m
n
(md,k,1 , . . . , md,k,n ), which satisfies j=1 md,k,j = d and formula, the two terms Πnj=1 v1,jd,k,j and Πnj=1 v2,jd,k,j with
md,k,j ∈ N ∪ 0, j = 1, . . . , n. Let vi,j , j = 1 . . . , n denote the same (md,k,1 , . . . , md,k,n ) are consistent monomial
the n attribute values of the i-th support vector SV features of SV1 and SV2 respectively, and each wd,k
  i, formula corresponds to a unique (md,k,1 , . . . , md,k,n ). To
1 ≤ i ≤ m . A wd,k , 1 ≤ d ≤ du , 1 ≤ k ≤ d+n−1 d is
computed by make the formulas from different wd,k ’s have identical
c1 -paired terms, the c2 -paired terms from different wd,k
qj
(2g)d d! m formulas will be multiplied by Πnj=1 v1,j with different
m
wd,k = ci (Πnj=1 vi,jd,k,j ) (q1 , . . . , qn )’s. For any two wd,k ’s, w(d,k)1 and w(d,k)2
d! Πnj=1 md,k,j ! i=1
where (d, k)1 = (d, k)2 , let (m(d,k)1 ,1 +q1,1 , · · · , m(d,k)1 ,n +
First, consider on a single wd,k . Since wd is a linear q1,n ) = (m(d,k)2 ,1 + q2,1 , · · · , m(d,k)2 ,n + q2,n ). The
combination of order-d monomial feature mapped sup- (q1,1 , · · · , q1,n ) will not be equal to (q2,1 , · · · , q2,n ) since
port vectors and wd,k is the k-th elements of wd , wd,k is (m(d,k)1 ,1 , · · · , m(d,k)1 ,n ) = (m(d,k)2 ,1 , · · · , m(d,k)2 ,n ).
the linear combination of the k-th monomial features of There will be identical c1 -paired terms in
q1,j q2,j
order-d monomial feature mapped support vectors. The w(d,k)1 Πnj=1 v1,j and w(d,k)2 Πnj=1 v1,j , but the c2 -
m(d,k)1 ,j n q1,j
coefficient paired with SVi in the linear combination is paired terms will become c2 (Πj=1 v2,j n
Πj=1 v1,j )
m q2,j
ci , 1 ≤ i ≤ m . Since wd,k is a linear combination of the and c2 (Πnj=1 v2,j(d,k)2 ,j Πnj=1 v1,j ) respectively, where
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING
11

25

(m(d,k)1 ,1 , . . . , m(d,k)1 ,n ) = (m(d,k)2 ,1 , . . . , m(d,k)2 ,n )


exp(x)
20 5−order Taylor polynomial at 0

and (q1,1 , . . . , q1,n ) = (q2,1 , . . . , q2,n ). Eliminating the 15

10

identical c1 -paired terms by subtraction will result 5

m
in the combination of Πnj=1 v2,jd,k,j with different 0

(md,k,1 , . . . , md,k,n )’s where the coefficients in the


−5
−3 −2 −1 0 1 2 3
x

qj
combination are Πnj=1 v1,j with different (q1 , . . . , qn )’s as Fig. 7. The approximation of exp(x) by its 5-order Taylor
q
w(d,k)1 Πnj=1 v1,j
1,j
− w(d,k)2 Πnj=1 v1,j
2,j q polynomial at 0.
m q m q
=c2 (Πnj=1 v2,j(d,k)1 ,j Πnj=1 v1,j 1,j
− Πnj=1 v2,j(d,k)2 ,j Πnj=1 v1,j
2,j
)
d
(The factor (2g) Πn
d!
of wd,k is omitted for reason is to prevent the numerical values from getting
d! j=1 md,k,j !
extremely large as the dimension of data increases. The
simplicity since it can be removed by division.). Since
other reason is that large g may cause the overfitting
(m(d,k)1 ,1 , . . . , m(d,k)1 ,n ) = (m(d,k)2 ,1 , . . . , m(d,k)2 ,n ) and
m problem. The Gaussian kernel function represents each
(q1,1 , . . . , q1,n ) = (q2,1 , . . . , q2,n ), the terms Πnj=1 v2,j(d,k)1 ,j
m instance by a bell-shaped function sitting on the instance,
and Πnj=1 v2,j(d,k)2 ,j from SV2 cannot be separated from which represents its similarity to all other instances.
q1,j q2,j
the terms Πnj=1 v1,j and Πnj=1 v1,j from SV1 . For ex- Large g means that the instance is more dissimilar to
ample, in w1,1 v1,2 − w1,2 v1,1 = 2gc2 (v2,1 v1,2 − v2,2 v1,1 ) others. The kernel memorizes the data and becomes
illustrated above, v2,1 and v2,2 cannot be separated from local, and the resulting classifier tends to overfit the data
m
v1,1 and v1,2 . Therefore, the value of either Πnj=1 v2,j(d,k)1 ,j [22]. To prevent the overfitting problem and numerical
m
or Πnj=1 v2,j(d,k)2 ,j , i.e., the monomial features of SV2 , difficulty, a simple strategy is setting g = 1/n where
cannot be derived. The above discussion can be extended n denotes the dimensions of data. g = 1/n is also
in a similar way to m > 2 support vectors where m − 1 the default setting of LIBSVM [26]. With scaling the
are known by the attacker. This concludes the proof that attribute values of data to [−1, 1] and setting g = 1/n,
removing the coefficients ci ’s cannot extract the content the argument of the exponential function (20) is at most
of SVm . in ±2 distance from the defined point 0 of the Taylor
series. Fig. 7 shows the approximation of exponential
5.2 Approximating Precision Issues of the PPSVC function by a 5-order Taylor polynomial at 0. In the
The PPSVC is an approximation of the Gaussian kernel evaluating points within [−2, 2], the value of the 5-order
SVM classifier. It approximates the exponential function Taylor polynomial is almost overlapped with the actual
value of the exponential function.
exp(2gSVi · x) (20) It is noted that the values of both the kernel parameter
by a Taylor polynomial. Therefore, in addition to the de- g and the cost parameter C of the SVM are usually
gree of the Taylor polynomial, there is also another factor chosen by cross-validation to select an appropriate pa-
affecting the approximating precision of the PPSVC: the rameter combination to train the SVM [22]. LIBSVM [25],
evaluating point of the exponential function in (20). [26] suggests grid-search on the combinations of (C, g) in
The infinite series representation of (20) adopted in the exponential growth using cross-validation. For example,
PPSVC is a Taylor series of exp(x) at zero. According to the default grid-search range of LIBSVM is C = 2−5
the Taylor theorem, the Taylor series at 0 evaluated at x to 215 and g = 2−15 to 23 . In order to have a good
will be equal to the original function if x is sufficiently approximating precision in the PPSVC, we constrain the
close to 0. If the evaluating point of the exponential func- upper bound of g’s search range to be 1/n, which will
tion (20) is distant too far from zero, the approximation keep the evaluating point of (20)’s Taylor polynomial
of the PPSVC will be degraded. being within ±2 from 0.
This potential precision problem can be prevented by This constraint on g’s grid-search range will not ap-
taking a careful look at the guidelines of the practical parently affect the classification performance since the
use of the SVM [22], [25] and the properties of the SVM value of g chosen by grid-search using cross-validation
with Gaussian kernel. One of the factors which influence is usually very small and does not exceed 1/n. It comes
the evaluating point of the exponential function (20) from that Gaussian kernel with large g is prone to overfit
is the dot product between the testing instance x and the data, which usually results in poor accuracy in cross-
the support vector SVi . The guidelines of the practical validation. We validate this claim by experimenting on
use of the SVM [22], [25] suggest scaling the value of real data to show that the classification performance does
each attribute to appropriate range like [0, 1] or [−1, 1] in not vary much when the 1/n upper bound is imposed to
the pre-processing step to prevent the effect that greater g in grid-search using cross-validation. The experimental
numeric range attributes may dominate those in smaller results are shown in Section 6.2.
range. Scaling the data also avoids numerical difficulty With scaling data to [−1, 1] suggested by the SVM
and prevents overflow. The other factor is the value of practitioner’s guidelines and constraining the g to be
the kernel parameter g of the Gaussian kernel function. smaller than 1/n, the evaluating point of (20)’s Taylor
The value of g is usually suggested to be small [22]. One polynomial at 0 can be guaranteed within ±2, and
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING
12

ϭϬϬ
Ϳ WW^sͲ ĚͺƵсϭ

therefore the potential precision problem which may ;й


LJ ϴϬ WW^sͲ ĚͺƵсϮ
ĂĐƌ
be caused by the far evaluating point of the Taylor Ƶ
ĐĐ
ϲϬ WW^sͲ ĚͺƵсϯ

ϰϬ WW^sͲ ĚͺƵсϰ


Ŷ
polynomial is prevented. Ž
ŝƚ
Ă ϮϬ
WW^sͲ ĚͺƵсϱ
ŝĐĨ
ŝƐ KƌŝŐŝŶĂů^sD
ĂƐů
Ϭ ůĂƐƐŝĨŝĞƌ
 ,ĞĂƌƚ /ŽŶŽƐƉŚĞƌĞ >ŝǀĞƌ WŝŵĂ/ŶĚŝĂŶ ^ƚĂƚůŽŐ ^ƚĂƚůŽŐ ^ŽŶĂƌ tŝƐĐŽŶƐŝŶ
6 E XPERIMENTAL A NALYSIS ŝƐŽƌĚĞƌƐ ŝĂďĞƚĞƐ ƵƐƚƌĂůŝĂŶ 'ĞƌŵĂŶ
EƵŵĞƌŝĐ
ƌĞĂƐƚ
ĂŶĐĞƌ

In this section, we evaluate the effectiveness of the Fig. 8. Classification Accuracy of the original SVM clas-
PPSVC by comparing the classification accuracy between sifier and PPSVCs with du = 1 to 5.
the original SVM classifier and its privacy-preserving
approximation by the PPSVC. We also evaluate the influ-
ence of the kernel parameter upper bound suggested in one-against-one or one-against-all methods [26] can also
Section 5.2, and study the scalability of the PPSVC. Addi- be applied to the PPSVC. The statistics of the datasets
tionally, we compare the classification performance with are given in the table below.
the anonymous data publishing technique k-anonymity.
Dataset Heart Ionosphere Liver Diabetes
# instances 270 351 345 768
6.1 Approximating the SVM Classifier # attributes 13 34 6 8
Dataset Australian German Sonar Breast
The objective of the PPSVC is to precisely approximate # instances 690 1000 208 683
the SVM classifier without compromising the privacy # attributes 14 24 60 10
of the training instances which are selected as support
vectors. We test the approximation ability of the PPSVC All attribute values have been scaled to [−1, 1] or [0, 1]
by comparing the accuracy between the original SVM in preprocessing steps to prevent different value range
classifiers and their corresponding PPSVCs with differ- and numerical difficulty. We use LIBSVM [26] as our
ent approximation degree du . tool to train the SVM classifiers. The value of the cost
We consider several public real datasets available in parameter C and the kernel parameter g to train the
the UCI machine learning repository [27] to evaluate SVM are determined by applying the grid-search using
the performance of the PPSVC. We select some medical cross-validation, where the upper-bound of g’s search
datasets to test the effectiveness of the PPSVC on medical range is the reciprocal of the number of attributes of
applications as we have mentioned in Section 1. The each dataset, as we have discussed in Section 5.2. The
classifiers trained from such medical datasets are for pre- SVM classifiers trained by LIBSVM are then transformed
dicting whether a patient is subject to a specific disease. to PPSVCs to protect the support vectors. The exper-
The Wisconsin breast cancer dataset, which contains imental results are shown in Figure 8 for comparing
clinical cases of breast cancer detection, is for predicting the classification accuracy between the original SVM
whether the organization is benign or malignant. The classifier and PPSVCs with du = 1 to du = 5. The
liver disorders dataset contains various blood tests and classification accuracy reported in the figure is the 5-fold
drink behavior records to learn a classifier for predicting cross-validation average.
liver disorders from excessive alcohol consumption. The It is seen that the PPSVC with du = 1 usually does not
Pima Indian diabetes dataset are medical records of have good approximation to the original SVM classifier.
female Pima Indian heritage, which are used to learn Because with du = 1, the approximation of the infinite
classifiers to predict if a patient is subject to diabetes, series in the privacy-preserving decision function is sim-
and the Statlog heart dataset is a heart disease database. ilar to that of approximating exp(x) by 1 + x. This linear
We also select two credit datasets to test the effectiveness approximation usually cannot achieve good precision.
of the PPSVC for predicting the credit of customers. The On medical datasets, except the liver disorder dataset,
Statlog Australian credit approval dataset is for credit the PPSVC achieves the same accuracy with the original
card applications. The Statlog German credit dataset is SVM classifier in du =2 on the breast cancer, heart disease,
for classifying people to good or bad credit risks. This and diabetes datasets. The PPSVC can handle these
dataset comes with two formats, where one contains problems well in low approximation degree. On the liver
both categorical and numeric attributes, and the other disorder dataset, the PPSVC achieves the same accuracy
is pure numeric. We adopt the pure numeric version for until du = 5. Since the two classes of instances in this
the ease of using with the SVM. Two physical datasets, dataset are highly overlapped, a bit of difference in
ionosphere and sonar, are also selected to test the effec- the decision boundary will result in much variation in
tiveness of the PPSVC on various applications. Targets of the classifying results. This causes it to require higher
the radar data in the ionosphere dataset are free electrons precision in the approximation of the decision boundary
in the ionosphere. The label indicates if the signal shows to obtain a better classifying performance.
evidence of some type of structure in the ionosphere. The On the credit datasets German and Australian, the
sonar dataset is for training a classifier to discriminate PPSVC in low approximation degree is enough to give
whether the sonar signals bounced off metal or rock. very good approximation. In these datasets, many at-
For the ease of experiments, the chosen datasets are tributes are indicator variables which are transformed
all binary classes. For multi-class problems, the popular from original categorical attributes in preprocessing. The
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING
13

values of the indicator variables in two classes of in- value of g chosen by grid-search using cross-validation
1
stances are separated clearly, which leaves only a few is usually very small and not bigger than #attributes . The
instances in the region close to the decision boundary. results validate the claim that the constraint of g’s upper
Hence a rough approximation is enough to achieve bound imposed by the PPSVC almost does not affect the
similar classifying accuracy. Note that the better accuracy classification performance.
obtained by PPSVC with du = 1 in Australian does not
represent that the PPSVC achieves better performance. 6.3 Scalability to Large-Scale Datasets
It is caused by the poor linear approximation in du = 1,
In the following, we test the scalability of the PPSVC
where some overlapped instances are accidentally classi-
on large-scale synthetic datasets. The datasets are 10-
fied to their correct labels by the imprecise approximat-
dimensional, sized from 1000 to 10000 instances. Two
ing decision boundary.
classes of data are in equal size, and on each dimension,
On physical problems, the ionosphere dataset achieves
the values of the two classes follow a normal distribution
satisfying approximation in du = 2. The sonar dataset
at mean ±0.5 respectively, and both have variance 0.6 to
needs du = 4 to obtain similar accuracy. This may come
cause partial overlapping. The parameters for training
from the larger kernel parameter determined by cross-
the SVM are the default of LIBSVM. Fig. 10 shows
validation on this dataset, which has touched the upper
1 the comparison of the classifier complexity between the
bound #attributes . Larger kernel parameter will result in
original SVM classifier and the PPSVCs with du = 1 to
lower approximating precision, and hence it requires
du = 5. The horizontal axis denotes the size of datasets,
higher approximation degree.
and the vertical axis denotes the complexity of classifiers
In general, with du = 2, the classification accuracy re-
in the unit of the number of double-precision floating
sulted by the PPSVC soon gets close to the original SVM
point numbers. It is seen that the classifier complexity
classifier. With du = 3, the PPSVC gets almost the same
of the original SVM classifier increases with the size of
classification accuracy to the original SVM classifier on
the dataset since its complexity is proportional to the
most datasets. On all datasets, the PPSVC gets the same
number of support vectors, while the number of support
classification accuracy to the original SVM classifier with
vectors increases with the size of the dataset. On the
du ≤ 5. This verifies our claim that the PPSVC can
contrary, the complexity of the PPSVC is independent
precisely approximate the original SVM classifier by a
of the number of support vectors. With a small du , the
low approximation degree du , and hence results in a
PPSVC is well scalable to large-scale datasets which may
classifier with moderate complexity. The PPSVC in low
result in a large number of support vectors.
approximation degree can effectively approximate the
SVM classifier and possess privacy-preserving property 7000
Original SVM Classifier

which protects the private content of support vectors.


PPSVC with d =1
u
6000
PPSVC with du=2
PPSVC with du=3
Classifier complexity

PPSVC with du=4

4000 PPSVC with du=5

6.2 Effect of the Constraint on Kernel Parameter


2000

The following tests the influence of constraining the


searching upper bound of the kernel parameter g on the 0
1000 4000
Size of data
7000 10,000

above datasets. In order to have a better approximating


precision in the PPSVC, an upper bound #attributes 1
is Fig. 10. Comparison of classifier complexity between the
imposed to the grid search range of the kernel parameter original SVM classifier and the PPSVC.
g in the parameter search process of training the SVM.
We have discussed theoretically in Section 5.2 that this
constraint will not apparently affect the classification 6.4 Performance Comparison with k-Anonymity
performance. Fig. 9 shows the comparison of classifi- In this section, we compare the performance of the
cation accuracy between the SVM classifiers trained by PPSVC with the anonymous data publishing technique
grid-search selected parameters with and without the k-anonymity [18], [19]. Since the application scenario
1
#attributes upper bound constraint on g. The accuracy of the PPSVC is similar to classifying unanonymized data
shown in the figure is the 5-fold cross-validation average. using SVM classifiers built upon anonymized data [19], we
evaluate the performance of k-anonymity by using the
ϭ SVM classifiers trained from anonymized training data
ůĂƐƐŝĨLJŝŶŐĐĐƵƌĂĐLJ

ŐǁŝƚŚ
Ϭ͘ϴ
Ϭ͘ϲ
ĚĞĨĂƵůƚ
ƐĞĂƌĐŚ to classify the unanonymized testing data. We compare
ƌĂŶŐĞ
Ϭ͘ϰ
Ϭ͘Ϯ
ŐǁŝƚŚ
ƵƉƉĞƌ the accuracy on three of the above datasets which in-
ďŽƵŶĚ
Ϭ
,ĞĂƌƚ /ŽŶŽƐƉŚĞƌĞ >ŝǀĞƌ ŝĂďĞƚĞƐ ƵƐƚƌĂůŝĂŶ 'ĞƌŵĂŶ ^ŽŶĂƌ ƌĞĂƐƚ
ϭͬηĂƚƚƌ͘
clude quasi-identifier attributes: Statlog heart has {age,
sex}, Pima Indian diabetes has {age, number of pregnant,
Fig. 9. Classification accuracy comparison of g with and body mass index}, and German credit has {purpose, credit
1
without upper bound #attributes in parameter search. amount, personal status and sex, present residence since, age,
job}. Value generalization hierarchies are first built on
It is seen that the accuracy is similar between the quasi-identifiers of each dataset. Then the Datafly algo-
two parameter selection schemes. The reason is that the rithm [18] is adopted to achieve k-anonymity. Since the
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING
14

SVM is a value-based algorithm, for numerical attributes, [2] R. Agrawal and R. Srikant, “Privacy preserving data mining,” in
each generalized range is represented by the mean value, SIGMOD’00.
[3] D. Agrawal and C. C. Aggarwal, “On the design and quantifica-
and for categorical data, the generalized category is tion of privacy preserving data mining algorithms,” in PODS’01.
represented by exhibiting all children categories [19]. The [4] Y. Lindell and B. Pinkas, “Privacy preserving data mining,”
cost parameter and the Gaussian kernel parameter to Journal of Cryptology, 2002.
[5] C. C. Aggarwal and P. S. Yu, “A condensation approach to privacy
train the SVMs are determined by cross-validation. The preserving data mining,” in EDBT’04.
performance comparison between training SVMs from [6] V. N. Vapnik, Statistical Learning Theory, 1998.
k-anonymized training data with k = 32 and k = 128 [7] C. J. C. Burges, “A tutorial on support vector machines for pattern
recognition,” Data Mining and Knowledge Discovery, 1998.
and the PPSVC with du = 5 is shown in Fig. 11. The [8] K. Chen and L. Liu, “Privacy preserving data classification with
reported accuracy is 5-fold cross-validation average. rotation perturbation,” in ICDM’05.
[9] H. Yu, X. Jiang, and J. Vaidya, “Privacy-preserving SVM using
ϴϱ
WW^sͲ
nonlinear kernels on horizontally partitioned data,” in SAC’06.
ϴϬ
ĚͺƵсϱ [10] H. Yu, J. Vaidya, and X. Jiang, “Privacy-preserving SVM classifi-
ĐĐƵƌĂĐLJ;йͿ

ŬсϯϮ
ϳϱ cation on vertically partitioned data,” in PAKDD’06.
ŬсϭϮϴ
ϳϬ
[11] J. Vaidya, H. Yu, and X. Jiang, “Privacy-preserving SVM classifi-
,ĞĂƌƚ ŝĂďĞƚĞƐ 'ĞƌŵĂŶ cation,” Knowledge and Information Systems, 2008.
[12] S. Laur, H. Lipmaa, and T. Mielikäinen, “Cryptographically pri-
vate support vector machines,” in KDD’06.
Fig. 11. Performance comparison with k-anonymity. [13] HIPAA, Standard for privacy of individually identifiable health infor-
mation, 2001.
The PPSVCs with du = 5 achieve almost the same [14] J. R. Quinlan, C4.5: Programs for Machine Learning, 1993.
accuracy with the original SVM classifiers on these [15] J. Han and M. Kamber, Data Mining, 2006.
[16] B. Mozafari and C. Zaniolo, “Publishing naive bayesian classifiers:
datasets as reported in preceding subsections. On dia- Privacy without accuracy loss,” in VLDB’09.
betes and German datasets, the accuracy of applying the [17] L. Sweeney, “Uniqueness of simple demographics in the U.S.
k-anonymity technique with k = 32 is a bit lower than population,” LIDAP-WP4, CMU, Lab for Int’l Data Privacy, 2000.
[18] ——, “Achieving k-anonymity privacy protection using gener-
the PPSVC, and it further falls down when k = 128. alization and suppression,” International Journal of Uncertainty,
On heart dataset, since the size of this dataset is smaller Fuzziness and Knowledge-Based Systems, 2002.
(270 instances), k = 32 is enough to significantly distort [19] A. Inan, M. Kantarcioglu, and E. Bertino, “Using anonymized data
for classification,” in ICDE’09.
its quasi-identifier values. It is seen that the distortion [20] A. Machanavajjhala, J. Gehrke, D. Kifer, and M. Venkitasubrama-
of data to achieve k-anonymity will slightly hurt the niam, “l-diversity: Privacy beyond k-anonymity,” in ICDE’06.
performance of the SVM, and the performance may get [21] B. Pinkas, “Cryptographic techniques for privacy-preserving data
mining,” ACM SIGKDD Explorations Newsletter, 2002.
worse when a large k is needed for better privacy pro- [22] B. Schölkopf and A. J. Smola, Learning with Kernels, 2002.
tection. Comparing the PPSVC to applying k-anonymity [23] A. J. Smola, B. Schölkopf, and K.-R. Müller, “The connection
on the SVM, the PPSVC hardly hurt the performance of between regularization operators and support vector kernels,”
Neural Networks, 1998.
the SVM, and can provide better protection on the data [24] Y.-W. Chen and C.-J. Lin, “Combining SVMs with various feature
privacy since all attribute values are hidden. selection strategies,” in Feature extraction, foundations and applica-
tions, 2006.
7 C ONCLUSION [25] C.-W. Hsu, C.-C. Chang, and C.-J. Lin, A practical guide to support
vector classification, 2003.
In this paper, we propose the PPSVC to tackle the [26] C.-C. Chang and C.-J. Lin, LIBSVM: a library for support vector
privacy violation problem of the classification model of machines, 2001, http://www.csie.ntu.edu.tw/∼cjlin/libsvm.
[27] A. Asuncion and D. Newman, “UCI machine learning repository,”
the SVM, which includes some intact instances of the 2007.
training data called support vectors. The PPSVC post-
processes the Gaussian kernel SVM classifier to trans-
form it to a privacy-preserving classifier which precisely
approximates the SVM classifier and does not disclose
the private content of support vectors. We prove its secu- Keng-Pei Lin is currently working toward the PhD degree in the Depart-
ment of Electrical Engineering, National Taiwan University, Taiwan. His
rity against adversarial attacks, and the precision issue is research interests include data mining and databases.
also addressed to guarantee a good approximation. The
experimental results validate our claim that the PPSVC Ming-Syan Chen received the B.S. degree in electrical engineering
can achieve similar classification accuracy to the original from National Taiwan University, Taiwan, and the M.S. and Ph.D. de-
grees in Computer, Information and Control Engineering from The Uni-
SVM classifier. By protecting the sensitive content of versity of Michigan, Ann Arbor, USA, in 1985 and 1988, respectively. He
support vectors, the resulted privacy-preserving SVM is now a Distinguished Research Fellow and the Director of Research
classifier can be publicly released or be shipped to Center of Information Technology Innovation in the Academia Sinica,
Taiwan, and is also a Distinguished Professor jointly appointed by EE
clients without violating privacy. Future directions for Department, CSIE Department, and Graduate Institute of Communi-
work include the challenge of applying the PPSVC to cation Eng. (GICE) at National Taiwan University. He was a research
high-dimensional data, and exploiting on other common staff member at IBM Thomas J. Watson Research Center from 1988
to 1996, the Director of GICE from 2003 to 2006, and also the Presi-
kernel functions like the polynomial kernel. dent/CEO of Institute for Information Industry, which is one of the largest
organizations for information technology in Taiwan, from 2007 to 2008.
R EFERENCES His research interests include databases, data mining, cloud computing,
and multimedia networking, and he has published more than 290 papers
[1] M.-S. Chen, J. Han, and P. S. Yu, “Data mining: An overview from in his research areas. Dr. Chen is a Fellow of ACM and a Fellow of IEEE.
database perspective,” IEEE Trans. Knowl. Data Eng., 1996.

You might also like