PPDM

1.
INTRODUCTION
According to [1], there has been a significant improvement in the
information processing technology and storage capacity. Hence lot of data is
generated by different types of organizations. For extracting hidden patterns from
these huge data sets, data mining techniques are applied. During the process of
data mining, Unexpected and interesting patterns are the main result because
expected patterns does not give any new information. Hence data mining is also
known as Knowledge mining and a very good example of data mining would be
Mining Gold from Rocks where Rocks refer to tons of data and Gold refers to
important information obtained from Rocks.
Let us take an example here to clearly understand the problem of privacy
during data mining. In this example, a hospital is an organization which has a data
set of all their patients. This data set contains the sensitive information of each
individual. Now let us consider a third party like a drug manufacturer. This third
party would like to take the data set prepared by the hospital to analyze the current
trend of diseases, the most affected age groups and the drugs in demand and also
details of the doctor referred etc etc. But this data set needs to be edited properly
before releasing it to the third party because the third party may not be trusted. If
the third party is able to identify a record completely, it leads to a breach in
privacy of the individual. The third party may also use the sensitive information
of an individual for malicious purposes.
One solution to this privacy breach would be to not send any details to the
third party. But there are advantages to send details to the third party for
knowledge mining as in the above example the drug manufacturer after mining
the data set may come to know about the prevalent diseases and he can focus on
the prevalent diseases and do more research on these diseases. Hence the process
of data mining cannot be denied and its advantages are magnanimous.
But privacy also plays an important role. It is a matter of concern with growing
issues related to information privacy.
Another solution would be to alter the data set in such a way that sensitive
information is not sent and it can be mined to find useful patterns also. This is by
far the best solution because privacy of the individual is intact and data is also
mined and analyzed for the various advantages it has to offer. This provides a
win-win situation for both the parties (in this case the hospital and the drug
manufacturer) to carry on with their work and also expect a more effective
solution for the current problems.
According to [2], there are more than 20 state of the art techniques dealing
with privacy preserving data mining. The research related to PPDM started in the
year 2000 with researchers R. Agrawal and A. Srikant in their paper [3]. Research
has been fruitful since then with various techniques like data perturbation [4],
association rule mining [5], histogram based approach, decision tree technique
[6], cryptographic technique [7], k-anonymity technique [8] among others having
dealt with the issue of privacy.
According to [9], PPDM approach assumes single level trust on data
miners (the third party). Under this assumption, a data owner generates only one
perturbed copy of the data with a fixed amount of uncertainty. This assumption is
limited in various applications where a data owner trusts the data miners at
different levels. This is a believable scenario as there may be more than one drug
manufacturer who wants the data of the hospital in the example mention above. A
particular data miner( third party) may be more trusted than the others . Hence a
less perturbed copy should be sent to a more trusted data miner and a more
perturbed copy should be sent to a less trusted data miner. Hence more than one
perturbed copy must be released for data mining. A malicious data miner could
access multiple perturbed copies through various other means.
By utilizing diversity across differently perturbed copies, the data miner
may be able to produce a more accurate reconstruction of the original data than
what is allowed by the data owner. We refer to this attack as diversity attack. It
includes the colluding attack scenario where adversaries combine their copies to
mount an attack ,it also includes the scenario where an adversary utilized public
information to perform the attack on its own. Preventing diversity attacks is the
key challenge in solving the problem. This defines the research problem i.e.
Trust Issues Concerning Multiple Parties in Privacy Preserving Data Mining .
The rest of the report is organized as follows: in section 2 i.e. the literature
survey we discuss in detail about data mining concepts and privacy preserving
data mining .In section 3 we discuss about the existing system, in section 4 we
suggest improvements to the existing system thus introducing our proposed
system, in section 5 concludes the report.
2. LITERATURE SURVEY
Here we are going to have a look at Data Mining Definition, Concepts and all the
preliminary information required to understand the report.
2.1 DATA MINING
Data Mining can be seen as a process for extracting hidden and valid knowledge from
huge databases [10]. Data mining extracts knowledge which was previously unknown
[11].This means that the more unexpected the knowledge generally the more interesting it
is. There is no benefit of mining data set to extract the knowledge which is very obvious.
The extracted knowledge needs to be valid. Moreover, extracting knowledge from a data
set having a small number of records is not a viable option .Here a doubt may arise, is
data mining similar to statistical data analysis? Among all traditional data analysis,
statistical analysis is most similar to data mining. Many of the data mining tasks such as
building predictive models, and discovering associations, could also be done through
statistical analysis. An advantage of data mining is its assumption free approach.
Statistical Analysis still needs some predefined hypothesis. Additionally Statistical
Analysis is restricted to only numerical attributes while data mining can handle both
numerical and categorical attributes. Moreover data mining techniques are generally easy
to use.
2.1.1
Data Mining Steps
Essential steps of data mining include data cleaning,data integration ,data

selection, data mining, pattern evaluation and knowledge Presentation[12,13].Each of
the steps are briefly discussed as follows.
Data Cleaning-It refers to the removal of natural noise and inconsistent

data from the database. Words and numbers may be misspelt and entered
erroneously in a database due to various reasons including typographical
errors .Missing values are either replaced by the most likely value or
deleted along the whole record. An attribute Soft Drink may have the
values such as pepsi,cola or pepsi cola which necessarily refers to
the same drink. They need to be consistent before data mining technique.
Data Integration-It is also known as data transformation. It is the process

of combining two or more data sources into a uniform data set. Different
data sources can use different models such as relational model or object
oriented relational model. A two dimensional data set is created from this
various sources. Sometimes a particular attribute can be called differently
in different data sources. For example one data source can name the
attribute as Income while the other data source can name the attribute as
Salary. These anomalies are resolved in the data integration phase.
Data Selection-In order to perform a particular data mining task all

relevant attributes are selected from a warehouse data set. This new set
comprising of these selected attributes are used for data mining .
Data Mining It is an essential process which extracts previously

unknown patterns and trends from a huge data set without making any
predefined hypothesis.
Pattern Evaluation and Knowledge Presentation- Some extracted patterns

may be obvious and unappealing . However , some other patterns may be
counter intuitive ,interesting and useful.
Fig 1: Steps in a Knowledge Discovery in Database (KDD )process
2.1.2
Data Mining Tasks

There are some tasks which makes use of data mining techniques,
although they themselves are not data mining. These tasks are often
mistakenly considered as data mining, perhaps due to the close link to it.
There are many data mining tasks such as classification, association rule
mining, clustering, outlier analysis, evolution analysis, characterization and
discrimination. We briefly discuss some of them as follows.
Classification and Prediction-A data set may have an attribute

called class attribute which refers to the category of records. For
example a patient data set may have a class attribute called
diagnosis along with several other non-class attributes that
describes various properties and conditions of the patient. Records
having class attribute values are known as labelled records.
Classification is the process of building a classifier from a set of
pre-classified records. Classifiers help to analyze data sets better.
They are expressed in different ways such as set of rules, decision

trees etc.
Fig 2: Decision tree example
Association Rules Mining-Primary objective of association rules

mining is to obtain frequent item sets, and association rules. If a
set of items appears in a number of transactions which is more
than the user defined threshold then the set is known as a frequent
item set. However, if two sets of items makes the appearance of
one set of items in a transaction makes the appearance of the other
set of items in the same transaction highly expected then it is
known as an association rule. An association rule can be
represented as X=>Y, where X and Y are mutually exclusive
subset of items and X,Y I ,where I is the set of all items in the
data set . A rule computer=> software [1%,50%] which says that
if a transaction contains computer then there is a 50% chance that
there will be software in that transaction and 1% of all the
transactions contain both of them.
Clustering-It is a process of arranging similar records in groups so

that the records belonging to the same cluster have high similarity,
while
records
belonging to
different
cluster
have
high
dissimilarity. Partitioning, hierarchical, density based, grid based

and model based methods are few of the clustering methods.
2.1.3
AN APPLICATION OF DATA MINING- MEDICAL DATA ANALYSIS

Generally, medical data sets contain wide variety of bio-medical data
which are distributed among parties. Various data mining tasks such as data
cleaning, data preprocessing and semantic integration can be used for the
construction of the warehouse and useful analysis of the medical databases.
Data sets having patient records can also be analyzed through data mining for
various other purposes such as prediction of diseases for new patients.
2.2
PRIVACY PRESERVING DATA MINING
Nowadays, data mining is a widely accepted technique for huge range of

organizations. Organizations are extremely dependent on data mining in their everyday
activities. The paybacks are well acknowledged and can hardly be overestimated. During
the whole process of data mining , these data which typically contain sensitive individual
information such as medical and financial information, often gets exposed to third parties
including collectors ,owners ,users and miners. Disclosure of such sensitive information
can cause a breach of individual privacy.
An intruder or malicious data miner can learn sensitive attribute information such as
disease type and income of a certain individual through re-identification of record from a
certain data set. It is also not unlikely for an intruder to have sufficient supplementary
knowledge, such as ethnic background , religion , marital status and number of children
of an individual.
Public concern is mainly caused due to the so-called secondary use of personal
information without the consent of the subject. In other words, customers feel strongly
that their personal information should not be sold to other organizations without their
prior consent. The IBM Multinational Consumer Privacy Survey performed in 1999 in
Germany, USA and UK illustrates public concern over privacy [14]. 80% of the
respondents feel that consumers have lost all control over how personal information is
collected and used by companies. 94% of the respondents are concerned about the
possible misuse of their personal information. This survey also shows that when it comes
to the confidence that their personal information is properly handled, customers have
most trust in health care providers and banks and the least trust in credit card agencies
and internet companies.
A Harris Poll Survey illustrates the growing public awareness and apprehension
regarding their privacy ,from results obtained in 1999,2000,2001 and 2003 [15]. The
public awareness is shown in the table 1.
1999
2000
2001
2003
Concerned
78%
88%
92%
90%
Unconcerned
22%
12%
8%
10%
Table 1. Harris Poll Survey
Due to the enormous efforts of data mining, yet high public concerns regarding
individual privacy, the implementation of privacy preserving data mining techniques has
become the demand at the moment. A privacy preserving data mining technique promotes
individual privacy while allowing extraction of useful knowledge from data.
There are several different methods that can be used to enable privacy preserving
data mining. One particular class of such techniques modifies the collected data before its
release, in an attempt to protect individual records from being re-identified. An intruder
even with supplementary knowledge, cannot be certain about the correctness of a reidentification when the data set has been re-modified. This class of privacy preserving
techniques relies on the fact that the data sets used for data mining purposes do not
necessarily contain 100% accurate data. In fact this is almost never the case , due to the
existence of natural noise in data sets . In the context of data mining it is important to
maintain the patterns in the data set. Additionally , maintenance of statistical parameters
namely mean and covariances of attributes is important in the context of statistical
databases.
High quality and privacy/security are two important requirements that a good
privacy preserving technique needs to satisfy. There is no single agreed upon technique
for privacy. Therefore measuring privacy/security is a challenging task.
10
3.TRUST ISSSUES IN PRIVACY PRESERVING DATA MINING

According to [9], Privacy Preserving Data Mining(PPDM) addresses the problem
of developing accurate models about aggregated data without access to precise
information in individual data record. But these techniques make a tacit assumption that
the data owner has a single level trust on data miners. This assumption is relaxed and the
scope of perturbation based PPDM is extended to Multi Level Trust(MLT-PPDM). The
research paper in [9] is our existing system. In this paper , the more trusted the data miner
is the less perturbed the data he can access. Under this setting, a malicious data miner
may have access to differently perturbed copies of the same data through various means ,
and may combine these diverse copies to jointly infer additional information about the
original data that the data owner does not intend to release.
This type of attack is known as diversity attack. Preventing such attacks is the key
challenge of providing MLT-PPDM services. This challenge is addressed by properly
correlation perturbation across copies at different trust levels. The solution is robust
against diversity attacks with respect to the privacy goal. The data owner must also be
able to generate perturbed copies for arbitrary trust levels on demand. This feature offers
maximum flexibility.
Data perturbation as already discussed in previous sections in a PPDM approach
which is widely used. This approach introduces uncertainty about individual values
before data are published or released to third parties for data mining purposes [15]. But
the single trust level assumption is limited. A two level trust level scenario as a
motivating example.
The government or a business might do internal data mining, but they also want
to release the data to the public and might perturb it more. The mining
department which receives the less perturbed internal copy also has access to the
more perturbed public copy. It would be desirable that this department does not
11
have more power in reconstructing the original data by utilizing both copies than
when it has only the internal copy
Conversely, if the internal copy is leaked to the public, then obviously the public
has all the power of the mining department. However, it would be desirable if the
public cannot reconstruct the original data more accurately when it uses both
copies than when it uses only the leaked internal copy.
This new dimension of MLT_PPDM poses new challenges, in contrast to the original
scenario multiple perturbed copies of the same data are available to the data miners. A
malicious data miner would try to reconstruct the original data through diversity attack.
This challenge is addressed by MLT-PPDM services. The focus is on additive
perturbation approach where random guassian noise is added to the original data with
arbitrary distribution and provide a systematic solution. Through a one to one mapping,
data owner generates distinctly perturbed copies of its data according to different trust
levels.
Here a question may arise, why do we use random perturbation among all the
techniques. The first category of PPDM technique is Secure Multi Party Computation
(SMC) which makes of the cryptographic techniques. However these techniques are not
practically put to use as they are extraordinarily expensive in practice, and impractical for
real use. Very other solutions have been proposed, solutions to build decision trees over
the horizontally partitioned data were proposed in [16]. For vertically partitioned data,
algorithms have been proposed to address the association rule mining [17]. K-means
clustering [18] and frequent pattern matching problems [19]. The work of [20] uses a
secure coprocessor for privacy preserving collaborative data mining and analysis.
3.1 PREMILINARIES
3.1.1 Jointly Guassian
Let
through
be L Guassian random variables. They are said to be jointly
guassian if and only if each of them is a linear combination of multiple independent
12
guassian random variable. Equivalently
through
are jointly guassian if and only if
any linear combination of them is also a guassian random variable.

A vector formed by jointly guassian random variable is a jointly guassian vector.
For a jointly guassian vector G=
, its probability density function is
given by any real vector g,

(G)=
Where
are the mean vector and covariance matrix of G respectively.
If multiple random variables are jointly guassian then conditional on a subset of them ,
the remaining variables are still jointly guassian . Specifically partition a jointly guassian
vector as
G=(
Then the distribution of
over
),
is also a jointly guassian with mean
and covariance matrix
3.1.2 Additive Perturbation

A widely used and accepted way to
perturb data is by additive
perturbation[15].This approach adds to the original data X , some random noise Z to

obtain the perturbed copy Y as follows
Y=X+Z----------------------------------------------------------------(1)
13
We assume that X,Y,Z are all N Dimensional vectors where N is the number of attributes
in X .Let
be the jth entry of X,Y and Z respectively.
The covariance matrix is a N
N matrix given by
]-----------------------------------------------------------(2)
Which is a diagonal matrix if attributes are uncorrelated.

The Noise Z is assumed to be independent of X and is a jointly guassian vector with zero
mean and covariance matrix
covariance matrix
E[Z
is an a N
chosen by the data owner. We write it as Z~N(0,
).The
N matrix is given by
]-----------------------------------------------------------------------------------(3)
In a straightforward way we need to verify
Huang et al [21] points out that there must be some correlation in the added noise
otherwise it may be filtered out. hence
for some constant
denoting the
perturbation magnitude.
3.1.3 Linear Least Squares Error Estimation

Given a perturbed copy of the data , a malicious data miner may attempt to
reconstruct the original data as accurately as possible. Among the family of linear
reconstruction methods, where estimates can only be linear functions of the perturbed
copy. Linear Least Square Error(LLSE) estimation has the minimum square errors
between the estimated values and the original values [22].
The LLSE estimate of X given Y is given by (Y) is as follows
(Y)=
-----------------------------------------------------------(4)
14
3.1.4 Kronecker Product

The kronecker product is a binary matrix operator that maps two matrices of arbitrary
dimensions into a larger matrix with a special block structure. Given a nm matrix A and
pq matrix B where ,
A=(
) their kronecker product is denoted by
AB is an np mq matrix with block structure
3.2 PROBLEM FORMULATION

It is true that the data owner may consider to release only the mean and
covariance of the original data. We remark that by simply releasing the mean and
covariance does not provide the same utility as the perturbed data. For many real
applications, knowing only the mean and covariance may not be sufficient to apply data
mining techniques , such as clustering, principal component analysis , and classification.
By using random perturbation to release the data set , the data owner allows the data
miner to exploit more statistical information without releasing the exact values of
sensitive attributes.
Let H be an (N.M) N matrix as follows :
H=( )
Where
represents NN identity matrix
15
And Y=HX+Z-------------------------------------------------(5)
3.3 THREAT MODEL
We assume malicious data miners who always attempt to reconstruct a more accurate
estimate of the original data given perturbed copies where
and
then the LLSE estimate is given by
(Y)=
(Y-
)--------------------------------(6) and
Covariance matrix is as follows :

E[ (Y)-X)
]=
--------------------------------------(7)
For an adversary who observes only a single copy
(1 i M) and gets a LLSE estimate
,the covariance matrix is as follows
E[ (Y)-X)
]=
---------------------------------------------(8)
3.4 DISTORTION
We define the concept of perturbation D between two data sets as the average expected
square difference between them. For example the distortion between the original data X
and the perturbed copy Y is given by
D(X,Y)=
Based on the above definition , we refer to a perturbed copy with respect to X ,if and only
if D(X
3.5 PRIVACY GOAL AND DESIGN SPACE

Using equation (8) we express the privacy of
16
.ie. D(X,
) as follows :
D(X,
)=
Tr(
)----------------(9)
Where Tr(.) is the trace of the matrix. We initially assume that the data owner wants to
release M copies for the ease of analysis.
We say the privacy goal is achieved with respect to M perturbed copies if

D(X,
Where
D(X, ( ))------------------(10)
)=
) is the set of perturbed copies an adversary uses to reconstruct the original
data. Intuitively ,achieving the privacy goal requires that give n the copy with the least
privacy among any subset of these M perturbed copies , the remaining copies in that
subset contain no extra information about X.
Assume the case of two perturbed copies
and
for noises
.The privacy goal
in (10) requires that

D(X,
Where
D(X,
-----------------------(11)
is the less perturbed copy.
3.6 PROPOSED SOLUTION

One way to satisfy (11) is to generate
We rewrite
If
so that
=X+
and
are independent.
as
are independent , then

.All information in
is nothing but a perturbed derivation of
useful for estimating X is inherited from
has no extra
innovative information to improve the estimation accuracy and (11) is satisfied.
17
Now the covariance matrix is given by (
)--------------------------(13)
3.6.1 Corner Wave Property

The privacy goal in (10) is achieved if the noise covariance matrix satisfies corner wave
property.Specifically we say that an MM square matrix has corner wave property if
from I to M,all the entries to the right and below(i,i) th entry are same. We assume
<
for all i=1M-1--------------------------------------------------(14)
Thus the covariance matrix is given by
)---------------------------------(15)
=(
---------------------------(16)
3.6.2 Batch Generation

Two algorithms have been defined.Algorithm 1 generates noise
Algorithm 1:Parallel Generation
1.Input:X,
and
to
2.Output : Y
3.Construct
with
4.Generate Z with
and
to
According to (15)
,according to (16)
5.Generate Y=HX+Z
18
to
in parallel
6.Output Y
Algorithm 2:Sequential Generation

1. Input:X,
and
to
2. Output:
3. Construct
~N(0,
4. Generate
5. Output
6. Construct noise
7. Generate
8. Output
The main disadvantage of the batch Generation approach is that it requires a data owner
to foresee all possible trust levels apriori.
3.6.3 On-Demand Generation
In order for maximum flexibility to the data owner we generate the perturbed copies on
demand. We assume that there are L existing copies and the data owner generates M-L
copies .Thus there will be M-L copies in total.
The guassian mean will be
------------------------------------(17)
And covariance
-
--------------------------(18)
Algorithm 3:On Demand Generation
19
1.Input: X,
and
to
and values of
2.Output:New copies
3.Construct
4.Extract
5.Generate
with
,
and
and
to
According to (15)
from
using (17) and (18)
6.for I from L+1 to M do

7. Generate
=X+
8.Output
9.end for
Algorithm 3 offers more flexibility .
20
4.PROPOSED SYSTEM
The following system is required for the execution of the proposed system.
System Configuration:H/W System Configuration:-
Processor
Pentium III
Speed
- 1.1 Ghz
RAM
- 256 MB(min)
Hard Disk
- 20 GB
Floppy Drive
- 1.44 MB
Key Board
- Standard Windows Keyboard
Mouse
- Two or Three Button Mouse
Monitor
- SVGA
S/W System Configuration:
Operating System
Application Server
Front End
:Windows95/98/2000/XP
: Tomcat5.0/6.X
: HTML, Java, Jsp
21
Scripts
Server side Script
: Java Server Pages.
Database Connectivity
: Mysql.
: JavaScript.
System Architecture:
22
DATA OWNERS
ORIGINAL DATA
MLT - PPDM
ALGORITHMS
RANDOM RATATION
PERTURBATION
PERTURBED
DATA
DATA MINERS
Implementation Modules:
23
1) Data owners (users)

2) Multilevel trust in PPDM (Manager)
3) Admin
Data owners
The bank customers are the data owners. They could register them self as per the
account number and create a username and password. User can view their original datas
Whatever they given when there open the account.
Multilevel trust in PPDM

Develop algorithms and code to execute the existing system and further enchance it .
Admin
Admin also can view the original datas. whatever stored in the database .Admin can
login and view the original datas .
We would first like to implement the existing system and then further enhance it.The
existing system considers only linear attacks.We would like to continue the research by
looking into non linear attacks to derive original data and recover more information.
5.CONCLUSION
24
The scope of PPDM is extended to Multi-level PPDM by relaxing an implicit assumption

of single level trust.MLT-PPDM allows data owners to generate differently perturbed
copies at his own will. The key challenge lies in preventing the data miners form
combining copies at different trust levels to jointly reconstruct the original data more
accurately then what is there with the data miner. This challenge is addressed by properly
correlating noise at different trust levels.The On demand Generation algorithm also
provides maximum flexibility.We would like to extend the existing system from linear
attacks to non linear attacks.
REFERENCES
[1] Md Zahidul Islam , Privacy Preserving Data Mining through Noise Addition 2008.
25
[2]Shwetha Taneja, Shashank Khanna, Sugandha Tilwalia, Ankita, A Review on

Privacy Preserving Data Mining: Techniques and Research Challenges Proc. IJCSIT,
Vol 5(2) pp 2310-2315 ,2014.
[3] R. Agrawal and A. Srikant, " Privacy-preserving data mining, in proceedings of
SIGMOD00, pp. 439-450.
[4] H. Kargupta and S. Datta, Q. Wang and K. Sivakumar, On the Privacy Preserving
Properties of Random Data Perturbation Techniques, in proceedings of the Third IEEE
International
Conference on Data Mining, IEEE 2003.
[5] D.Karthikeswarant, V.M.Sudha, V.M.Suresh and A.J. Sultan, A Pattern based
framework for privacy preservation through Association rule Mining in proceedings of
International Conference On Advances In Engineering, Science And Management
(ICAESM -2012), IEEE 2012.
[6] H.C. Huang, W.C. Fang, Integrity Preservation and Privacy Protection for Medical
Images with Histogram-Based Reversible Data Hiding, in proceedings of 978-1-457704222/11/$26.00_c, IEEE 2011.
[7] Y. Lindell, B.Pinkas, Privacy preserving data mining, in proceedings of Journal of
Cryptology, 5(3), 2000.
[8] L. Sweeney, "k-Anonymity: A Model for Protecting Privacy, in proceedings of Int'l
Journal of Uncertainty, Fuzziness and Knowledge-Based Systems 2002.
26
[9] Yapling Li ,Minghua Chen , Qiwei Li and Wei Zhang ,Enabling Multilevel Trust in
Privacy Preserving Data Mining , IEEE transactions on knowledge and data
engineering, vol-24,no.9 pp 1598-1613 ,September 2012.
[10] P. Cabena, P. Hadjinian, R. Stadler, J. Verhees, and A. Zanasi. Discovering Data
Mining from Concept to Implementation. Prentice Hall PTR, New Jersey 07458,USA,
1998.
[11] A. Cavoukian. Data mining: Staking a claim on your privacy, Information and
Privacy Commissioner Ontario. Available from http://www.ipc.on.ca/ docs/datamine.pdf,
Accessed on 21 May, 2008, 1998.
[12] J. Han and M. Kamber. Data Mining Concepts and Techniques. Morgan Kaufmann
Publishers, San Diego, CA 92101-4495,USA, 2001.
[13] R. Groth. Data Mining A Hands-On Approach For Business Professionals. Prentice
Hall PTR, New Jersey 07458, USA, 1998.
[14] Consumer Report. Ibm multi-national privacy survey consumer report. available
from http://www1.ibm.com/services/les/privacy survey oct991.pdf. visited on 01.07.03.
[15] D. Agrawal and C.C. Aggarwal, On the Design and Quantification of Privacy
Preserving Data Mining Algorithms, Proc. 20th ACM SIGMOD-SIGACT-SIGART
Symp. Principles of Database Systems (PODS 01), pp. 247-255, May 2001.
[16] Y. Lindell and B. Pinkas, Privacy Preserving Data Mining, Proc.Intl Cryptology
Conf. (CRYPTO), 2000.
[17] J. Vaidya and C.W. Clifton, Privacy Preserving Association Rule Mining in
Vertically Partitioned Data, Proc. ACM SIGKDD Intl Conf. Knowledge Discovery and
Data Mining, 2002.
27
[18] J. Vaidya and C. Clifton, Privacy-Preserving K-Means Clustering over Vertically

Partitioned Data, Proc. ACM SIGKDD Intl Conf.Knowledge Discovery and Data
Mining, 2003.
[19] A.W.-C. Fu, R.C.-W. Wong, and K. Wang, Privacy-PreservingFrequent Pattern
Mining across Private Databases, Proc. IEEEFifth Intl Conf. Data Mining, 2005.
[20] B. Bhattacharjee, N. Abe, K. Goldman, B. Zadrozny, V.R. Chillakuru, M.del Carpio,

and C. Apte, Using Secure Coprocessors for Privacy Preserving Collaborative Data
Mining and Analysis, Proc. Second Intl Workshop Data Management on New
Hardware (DaMoN 06), 2006.
[21] Z. Huang, W. Du, and B. Chen, Deriving Private Information From Randomized
Data, Proc. ACM SIGMOD Intl Conf. Management of Data (SIGMOD), 2005.
[22] K. Shanmugan and A. Breipohl, Random Signals: Detection, Estimation, and Data
Analysis. John Wiley & Sons Inc, 1988.
28

PPDM

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

PPDM

Uploaded by

Copyright:

Available Formats

1.

Data Mining Steps

Essential steps of data mining include data cleaning,data integration ,data

Data Cleaning-It refers to the removal of natural noise and inconsistent

Data Integration-It is also known as data transformation. It is the process

Data Selection-In order to perform a particular data mining task all

Data Mining It is an essential process which extracts previously

Pattern Evaluation and Knowledge Presentation- Some extracted patterns

Fig 1: Steps in a Knowledge Discovery in Database (KDD )process

Data Mining Tasks

Classification and Prediction-A data set may have an attribute

They are expressed in different ways such as set of rules, decision

Fig 2: Decision tree example

Association Rules Mining-Primary objective of association rules

Clustering-It is a process of arranging similar records in groups so

dissimilarity. Partitioning, hierarchical, density based, grid based

AN APPLICATION OF DATA MINING- MEDICAL DATA ANALYSIS

PRIVACY PRESERVING DATA MINING

Nowadays, data mining is a widely accepted technique for huge range of

Table 1. Harris Poll Survey

3.TRUST ISSSUES IN PRIVACY PRESERVING DATA MINING

be L Guassian random variables. They are said to be jointly

guassian if and only if each of them is a linear combination of multiple independent

guassian random variable. Equivalently

are jointly guassian if and only if

any linear combination of them is also a guassian random variable.

, its probability density function is

given by any real vector g,

are the mean vector and covariance matrix of G respectively.

Then the distribution of

is also a jointly guassian with mean

and covariance matrix

3.1.2 Additive Perturbation

perturb data is by additive

perturbation[15].This approach adds to the original data X , some random noise Z to

be the jth entry of X,Y and Z respectively.

The covariance matrix is a N

Which is a diagonal matrix if attributes are uncorrelated.

chosen by the data owner. We write it as Z~N(0,

In a straightforward way we need to verify

for some constant

3.1.3 Linear Least Squares Error Estimation

3.1.4 Kronecker Product

) their kronecker product is denoted by

AB is an np mq matrix with block structure

3.2 PROBLEM FORMULATION

represents NN identity matrix

then the LLSE estimate is given by

Covariance matrix is as follows :

For an adversary who observes only a single copy

(1 i M) and gets a LLSE estimate

,the covariance matrix is as follows

3.5 PRIVACY GOAL AND DESIGN SPACE

We say the privacy goal is achieved with respect to M perturbed copies if

) is the set of perturbed copies an adversary uses to reconstruct the original

.The privacy goal

in (10) requires that

is the less perturbed copy.

3.6 PROPOSED SOLUTION

are independent , then

is nothing but a perturbed derivation of

useful for estimating X is inherited from

innovative information to improve the estimation accuracy and (11) is satisfied.

Now the covariance matrix is given by (