Professional Documents
Culture Documents
INTRODUCTION
According to [1], there has been a significant improvement in the
information processing technology and storage capacity. Hence lot of data is
generated by different types of organizations. For extracting hidden patterns from
these huge data sets, data mining techniques are applied. During the process of
data mining, Unexpected and interesting patterns are the main result because
expected patterns does not give any new information. Hence data mining is also
known as Knowledge mining and a very good example of data mining would be
Mining Gold from Rocks where Rocks refer to tons of data and Gold refers to
important information obtained from Rocks.
Let us take an example here to clearly understand the problem of privacy
during data mining. In this example, a hospital is an organization which has a data
set of all their patients. This data set contains the sensitive information of each
individual. Now let us consider a third party like a drug manufacturer. This third
party would like to take the data set prepared by the hospital to analyze the current
trend of diseases, the most affected age groups and the drugs in demand and also
details of the doctor referred etc etc. But this data set needs to be edited properly
before releasing it to the third party because the third party may not be trusted. If
the third party is able to identify a record completely, it leads to a breach in
privacy of the individual. The third party may also use the sensitive information
of an individual for malicious purposes.
One solution to this privacy breach would be to not send any details to the
third party. But there are advantages to send details to the third party for
knowledge mining as in the above example the drug manufacturer after mining
the data set may come to know about the prevalent diseases and he can focus on
the prevalent diseases and do more research on these diseases. Hence the process
of data mining cannot be denied and its advantages are magnanimous.
But privacy also plays an important role. It is a matter of concern with growing
issues related to information privacy.
Another solution would be to alter the data set in such a way that sensitive
information is not sent and it can be mined to find useful patterns also. This is by
far the best solution because privacy of the individual is intact and data is also
mined and analyzed for the various advantages it has to offer. This provides a
win-win situation for both the parties (in this case the hospital and the drug
manufacturer) to carry on with their work and also expect a more effective
solution for the current problems.
According to [2], there are more than 20 state of the art techniques dealing
with privacy preserving data mining. The research related to PPDM started in the
year 2000 with researchers R. Agrawal and A. Srikant in their paper [3]. Research
has been fruitful since then with various techniques like data perturbation [4],
association rule mining [5], histogram based approach, decision tree technique
[6], cryptographic technique [7], k-anonymity technique [8] among others having
dealt with the issue of privacy.
According to [9], PPDM approach assumes single level trust on data
miners (the third party). Under this assumption, a data owner generates only one
perturbed copy of the data with a fixed amount of uncertainty. This assumption is
limited in various applications where a data owner trusts the data miners at
different levels. This is a believable scenario as there may be more than one drug
manufacturer who wants the data of the hospital in the example mention above. A
particular data miner( third party) may be more trusted than the others . Hence a
less perturbed copy should be sent to a more trusted data miner and a more
perturbed copy should be sent to a less trusted data miner. Hence more than one
perturbed copy must be released for data mining. A malicious data miner could
access multiple perturbed copies through various other means.
By utilizing diversity across differently perturbed copies, the data miner
may be able to produce a more accurate reconstruction of the original data than
what is allowed by the data owner. We refer to this attack as diversity attack. It
includes the colluding attack scenario where adversaries combine their copies to
mount an attack ,it also includes the scenario where an adversary utilized public
information to perform the attack on its own. Preventing diversity attacks is the
key challenge in solving the problem. This defines the research problem i.e.
Trust Issues Concerning Multiple Parties in Privacy Preserving Data Mining .
The rest of the report is organized as follows: in section 2 i.e. the literature
survey we discuss in detail about data mining concepts and privacy preserving
data mining .In section 3 we discuss about the existing system, in section 4 we
suggest improvements to the existing system thus introducing our proposed
system, in section 5 concludes the report.
2. LITERATURE SURVEY
Here we are going to have a look at Data Mining Definition, Concepts and all the
preliminary information required to understand the report.
2.1 DATA MINING
Data Mining can be seen as a process for extracting hidden and valid knowledge from
huge databases [10]. Data mining extracts knowledge which was previously unknown
[11].This means that the more unexpected the knowledge generally the more interesting it
is. There is no benefit of mining data set to extract the knowledge which is very obvious.
The extracted knowledge needs to be valid. Moreover, extracting knowledge from a data
set having a small number of records is not a viable option .Here a doubt may arise, is
data mining similar to statistical data analysis? Among all traditional data analysis,
statistical analysis is most similar to data mining. Many of the data mining tasks such as
building predictive models, and discovering associations, could also be done through
statistical analysis. An advantage of data mining is its assumption free approach.
Statistical Analysis still needs some predefined hypothesis. Additionally Statistical
Analysis is restricted to only numerical attributes while data mining can handle both
numerical and categorical attributes. Moreover data mining techniques are generally easy
to use.
2.1.1
deleted along the whole record. An attribute Soft Drink may have the
values such as pepsi,cola or pepsi cola which necessarily refers to
the same drink. They need to be consistent before data mining technique.
2.1.2
while
records
belonging to
different
cluster
have
high
2.1.3
2.2
Germany, USA and UK illustrates public concern over privacy [14]. 80% of the
respondents feel that consumers have lost all control over how personal information is
collected and used by companies. 94% of the respondents are concerned about the
possible misuse of their personal information. This survey also shows that when it comes
to the confidence that their personal information is properly handled, customers have
most trust in health care providers and banks and the least trust in credit card agencies
and internet companies.
A Harris Poll Survey illustrates the growing public awareness and apprehension
regarding their privacy ,from results obtained in 1999,2000,2001 and 2003 [15]. The
public awareness is shown in the table 1.
1999
2000
2001
2003
Concerned
78%
88%
92%
90%
Unconcerned
22%
12%
8%
10%
Due to the enormous efforts of data mining, yet high public concerns regarding
individual privacy, the implementation of privacy preserving data mining techniques has
become the demand at the moment. A privacy preserving data mining technique promotes
individual privacy while allowing extraction of useful knowledge from data.
There are several different methods that can be used to enable privacy preserving
data mining. One particular class of such techniques modifies the collected data before its
release, in an attempt to protect individual records from being re-identified. An intruder
even with supplementary knowledge, cannot be certain about the correctness of a reidentification when the data set has been re-modified. This class of privacy preserving
techniques relies on the fact that the data sets used for data mining purposes do not
necessarily contain 100% accurate data. In fact this is almost never the case , due to the
existence of natural noise in data sets . In the context of data mining it is important to
maintain the patterns in the data set. Additionally , maintenance of statistical parameters
namely mean and covariances of attributes is important in the context of statistical
databases.
High quality and privacy/security are two important requirements that a good
privacy preserving technique needs to satisfy. There is no single agreed upon technique
for privacy. Therefore measuring privacy/security is a challenging task.
10
The government or a business might do internal data mining, but they also want
to release the data to the public and might perturb it more. The mining
department which receives the less perturbed internal copy also has access to the
more perturbed public copy. It would be desirable that this department does not
11
have more power in reconstructing the original data by utilizing both copies than
when it has only the internal copy
Conversely, if the internal copy is leaked to the public, then obviously the public
has all the power of the mining department. However, it would be desirable if the
public cannot reconstruct the original data more accurately when it uses both
copies than when it uses only the leaked internal copy.
This new dimension of MLT_PPDM poses new challenges, in contrast to the original
scenario multiple perturbed copies of the same data are available to the data miners. A
malicious data miner would try to reconstruct the original data through diversity attack.
This challenge is addressed by MLT-PPDM services. The focus is on additive
perturbation approach where random guassian noise is added to the original data with
arbitrary distribution and provide a systematic solution. Through a one to one mapping,
data owner generates distinctly perturbed copies of its data according to different trust
levels.
Here a question may arise, why do we use random perturbation among all the
techniques. The first category of PPDM technique is Secure Multi Party Computation
(SMC) which makes of the cryptographic techniques. However these techniques are not
practically put to use as they are extraordinarily expensive in practice, and impractical for
real use. Very other solutions have been proposed, solutions to build decision trees over
the horizontally partitioned data were proposed in [16]. For vertically partitioned data,
algorithms have been proposed to address the association rule mining [17]. K-means
clustering [18] and frequent pattern matching problems [19]. The work of [20] uses a
secure coprocessor for privacy preserving collaborative data mining and analysis.
3.1 PREMILINARIES
3.1.1 Jointly Guassian
Let
through
12
through
If multiple random variables are jointly guassian then conditional on a subset of them ,
the remaining variables are still jointly guassian . Specifically partition a jointly guassian
vector as
G=(
over
),
13
We assume that X,Y,Z are all N Dimensional vectors where N is the number of attributes
in X .Let
N matrix given by
]-----------------------------------------------------------(2)
is an a N
).The
N matrix is given by
]-----------------------------------------------------------------------------------(3)
Huang et al [21] points out that there must be some correlation in the added noise
otherwise it may be filtered out. hence
denoting the
perturbation magnitude.
-----------------------------------------------------------(4)
14
A=(
H=( )
Where
15
And Y=HX+Z-------------------------------------------------(5)
3.3 THREAT MODEL
We assume malicious data miners who always attempt to reconstruct a more accurate
estimate of the original data given perturbed copies where
and
(Y)=
(Y-
)--------------------------------(6) and
]=
--------------------------------------(7)
E[ (Y)-X)
]=
---------------------------------------------(8)
3.4 DISTORTION
We define the concept of perturbation D between two data sets as the average expected
square difference between them. For example the distortion between the original data X
and the perturbed copy Y is given by
D(X,Y)=
Based on the above definition , we refer to a perturbed copy with respect to X ,if and only
if D(X
16
.ie. D(X,
) as follows :
D(X,
)=
Tr(
)----------------(9)
Where Tr(.) is the trace of the matrix. We initially assume that the data owner wants to
release M copies for the ease of analysis.
D(X, ( ))------------------(10)
)=
data. Intuitively ,achieving the privacy goal requires that give n the copy with the least
privacy among any subset of these M perturbed copies , the remaining copies in that
subset contain no extra information about X.
Assume the case of two perturbed copies
and
for noises
D(X,
-----------------------(11)
If
so that
=X+
and
are independent.
as
has no extra
17
)--------------------------(13)
)---------------------------------(15)
=(
---------------------------(16)
and
to
2.Output : Y
3.Construct
with
4.Generate Z with
and
to
According to (15)
,according to (16)
5.Generate Y=HX+Z
18
to
in parallel
6.Output Y
and
to
2. Output:
3. Construct
~N(0,
4. Generate
5. Output
6. Construct noise
7. Generate
8. Output
The main disadvantage of the batch Generation approach is that it requires a data owner
to foresee all possible trust levels apriori.
3.6.3 On-Demand Generation
In order for maximum flexibility to the data owner we generate the perturbed copies on
demand. We assume that there are L existing copies and the data owner generates M-L
copies .Thus there will be M-L copies in total.
The guassian mean will be
------------------------------------(17)
And covariance
-
--------------------------(18)
19
1.Input: X,
and
to
and values of
2.Output:New copies
3.Construct
4.Extract
5.Generate
with
,
and
and
to
According to (15)
from
=X+
8.Output
9.end for
Algorithm 3 offers more flexibility .
20
4.PROPOSED SYSTEM
The following system is required for the execution of the proposed system.
Processor
Pentium III
Speed
- 1.1 Ghz
RAM
- 256 MB(min)
Hard Disk
- 20 GB
Floppy Drive
- 1.44 MB
Key Board
Mouse
Monitor
- SVGA
Operating System
Application Server
Front End
:Windows95/98/2000/XP
: Tomcat5.0/6.X
: HTML, Java, Jsp
21
Scripts
Database Connectivity
: Mysql.
: JavaScript.
System Architecture:
22
DATA OWNERS
ORIGINAL DATA
MLT - PPDM
ALGORITHMS
RANDOM RATATION
PERTURBATION
PERTURBED
DATA
DATA MINERS
Implementation Modules:
23
Data owners
The bank customers are the data owners. They could register them self as per the
account number and create a username and password. User can view their original datas
Whatever they given when there open the account.
Admin
Admin also can view the original datas. whatever stored in the database .Admin can
login and view the original datas .
We would first like to implement the existing system and then further enhance it.The
existing system considers only linear attacks.We would like to continue the research by
looking into non linear attacks to derive original data and recover more information.
5.CONCLUSION
24
REFERENCES
[1] Md Zahidul Islam , Privacy Preserving Data Mining through Noise Addition 2008.
25
26
[9] Yapling Li ,Minghua Chen , Qiwei Li and Wei Zhang ,Enabling Multilevel Trust in
Privacy Preserving Data Mining , IEEE transactions on knowledge and data
engineering, vol-24,no.9 pp 1598-1613 ,September 2012.
[10] P. Cabena, P. Hadjinian, R. Stadler, J. Verhees, and A. Zanasi. Discovering Data
Mining from Concept to Implementation. Prentice Hall PTR, New Jersey 07458,USA,
1998.
[11] A. Cavoukian. Data mining: Staking a claim on your privacy, Information and
Privacy Commissioner Ontario. Available from http://www.ipc.on.ca/ docs/datamine.pdf,
Accessed on 21 May, 2008, 1998.
[12] J. Han and M. Kamber. Data Mining Concepts and Techniques. Morgan Kaufmann
Publishers, San Diego, CA 92101-4495,USA, 2001.
[13] R. Groth. Data Mining A Hands-On Approach For Business Professionals. Prentice
Hall PTR, New Jersey 07458, USA, 1998.
[14] Consumer Report. Ibm multi-national privacy survey consumer report. available
from http://www1.ibm.com/services/les/privacy survey oct991.pdf. visited on 01.07.03.
[15] D. Agrawal and C.C. Aggarwal, On the Design and Quantification of Privacy
Preserving Data Mining Algorithms, Proc. 20th ACM SIGMOD-SIGACT-SIGART
Symp. Principles of Database Systems (PODS 01), pp. 247-255, May 2001.
[16] Y. Lindell and B. Pinkas, Privacy Preserving Data Mining, Proc.Intl Cryptology
Conf. (CRYPTO), 2000.
[17] J. Vaidya and C.W. Clifton, Privacy Preserving Association Rule Mining in
Vertically Partitioned Data, Proc. ACM SIGKDD Intl Conf. Knowledge Discovery and
Data Mining, 2002.
27
28