You are on page 1of 10

Protect Privacy of Medical Informatics using K-Anonymization Model

Asmaa H.Rashid 1 Arab Academy for Science, Technology Collage of computing and information technology Sheraton Heliopolis - Cairo - Egypt Rashid.asmaa@yahoo.com Prof.dr . Abd-Fatth Hegazy 2 Arab Academy for Science, Technology Collage of computing and information technology Sheraton Heliopolis - Cairo - Egypt abdheg@yahoo.com

Abstract:
While there is an increasing need to share medical information for public health research, such data sharing must preserve patient privacy without disclosing any information that can be used to identify a patient. A considerable amount of research in data privacy community has been devoted to formalizing the notion of identifiability and developing techniques for anonymization but are focused exclusively on structured data. On the other hand, efforts on de-identifying medical text documents in medical informatics community rely on simple identifier removal or grouping techniques without taking advantage of the research developments in the data privacy community. This paper attempts to fill the above gaps and presents a framework and prototype system for de-identifying health information including both structured and unstructured data. We empirically study a simple Bayesian classifier, a Bayesian classifier with a sampling based technique, and a conditional random field based classifier for extracting identifying attributes from unstructured data. We deploy a kanonymization based technique for de-identifying the extracted data to preserve maximum data utility. We present a set of preliminary evaluations showing the effectiveness of our approach. Keywords: Anonymization - Medical text - Named entity recognition - Conditional random fieldsCostproportionate sampling - Data linkage . and organizations are increasingly recognizing the critical value of sharing such a wealth of information. However, individually identifiable information is protected under the Health Insurance Portability and Accountability Act (HIPAA).1

1.1. Motivating Scenarios


The National Cancer Institute initiated the shared pathology informatics network (SPIN)2 for researchers throughout the country to share pathology-based data sets annotated with clinical information to discover and validate new diagnostic tests and therapies. Fig. 1 shows a sample pathology report section with personally identifying information such as age and medical record number highlighted. It is necessary for each institution to de-identify or anonymize the data before having it accessible by the network. This network of shared data consists of both structured and unstructured data of various formats. Most medical data is heterogeneous meaning that even structured data from different institutions are labeled differently and unstructured data is inherently heteregeneous. We use the terms heterogeneous and unstructured data interchangeably throughout this paper. unstructured data is inherently heteregeneous. We use the terms heterogeneous and unstructured data interchangeably throughout this research CLINCAL HISTORY: 55 year old female with a history of B-cell lymphoma (Marginal zone,sh-02 22222,2/6/01).flow cytomerty and molecular diagnostics drawn .
Figure1: A sample pathology report section .

1.

Introduction

Current information technology enables many organizations to collect, store, and use various types of information about individuals. The government

1 Health

Insurance Portability and Accountability Act (HIPAA). http://www.hhs.gov/ocr/hipaa/. State law or institutional policy may differ from the HIPAA standard and should be considered as well.
2 Shared

pathology informatics network. http://www.cancerdiagnosis.nci.nih.gov/spin/.

2. Existing and potential solutions. Currently, investigators or institutions wishing to use medical records for research purposes have three options: obtain permission from the patients, obtain a waiver of informed consent from their Institutional Review Boards (IRB), or use a data set that has had all or most of the identifiers removed. The last option can be generalized into the problem of deidentification or anonymization (both deidentification and anonymization are used interchangeably throughout this paper) where a data custodian distributes an anonymized view of the data that does not contain individually identifiable information to a (data recipient). It provides a scalable way for sharing medical information in large scale environments while preserving privacy of patients. At the first glance, the general problem of data anonymization has been extensively studied in recent years in the data privacy community [1]. The seminal work by Sweeney et al. shows that a dataset that simply has identifiers removed is subject to linking attacks [2]. Since then, a large body of work contributes to data anonymization that transforms a dataset to meet a privacy principle such as kanonymity using techniques such as generalization, suppression (removal), permutation and swapping of certain data values so that it does not contain individually identifiable information, such as [3,4,5,1,6,7,8,9,10,11]. While the research on data anonymization has made great progress, its practical utilization in medical fields lags behind. An overarching complexity of medical data, but often overlooked in data privacy research, is data heterogeneity. A considerable amount of medical data resides in unstructured text forms such as clinical notes, radiology and pathology reports, and discharge summaries. While some identifying attributes can be clearly defined in structured data, an extensive set of identifying information is often hidden or have multiple and different references in the text. Unfortunately, the bulk of data privacy research focus exclusively on structured data. On the other hand, efforts on de-identifying medical text documents in medical informatics community [12,13,14,15,16,17,18,19] are mostly specialized for specific document types or a subset of HIPAA identifiers. Most importantly, they rely on simple identifier removal techniques without taking advantage of the research developments from data privacy community that guarantee a more formalized notion of privacy while maximizing data utility.

1.3. Contributions.
Our work attempts to fill the above gaps and bridge the data privacy community and medical informatics community by developing a framework and prototype system, HIDE, for Health Information DEidentification of both structured and unstructured data. The contributions of our work are two fold. First, our system advances the medical informatics field by adopting information extraction (also referred to as attribute extraction) and data anonymization techniques for de-identifying heterogeneous health information. Second, the conceptual framework of our system advances the data privacy field by integrating the anonymization process for both structured and unstructured data. The specific components and contributions of our system are as follows: Identifying and sensitive information extraction. We leverage and empirically study existing named entity extraction techniques [20,21], in particular, simple Bayesian classifier and sampling based techniques, and conditional random fieldsbased techniques to effectively extract identifying and sensitive information from unstructured data. Data linking. In order to preserve privacy for individuals and apply advanced anonymization techniques in the heterogeneous data space, we propose a structured identifier view with identifying attributes linked to each individual.

Anonymization. We perform data suppression and generalization on the identifier view to anonymize the data with different options including full deidentification, partial de- identification, and statistical anonymization based on kanonymization. While we utilize off-the-shelf techniques for some of these components, the main contribution of our system is that it bridges the research on data privacy and text management and provides an integrated framework that allows the anonymization of heterogeneous data for practical applications.

2-Novelty and technical contributions


In the following, we explain the new substantive contributions to search for data and privacy. View of the area Constraint, and focus on six pieces of work that are considered to be the most interesting in the privacy of the data [26,27,28,29].

2.1 Personalized Privacy Preservation


In [30], we investigated the publication of sensitive data using generalization, the most popular anonymization methodology in the literature. Existing privacy model for generalized tables (i.e., noisy microdata obtained through generalization) exert the same amount of protection on all individuals in the dataset, without catering for their concrete needs. For example, in a set of medical records, a patient that has contracted flu would receive the same degree of privacy protection with a patient suffering from cancer, even if the former is willing to disclose her symptom directly (since flu is a common disease), while the latter requires higher privacy guarantee than others (due to the sensitiveness of her medical condition). Motivated by this, we proposed a personalized framework that allows each individual to specify the amount of privacy protection required for her data. Based on this framework, we devised the first privacy model that takes into account personalized privacy requirements. We also developed an efficient algorithm for computing generalized tables that conform to the model.

of QI attributes contained in the microdata. Besides its theoretical guarantee, the proposed algorithm also worked fairly well in practice, and considerably outperformed the state of the art in several aspects.

2.4 Transparent Anonymization


Solutions for data publication consider that the adversary possesses certain prior knowledge about each individual, but overlooks the possibility that, the adversary may also know the anonymization algorithm adopted by the data publisher. Consequently, an attacker can compromise the privacy protection enforced by those solutions, by exploiting various characteristics of the anonymization approach. To remedy the problem, we proposed the analytical model [33] for evaluating the disclosure risks in generalized tables, under the assumption that everything involved in the anonymization process, except the input dataset, is public knowledge. Based on this model, we developed three generalization algorithms that ensure privacy protection, even against an adversary who has thorough understanding of the algorithms. Compared with the state-of-the-art generalization techniques, our algorithms not only provided a higher degree of privacy protection, but also attained satisfactory performance in terms of information distortion and computation overhead.

2.2 Republication of Dynamic Datasets


Data collection is often a continuous process, where topless are inserted into (deleted from) the microdata as time Evolves . Hence, a data publisher may need to republish the microdata multiple times to reflect its recent changes. Such republication is not supported by conventional generalization techniques, since they assume that the microdata is static. We addressed the above issue by proposing an innovative privacy model, m-invariance [31], which secures the privacy of any individual involved in the republication process, even against an adversary who exploits the correlations between multiple releases of the microdata. The model was accompanied with a generalization algorithm, whose space and time complexity is independent of the number n of generalized tables that have been released by the publisher. This property of the algorithm is essential in the republication scenario, where n increases monotonically with time.

2.5 Anonymization via Anatomy


While most previous work adopts generalization to anonymized data, in [34], we proposed a novel anonymization methodology, anatomy, which provides almost the same privacy guarantee as generalization does, and yet significantly outperforms generalization in terms of the accuracy of data analysis on the distorted microdata. We provided theoretical justifications for the superiority of anatomy over generalization, and developed a linear time algorithm for anonymizing data via anatomy. The exigency and electiveness of my solution were verified through extensive experiments.

2.6 Dynamic Anonymization


In [35], we proposed dynamic anonymization, which produces a tailor-made anonymized version of the dataset for each query given by the users, such that the anonymized data maximizes the accuracy of the query result. Privacy preservation is achieved by ensuring that, the combination of all anonymized data does not reveal private information, i.e., even if the adversary obtains every anonymized version of the dataset, s/he would not be able to infer the sensitive value of any individual. Through extensive experiments, we showed that dynamic anonymization sign cant improved the accuracy of queries on the

2.3 Complexity of Data Anonymization


In [31], we presented the study on the complexity of producing generalized tables that conform to ldiversity, the most well-adopted privacy model. We showed that it is NP-hard to achieve l-diversity with the minimum information loss, for any l larger than two and any dataset that contains at least three distinct sensitive values. Then, we developed an O (l d)-approximation algorithm, where d is the number

anonymized data, when compared with the existing techniques.

3- Protect Privacy Models.


What is privacy protection? Dalenius [1977] provided a very stringent definition: access to the published data should not enable the attacker to learn anything extra about any target victim compared to no access to the database, even with the presence of any attackers background knowledge obtained from other sources. Dwork [2006][57] showed that such absolute privacy protection is impossible due to the presence of background knowledge. Suppose the age of an individual is sensitive information. Assume an attacker knows that Alices age is 5 years younger than the average age of American women. If the attacker has access to a statistical database that discloses the average age of American women, then Alices privacy is considered to be compromised according to Dalenius definition, regardless whether or not Alices record is in the database [Dwork 2006]. After searching and access to many of the research I was able to find a range of models that are used to protect the privacy of data that help us develop and improve the privacy and dissemination of data and information that can be utilized in various disciplines while maintaining the same privacy and I will offer models are as follows : k-Anonymity, MultiR kAnonymity, Diversity, Confidence Bounding, (; k)-Anonymity, (X; Y )-Privacy , (k; e)Anonymity , (;m)-Anonymity , Personalized Privacy , t-Closeness , -Presence , (c; t)-Isolation , -Differential Privacy , (d; )-Privacy , Distributional Privacy .[57]

4- k-Anonymization Approach.
Organizations (such as the Census Bureau or hospitals) collect large amounts of personal information. This data has high value for the public, for example, to study social trends or to find cures for diseases. However, careless publication of such data poses a danger to the privacy of the individuals who contributed data. There has been much research over the last decades on methods for limiting disclosure in data publishing; in particular, the computer science community has made important contributions over the last ten years. The research in this area investigated various adversary models and proposed different anonymization techniques that provide rigorous guarantees against attacks. However, to the best of our knowledge, none of these techniques has so far

been implemented as part of a usable tool1. This is mainly due to the non-interactive nature of these techniques: The only interface they provide to data publishers is a set of parameters that controls the degree of privacy protection to be enforced in the anonymized data. The publishers, however seldom have enough knowledge to decide appropriate values for the parameters; setting these values requires not only a deep understanding of the underlying privacy model but also a thorough understanding of possible adversaries. Furthermore, even if the data publisher had such knowledge, she much prefers a interactive anonymization process instead of fixing the algorithm and its parameters before seeing the anonymized output data. The data publisher will select the final anonymized version of the data only after she has explored the space of anonymization parameters and adversary models. Existing anonymization techniques have not been put into such a progressive, user-centric anonymization process. In this demonstration, we bring the theory of data anonymization to practice. We developed CAT, the Cornell Anonymization Toolkit, that not only incorporates the state-of-the-art formal privacy protection methods, but also provides an intuitive interface that can interactively guide users through the data publishing workflow. CAT was designed with two objectives in mind. First, the toolkit should help users to acquire an intuitive understanding of the disclosure risk in the anonymized data, so that they can make educated decisions on releasing appropriate data. Second, the toolkit should offer the users full control of the anonymization process, allowing them to adjust various parameters and to examine the quality of the anonymized data (in terms of both privacy and utility) in a convenient manner. To the best of our knowledge, this is the first effort that employs existing anonymization techniques to provide a practical tool for data publication.[22,23,24,25] .

5- k-Anonymity: a model for protecting privacy .


Consider a data holder, such as a hospital or a bank, that has a privately held collection of person-specific, field structured data. Suppose the data holder wants to share a version of the data with researchers. How can a data holder release a version of its private data with scientific guarantees that the individuals who are the subjects of the data cannot be re identified while the data remain practically useful? The solution provided in this paper includes a formal protection model named k-anonymity and a set of accompanying policies for deployment. A release provides k-anonymity protection if the information for each person contained in the release cannot be

distinguished from at least k-1 individuals whose information also appears in the release. This paper also examines re-identification attacks that can be realized on releases that adhere to k-anonymity unless accompanying policies are respected. The kanonymity protection model is important because it forms the basis on which the real-world systems known as Datafly, m-Argus and k-Similar provide guarantees of privacy protection. In today's information society, given the unprecedented ease of finding and accessing information, protection of privacy has become a very important concern. In particular, large databases that include sensitive information (e.g., health information) have often been available to public access, frequently with identifiers stripped of in an attempt to protect privacy. However, if such information can be associated with the corresponding people's identifiers, perhaps using other publicly available databases, then privacy can be seriously violated. For example, Sweeney [36] pointed out that one can find out who has what disease using a public database and voter lists. To solve such problems, Samarati and Sweeney [37] have proposed a technique called k-anonymization. In this paper, we study how to enhance privacy in carrying out the process of k-anonymization. Consider a table that provides health information of patients for medical studies, as shown in Table 1. Each row of the table consists of a patient's date of birth, zip code, allergy, and history of illness. Although the identifier of each patient does not explicitly appear in this table, a dedicated adversary may be able to derive the identifiers of some patients using the combinations of date of birth and zip code. For example, he may be able to find that his roommate is the patient of the first row, who has allergy to penicillin and a history of pharyngitis.

attributes. (There may be other attributes in a table besides the quasi-identifier attributes and the sensitive attributes; we ignore them in this paper since they are not relevant to our investigation.) The privacy threat we consider here is that an adversary may be able to link the sensitive attributes of some rows to the corresponding identifiers using the information provided in the quasi-identifiers Aproposed strategy to solve this problem is to make the tablek-anonymous [40].

Date of Birth * 08-02-57 * 08-02-57 *

Zip Code 07030 0702* 07030 07029 07030

Allergy Penicillin NO Allergy NO Allergy Sulfur NO Allergy

History of Illness Pharyngitis Storke Polio Diphthheria Colitis

2- Table Anonymized Table of Health Data

Date of Birth 03-24-79 08-02-57 11-12-39 08-02-57 08-01-40

Zip Code 07030 07028 07030 07029 07030

Allergy Penicillin NO Allergy NO Allergy Sulfur NO Allergy

History of Illness Pharyngitis Storke Polio Diphthheria Colitis

1- A Table of health data

In this example, the set of attributes fdate of birth, zip codeg is called a quasi-identifier [38, 39], because these attributes in combination can be used to identify an individual with a significant probability. In this paper, we say an attribute is a quasi-identifier attribute if it is in the quasi-identifier .The attributes like allergy and history of illness are called sensitive

In a k-anonymous table, each value of the quasiidentifier appears at least k times. Therefore, if the adversary only usesthe quasi-identifiers to link sensitive attributes to the indemnifiers, then each involved entity (patient in our example) I s hidden" in at least k peers. The procedure of making a table kanonymous is calld k-anonymization. It can be achieved by suppression (i.e., replacing some entries with \") or generalization (e.g., replacing some or all occurrences of \07028"and \07029" with \0702"). Table 2 shows the result of 2- anonymization on Table 1. Several algorithmic methods have been proposed describing how a central authority can kanonymize a table before it is released to the public. During their research, we consider a related but diffierent scenario: distributed customers holding their own data interact with a miner and use kanonymization in this process to protect their own privacy. For example, imagine the above mentioned health data are collected from customers by a medical researcher. The customers will feel comfortable if the medical researcher does not need to be trusted and only sees a k-anonymized version of their data. To solve this problem, we show methods by which kanonymization can be jointly performed by the involved parties in a private manner such that no single participant, including the miner, learns extra information that could be used to link sensitive attributes to corresponding identifiers.

6- State-of-the-Art Privacy Preserving .


We briefly review the most relevant areas below and discuss how our work leverages and advances the current state-of-the-art.

6.1. Privacy preserving data publishing


Privacy preserving data publishing for centralized databases has been studied extensively in recent years. One thread of work aims at devising privacy principles, such as k-anonymity and later principles that remedy its problems, that serve as criteria for judging whether a published dataset provides sufficient privacy protection [41,42,43,44,45,]. Another large body of work contributes to algorithms that transforms a dataset to meet one of the above privacy principles (dominantly k-anonymity). The bulk of this work has focused exclusively on structured data.

6.2. Medical text de-identification


In the medical informatics community, there are some efforts on de-identifying medical text documents [41,43,39]. Most of them uses a two-step approach which extracts the identifying attributes first and then removes or masks the attributes for deidentification purposes. Most of them are specialized for specific document types (e.g. pathology reports only [46,47,48]). Some of them focus on a subset of HIPAA identifiers (e.g. name only [49,50]) while some others focus on differentiating protected health information (PHI) from non-PHI [51]. Most importantly, most of these work rely on simple identifier removal or grouping techniques and do not take advantage of the recent research developments that guarantee a more formalized notion of privacy while maximizing data utility.

6.3. Information extraction


Extracting atomic identifying and sensitive attributes (such as name, address, and disease name) from unstructured text such as pathology reports can be seen as an application of named entity recognition (NER) problem [53]. NER systems can be roughly classified into two categories and both are applied in medical domains for de-identification. The first uses grammar based or rule-based techniques [54]. Unfortunately such hand-crafted systems may take the cost of months of work by experienced domain experts and the rules will likely need to change for different data repositories. The second uses statistical learning approaches such as support vector machine (SVM)-based classification methods. However, an SVM based method such as [51] only performs binary classification of the terms into PHI or non-PHI and does not allow statistical de-identification that requires the knowledge of different types of identifying attributes.

has high value for the public, for example, to study social trends or to find cures for diseases. However, careless publication of such data poses a danger to the privacy of the individuals who contributed data. There has been much research over the last decades on methods for limiting disclosure in data publishing; in particular, the computer science community has made important contributions over the last ten years. The research in this area investigated various adversary models and proposed different anonymization techniques that provide rigorous guarantees against attacks. However, to the best of our knowledge, none of these techniques has so far been implemented as part of a usable tool 1. This is mainly due to the non-interactive nature of these techniques: The only interface they provide to data publishers is a set of parameters that controls the degree of privacy protection to be enforced in the anonymized data. The publishers, however seldom have enough knowledge to decide appropriate values for the parameters; setting these values requires not only a deep understanding of the underlying privacy model but also a thorough understanding of possible adversaries. Furthermore, even if the data publisher had such knowledge, she much prefers a interactive anonymization process instead of fixing the algorithm and its parameters before seeing the anonymized output data. The data publisher will select the final anonymized version of the data only after she has explored the space of anonymization parameters and adversary models. Existing anonymization techniques have not been put into such a progressive, user-centric anonymization process. In this demonstration, we bring the theory of data anonymization to practice. We developed CAT, the Cornell Anonymization Toolkit, that not only incorporates the state-of-the-art formal privacy protection methods, but also provides an intuitive interface that can interactively guide users through the data publishing workflow. CAT was designed with two objectives in mind. First, the toolkit should help users to acquire an intuitive understanding of the disclosure risk in the anonymized data, so that they can make educated decisions on releasing appropriate data. Second, the toolkit should offer the users full control of the anonymization process, allowing them to adjust various parameters and to examine the quality of the anonymized data (in terms of both privacy and utility) in a convenient manner. To the best of our knowledge, this is the first effort that employs existing anonymization techniques to provide a practical tool for data publication[25].

7.1 System Overview .


Figure 1 illustrates the major components of theirs system. The anonymizer uses an algorithm that, given some userdefined parameters, produces anonymized data that adheres to a user-selected privacy model. We anonymize data using generalization [25], which transforms attribute values of non-sensitive attributes (e.g., gender, date of birth, ZIP code) in the data into values ranges, so as to prevent an adversary from identifying individuals by linking these attributes with public available information. Currently, our system implements the Incognito algorithm [22] and the l diversity model [23]. To ensure responsiveness

7. Real application method (open source).

use

anonymization

Organizations (such as the Census Bureau or hospitals) collect large amounts of personal information. This data

Step 1: Preliminary Anonymization. User Interface


Anonymizer Engine Anonymizer Risk Analysis We first visit the middle-right panel, where there are two sliders that control the two parameters l and c of the ldiversity algorithm. Suppose that we do not have a clear idea of how these parameters should be set. Then, we simply select some initial values for l and c, and click the Generalize button. The system now computes a new anonymized table which serves as a starting point of our anonymization process.

Data Storage

Figure2: system architecture

Married Divorced Widowed Single total

male 284421 37453 13546 48105 3835525

female 48590 56581 61549 49755 216475

Total 333011 94034 75095 97860 600000

3. Table: contingency table In the steps that follow, we will evaluate the quality of this anonymized table in terms of both privacy and utility. In case the table is unsatisfactory, we can refine it by adjusting the values of l and c .

Step 2: Utility Evaluation.


Figure3: Anonymization process

the dataset to be anonymized is kept in main memory, and all algorithms run against this main-memory resident data. In addition to the anonymizer, we have a risk analyzer for evaluating the disclosure risks of records in anonymized data, based on user-specified assumptions about the adversarys background knowledge that can be specified through the user interface. Following the l-diversity model, we consider that the adversary may have information of the nonsensitive attributes of every individual, as well as several pieces of additional knowledge about the sensitive attributes. Each of these pieces of knowledge is modeled as a negated atom, i.e., a statement declaring that an individual is not associated with a certain sensitive attribute, such as Alice does not have diabetes or Bob does not have cancer. We quantify the disclosure risk of an individual as the adversarys posterior probability of inferring the correct values of the sensitive attributes of the individual, after combing the anonymized data with the background knowledge.

To get an understanding of the utility of an anonymization, we first click the Contingency Tables tab in the lowerleft panel, to compare the contingency tables that correspond to the original and anonymized data, respectively. Specifically, a contingency table is a table that shows the frequencies for combinations of two attributes. For example, Table 1 illustrates a contingency table of gender and marital status. Intuitively, contingency tables show correlations between pairs of attributes. By examining the changes in the contingency tables before and after anonymization, we can get an idea of how the anonymization affects the characteristics of the data beyond looking at individual attributes. The two combo boxes in the top of the lower-left panel enable us to specify the two dimensions of the contingency tables. After that, we will click on the Density Graphs tab, and the system will depict two density graphs that correspond to the contingency table, as shown in Figure 3. This provides us a more intuitive way to evaluate the differences between the original and anonymized data. In general, the more similar the graphs are, the more useful information is retained in the anonymized table.

Step 3: Risk Evaluation.


We can now evaluate the privacy protection provided by the anonymized data. We begin by visiting the lower-right panel, and specify the amount of background knowledge that the adversary is expected to have. For example, in Figure 3, if we consider that the adversary is able to learn the ages of the individuals from public available information, then we can specify such knowledge of the adversary by putting a tick in the checkbox associated with Background Knowledge about Age. In addition, we

7.2 DEMONSTRATION DESCRIPTION


We will demonstrate our system by showing the process of anonymizing a real dataset, as illustrated in Figure 2. We begin by loading the dataset into our system, upon which the tuples in the datasets will be shown in the upper-left panel of the user interface . After that, we interact with the system to produce an anonymized table, by repeating the following four steps.

can use the slider in the bottom of the panel to define the number of negated atoms that the adversary may have about the sensitive attribute. Once the background knowledge of the adversary is decided, we click theEvaluate Riskbutton, which will trigger an update in the upper-left panel. The system first calculates the disclosure risk of every record in the dataset based on the background knowledge. Thus makes the risks of the tuples available. For example, in Figure 3, the first tuple has a 4% disclosure risk, which means that, an adversary with the specified background knowledge would have 4% confidence to infer the income of the individual corresponding to the first tuple. In addition, the system also plots a histogram on the upper right panel that illustrates the distribution of the disclosure risks of all individuals in the dataset. For the case in Figure 3, the histogram shows that the adversary has less than 20% confidence to infer the incomes of most individuals. After inspecting the disclosure risks of the tuples, we have an intuitive understanding of the amount of privacy that is guaranteed by the anonymized table. In case both the privacy guarantee and the utility of the table are deemed sufficient, we request the system to output the table. Otherwise, we can move on to the next step.

be restored whenever necessary. We can now return to Step 1 and re-adjust the parameters in the middle-right panel to generate a new anonymized table. We apply this process iteratively until we obtain a satisfactory anonymization.

8. Conclusion and future works


We presented a conceptual framework as well as a prototype system for anonymizing heterogeneous health information including both structured and unstructured data. Our initial experimental results show that our system effectively detects a variety of identifying attributes with high precision, and provides flexible de-identification options that anonymizes the data with a given privacy guarantee while maximizing data utility to the researchers. While our work is a convincing proof-ofconcept, there are several aspects that will be further explored. First, we are exploring innovative anonymization approaches that prioritize the attributes based on how important and critical they are to the privacy preserving requirements as well as the application needs. Second, in addition to enhance the (atomic) attribute extraction accuracy , a more in-depth and challenging problem that we will investigate is to extract indirect identifying information .

Figure5:Query precision using different de-identification options figure4:User interface

Step 4: Manipulating Sensitive Tuples.


In this step, we have the option of applying special treatment to special records in the table, i.e., records whose disclosure risks are much higher than most other tuples. Such tuples could be outliers in the dataset, and their existence may severely degrade the quality of anonymization in case we would not treat them separately. To eliminate such tuples, we first specify a threshold using the slider in the upper-right panel, and then click the Delete button to remove all those tuples whose disclosure risks are above the threshold. All deleted tuples can be reviewed in the Deleted Tuples tab of the panel, and can

and my objective in thos paper Protect and enhance privacy at medical data using k-anonymzation model Participation of medical information in order to conclude the knowledge and making the right decisions in the cases of living patients and get the best results Through cooperation and sharing data and maintaining the privacy of patients and comparison between best privacy model kanonymity model and -diversity .

References :
1: B.C.M. Fung, K. Wang, R. Chen, P.S. Yu, Privacypreserving data publishing: a survey on recent developments, ACM Computing Surveys, 2010.

2: L. Sweeney, k-Anonymity: a model for protecting privacy, International Journal on Uncertainty Fuzziness, and Knowledge-based Systems 10 (5) (2002) 3: V.S. Iyengar, Transforming data to satisfy privacy constraints, in: Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2002, pp. 279288. 4: K. Wang, P.S. Yu, S. Chakraborty, Bottom-up generalization: a data mining solution to privacy protection, in: Proceedings of the 4th IEEE International Conference on Data Mining (ICDM 2004), November 2004. 5:B.C.M. Fung, K. Wang, P.S. Yu, Top-down specialization for information and privacy preservation, in: Proceedings of the 21st IEEE International Conference on Data Engineering (ICDE 2005), Tokyo, Japan, 2005, pp. 205216. 6: I. Bhattacharya, L. Getoor, Iterative record linkage for cleaning and integration, in: DMKD04: Proceedings of the 9th ACM SIGMOD Workshop on Research issues in Data Mining and Knowledge Discovery, 2004 7: S. Zhong, Z. Yang, R.N. Wright, Privacy-enhancing kanonymization of customer data, in: Proceedings of the Twenty-fourth ACM SIGACTSIGMODSIGART Symposium on Principles of Database Systems, 2005. 8: K. LeFevre, D. Dewitt, R. Ramakrishnan, Incognito: efficient full-domain k-anonymity, in: ACM SIGMOD International Conference on Management of Data, 2005. 9: K. LeFevre, D. DeWitt, R. Ramakrishnan, Mondrian multidimensional k-anonymity, in: Proceedings of the 22nd International Conference on Data Engineering, ICDE 2006, 2006. 10: X. Xiao, Y. Tao, Anatomy: simple and effective privacy preservation, in: Thirrty-second International Conference on Very Large Databases (VLDB), 2006, pp. 139150. 11: Q. Zhang, N. Koudas, D. Srivastava, T. Yu, Aggregate query answering on anonymized tables, in: Proceedings of the 23rd International Conference on Data Engineering, ICDE 2007, 2007, pp. 116125. 12: L. Sweeney, Replacing personally-identifying information in medical the records scrub system, Journal of the American Informatics Association (1996) 333337 13: L. Sweeney, Guaranteeing anonymity when sharing medical data, the datafly system, in: Proceedings of AMIA Annual Fall Symposium, 1997 14: S.M. Thomas, B. Mamlin, G.S. Adn, C. McDonald, A successful technique for removing names in pathology reports, in: Proceedings of AMIA Symposium, 2002, pp. 777781 15: R.K. Taira, A.A. Bui, H. Kangarloo, Identification of patient name references within medical documents using

semantic selectional restrictions, in: Proceedings of AMIA Symposium, 2002, pp. 757761. 16: D. Gupta, M. Saul, J. Gilbertson, Evaluation of a deidentification (de-id) software engine to share pathology reports and clinical documents for research, American Journal of Clinical Pathology (2004) 76186. 17: T. Sibanda, O. Uzuner, Role of local context in deidentification of ungrammatical fragmented text, in: North American Chapter of Association for Computational Linguistics/Human Language Technology, 2006. 18: R.M.B.A. Beckwith, U.J. Balis, F. Kuo, Development and evaluation of an open source software tool for deidentification of pathology reports, BMC Medical Informatics and Decision Making 6 (12) (2006). 19: O. Uzuner, Y. Luo, P. Szolovits, Evaluating the stateof-the-art in automatic de-identification, Journal of the American Medical Informatics Association 14 (5) (2007). 20: C. Manning, H. Schutze, Foundations of Statistical Natural Language Processing, MIT Press, 1999. 21: D. Nadeau, S. Sekine, A survey of named entity recognition and classification, Linguisticae Investigationes 30 (7) (2007). 22: K. LeFevre, D. J. DeWitt, and R. Ramakrishnan. Incognito: Efficient full-domain k-anonymity. In SIGMOD, pages 4960, 2005. 23: A. Machanavajjhala, J. Gehrke, D. Kifer, and M. Venkitasubramaniam. l-diversity: Privacy beyond k-anonymity. TKDD, 1(1), 2007 24: A. Machanavajjhala, D. Kifer, J. M. Abowd, J. Gehrke, and L. Vilhuber. Privacy: Theory meets practice on the map. In ICDE, pages 277286, 2008 25: P. Samarati. Protecting respondents identities in microdata release. TKDE, 13(6):10101027, 2001. 26: X. Xiao and Y. Tao. Anatomy: Simple and eective privacy preservation. In VLDB, pages 139{150, 2006. 27: X. Xiao and Y. Tao. Dynamic anonymization: Accurate statistical analysis with privacy preservation. In SIGMOD, pages 107{120, 2008. 28: X. Xiao, Y. Tao, and N. Koudas. Title suppressed due to double blind review requirements. Submitted to TODS. 29: X. Xiao, K. Yi, and Y. Tao. The hardness and approximation algorithms for l-diversity. Submitted to VLDB Journal. 30: X. Xiao and Y. Tao. Personalized privacy preservation. In SIGMOD, pages 229{240, 2006.

31: X. Xiao and Y. Tao. m-invariance: Towards privacy preserving re-publication of dynamic datasets. In SIGMOD, pages 689{700, 2007} 32: X. Xiao, K. Yi, and Y. Tao. The hardness and approximation algorithms for l-diversity. 33: X. Xiao, Y. Tao, and N. Koudas. Title suppressed due to double blind review requirements. Submitted to TODS. 34: X. Xiao and Y. Tao. Anatomy: Simple and effective privacy preservation. In VLDB, pages 139{150, 2006} 35: X. Xiao and Y. Tao. Dynamic anonymization: Accurate statistical analysis with privacy preservation. In SIGMOD, pages 107{120, 2008} 36: L. Sweeney. k-anonymity: a model for protecting privacy. Int. J. Uncertain. Fuzziness Knowl.-Based Syst., 10(5):557{570, 2002} 37: P. Samarati and L. Sweeney. Generalizing data to provide anonymity when disclosing information (abstract). In Proc. of the 17th ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, page 188. ACM Press, 1998. 38: T. Dalenius. Finding a needle in a haystack c or identifying anonymous census record. Journal of Offcial Statistics, 2(3):329{336, 1986} 39: L. Sweeney. k-anonymity: a model for protecting privacy. Int. J. Uncertain. Fuzziness Knowl.-Based Syst., 10(5):557{570, 2002} 40: P. Samarati and L. Sweeney. Generalizing data to provide anonymity when disclosing information (abstract). In Proc. of the 17th ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, page 188. ACM Press, 1998. 41: L. Sweeney, k-Anonymity: a model for protecting privacy, International Journal on Uncertainty Fuzziness, and Knowledge-based Systems 10 (5) (2002). 42: A. Machanavajjhala, J. Gehrke, D. Kifer, M. Venkitasubramaniam, L-diversity: privacy beyond kanonymity, in: Proceedings of the 22nd InternationalConference on Data Engineering (ICDE06), 2006, pp. 24. 43: T.M. Truta, B. Vinay, Privacy protection: p-sensitive kanonymity property, in: Proceedings of the 22nd International Conference on Data Engineering Workshops, ICDE 2006, 2006, pp. 94. 44: N. Li, T. Li, T-closeness: privacy beyond k-anonymity and l-diversity, in: Proceedings of the 23rd International Conference on Data Engineering, ICDE 2007, 2007 .

45: X. Xiao, Y. Tao, M-invariance: towards privacy preserving re-publication of dynamic datasets, in: SIGMOD Conference, 2007, pp. 689700. 46: S.M. Thomas, B. Mamlin, G.S. Adn, C. McDonald, A successful technique for removing names in pathology reports, in: Proceedings of AMIA Symposium,2002, pp. 777781. 47: D. Gupta, M. Saul, J. Gilbertson, Evaluation of a deidentification (de-id) software engine to share pathology reports and clinical documents for research, American Journal of Clinical Pathology (2004) 76186. 48: R.M.B.A. Beckwith, U.J. Balis, F. Kuo, Development and evaluation of an open source software tool for deidentification of pathology reports, BMC Medical Informatics and Decision Making 6 (12) (2006). 49: R.K. Taira, A.A. Bui, H. Kangarloo, Identification of patient name references within medical documents using semantic selectional restrictions, in: Proceedings of AMIA Symposium, 2002, pp. 757761. 50: S.M. Thomas, B. Mamlin, G.S. Adn, C. McDonald, A successful technique for removing names in pathology reports, in: Proceedings of AMIA Symposium,2002, pp. 777781. 51: T. Sibanda, O. Uzuner, Role of local context in deidentification of ungrammatical fragmented text, in: North American Chapter of Association for Computational Linguistics/Human Language Technology, 2006. 52: C. Manning, H. Schutze, Foundations of Statistical Natural Language Processing, MIT Press, 1999. 53: D. Nadeau, S. Sekine, A survey of named entity recognition and classification, Linguisticae Investigationes 30 (7) (2007). 54: R.M.B.A. Beckwith, U.J. Balis, F. Kuo, Development and evaluation of an open source software tool for deidentification of pathology reports, BMC Medical Informatics and Decision Making 6 (12) (2006). 55: A. Meyerson and R. Williams. On the complexity of optimal k-anonymity. In Proc. 22nd ACM SIGMODSIGACT-SIGART Symposium on Principles of Database Systems, Paris, France, June 2004. 56: L. Sweeney. Guaranteeing anonymity when sharing medical data, the datay system. In Proc. of Journal of the American Medical Informatics Association, 1997 57: DWORK, C.. Differential privacy. In Proc. of the 33rd International Colloquium on Automata, Languages and Programming (ICALP). Venice, Italy, 112, 2006.

You might also like