Professional Documents
Culture Documents
1 Introduction
Part of the difficulty of working with data that can come from sensitive sources, such
as health or financial data, is protecting the privacy of individuals or organizations
related to the data. Such types of data need to be anonymized with some of the data
anonymization techniques and methods, which is a prerequisite of the data utilization
while in the same time retaining the privacy of the data. Various means are used for
the data anonymization, that include algorithms and physical equipment. Cormode
and Srivastava [5] state that the result of data anonymization is anonymized data
which are, essentially, a set of possible worlds, one of which corresponds to the
original data.
Data anonymization has been the subject of research and patenting activities in re-
cent years. Development related to data anonymization is being reinforced across
many industrial applications. That makes it important for both researchers and busi-
ness practitioners to be aware of patents in the field of data anonymization across
different companies, industries, and countries, which is the purpose of this paper.
According to World International Property Organization, patent is an exclusive
right granted for an invention - a product or a process which provides a new way of
doing something, or a new technical solution to a problem [14]. It offers the exclusive
right to stop or prevent others from commercially making, using, distributing,
importing or selling the patented invention without the patent owner's permission.
Those rights are only valid in the country or region where a patent has been filed or
granted [11]. Also, the protection is granted for a limited period, generally 20 years
from the filing date of the application [14].
Many countries use national patent systems based on "world patent application
that are made under the Patent Cooperation Treaty [14]. The World Intellectual Prop-
erty Organization maintains a database of published international patent applications,
using International Patent Classification system (IPC), that was established in 1971 by
the Strasbourg Agreement, and is nowadays used in more than 100 countries world-
wide [14, 9]. Additionally, there are two important classification systems used by the
largest patent offices, both based on the IPC: 1) The Cooperative Patent Classification
(CPC) system - developed by the European Patent Office (EPO) and the United States
of America, and 2) File Index (FI) Japanese patent classification system [11].
The patent documents are highly structured, providing rish source of information
[9]. They contain fields such as patent title, description, simple family ID,
publication/issue year, filing/application year, assignee country, assignee
original/inventors, IPC codes, CPC codes, and FI codes. Patent landscape, also known
as a Competitive Technical Intelligence Report, White Space Analysis or Technical
Gap Analysis, is a study which uses a large set of patents data to extract useful
information for understanding a particular field [13]. It aims to give an overview of a
particular field and provide insights to decision makers. The insights can be for
example what is the publication trend (time) of patents or filing trend (technology) of
patents; who are the top assignees or which companies are filing how many patents;
and how are patents spread across countries [13]. Other approaches are also often
used [7]. For example, Noh, Jo and Lee [10] focused on keywords strategies for pa-
tent analysis and offered guidelines on the selection and processing keywords for
patent analysis, and Brgmann et al. [2] presented an operational prototype of a
workbench for patent document analysis and summarization. Text mining and visuali-
zation based approaches had been also used for analyzing the patent content in the
vast body of literature [1].
Numerous researchers have developed the patent landscape for different technolo-
gy fields. Some examples will be provided with the brief presentation of the method-
ology. Han and Sohn [8] identified technological convergence in standards related to
information and communication technology, and have applied social network analysis
and association rules analysis. Choi and Hwang [4] analyzed the patents related to
Light Emitting Diode and wireless broadband fields by using trend analysis, and
method that combines the network-based and the keyword-based research. Patent
analysis was used to explore virtualization technology development in USA [12],
analyzing technology life cycle, assignee organization and country, patent classifica-
tion and patents citations. In [3], authors investigated technological pervasiveness and
variety of innovators in Green ICT, using network analysis.
The goal of this paper is to develop a patent landscape of the data anonymization
related patents during the period from 2001 to 2015 (15th August 2015) by providing
an answer to the following research questions:
To the best of our knowledge, there has not been analysis of the patents related to
the data anonymization. As the attempt to develop a patent landscape of the data
anonymization approaches, this study is expected to help in understanding this area,
and shed some more light to the means of protecting data privacy.
2 Methodology
The development of the patent landscape consists on the four stages related to: (i) the
patent selection and trend analysis, (ii) the areas of technology analysis, (iii) assignee
country and organization analysis, and (iv) text mining analysis.
As a source for the patent search and selection, we have used the Patseer database,
which is an online global patent database covering the patent activity in 121 countries
stored in the forms of simple patents and patent families. Patent family consists of a
set of patent applications assigned in different countries, in order to protect the inno-
vation in wider geographical area.
In order to detect the patents related to the data anonymization, we have searched
the patents that have in their title the word data and one of the following words:
anonymizing, anonymization, anonymized, anonymizy and anonymize.
Therefore, Patseer database was searched at the 15th August 2015, using search string
(TA:(data AND anonym*)), with an option for searching simple patent families. Fol-
lowing keywords associated with data anonymization were used: anonymizing,
data, anonymization, anonymized, anonymizy and anonymize. English
spelling of the words was also used, e.g. anonymisation. Possible statuses of the pa-
tent are: active, inactive-rejected, refused, suspended, inactive with-
drawn/surrendered. In our analysis we have focused only to the active simple patent
families.
Fig. 1. IPC hierarchical levels (Source: According to World Intellectual Property Organisation,
2015, p. 6)
The section ist the highest level of the hierarchy of the IPC. It is considered as a
very broad indication of the technologial contents [14]. IPC contains eight sections,
divided into classes, and each class refers to one or more subclasses. Finally, each
subclass is broken down into groups [14]. Patents related to the data anonymization
are most often patented under the sections G-Physics and H-Electricity. Some
examples of the sections with the data anonymization patents are: G06F-Electric
digital data processing, and G06Q-Data processing systems or methods. Some
examples of groups are: H04L9/00-Arrangements for secret or secure communication
and G06F21/62-Security arrangements for protecting computers, components thereof,
programs or data against unauthorized activity - Protecting access to data.
In this research, we analyze the active simple patent families related to data
anonymization according to the sections, subclasses and groups. In addition, we use
association rules analysis at IPCs' Group level in order to determine what is the
heterogeneity of the technical content protected by the patenting process.
2.3 Stage 3: Patent analysis according to the assignee country and
organization
According to the [13], the assignee is the entity that has the property right to the
patent. The assignee is not necessary the inventor of the new knowledge, since it is
more likely that the organization will assign a patent, in which the inventor is
employed. In this research we use the extensive analysis of organizations and
countries, focusing to the longitudinal trend when possible. The aim was to determine
which are top countries and organizations that assigned patents related to data
anonymization.
In order to detect the main themes that emerge in patents related to the data anony-
mization, text mining approach was utilized. Text mining of simple patent families
titles has been used in order to determine what themes emerge most often as the sub-
ject of patenting process related to data anonymization. In order to reduce the size of
variability of the words, different approachs like filtering, lemmatization or stemming
could be used [9]. We have used the Staticstica Text Mining software in order to
utilize the stemming method. Examples of stemming techniques are, e.g. remove the
ing from words, and s from plural of nouns. By using stemming algorithm, we
have build the stems, which are natural group of words with similar or even equal
meaning. For example, the stemming algorithm develops a stem analy which
represents words analysis and analytics.
3 Results
Figure 2 represents the patent dynamics for the period between 2001 and August
15th, 2015. The increasing trend is present, since in the period from 2001 to 20010
less than 10 simple patent families were registered per year. After than period, the
number of simple patent families is increasing, and to the 83 simple patent families
registered in 2014. Our data is missing patents that may have been assigned later than
the 15th August 2015.
83 83
35
26
18
9 6 8 10
3 2 1 5 3 4
Fig. 2. Number of data anonymization simple patent families (2001- August 15th, 2015)
Source: Authors; Patseer [15th August 2015]
The IPC group analysis revealed the following results. The majority of data anon-
ymization simple patent families were assigned within the group G06F17/30-Digital
computing or data processing equipment or methods adapted for specific functions -
Information retrieval; Database structures (69 simple patent families). The second
most often IPC group is G06F21/60- Security arrangements for protecting computers,
components, programs or data against unauthorized activity - Protecting data (53
simple patent families). The third most often is group H04L29/06- Arrangements,
apparatus, circuits or systems, not covered by a single one of groups H04L 1/00-H04L
27/00 - Characterized by a protocol (39 simple patent families). More detailed infor-
mation on the number of patents related to data anonymization at the IPC group level
is presented in Appendix 2.
In order to detect the level of the heterogeneity of the technical content related to
data anonymization protected by the patenting process, we have used the association
rules analysis [9]. Most of the patents were assigned to more than one IPC group, and
616 groups were identified for the 296 simple patent families. This indicates that in
average one Simple Patent Family is registered to approximately 2 IPC groups. There-
fore, only 12 rules are generated, under the minimal support and confidence at 1%
level, which indicates that, is difficult to find the dependencies between different IPC
group level codes. The results are presented Figure 3.
Fig. 3 Association rules network at IPC Group level patents (Source: Authors; PatSeer [15th
August 2015]; Statistica Text Miner)
According to the set limitations and generated rules we conclude that heterogeneity
does not characterize the protected technical content related to data anonymization
content. Results reveal that following IPC groups were most often registered together:
G06F21/60- Security arrangements for protecting computers, components, programs
or data against unauthorized activity - Protecting data, G06F21/62- Security arrange-
ments for protecting computers, components thereof, programs or data against unau-
thorized activity - Protecting access to data via a platform, G06F17/30- Digital com-
puting or data processing equipment or methods adapted for specific functions - In-
formation retrieval; Database structures, and G06F17/00- Digital computing or data
processing equipment or methods, specially adapted for specific functions.
3.3 Patent assignee organization and country analysis
Our search revealed that the patenting activites is spread across different countries,
but the majority of patents related to data anonymization have been assigned by The
USA and Japan. In some of the cases, more than one organization from two or more
countries were the assignees. Figure 4 outlines the patent dynamics according to
countries for the period between 2001 and August 15th, 2015. The USA is the leading
country, since its organizations began publishing patents on data anonymization in
2001. Other countries followed later. European countries that have assigned more
than five patents in given period are Germany (29 simple patent families), Switzer-
land (18 simple patent families), France (13 simple patent families) and Ireland (8
simple patent families). Our data is missing patents that may have been assigned later
than the 15th August 2015. Appendix 3 lists the number of patents related to data
anonymization of the assignee countries for the period between 2001 and August
15th, 2015.
40
35
30
25
20
15
10
5
0
Fig. 4. Number of data anonymization simple patent families per country (2001- August 15th,
2015); countries with more than 5 simple patent families; Source: Authors; Patseer [15th
August 2015]
Table 1 represents the number of simple patent families related to data anonymiza-
tion according to assignee organization and country Assignee of the patent is organi-
zations that refer to a company, an academic institution and individual persons in
some of the cases. The organization with the largest number of simple patent families
related to the data anonymization in the observed period is NEC, registered in Japan,
(22 simple patent families or 7,61%), followed by IBM registered in the United States
of America (21 simple patent families or 7,27%). Other organizations that registered
larger number of simple patent families are as well multinational organizations, such
as Microsoft, Alcatel, Google, Siemens, Mastercard, and Amazon.
Table 1. Number of simple patent families related to data anonymization according to assignee
organization and country (Source: Authors; PatSeer [15th August, 2015])
In order to provide a more intuitive insight into the themes that occur in the titles of
the simple patent families related to the data anonymization, tag cloud analysis is
conducted [6]. Tag cloud has become a common way of visualizing most occurring
themes, since it visualize of the most often words in the analyzed text, relating the
size of the word to its relative frequency. Therefore, the words that occur more often
are larger.
Software Wordle was used in order to generate a tag cloud of the stems that have
occurred the most often in the titles of the simple patent families related to the data
anonymization. In order to increase the transparency of the cloud, we have applied the
tag cloud algorithm to the stems that have occurred more than 5 times in the titles of
the simple patent families. We have excluded the stems data and anonym, since
they have occurred in every title due to the fact that these words were the criteria for
the selection of the simple patent family in the analysis. Also, two stems that also
occurred often are omitted from the analysis: system and method.
Table 2. Most often used words in titles of the patent related to data anonymization; = > 10
patents (Source: Authors; PatSeer [15th August 2015]; Statistica Text Miner)
Figure 4 indicates that stems method and system have occurred the most often
within the titles of simple patent families related to the data anonymization. Following
groups of topics were also identified: (i) themes related to physical equipment such as
devic, comput or apparatus; (ii) themes related to software such as program,
process, and analy or manag; (iii) themes related to the goal of the patent, such
as protect, ident, encrypt or privac; and (iv) some specific themes related to
the areas of the implementation, such as commun, medic, or service. Example
of the patent related to above mentioned groups are: (i) US20040199789A1: Anony-
mizer data collection device; (ii) US20080287118A1: Method, apparatus and
computer program for anonymization of identification data; (iii)
DE102007033667A1: Method and apparatus for an anonymous encrypted mobile data
- and voice communication, and (iv) US20100070306A1: Patient community system
with anonymized electronic medical data.
Fig. 4. Tag cloud of the most often used words in patent titles related to data anonymization; >
5 simple patent families (Source: Authors; PatSeer [15th August, 2015]; Wordle.org)
4 Conclusion
The paper presents the examination of data anonymization related simple patent fami-
lies, based on the data gathered from the Patseer. We have analyzed 296 active simple
patent families related to the data anonymization assigned from 2001 to 15th August
2015. The analysis is conducted in four stages: (i) detecting the trend in data
anonymization patenting, (ii) patent classification according to the areas of
technology, (iii) assignee organization and country analysis, and (iv) text mining
utilization for themes identification. The analysis revealed the answers to the research
question, that provide inights into the data anonymization patent landscape.
The first research question (RQ1) aimed at detecting the trend in data
anonymization patenting. The number of Single Patent Families is growing with the
high increase after 2010, and espetially after 2014, thus indicating a positive trend in
the area of patenting data anonymization solutions. Such increase is the result of the
incrase of the awareness of the necessity of the data privacy protection, and also the
new challenges (e.g. big data) that are ahead to this issue.
The second research question (RQ2) aimed at detecting protected technical content
related to data anonymization classified using IPC system (at the sub-class and group
level). The majority of simple patent families related to data anonymisation were
assigned to the section G Physics of IPC system. G section sub-classes with most
patents are G06F - Electric digital data processing and G06Q - Data processing
systems or methods. Within this sub-class, the majority were assigned to the group
G06F17/30- Digital computing or data processing equipment or methods adapted for
information retrieval and database structures. Therefore, the protection of data privacy
in databases and for information retrieval has brought the biggest attention of the
inventors, which is the result of the omnipresent digitization of the information.
Association rules analysis revealed that the patents with more than one IPC group
were homogenous, since all of the co-occurring IPC groups were from the class
G06F- Electric digital data processing.
The third research question (RQ3) aimed at which organizations from which
countries patented their innovations related to data anonymization. According to the
patent analysisz, the data anonymization technology is spread across different
countries, but the majority of simple patent families related to data anonymization
have been assigned by the USA and Japan organizations. The NEC, registered in
Japan, assigned the greatest number of patents, followed by IBM registered in the
USA in the observed period. Numerous multinational corporations, such as Google,
Microsoft, Amazon and MasterCard have also registered substantial number of pa-
tents related to the data anonymization.
The fourth research question (RQ4) aimed at detecting what themes emerge most
often as the subject of patenting process related to data anonymization. The most
often used word in titles of the patents related to data anonymization was anonym*,
followed by method, data and system. Several additional groups that indicated
the most often themes related to data anonymization were detected: physical equip-
ment, software, protection, identification, encryption or privacy, and specific themes
such as community, medical, or service.
Limitations of this work result from the fact that we have oriented only to the
simple patent families that have the word data and one of the following words:
anonymizing, anonymization, anonymized, anonymizy and anonymize.
Hence, the patents that have these words in the abstract, but not in the title are omitted
from the analysis. Furthermore, the analysis was conducted only for the part of the
year 2015, which prevented us in providing conclusion for the most recent period.
Further research recommendations emerge from these limitations, urging the need to
include also the abstract and full text into the analysis. Since this would lead to the
much larger number of results, text mining approach should be fully utilized for such
a research in order to automatize the process of analysis.
Appendices
Appendix 1. Number of patents related to data anonymization according to the IPC system -
Sub-class level (Source: Authors; Patseer [15th August 2015])
Simple
Code Code description patent
families
A Human necessities
A61B Diagnosis; Surgery; Identification 3
B Performing operations; transporting
B60Q Arrangement of signaling or lighting devices 1
B65G Transport or storage devices 1
C Chemistry; Metallurgy
C12N Micro-organisms or enzymes; compositions 1
G Physics
Measuring distances, levels or bearings; surveying; navigation;
G01C 1
gyroscopic instruments; photogrammetry; videogrammetry
Measuring not specially adapted for a specific variable and variables
G01D 2
not covered by a single another subclass
G01R Measuring electric and magnetic variables 3
G05B Control or regulating systems 2
G06F Electric digital data processing 283
G06K Recognition and presentation of data; record carriers 9
G06N Computer systems based on specific computational models 6
G06Q Data processing systems or methods 124
G06T Image data processing or generation 4
Ticket-issuing apparatus; taximeters; apparatus for collecting fares,
G07B 1
tolls or entrance fees; franking apparatus
Time or attendance registers; registering or indicating the working of
G07C machines; generating random numbers; voting or lottery 5
apparatus; arrangements, systems or apparatus for checking
G07G Registering the receipt of cash, valuables, or tokens 6
G08B Signalling or calling systems; order telegraphs; alarm systems 2
G08C Transmission systems for measured values and control signals 2
G08G Traffic control systems 2
G09B Educational or demonstration appliances 1
G09C Ciphering or deciphering apparatus 8
Speech analysis or synthesis, recognition, processing, coding or
G10L 3
decoding
Information storage based on relative movement between record
G11B 1
carrier and transducer
H Electricity
H04H Broadcast communication 5
H04L Transmission of digital information 101
H04M Telephonic communication 7
H04N Pictorial communication 6
H04Q Selecting 3
H04W Wireless communication networks 21
Missing data 2
Appendix 2. Number of patents related to data anonymization according to the IPC system -
Group level (Source: Authors; PatSeer [15th August 2015])
Assignee Number of
Assignee country / region %
country code patents
The United States of America US 131 44,2%
Japan JP 59 19,93%
Germany DE 29 9,80%
Switzerland CH 18 6,08%
France FR 13 4,39%
South Korea KR 9 3,04%
Ireland IE 8 2,70%
United Kingdom GB 4 1,35%
Sweden SE 4 1,35%
Australia AU 3 1,01%
Finland FI 3 1,01%
India IN 2 0,68%
Spain ES 2 0,68%
United States of America-United Kingdom US-GB 2 0,68%
The United States of America-Japan US-JP 2 0,68%
Austria AT 1 0,34%
Denmark-United States of America DE-US 1 0,34%
Ireland-United States of America IE-US 1 0,34%
Israel IZRAEL 1 0,34%
Norway NO 1 0,34%
Russia RU 1 0,34%
The United States of America-Australia US-AU 1 0,34%
Total 296 100,00%
References
1. Abbas, A., Zhang, L., Khan, S.U.: A literature review on the state-of-the-art in patent
analysis. World Patent Information. 37, 3-13(2014)
2. Brgmann, S. et al.:Towards content-oriented patent document processing: Intelligent
patent analysis and summarization. World Patent Information. 40, 3042 (2015)
3. Cecere, G., Corrocher, N., Gossart, C., Ozman, M.: Technological pervasiveness and
variety of innovators in Green ICT: A patent-based analysis. Research Policy, 43(10),
1827-1839 (2014)
4. Choi, J., Hwang, Y.-S.:Patent keyword network analysis for improving technology
development efficiency. Technological Forecasting and Social Change. 83, 170182
(2014)
5. Cormode, G., Srivastava, D.: Anonymized Data: Generation, models, usage. In: 26th IEEE
International Conference on Data Engineering, pp. 12111212. IEEE, Long Beach (2010)
6. De Spindler, A., Leone, S., Nebeling, M., Geel, Matthias, Norrie, M.C: Using
Synchronised Tag Clouds for Browsing Data Collections. In: Mouratidis, H., Rolland, C.
(eds.) Advanced Information Systems Engineering. LNCS, vol. 6741, pp. 214-228.
Springer, Heidelberg (2011).
7. Grant, E., Van den Hof, M., Gold, E. R.: Patent landscape analysis: A methodology in
need of harmonized standards of disclosure. World Patent Information. 39, 3-10 (2014)
8. Han, E.J., Sohn, S.Y.: Technological convergence in standards for information and
communication technologies. Technological Forecasting and Social Change. 106, 110
(2016)
9. Kim, J., Lee, S.: Patent databases for innovation studies: A comparative analysis of
USPTO, EPO, JPO and KIPO. Technological Forecasting and Social Change. 92, 332345
(2015)
10. Noh, H., Jo, Y. and Lee, S.: Keyword selection and processing strategy for applying text
mining to patent analysis. Expert Systems with Applications. 42(9), 43484360 (2015)
11. Patent Lens, http://www.bios.net/daisy/patentlens/ip/around-the-world.html
12. Sheau-Pyng, J., Ming-Fong, L., Chin-Yuan, F.: Using Patent Analysis to Analyze the
Technological Developments of Virtualization. Procedia-Social and Behavioral Sciences.
57, 146-154 (2012)
13. Sinha, M., Pandurangi, A.: Guide to Practical Patent Searching and how to use PatSeer for
Patent Search and Analysis. Gridlogics Technologies, Pune (2015)
14. World Intellectual Property Organization. Guide to the IPC. WIPO (2015) Available at:
http://www.wipo.int/export/sites/www/classifications/ipc/en/guide/guide_ipc.pdf
[Accessed April 21, 2016].