You are on page 1of 48

PRACTITIONERS TOOLKIT

BEING SMART WITH DATA,


USING INNOVATIVE
SOLUTIONS

Employment, MARCH 2017


Social Affairs
and Inclusion
Europe Direct isaservice tohelp you find answers
to your questions about the European Union.

Freephone number (*):

00 800 6 7 8 9 10 11
(*) The information given isfree, asare most calls
(though some operators, phone boxes orhotels may charge you).

More information onthe European Union isavailable onthe internet (http://europa.eu).

Luxembourg: Publications Office ofthe European Union, 2017

ISBN 978-92-79-66962-0
doi:10.2767/797031

European Union, 2017


Reproduction isauthorised provided the source isacknowledged.

Cover picture: European Union

The European Network ofPublic Employment Services was created following aDecision ofthe European Parliament and Council inJune 2014
(DECISION No573/2014/EU). Its objective isto reinforce PES capacity, effectiveness and efficiency. This activity has been developed within
the work programme ofthe European PES Network. For further information: http://ec.europa.eu/social/PESNetwork.

This activity has received financial support from the European Union Programme for Employment and Social Innovation "EaSI" (2014-2020).
For further information please consult: http://ec.europa.eu/social/easi

LEGAL NOTICE
This document has been prepared for the European Commission however itreflects the views only ofthe authors, and the Commission cannot
beheld responsible for any use which may bemade ofthe information contained therein.
PRACTITIONERS TOOLKIT

BEING SMART WITH DATA,


USING INNOVATIVE
SOLUTIONS

Written byDr. Willem Pieterson,


the Center for e-Government Studies,
in collaboration with ICF

MARCH 2017
Contents

CHAPTER 1. INTRODUCTION 6
1.1 What isthis toolkit about? 6
1.2 Who this toolkit isfor 8
1.2.1 PES with little orno experience working with data 8
1.2.2 PES with some orintermediate levels of experience 8
1.2.3 PES with advanced levels ofexperience 8
1.2.4 Tactical oroperational managers 8
1.3 Strategic managers 9
1.4 Scope ofthis toolkit 9
1.5 Reading guide 9

CHAPTER 2. GETTING STARTED WITH DATA 10


2.1 Creating aplan 10
2.1.1 Deductive approaches 10
2.1.2 Inductive approaches 11
2.2 Creating your data team 12
2.2.1 Leadership 12
2.2.2 Data team members 13
2.3 Setting upthe data infrastructure 14
2.4 Creating adata-catalogue 15
2.5 Costs and budgeting 16

CHAPTER 3. ORGANISING DATA 19


3.1 Cleaning & Sanitising 19
3.2 Describing data & data characteristics 20
3.3 Quality control 21
3.4 Integrating data sources 21
3.5 Security and Data Protection 22

CHAPTER 4. ANALYSING DATA 25


4.1 Overview 25
4.2 Statistics 26
4.3 Data mining & KDD 29
4.4 Advanced Analytics 31
4.4.1 Artificial Intelligence 31
4.4.2 Machine Learning 33
4.4.3 Deep Learning 34
4.5 Combinations & Derivations 35

CHAPTER 5. PRESENTING & REPORTING 36


5.1 Why move away from traditional reports? 36
5.2 (Interactive) Visualisations 37
5.3 Interactive Tools & Dashboards 38
5.4 Open data 40

CHAPTER 6. EVALUATION & CONTINUATION 41


6.1 Evaluation 41
6.2 Continuation and scale-up of pilots 43

APPENDICES 45
Appendix 1 | Safe Harbor De-identification types 45
6

Chapter 1. The role and importance ofdata inour society


isgrowing fast. Not only dowe collect more and

Introduction
more data, but advancements incomputing power
aswell asthe tools and algorithms toanalyse data
allow organisations touse data inentirely new ways.
This also applies toPublic Employment Services
(PES). Afew PES are exploring the use ofBig Data
toimprove efficiency and effectiveness ofprocesses,
improve customer satisfaction and/or innovate
inorder totransform how the PES functions. This
toolkit could help them onthis journey. Most PES,
however, are atthe start ofthis journey and this
toolkit can also help PES who want tostart using
their data inbetter and smarter ways.

1.1 What isthis toolkit about?


This toolkit isabout data and the use ofdata tocre-
ate abetter functioning PES. Among the chief rea-
sons tohave atoolkit about data isthat PES, like
most other organisations, are collecting more and
more data. And asdata isbeing stored inproduction
systems and/or data warehouses, the possibility
arises touse this data for all kinds ofpurposes. The
infographic below illustrates a) the tremendous
growth in(world-wide) data production, b) the poten-
tial (cost) benefits ofusing so-called big data and
c), the anticipated future growth ofdata production
(e.g. due tothe growth ofthe Internet ofThings).

Figure 1: Infographic about big data


0,5 %
100.000.000.000 of all data
Potential savings in operational are being
efficiency at governments in developed analysed
European countries due to big data* and that
percentage
is shrinking**

44x
Growth in data
production
26.000.000.000
2009-2020**** Number of connected devices
2010 2030
on the Internet of Things (IoT)
in 2020*****
The amount of data
produced doubles
every year***

* http://www.csc.com/insights/flxwd/78931-big_data_universe_beginning_to_explode
** http://www.mckinsey.com/business-functions/digital-mckinsey/our-insights/
big-data-the-next-frontier-for-innovation
*** https://www.technologyreview.com/business-report/big-data-gets-personal/
**** http://www.gartner.com/newsroom/id/2684616
***** http://www.emc.com/about/news/press/2011/20110628-01.htm
PRACTITIONERS TOOLKIT

With the expected future growth ofdata production, This toolkit aims tohelp organisations, specifically
several challenges arise: PES, toget started with finding answers tothese
How tounlock the potential ofdata wealready questions.
have toimprove efficiency, effectiveness,
customer satisfaction orany other goal? Even though the topic ofthis toolkit isdata, the true
How toset-up the infrastructure inthe organi- goal isto help PES transform their existing data into
sation now sothat wegain experience with data Information, Knowledge, and subsequently Wisdom.
analytics before wedrown inasea ofdata? Ineach stage oftransformation ofdata, the data
How tointegrate data-analytics into the DNA isenriched, asthe schema below illustrates.
ofthe organisation sothat organisational
decision making improves and organisational
agility increases?

Data Information Knowledge Wisdom

Raw Meaning Context Application

Figure 2: Data transformation process

STEP TRANSFORMATION EXPLANATION EXAMPLE(S)

Data None This isplainly data, the 63 [just the plain number]
(Raw data) numbers how you would
extract those from any
system.

Information Adding Meaning When transforming data into 63% oflower educated
information, weadd basic clients are unhappy with
meaning tothe information. PES service levels
Very often this happens
byadding units, variables
and definitions.
Knowledge Adding context When adding context, weare 63% oflower educated clients
able tomake sense ofthe are unhappy with PES service
data, itstarts telling astory. levels, compared to48%
ofhigher educated clients.
Wisdom Adding application Turning knowledge into 63% oflower educated clients
action isthe last step of are unhappy with PES service
the process. Bycombining levels, compared to48%
data points, you can create ofhigher educated clients.
actionable results. This correlates with their
evaluation oflanguage
difficulty onthe PES website.
Changing language level
could solve this problem
8

The toolkit isbased onthe thematic review workshop 1.2.1 PES with little orno experience
onmodernising PES through data and ITsystems1. working with data
The workshop revealed aneed tounderstand the
topic ofdata inmore detail and explore how PES This toolkit can serve asastarting point for those
can benefit from advancements inthis field. Core PES who want toget started with data. Itgives
concept inthis toolkit isthe concept ofsmart data. anoverview ofrelevant actions and provides practical
Byusing the concept smart data, wewant toprevent tips onhow toget started and where.
bias towards data being anormative goal initself.
Smart data isseen asthe sum of: PES with little orno experience are advised tostart
(Big) Data [the data itself] reading at: Chapter 2 Getting started with data
Utility [the potential utility derived from
the data] 1.2.2 PES with some orintermediate levels
Semantics [the semantic understanding of experience
ofthe data]
Data Quality [the quality ofthe data collected] For PES with some experience with data, especially
Security [the ways data are managed securely] the sections onadvanced analytics and the various
Data Protection [how privacy and confidenti- examples provided throughout this toolkit may beof
ality are guarded] interest. Furthermore, the sections onreporting and
presenting data may provide new insights. Lastly,
These different topics are woven throughout the the parts ondata security and protection could serve
body ofthis toolkit. Similarly, tostay focused, asagood refresh onmatters related tosecurity.
westreamline this toolkit along the lines ofthe PDCA
(Plan, Do, Check, Act) Cycle. The first content chapter PES with some experience, are advised tostart read-
[2] focuses onhow toget started and create aplan. ing at: Chapter 3 Organising data
The following chapters [3-5] focus onthe actual
doing. While wefocus throughout the toolkit and 1.2.3 PES with advanced levels ofexperience
the proper checks and balances, chapter 6, isspecifi-
cally devoted tothe topic ofchecking and evaluating. PES that have much experience working with data
While this isapractitioners toolkit, most content may still benefit from this toolkit. Onthe one hand,
inthis toolkit this actionable, but once again, specific the toolkit may provide some new insights, especially
action points after the analytical process are dis- regarding more novel developments inthe advanced
cussed inthe final chapter. analytics section. Furthermore, the toolkit can serve
asan introduction for new employees who join data
1.2 Who this toolkit isfor teams. Lastly, even advanced PES may find interest-
ing examples from other PES throughout this toolkit.
The primary audience for this toolkit consists ofPES
who have little tono experience working with (big) The most relevant section for PES with advanced
data. Asapractitioners toolkit, managers ontactical levels ofexperience will besection 4.4.
and operational levels may benefit the most from
the content inthis toolkit. For example managers 1.2.4 Tactical oroperational managers
who have been tasked with analytics ordata science.
However, there will beuses for other audiences Tactical oroperational managers tasked with setting
aswell. Below welay out how the different audiences updata functions and/or capabilities within PES may
could benefit from this toolkit. benefit the most this toolkit. The entire toolkit should
berelevant for this audience. The toolkit will help
you get familiar with much ofthe jargon used inthe
world ofdata analytics and should give you enough
guidance toget started. While this toolkit ismeant
toprovide ageneric introduction and practical tips
and tricks, most sections will provide links toother
resources that could help you further.

1 The Thematic Review Workshop took place


inCroatia in6-7 July 2016; itwas developed
under the PES Mutual Learning Programme.
PRACTITIONERS TOOLKIT

1.3 Strategic managers The same applies tothe presentation ofinformation.


While traditional research reports (e.g. print, pdfs,
Biggest value for strategic managers ofthis toolkit etc.) are still widely used and can begood ways
istwofold. The first isthat ithelps higher level execu- topresent data and while (static) graphics created
tives infamiliarising themselves with the possibilities in for example spreadsheet software can provide
ofdata analytics. The second isthat the toolkit can compelling insights, weagain choose tofocus onthe
support strategic managers intheir decision making more novel solutions topresent data. This includes
processes around data analytics, for example tools tocreate interactive graphs and online
interms ofhiring adata team, and the goals where dashboards.
data analytics could make adifference.
1.5 Reading guide
Specific strategic points are added throughout this
toolkit that are relevant for strategic managers The content ofthis toolkit isorganised asfollows:
(See pages 10, 11, 13, 15, 16, 20, 22, 23 and 31).
Strategic manager will benefit the more from read- 1. Getting started with analytics. Creating aplan,
ing chapters 2 & 6. aswell asadata team. Setting upthe data infra-
structure and hecreation ofan inventory ordata
1.4 Scope ofthis toolkit catalogue.

This toolkit isabout the use ofdata that PES collect 2. Organising data. Cleaning and describing data.
intheir systems orthrough other methods. Data cur- Ensuring the quality ofdata, integration ofdata
rently stored inproduction systems ordata ware- sources. How tosecurity data and protect privacy
houses are examples ofthis type ofdata. Wealso and confidentiality.
include data that are available inaPES that donot
stem from primary processes, such asresearch data 3. The actual analysis ofdata. Different types ofdata
(e.g. data-sets from survey research), data shared analysis including statistical methods, data mining
from other organisations (e.g. educational data) and advanced topics such asartificial intelligence
oreven reports, books, etc. Insum; wefocus onall and machine learning.
data the PES already has and less soon data the PES
would need tocollect toachieve certain goals. Con- 4. Presenting and reporting ofdata. What are novel
ducting research (e.g. how toconduct surveys, inter- ways topresent results and what are considera-
views, focus-groups, etc.) isnot part ofthis toolkit. tions when reporting outcomes? Also discussion
ofopen data.
In terms ofanalytics; the focus ison the more novel
types ofanalytics and specifically those more com- 5. Evaluation, continuation and implementation. How
monly associated with big data. Where appropriate, togo from small pilot orexperiment toabroader
wewill discuss more traditional types ofanalytics implementation? What are the key technical and
(such asmore common statistical methods), and organisational considerations?
tools (such asExcel, SPSS, SAS, etc.), but given the
abundance of(online) resources, wewill link tothose
resources instead ofproviding more comprehensive
information inthis toolkit.
10

Chapter 2. Does the organisation have quantifiable goals


and data available totrack progress towards

Getting started
meeting those goals?
Does the organisation simply want tolearn
about working with data sothat experience

with data isgained that could beuseful for the future?


Does the organisation want toinnovate and see
what improvements can bemade from tinkering
with (data) resources available?
This chapter deals with the question ofhow toget
started. What are the main things tothink about When the why question ofthe organisation looks
when anorganisation wants tostart doing more like the former two questions, the organisation isbet-
with the existing data? What kind ofskills and knowl- ter off starting with adeductive approach tothe use
edge would you need? ofdata. When itlooks like latter two questions,
aninductive approach may bebetter. The utility
The following topics form the heart ofthis chapter: derived from each approach will bedifferent, aswell
Provide overview ofthings tothink ofwhen asthe composition ofthe data team involved and
starting towork with data the place ofthis team inthe organisation.
Help create ateam ofpeople who can work
with data
Setting upthe organisational infrastructure Smart data = (Big) Data + Utility
+ Semantics + Data Quality + Security
The following strategic questions are answered inthis + Data Protection
chapter: Utility refers tothe usefulness ofsmart data.
What isthe best way toget started and what The more itaids insolving business problems
can Iexpect from working with data? and/or the increase ofknowledge inthe organi-
Whom should bein charge and where inthe sation, the higher the utility is. However, depend-
organisation should weposition the data ing onthe approach taken when starting with
function? analytics (deductive orinductive) expectations
What can Ido toensure success? regarding utility can bedifferent and should
bemanaged assuch.
The following tactical questions are answered inthis
chapter:
What are the things weneed todo actually
start working with data? 2.1.1 Deductive approaches

2.1 Creating aplan In deductive approaches, the organisation has a(rela-


tively) clear understanding ofwhat itwants todo,
The most important question toanswer when starting for example what the problem isthat needs tobe
to work with data is: why? Depending onthe answer solved.
to that question the approach taken when implementing
analytics and working with data will be different. For instance, aPES may find that anew automated
Is there aconcrete (organisational) problem vacancy matching system yields lower rates ofsuc-
that needs tobe solved? cessful matches that the previous manual matching

Figure 3: Deductive research approaches

Theory Hypothesis Observation Confirmation


PRACTITIONERS TOOLKIT

11

Figure 4: Inductive research approaches

Observation Pattern Hypothesis Theory

process. Inthis case, the PES could formulate the 2.1.2 Inductive approaches
expectation ortheory that the new matching system
isworking improperly. This theory could betranslated The set-up ofinductive approaches isdifferent. There
into aseries ofconcrete expectations orhypotheses isno clear understanding ofaproblem and the focus
that could betested using data (e.g. The matching ison:
algorithms donot include certain important variables
orthe weight ofcertain variables incalculating 1. Learning from the data and/or finding ways to
matches istoo high/low). Ifsuch hypotheses exist, generate value from this data.
adata team could start working ontesting these
hypothesis and work closely with the business pro- 2. Learning about analytics and gaining knowledge
cess owners inthe PES tocollect and analyse data. about the process and (smart) data applications.

Therefore, ininductive approaches there are fewer (if


Concept: Data team any) explicit expectations orhypothesis regarding
In the context ofthis toolkit, wedefine the data outcomes. The focus ison letting the data speak and
team asthe group ofpeople within the PES discovering interesting patterns inthe data. For exam-
tasked with data analytics. This could overlap ple, suppose aPES gets high levels ofunstructured
(partially) with existing research teams (whose emails from clients (for example with questions,
focus typically ismore onstatistics (see section complains, comments), adata team could start ana-
4.2) and/or data mining (see section 4.3). Focus lysing these emails tosee ifthere are any interesting
ofthe data team ison the analysis of(big) patterns inthis data, for example regarding:
data that can beextracted from systems The word choice ofpeople (that could beused
orimported from other sources. tochange the tone ofvoice incommunication
orthe use ofcertain synonyms inwritten
communication).
In such asituation, the analytics function would bemore The questions people have (that could be
embedded within the relevant business processes used tochange the way information isbeing
toensure smooth communication tosolve the problem displayed onwebsites).
athand. Furthermore, higher levels ofvalidity ofand The language (skill) level ofcertain groups
confidence inthe results are required ifoutcomes are ofjob seekers (that could beused tocreate
being used inproduction processes. customer segments that could betargeted
using different communication styles).

Concept: Data analytics2 Only after certain patterns have been discovered
The whole process offormulating data goals, inthe data after analysing high numbers ofobserva-
collecting, organising, analysing and presenting tions, certain hypothesis and eventually theories
(big) data. could beformulated regarding the discovered pat-
terns (which insubsequent rounds ofdeductive
analysis could befurther tested).

2 Some equate data analytics todata analysis, In such asituation, the analytics function would take
wedecide todifferentiate for the sake ofclarity
(and inline with many publications dealing with big more the shape ofalaboratory orexperimental unit.
data). Wesee analysis asthe process ofanalysing This means itoperates more onastand-alone basis,
data, whereas wesee analytics asamuch broader has more freedom toexperiment and isless tied
term involving the setting ofgoals, collection,
organisation, analysis and presentation ofdata.
12

Deductive vsInductive

DEDUCTIVE INDUCTIVE
Primary focus Solving business problems Learning and innovation
Management focus Integration with the business, bridging parts Shielding the data team from the business,
of the organisation. Introducing the team to make sure they can focus ontheir work.
to the organisation (so they understand Making sure data isaccessible and the team
the processes). works with the data.
Position inthe organisation Close to/part ofbusiness processes Independent/removed from the business
processes
Team values Validity, robustness, value driven Creativity, making mistakes, trial and error
Team composition Focused, mostly ondata engineers Broad, including social scientists & people
and scientists with creative profiles

tospecific business processes. The following table CDO (Chief Data Officer) [or equivalent]
compares differences between the different approaches. In practice, only large (data) mature organisa-
tion will have aleading data position. The CDO
PES wanting tostart with analytics are typically isresponsible for governance and utilisation
better ofstarting with inductive approaches. Ittypi- ofdata across the entire organisation. This
cally allows for smaller scale experiments that allow means that deCDO oversees all data initiatives
both the data team, and the PES, togrow accus- and coordinates all analytics activities within
tomed toworking with data and slowly turn into the organisation.
adata driven organisation. Creating asmall team CIO (Chief Innovation Officer) [or equivalent]
that operates relatively independently from the The first interpretation ofCIO isthat of
organisation inorder toprove value inthe long term Innovation officer. This role isconcerned with
isagood starting point. Once the team has experi- innovation and change management within
ence and shown value, the team could bebrought the organisation. Ifthe data team ispositioned
into the organisation more and start shifting tomore under the Innovation Officer, the focus ofthe
deductive approaches. data team will most likely beon more inductive
approaches, trying tocreate innovative data-
2.2 Creating your data team driven solutions.
CIO (Chief Information Officer) [or equivalent]
Crucial tothe success ofworking with data, isthe The second interpretation isthat ofInformation
composition ofthe data team. Relevant questions officer. This role isreserved for the highest
inthis respect are; what isthe approach weare taking ranking officer who isresponsible for informa-
(inductive, deductive, oracombination)? How many tion technology and computer systems inside
resources can wemake available? What isthe time the organisation. Ifpositioned under aCIO,
pressure todeliver results? Inthis section wediscuss the data team will probably befocused more
team leadership (and variations therein) and several onsupporting the technology role inthe
tiers ofpotential positions within the team. organisation and hence have amore deductive
orientation.
2.2.1 Leadership CTO (Chief Technology Officer) [or equivalent]
The CTO isin charge of(the broader) technology
In smaller scale settings, the team will besmaller used bythe organisation, but could also focus
and team leadership will generally have aless senior oncore technologies iftechnology isimportant
position. Inthis case asenior data-scientist orman- inclient facing processes (with the increasing
ager ofdata analytics would beafitting role tolead levels ofautomation used inPES, that seems
the team. Asfor the position inthe organisation, toapply here). Ifpositioned under aCTO, the
arole under the following functional leadership role data team will probably focus ondeductive,
ispossible: client oriented and technology related issues.
PRACTITIONERS TOOLKIT

13

Director ofR&D [or equivalent] sure the system works smoothly and performs
The last role isthat ofthe director ofR&D. well. Very often they have abackground
While there issome overlap with the Chief insoftware engineering.
Innovation Officers role, this CIO istypically Data analysts
occupied with more short term change manage- Data Analysts are the professionals intheir
ment and the implementation ofinnovations. organization who query and process data.
The director ofR&D istypically charges with Furthermore, they typically create data reports
longer term research and development. and summaries, visualizations. This role ismore
Ifpositioned here, the data team will bemore closely associated with Data Mining and
experimental and focused onthe development Statistical Analysis.
oflonger term innovations. Data scientists
The data scientist isthe most important role
inthe context ofthis toolkit. Key role ofthe
Strategic insight data scientist isto generate valuable and
The purpose ofthis overview ofroles isnot actionable insights from the data and help solve
toprescribe where the data team should problems inthe organization using data. Data
bepositioned. Rather, itis meant toraise the scientists apply (mostly) advanced analytics,
awareness that the position inthe organisation such asmachine learning, todata.
will impact the expectations one should have
ofthe team and what the focus ofthe team Secondary roles:
will have. This could impact hiring ofteam Social scientists
members aswell. Team members with asocial science back-
ground (e.g. Sociology, psychology, communica-
tion, marketing, etc.) can perform two important
2.2.2 Data team members roles onthe team. The first isto help create
theories and hypotheses can beanswered/
Depending onthe (desired) size, workload and focus tested using deductive approaches. The second
ofthe team, different types ofteam members are isto help make sense ofthe outcome ofanaly-
needed tostart asuccessful data analytics practice. sis when more inductive approaches are taken.
Wedivide these types inthree tiers: Inthis way, social scientist hold akey role
Key Roles intranslating organisational goals and/or
These are the must have members asyou start problems into the actual data work and
building your data team. subsequently translate the outcome ofthe
Secondary roles data work into implications and actions for
These are the good tohave roles and will the organisation.
become more relevant asthe team grows Software engineers
insize. Even though there isawealth oftools available,
Tertiary roles toorganise, analyse and present data, very
These are the nice tohave roles that add value often anorganisation will discover that the
tothe team, but are less critical than the others. tools donot fit their needs entirely. Software
They will most likely become relevant once the engineers built custom service solutions (in
team reaches high levels ofmaturity and has conjunction with the rest ofthe team) that
alarge size. help maximize results across the organization.
Difference with data engineers isthat the
We can define three key roles inworking with the software engineer inthis context typically works
actual data: onmore front end (or customer) facing solu-
Data engineers tions (for example dashboards ormobile
The data engineer typically sets upand works applications toaccess data and outcomes).
with the data infrastructure. That means that
they set updatabases and work onExtraction, Some tertiary roles (that wewont discuss into detail):
Transformation and Loading (ETL, see below) Graphical and/or interface designers
tasks, they support data analysts and data To help design useful and usable applications
scientists intheir roles, and lastly they make and visualisations.
14

Philosophers Many configurations ofETL are possible and exist


To help interpret data and help create theories inpractice. Most ofthese approaches have pros
and hypotheses. and cons:
Mathematicians & Statisticians Working from production systems/data bases
To aid inthe further development ofcomplicated In this setup, data engineers would collect data
mathematical models and/or complicated directly from production systems (e.g. live
statistical analyses. profiling ormatching systems). While this
allows the team towork with the most current
2.3 Setting upthe data data. The downside isthat itcreates security
infrastructure risks and the risk ofmessing with live data.
Furthermore, ifdata from multiple processes
Once ateam isin place, itis necessary tocreate the are needed, the team has toextract and
data infrastructure onwhich the data team can work. transform data multiple times.
The most important challenge here isto create (a) Working from acentral data warehouse
database(s) from which the data team can pull the If the organisation has (one ormore) central
data needed toperform their analyses. Incomputing, (virtual3) warehouse(s), itis not uncommon for
Extract, Transform, Load (ETL) refers tothe process data teams towork directly with the data from
indatabase usage and especially indata warehous- the warehouse. Asthe data inawarehouse
ing ofperforming: typically isnot used inproduction settings,
Extraction. The process ofextracting data from production processes should not beimpacted.
different data sources However, this setup does create problems that
Transformation. Transformation the data for are (especially) relevant inthe government
storing itin the proper format orstructure for context. The first isthat inmany cases (such
the purposes ofquerying and analysis aslegal archive reasons) adata warehouse
Loading loads itinto the final target (data- serves asan important data store orarchive.
base, more specifically, operational data store,
data mart, ordata warehouse, or(in this case)
3 It isfairly common tohave avirtual database that
ananalytics database from which analytics uses wrappers togather data from multiple different
are performed). data sources.

Figure 5: Potential database architecture

Process Process Process Analytics Distributed


Servers Databases Warehouse Databases Analytics Computing
Server Server

Presentation
Server

Open data
Database
Web
Server
PRACTITIONERS TOOLKIT

15

Getting started

PACKAGE/TOOL EXPLANATION LINK

Hadoop Apache Hadoop isasoftware library http://hadoop.apache.org


that enables for the distributed
processing oflarge data sets across
clusters ofcomputers using simple
programming models.
Spark Apache Spark isan engine for (distributed) http://spark.apache.org/
large scale data processing.
Scriptella ETL Scriptella ETL Scriptella isan open http://scriptella.org/
source ETL tool.

Working with data from the warehouse directly Presentation and visualisation servers
could pose arisk regarding the integrity ofthis These servers typically run applications
archive. The second isthat the warehouse may toshow results (such asdashboards) and/or
contain sensitive data orPII [see below] that the data catalogue.
should not beused for analytics purposes.
Creating adedicated analytics database
The third and last scenario isthat of(a) Concept: PII
dedicated database(s) solely for the purpose PII stands for Personal Identifiable Information.
ofanalytics. This isinformation that helps identify individual
people. Classic examples ofPII are names,
In this case, relevant data are being pulled from addresses and unique person identifiers such
the warehouse (and other sources) and after being associal security orcitizen numbers. One
sanitised loaded into ananalytics database. This problem with PII and analytics isthat, asmore
database would then bethe primary source for the and more different types ofdata are being
data team towork from. This database could be combined, itbecomes increasingly possible
supported byseveral types ofservers toaid in the to indirectly pinpoint individuals. Itis impor-
analytics: tant, especially asdata are being opened (see
Analytics Servers section 5.4) toensure that individuals cannot
Servers whose sole purpose itis torun beidentified.
(computational) analytics. These servers
typically have lots ofprocessing power torun
heavy and complicated analytics.
Distributed Computing Servers 2.4 Creating adata-catalogue
When datasets become too large, itmay no
longer befeasible torun onthem onasingle In order for any organisation tostart answering ques-
server. Inthis situation itis common toset tion, itis important toknow what information the
upadistributed computing environment. What organisation already has. This prevents the organisa-
this entails is(conceptually) straightforward: tion from collecting the same information again and
adataset isbeing broken down insmaller ityields anoverview ofdata stored within the organi-
pieces, send todifferent servers (computers), sation that can beused for other purposes.
analysed onthese servers, and results ofthe
analysis are bundled back into one single A data catalogue isan instrument that provides
outcome. Many commercial distributed anoverview of(all) data present inthe organisation,
environments are available (such asAmazon it(semantically) explains the data and variables,
Web Services (AWS), Google Compute Engine, and adds meta-data tothese data sets. This means
and many others), but depending onthe that the catalogue contains descriptions ofthe vari-
sensitivity ofthe data, aPES could consider ables and nature ofthe data.
setting upadistributed environment.
16

Concept: Data catalogue Concept: Meta-data


A data catalogue isan overview ofthe existing Meta data isdata that describes other data. For
data indatabases and provides descriptions using example, meta-data about avacancy database
meta data ofthe nature and state ofthe data, could include descriptions ofall variables stored
such asthe base tables, volume, definitions, prop- inthe database, the nature ofthese variables
erties, synonyms, annotations and tags. Assuch, (e.g. Are they text, numbers (and what kind
the data catalogue isan important part ofthe ofnumber format), etc.) and such things asthe
semantics part ofthe Smart Data equation. number ofrecords. Inessence, the data-catalogue
isan overview ofall the meta-data.

Smart data = (Big) Data + Utility While the primary function ofadata catalogue
+ Semantics + Data Quality + Security isto document the data stored inthe organisations
+ Data Protection databases, more types ofdata could beincluded
Semantics refers tothe meaning ofdata. Know- inadata catalogue, such as:
ing exactly what data you have and what the Research data (such assurvey data sets).
data can tell you about reality isan important Externally available (analytics) data (such
aspect ofsmart data. Semantics here does not aswhat iscollected through Google analytics
just refer todescriptions ofdata and variables, orother trackers).
but also totheir meaning inreal life. For exam- Other relevant data (such associal media use,
ple, does ameasure orproxy ofcustomer relevant (technical) documents.
satisfaction really signify satisfaction ofclients
inreal life? While astatic data-catalogue ispossible (literally
acatalogue describing data), most organisations
choose tocreate adatabase based data-catalogue
A data catalogue can becompared toan index with asearchable front-end (often using web tech-
inalibrary. Alibrary isacollection ofbooks, written nologies). While technically anopen data catalogue,
bydifferent people ondifferent subjects atdifferent the catalogue athttp://catalog.data.gov/dataset
points intime ofdifferent lengths for different provides agood anexample ofwhat atypically
audiences. Anindex inthe library contains the data catalogue looks like.
overview ofexactly what you can find inthe library.
Furthermore, the index allows you tofind the 2.5 Costs and budgeting
resources you need.
We finish this chapter byfocusing onthe costs
The process ofcreating adata catalogue ispretty associated with starting ananalytics practice.
straightforward. Members ofthe data team will Budgeting for analytics activities proves tobe
have todocument all available data inthe organisa- achallenging task. Astudy byGartner4 found that
tion and use meta-data todescribe this data. For more than half ofall analytics projects failed
example, they have to: because they were not completed within budget
Create anoverview ofall data sources, oron schedule. Key reason for this isthat itis very
databases and datasets available and the difficult toestimate the costs ofearly analytics
nature ofthese data sets (e.g. what kind projects, for example because:
ofdatabases dowe have?). Early on, itis often unclear which information
Document and describe all variables within the organisation has and how useful this
these datasets. information is.
Document the number ofrecords and changes Organising and cleaning data take upconsider-
inthese records (e.g. How often are databases able amounts oftime (and resources) and
updated). organisations tend tounderestimate the amount
oftime ittakes toget all data organised.

4 http://www.gartner.com/newsroom/id/2637615
PRACTITIONERS TOOLKIT

17

Especially with inductive projects, the expected example conflict with the needs ofthe organi-
outcomes are difficult toforesee, this creates sation when certain requirements cannot
many uncertainties regarding timeline and costs. bemet [for example when acertain tool does
not work onthe platform provided]). The
Nevertheless, itis possible tocreate abudget and second isthat certain legal requirements could
allocate costs for execution ofanalytics. The cost prohibit the organisation from storing data
ofany analytics endeavour typically breaks down outside ofthe organisation, the government and/
inthe following: or the country. Itis wise toconsider the specific
The set-up cost for the relevant infrastructure legal requirements (see also section 3.5) before
(databases, analytics servers, data extraction, making this decision.
transformation & loading, other hardware and Internal data team orexternal service provider?
software). The good news regarding these Organisations that want toengage inanalytics
costs isthat, while still significant, the cost activities face animportant (cost related)
ofacquiring, storing and managing data keeps decision when itcomes tothe personnel aspect;
ongoing down. Especially using scalable hire adata-team oroutsource the personnel
solutions (see 2.3) itis possible toset up aspect. Several (consulting) providers provide
arelatively affordable infrastructure. analytics services that could take care ofthe
The recurring/ongoing costs for the infrastruc- personnel requirement (and often the infrastruc-
ture (power, licensing fees, maintenance, etc.). ture aswell). Furthermore, the organisation could
Total sum ofthese costs depends onmany consider using freelance personnel orvirtual (job)
factors such asthe use offreely available marketplaces, but given the sensitivity ofthe
software, orcommercial software, the size data often involved these options are often not
ofthe server park, etc. realistic. The benefits ofusing anexternal
Personnel costs. This entails the costs ofthe provider are that a) itcould becost effective
data-team such ashiring and salary costs. ifanalytics are only needed onaproject basis
While analytics isdriven bytechnology, human (and not continuously), b) you dont have toworry
labour still makes upalarge portion ofthe about training orcapacity. Downsides are that
total cost ofany analytics initiatives (in most a) the organisation builds little experience and
cases the bulk). Reason for this isthat organ- knowledge about analytics internally, b) costs
izing, sanitizing, cleaning and subsequently can be(much) higher inthe long run. Soif the
analysing and interpreting ofdata isavery organisation plans toseriously build analytics
labour intensive process (also see chapter 3). capabilities, creating the capacity internally
The more ambitious the project, the higher seems the right course ofaction.
the labour costs will be. Hybrid ornot?
Lastly, the organisation could choose amix
However, the degree towhich all costs occur depend ofdoing part ofthe infrastructure internally
onanumber offactors, most notably the following (such asdata storage) and part externally
choices will impact the cost: (such asrunning certain analysis onan external
In-house oroutsourcing? server provided byathird party) and hire part
The list ofcosts above assumes the organisa- ofthe data team and use anexternal provider
tion will want toown their own infrastructure for specific expertise areas. Inpractice, most
and have all personnel onstaff. There are, advanced organisations use some kind ofhybrid
however anumber ofalternative scenarios: version. Reasons for this are that a) analytics
Use ofexternal orcloud infrastructure? needs can become very specific (or advanced)
Many organisations host their analytics servers and using and external provider ismore
elsewhere, often inthe cloud. Benefits ofusing cost-effective than building the capability
external providers for storage and/or analytics internally for just one specific type ofanalysis
solutions isthat a) you only pay for the b) depending onthe nature and scope ofthe
capacity needed, b) itis easy toscale the analytics activities, the organisation may only
capacity needed upor down and c) you dont need acertain amount ofcapacity torun the
have toworry (as much) about administration, normal business but during peak times orfor
maintenance, etc. However, there are anumber specific activities, extra capacity could be
ofdrawbacks, the first isalower level ofcon- needed inwhich case partnering with athird
trol over the infrastructure (which could for party isreasonable.
18

Furthermore, the following are relevant considera- advanced, sodoes the complexity ofthe work.
tions when itcomes tocosts: Especially when amultitude ofdata sources are
As mentioned above, many projects overrun being used and models become more compli-
their budgets. This means that there isaten- cated, costs can rise atahigher than linear rate.
dency tounderestimate costs. For this reasons The best way tokeep costs manageable isto
itis advisable toeither lower project require- start small. Smaller datasets can beanalysed
ments ornot beoverly optimistic when starting onrelatively cheap hardware and smaller teams
with analytics. are needed for smaller projects. Inthat sense, if
Costs ofanalytics donot scale linearly. Not only budgets are non-existent, orsmall, itis advisable
isthere typically alarge start-up cost involved, tostart asmaller analytics practice and grow this
but asanalytics activities become more over time asthe team proves its value.

PRACTICAL EXAMPLE

In May 2016, the Executive Office ofthe United States President released areport onthe opportunities ofBig Data. The report contains
acase study onthe potential ofBig Data for employment. Askey problem inemployment related todata, the report recognises that
traditional hiring practices may unnecessarily filter out applicants whose skills match the job opening. Tosolve this problem, big data
isseen asan opportunity: Big data can beused touncover orpossibly reduce employment discrimination. For example, big data
analytics can beused:
To prevent affinity bias orlike mebias inthe hiring process (for example where hiring managers tend toselect candidates like them
orwhom they like).
To find potential job-candidates who otherwise might have been overlooked based onthe more traditional educational orworkplace-
experience related job-requirements. For example, bylooking atthe skills and knowledge areas that have made other employees
successful, amatching system could use pattern matching torecognize the characteristics that made current employees successful
and thus need tobe looked for infuture employees.
Large data analytics systems could help prevent biases often seen intraditional hiring practices that could lead todiscrimination.
Analgorithm could bedesigned tonot look atfactors like age, gender, race orany factor whereas itis much more difficult toblock
such (implicit) factors asahuman.
Beyond supporting orrecommending matching/hiring decisions, advanced algorithms create the possibility ofsolving long-term
employment challenges related todiscrimination, such asthe wage gap oroccupational segregation, for example bygoing beyond
formal job qualifications, but finding the person for the job based oncultural orother factors.
Using data-analytics new kinds ofcandidate scores ormatching scores can becreated byusing diverse and new sources
ofinformation onjob candidates. The report mentions how one employment research firm found that distance employees commute
towork tobe one ofthe strongest predictors ofhow long customer service employees will stay with their jobs. Such variables
and data could beused toimprove matching algorithms for specific (customer service) job vacancies.
Finally, machine learning based algorithms could help decide what kinds ofemployees are likely tobe successful byreviewing
the past performance ofexisting employees ofcertain companies orjob seekers who worked for certain firms orby analysing
the preferences ofhiring managers asshown bytheir past decisions. This could also apply tosuch things asemployee turnover
and the likelihood that certain people will retain jobs incertain industries.
PRACTITIONERS TOOLKIT

19

Chapter 3. In this chapter wefocus onthe organisation ofthe


data. Wediscuss how data can becleaned and

Organising data
sanitised and how existing data from within and
outside ofthe organisation can beextracted, trans-
formed and loaded into adatabase that can beused
for analytics.

The following topics form the main content ofthis


chapter:
Provide overview ofwhat organising data entails.
Focus onareas such assecurity, data protection
and privacy.

Strategic questions this chapter answers:


How can weguarantee security?

Tactical questions this chapter answers:


How toorganise and protect data?
What are reasonable expectations regarding
data organising.

3.1 Cleaning & Sanitising


Organising data isacomplicated process that
requires alot oftime and devotion. Data scientists
can spend upto 90% oftheir time organising data.
This has several reasons:

1. The data are inan undesired format. For example,


databases can store times and dates using many
different formats (e.g. think about the European
DD/MM/YY format versus the USMM/DD/YY format).
Importing different data sources inone database
can create inconsistencies incertain variables that
need tobe fixed. Furthermore, data might have
tobe reorganised sothat analytics can berun.

2. The data need tobe cleaned. Many databases


contain noise that could berelevant for one pro-
cess, but irrelevant for analytical processes. Clean-
ing the analytics database helps reduce noise (and
could aid indatabase performance). Inadata-
catalogue, certain variables that are less relevant
for analytics could bemarked and deleted when
creating adedicated analytics database.

3. The data need tobe sanitised. This refers tothe


process ofridding the database from unwanted
information. For example, personal identifiable
information (PII) (see section 2.3) needs tobe
removed from the datasets. When variables are
documented properly, sanitisation can bedone using
scripts. When that isnot the case, the data-team
will have tocheck the data manually and sanitise
when needed.
20

Getting started

The following might beuseful toget started onorganising and cleaning data:

PACKAGE/TOOL EXPLANATION LINK


Encog Machine Learning Framework Encog isan advanced machine learning http://www.heatonresearch.com/encog/
framework that supports avariety of
advanced algorithms, aswell assupport
classes tonormalize and process data.

4. The data need tobe transformed. Even though Smart data = (Big) Data + Utility
transformation ispart ofthe ETL process (see + Semantics + Data Quality + Security
section 2.3). Additional transformation can be + Data Protection
needed while preparing for data analytics. For Data Quality refers tovarious aspects ofthe data
example, decimal places could have tobe fixed itself. These are a) completeness (do Ihave
or floating numbers could have tobe converted enough data about everybody tomake claims),
to integers. Another example isto transform b) cleanness (are the data well cleaned, sanitised
unstructured data into structured data. and maintained), c) have high levels ofvalidity
and d) impact (have they been analysed insuch
away that they retain their validity and create
Concept: Structured vs. Unstructured data relevant meaning for the organisation).
Structured data isdata with ahigh level
oforganisation (and formatting). For example,
atable with records, variables and labelled data 3.2 Describing data & data
(such asatable with jobseekers demographic characteristics
information) isstructured information. Inits most
simple form, unstructured data isdata lacking Descriptive analytics help usunderstand the data.
such organisation. Adatabase with PDFs orpho- When describing data, welook athow the data
tocopies ofjobseekers resumes isan example isdistributed (e.g. normal, power-law, linear), what
ofunstructured data. the key characteristics ofthe data are (e.g. the mode,
median and mean) and wecheck for such things
asthe outliers inthe data. Descriptives are important
5. Missing data might have tobe imputed. Imputation because they:
isthe process ofsubstituting missing data with Help usunderstand the nature ofthe data
calculated values. Indatabases where data (e.g. what isthe nature ofthe variables).
ismissing, scientists may choose torun algorithms Help usdraw initial conclusions about the data
toimpute these missing data points. One way (e.g. Based onthe distribution ofdata wecan
todo this isby looking atpatterns inthe data. For observe that certain jobseekers have certain
example, ifpeople with similar characteristics characteristics).
consistently have the same value regarding acer- Help usprepare for further analysis (e.g. by
tain variable, the likelihood increases that the removing outliers that distort the data).
missing values are similar.
Furthermore, descriptive statistics are another way
It isextremely important that data are cleaned and tocheck the quality ofthe data and the organisation
sanitised properly. For this reason, PES starting to ofthe data. For example, itcould show inconsisten-
work with data could create peer review processes cies inthe data collected and areas where data have
to make sure data are being reviewed after being been improperly transformed.
organised byother team members.
PRACTITIONERS TOOLKIT

21

Getting started Code and model reviews either byteam


members orexternal experts
The following methods, tools and/or applications can Combining model fit measures with theoretical
beused tocalculate descriptive analytics: evaluations
Most spreadsheet software (e.g. Microsoft Excel, Automated testing ofcode before models are
LibreOffice Calc) include basic functions for being executed orany code isbeing published
most descriptive analytics orpushed toaproduction environment.
Most statistical software (e.g. SPSS, SAS) have
dedicated functions for descriptive analysis 3.4 Integrating data sources
Modelling languages (e.g. R& Python) have
packages for descriptive analysis. Focus ofthis section isnot onthe integration ofdata
inbusiness processes. Rather, the focus ison the
3.3 Quality control combination ofdifferent types ofdata from within
and between organisations.
High quality data isan important element ofsmart
data. Ensuring quality and having the proper checks Data integration isthe process ofcombining data
and balances inplace can help aPES being confident housed indifferent sources and giving users auni-
inthe quality ofthe available data. The most impor- fied view ofthe data. While inmany organisations,
tant way tocontrol quality isto have control mecha- adata warehouse will already have integrated the
nisms during every single step ofthe data process. most important data sources, there are many sce-
This means that during collection, organisation, narios inwhich data will have tobe integrated
analysis, presentation, and evaluation activities have orcombined tocreate aunified view necessary for
tobe deployed tocheck for the quality ofthe data. analytics, such as:
Missing data
These are ways tocontrol the quality ofthe data Imagine job seekers registering online, but not
during various stages ofthe process. Examples completing all information during the registra-
include: tion process (e.g. their educational background).
These data are missing inthe database and this
When preparing for data collection: could beaproblem ifwe want todo education
Checking the quality oftheories orhypotheses related analysis. If, during the registration
using expert evaluations orliterature reviews. process, the job seeker has also uploaded
aresume, wemay beable toextract the data
When collecting data: from the resume and integrate the two data
Taking multiple measurements, observations sources inorder tocreate acomplete record.
orsamples Triangulation
Using standardised methods, instruments and Triangulation refers tothe combination ofdiffer-
protocols when collecting data (which could ent data sources tocheck for the quality ofone
beincluded inthe data catalogue). (or both). For example, data scientists could use
predictive modelling topredict satisfaction
When organising data: ofjobseekers with amatching tool and use
Setting upvalidation rules when entering data survey data tocheck the results ofthe model.
Using strict protocols when cleaning data Increased quantity
Documenting database structures properly The most important reason, however, tointe-
(and reviewing this) grate data sources isto have more data
Creation ofautomated scripts for organisation available. For example, data oncoaching or
tasks (and checking their performance using training outcomes could beused toenhance
reviews) the profile ofajob seeker.
Having peer reviews oforganisation methods
and practices. While the actual implementation ofdata integration
isatechnical question, the main challenge ofdata
When analysing data and presenting results: integration isorganisational innature. Itinvolves
Model testing using multiple (independent) such activities as:
samples
22

Convincing different stakeholders ofopening This applies even more strongly todata integration
uptheir data silos. across organisations. Ifthe success ofthe data team
Getting cooperation from ITdepartments depends ondata from other organisations, itwill
toactually get the data. need astronger position inthe organisation, prefer-
Coordinate privacy and security risks with ably with support from the highest levels ofleader-
relevant stakeholders. ship inthe organisation. Logical data sharing partners
Working with relevant stakeholders tounder- for PES include:
stand the nature ofthe data (i.e. Add tothe
data-catalogue). Other governments:
Making agreements around updates ofthe data Ministries ofeducation (or similar)
orSLAs (i.e. how often and towhat standards For example regarding data about the future
data are being shared). workforce, which could behelpful inpredictive
models for future matching applications
orunemployment forecasting.
Concept: Service Level Agreement (SLA) Tax agencies (or similar)
An agreement specifying the quality ofservices Regarding financial data about job seekers
delivered from one (part ofan) organisation (e.g. For benefit fraud detection).
toanother. For example about the uptime Social security institutions (or similar)
ofservers orthe refresh rate ofdata. Regarding social security orbenefit information.
This isespecially relevant when there isalegal
obligation for data collection and/or sharing.
The stronger the mandate ofthe leader ofthe data Statistics bureaus
team and the higher the position inthe organisation, For various types ofinformation such aspopu-
the more (formal) organisational power the team lation mobility (which could beused tofine-tune
has ingetting the data itneeds. This isan important job recommendations) orhousehold develop-
consideration when starting with data analytics. ments (which could impact the labour force).
Amore experimental team focused oninductive Regional orlocal governments
approaches, may have more difficulty inintegrating For data regarding specific local orregional
data sources from across the organisation versus circumstances (e.g. Local employment
ateam with aposition more closely tied tobusiness initiatives).
processes.
Businesses:
For example regarding job developments
(are business going toadd orremove positions?),
their future needs.
PRACTICAL EXAMPLE
Other organisations:
Such asfoundations working inthe labour
market (for example, overviews ofactivities
The X-Road isEstonias infrastructure that connects databases could help ininterpreting specific labour
from amultitude ofgovernmental agencies. Itis best described market fluctuations).
asadistributed service bus which allows databases tointeract, making
integrated e-services possible. This, however, also creates opportunities 3.5 Security and Data Protection
for data integration that can beused for data analytics.
Currently, 219 databases are connected toX-Road and these result Security and Data Protection are two other key ele-
inover 1700 services being offered. Byintegrating data sources, ments from the Smart Data equation. Protecting (user)
citizens only have tosupply many pieces ofinformation once and data and having good security should beamong the
italready allows fraud prevention through analytics. highest priorities ofboth the data team, aswell asthe
Please, find further information inthe following link. leadership ofthe team and the parts ofthe organisa-
tion involved inthat data used bythe team. Several
types ofsecurity are important:
PRACTITIONERS TOOLKIT

23

Physical security The second topic weare discussing inthis toolkit


Where are the data stored? How easily can these isthat ofdata protection.
machines beaccessed? Storing data insafe locations
where only authorised personnel has access isone
ofthe key steps toensure. The following can help Smart data = (Big) Data + Utility
improving physical security: + Semantics + Data Quality + Security
Have strict access controls torooms containing + Data Protection
data servers Data protection refers tothe ways privacy and
Have protocols for the use of(sensitive) data confidentiality are safeguarded. Are PII replaced
ondesktops, laptops and other devices byother identifiers? Can data betraced (in
Have guidelines for the use ofmobile devices combination ornot) toindividuals? Are users
(e.g. Laptops, tables, phones) outside ofsecure aware ofthe use oftheir data? Protecting data
areas properly will protect (vulnerable) individuals
Have protocols for the use ofremovable media and minimise organisational risks.
(e.g. USB drives, removable hard drives, etc.).

Virtual access security For purposes ofthis toolkit, webreak down data
Once being close toamachine (or accessing protection intwo topics; privacy and confidentiality.
itremotely), how easy isit togain access? Are (safe) While privacy applies tothe person that needs tobe
passwords inplace? Isdata encrypted? The following protected, confidentiality applies tothe persons data.
can help inimproving virtual access security: When data can beused toidentify aperson, privacy
Develop guidelines for the encryption ofdata issues may arise. When data about aperson can
(especially onthose devices with easier physical beused maliciously (for example byjudging aperson
access) using data that was supposed tobe collected anony-
Develop policies for the use offirewalls and mously), confidentiality issues can arise.
ant-virus software
Have strict protocols regarding passwords, Guarding the privacy ofindividuals and the confi-
password sharing and password changes dentiality oftheir data isimportant toensure no
Limit the use ofAPIs (and other ways toaccess harm isdone toany individual ororganisation. Main
data) tonon-sensitive data and/or open data. consideration regarding privacy consists ofthe
applicable laws and regulations. Onan EUlevel, the
following are important:
Smart data = (Big) Data + Utility Directive 95/46/EC | Protection ofpersonal data.
+ Semantics + Data Quality + Security Directive 95/46/EC sets uparegulatory frame-
+ Data Protection work which seeks tostrike abalance between
Security refers tothe ways the data are being ahigh level ofprotection for the privacy
securely stored and managed. This applies not ofindividuals and the free movement ofper-
only tothe physical security (who can access sonal data within the European Union (EU).
servers and related systems?) but also virtual Todo so, the Directive sets strict limits onthe
security (who has access todata secured collection and use ofpersonal data and
onsystems?). Security isimportant tomake demands that each Member State set upan
sure systems are being hacked and/or data independent national body responsible for the
does not leak oris being stolen. supervision ofany activity linked tothe process-
ing ofpersonal data [quoted from URL below]
URL: http://eur-lex.europa.eu/legal-content/EN/
In addition, the following can help with security TXT/?uri=URISERV%3Al14012
ingeneral: Regulation (EU) 2016/679 | General Data
Makes sure software isalways upto date Protection Regulation
Having aregular security meeting todiscuss This Regulation isset toreplace directive 95/46/EC.
and refresh members ofthe data-teams Itwas adopted on27 April 2016 and enters into
memories onsecurity related matters application on25 May 2018.
Make security training astandard part of(new) More info: http://ec.europa.eu/justice/data-
data-team members, sothat aculture ofsecu- protection/index_en.htm
rity awareness isinstilled from the start.
24

Besides applicable European regulations, every Assess whether the information used complies
single member states will have their own applicable with all (privacy-related) legal and regulatory
laws and regulations. These should beconsulted requirements.
before starting analytics projects. When data across Make aninventory ofpotential risks ofworking
multiple countries are being collected, laws from with PII.
multiple countries may apply. Special care should Assess processes for handling information
also begiven tothe (cloud) storage ofdata incoun- toreduce ormitigate potential privacy risks.
tries other than the home country and/or outside Investigate the consent methods (see below)
ofthe EU. used toask individuals permission for the use
oftheir data.
Next toabiding bythe law, the following good prac- Record the outcomes ofthe assessment and
tices can help toestablish good data practices and make them available.
ensure privacy protection: Implement solutions for any risks orproblems
Implement solid de-identification protocols discovered.
and capabilities.
This consists ofthe removal (or replacement) Implement privacy bydesign principles (also
ofPII (see section 2.3), aswell ashaving checks required according toarticle 23 ofthe new
and balances inplace that ensure that noPII EU General Data Protection Regulation).
enters the process atany time and/or individu- This means that any project, process orservice
als can beidentified using analytics (e.g. needs tobe designed from the start toadhere
Bycombining amultitude ofvariables, itcould tothe strictest possible privacy considerations.
bepossible tonarrow data sets down toindi- For example, when creating ananalytics
viduals). The U.S. Department ofHealth and database, PII should never beincluded insuch
Human Services (HHS) describes two (commonly adatabase inthe first place. This prevents the
accepted) methods tode-identify information5: data team towork with PII inthe first place.
Expert Determination method:
In this scenario qualified experts apply statisti- To safeguard confidentiality, two actions are important:
cal orscientific principles torender information Use consent procedures when collecting
tobe not individually identifiable. information. For example ask people for consent
Safe Harbor method: touse their data when they register asunem-
The removal of18 types ofdata related ployed, orwhen they fill out surveys.
toindividuals from the dataset completely Actively inform individuals ofthe purposes
(see Appendix 1) for anoverview. for which their data isbeing used (very often
this happens inconjunction with the consent
Always use privacy impact assessments (PIA). procedure).
PIAs are tools used toidentify and mitigate
privacy risks. Section 3, article 33 ofthe new
EUGeneral Data Protection Regulation [Data
protection impact assessment and prior
authorisation] already stipulates controllers and
processors tocarry out adata protection impact
assessment prior torisky processing operations.
However, agood practice could beto assess the
privacy impact for any project related toindi-
vidual people and/or cases. Atthe very least,
such aPIA should:

5 See http://www.hhs.gov/hipaa/for-professionals/
privacy/special-topics/de-identification/index.html
PRACTITIONERS TOOLKIT

25

Chapter 4. In the fourth chapter wediscuss the analyses ofthe


data. Wediscuss more traditional types ofanalytics

Analysing data
(such asstatistical methods and data mining), but
also discuss novel and innovative types ofanalytics
such asmachine learning and artificial intelligence.
The focus ofthese innovative types isnot necessarily
onhow PES are using these types, but onpotential
use cases for the future.

The following topics form the main content ofthis


chapter:
Provide overview ofanalytical techniques and
tools
Provide guidance onwhat goals can beachieved
using what methods.

These strategic question are being answered inthis


chapter:
How can analytics help achieving specific
strategic organisational goals?

The following tactical questions are central inthis


chapter:
What are types ofanalytics that are available
and how can they help PES?
Which analytics tochoose and what are their
pros/cons?

4.1 Overview
Before westart our overview oftechniques towork
with data, wefirst give anoverview ofsome common
analytical approaches and their differences/overlap6.
Asthe graph makes clear, there isan abundance
ofmethods, tools and approaches available totrans-
form data into value. The specific use case ofeach
approach depends onthe goal (see Chapter 2) the
PES wants toachieve.

6 Goal isnot tobe complete inthis overview,


but toprovide anoverview ofrelevant approaches
and toshow their relations.
26

Key Data Analytics Concepts

Figure 6: Key data analytics concepts


(Semi)-automatic analysis of (big) data
to extract new information and transform
data into an understandable structure
for further use
[Knowledge Discovery in Databases]
KDD the process of discovering useful
knowlegdge from a collection of data

Data
Mining

Statistics
Collection, organization,
Quantitative analysis, interpretation,
and presentation
Subset of machine Deep of data
learning that attempts Learning
to model high level Qualitative
abstractions in data Quantitative data
Machine collection and analysis
Learning
Constructing Qualitative data
algorithms that collection and analysis
learn from and Artificial
make predictions Intelligence Creation of intelligence
based on data exhibited by machines

4.2 Statistics ofqualitative data (for example, if many interviews


are being held, certain responses could bequantified
Statistics refers to(traditional types) ofresearch and used for quantitative analysis).
tocollect data from people orother units ofresearch.
Typically, statistics isseen asmerely traditional Quantitative methods are, even inthe age ofbig data,
social science research orjust the statistical analysis still very useful. While most data pulled from data-
ofdata. While both are not entirely correct, wewill bases are objective, factual data, quantitative methods
try and explain the concept inmore detail inthis can beused tomeasure more subjective elements.
section. For the purpose ofthis toolkit, our focus The most typical example iscustomer satisfaction
ison the data collection and analysis part and surveys. Inthe context ofinnovation, surveys are
welimit ourselves tostatistics inthe more traditional agood way totest assumptions derived from theories
social science sense. Statistical analysis isbeing used orget structured (quantified) feedback onprototypes
inpretty much every other method ofanalyses, orideas. Other methods, apart from surveys, are:
hence our focus. A/B tests7
Where different groups ofrespondents are
We focus ontwo key approaches within statistics: being exposed todifferent versions ofaproduct
(or treatment). Measuring behaviour after
1. Quantitative methods exposure can help infer effects about the
These are methods typically used tocollect and ana- treatment. This could beused byPES toexperi-
lyse data that has tobe asked from entities and ment with different versions ofwebsites
cannot begathered using other methods. Put simply, orother online tools.
quantitative methods are used totest hypotheses.
Most commonly, quantitative methods are associated
7 Depending on the design of the study, most of the
with surveys and questionnaires, however there are methods explained can take a more qualitative or
many more quantitative methods of collecting data, quantitative nature. We mention them there where
such aslarge scale observations or quantifications they in our view are used more commonly.
PRACTITIONERS TOOLKIT

27

The following table gives more detail ofeach approach and potential use cases for PES.

APPROACH DESCRIPTION POTENTIAL USE CASE


Statistics Collection, organisation, analysis, Traditional types ofresearch tocollect
interpretation and presentation ofdata data from people orother units
ofresearch.
Example(s) Surveys, observations, interviews and the subsequent analysis ofthis data.
Quantitative statistics 8
The use ofquantitative methods Quantitative methods ofdata collection
tocollect and analyse data. Surveys are typically used tocollect data that has
are acommon method tocollect tobe asked from entities and cannot
quantitative data. begathered using other methods.
Example(s) Surveys tomeasure customer satisfaction.
Qualitative statistics Use ofqualitative methods togather Qualitative methods ofdata collection
data that focuses onquality, rather are typically used togain adeeper
than quantity. Common methods are descriptive understanding ofatopic.
interviews, focus groups, etc.
Example(s) Interviews tounderstand why people contact aPES.
Data mining Data mining isthe application ofspecific Data mining isused tocondense large
algorithms inorder toextract patterns amounts ofdata and/or transform them.
from data.
Example(s) Mining usage statistics ofservice channels and showing trends/developments.
KDD (Knowledge Discovery KDD isthe process ofdiscovering KDD isused tobuild upon data mining
inDatabases) knowledge and patterns inlarge and transform data into knowledge.
amounts ofdata.
Example(s) Fraud detection ofbenefits.
Artificial intelligence Goal ofAI isto create intelligence Create smarter technologies that can
exhibited bymachines. make decisions orsupport decision
making.
Example(s) Smart job search systems that work based oncustomer profiles.
Machine learning Algorithms that learn from data Machine learning isused tocreate better
processed tomake predictions and functioning algorithms and models
improve outcomes. bylearning from ongoing analysis.
Example(s) Improvement ofmatching systems byanalysing reasons for previous matches
and/or mismatches.
Deep learning Algorithms that learns and models based Deep learning isused toexplore data
onhigh level abstract and layers that ishighly unstructured and abstracted
ormanifestations ofunstructured data and tries tocreate abstractions from this
(such aswritten text, pictures, videos and data.
many combinations ofdata sources).
Example(s) Drawing inferences from writing styles and formatting ofresumes toimprove
vacancy matching ortraining.

Eye tracking
Where respondents use adevise that tracks
their eye movements. This helps understand
how respondents navigate products, which parts 8 One could argue that most approaches are
draw attention, etc. This could beused indevel- quantitative and therefore fall in this bucket.
However, we use quantitative statistics in a social
opment stages ofnew (online) tools tohelp science context, i.e. Referring to the analysis is
understand how people navigate pages. quantitative data collected through social science
research methods, such as surveys.
28

PRACTICAL EXAMPLE
Plus-minus methods
Where respondents are asked tomark part
of aproduct (e.g. anonline tool, orphysical
The USState ofNew Mexico Department ofWorkforce Solutions (DWS) product) they like ordislike. Subsequently
noticed that many benefits applications made mistakes (purposefully respondents are asked toelaborate ontheir
ornot) while applying for (unemployment) benefits, resulting inimproper choices. Often these are used totest brochures
benefits payments. and other physical products, but could beused
inonline, product environments aswell.
DWS partnered with Deloitte toconduct atwo stage project. The first was
touse quantitative statistical analysis tomodel suspect behaviours. The
next step was togently nudge individuals into more desirable behaviours. Like quantitative methods, qualitative research
The key tothis was todesign and test communications and notifications remains valuable inthe age ofbig and/or smart
for claimants atthree moments: 1) during the vetting process for eligibility, data. The key function ofqualitative methods isto
2) when individuals report work and earnings, and 3) while determining help make sense ofthe world and/or get adeeper
anaction plan toseek new employment. understanding ofphenomenon that simply cannot
By field testing different types ofcommunications, DWS was able toanalyse begenerated through other means ofanalysis.
the best working solution and subsequently implement this. DWS was Ininnovation settings, qualitative methods are most
able tosubstantially influence claimants behaviour. The State successfully commonly used inconjunction with other methods,
increased accurate reporting while reducing improper payments. throughout the innovation process.
(see https://www2.deloitte.com/us/en/pages/deloitte-analytics/articles/
business-analytics-case-studies.html) Getting started

The following methods, tools and/or applications can


beused tocollect and analyse (statistical) data:
2. Qualitative methods There are many online (free) survey tools
These are methods are typically used togain adeeper available, such asSurveyMonkey [surveymon-
descriptive understanding ofatopic. Put simply, key.com]. Non-free products (such asQualtrics
qualitative methods are commonly used tocreate [qualitrics.com]) often have more functionality
hypotheses. Most often, people associate qualitative and/or better support
methods with (group) interviews, but aswell aswith Most spreadsheet software (e.g. Microsoft Excel,
quantitative methods, there are many more LibreOffice Calc) include basic statistical
approaches. Relevant examples inthis context are: functions
Think aloud methods Dedicated statistical software (e.g. SPSS, SAS,
Where people are asked toperform atask and PSPP) can beused toperform (complex) statisti-
explain their thought process while performing cal analysis on(mostly) quantitative data
these tasks. This could beused togain deeper Various tools exist for the transcription and
insights into the choices people make when using analysis ofqualitative data. Examples are:
(online) tools orapplications (such asmobile apps).

PACKAGE/TOOL EXPLANATION LINK


QDA Miner Lite Free computer assisted qualitative https://provalisresearch.com/products/
analysis software. Can beused for the qualitative-data-analysis-software/
analysis oftextual data such asinterview freeware/
and news transcripts, open-ended
responses, etc.
Weft QDA Free tool toanalyse qualitative data http://www.pressure.to/qda/
such asinterview transcripts, fieldnotes
and other documents.
Transana Software toa) various types ofdata https://www.transana.com/
inasingle analysis, b) categorize and
code segments ofthe data, c) explore
coded data through text reports,
graphical reports, and searches.
PRACTITIONERS TOOLKIT

29

Open Source and Free dedicated tools and


packages can befound inthe able below:

PACKAGE/TOOL EXPLANATION LINK


R R isafree software environment for www.r-project.org
analytics and visualisations. Itis one
of the most popular programming
language for analytics. Itis highly
modular and alarge support/
documentation community online.
Python Python isaprogramming language https://www.python.org
that isbeing used heavily for analytics
(sometimes inconjunction with R).
Like R, itis very modular and various
packages exist for specific types
ofanalytics (or visualisations).
DataMelt DataMelt isafree mathematics software. http://jwork.org/dmelt/
Itcan beused for numeric computation,
statistics, symbolic calculations, data
analysis and data visualization.

4.3 Data mining & KDD Data mining refers the application ofspecific algo-
rithms inorder toextract patterns from data. Data
Data Mining and Knowledge Discovery inDatabases mining was invented when datasets became too
(KDD) are closely related (yet different) types ofana- large tobe analysed byhumans soresearchers
lytics. Their relationship can beseen infigure 7. invented ways tocondense large datasets and
Wecould argue that KDD isaway ofturning informa- extract useful types ofinformation. Therefore the
tion (see figure 7) gathered through data mining key difference (in this context) between the statistical
into valuable knowledge. methods discussed inthe previous section isthat
data mining isaimed atthe automation ofthe
Figure 7: KDD process analysis and presentation ofresults.

Data Mining KDD

ETL Data Patterns Knowledge


30

Two common applications ofdata mining are: Nowadays, data mining isoften used inconjunction
with more advanced types ofanalytics (as wewill
Automated prediction oftrends and behaviours. discuss below). For example, certain probabilities
For example, based onprevious purchases, ofoccurrences discovered using data mining can
marketers can estimate the likelihood ofcus- beused asinputs for machine learning models.
tomers buying other products orwhen they Likewise, machine learning could beused toimproved
are likely tobuy the same product again. algorithms used for data mining.

Within PES, this could for example beused to: KDD inmany ways isafollow-up step todata mining.
Estimate the likelihood that (certain) job seekers Itis important tomention here asit illustrates two
find (certain) jobs inacertain period oftime. points:
Estimate the probability ofbenefit fraud While data mined can have value initself, the
occurring among certain groups ofpeople with true value lies inthe interpretation ofdata and
benefits. its transformation into knowledge.
Unemployment forecasting based onhistorical It requires extra effort toturn data into interpre-
data. tative knowledge (and actionable wisdom.

Automated discovery ofpreviously unknown To move from data mining toKDD, PES can dotwo
patterns. things:
By combing different variables and many types Enrich the data mined (e.g. bycombining
of data, data mining can beused todiscover variables) sothat more value iscreated
patterns in data that were previously unknown. orpatterns become more obvious. For example,
For example, this type ofdata mining isused bycombining unemployment trends with labour
inmarketing todiscover ifcustomers who market seasonality, itbecomes possible
purchase certain products are also buying tocorrect for seasonal variations inunemploy-
other products (for Example, Amazon uses this ment and assess the true trend (if any).
togive recommendations tocustomers: people Use experts tointerpret results. Making sense
who bought this product, also bought that ofresults can beextremely difficult without the
product). proper subject matter expertise and knowledge
ofthe context inwhich the analysis takes place.
Within PES, this could for example beused to: Itis for these reasons that having social
If job seekers with certain similar aspects on scientists (and other experts) onthe data team
their resumes are more likely tofind jobs quicker. can enhance its value multifold.
If certain combinations ofjob seeker character-
istics would also make them agood fit for Getting started
vacancies not directly fitting with their past
experiences. The following methods, tools and/or applications can
beused for data mining and/or KDD:

PACKAGE/TOOL EXPLANATION LINK


RapidMiner Open Source platform tomine data https://rapidminer.com/
(and run data science applications).
Data Applied Cloud based, free data mining http://www.data-applied.com/
and visualisation platform.
Orange Open source data mining and http://orange.biolab.si/
visualisation platform (with machine
learning capabilities).
PRACTITIONERS TOOLKIT

31

4.4 Advanced Analytics all the potential moves ofagame ofGo requires
tremendous processing power and the ability
4.4.1 Artificial Intelligence ofAlphaGo tocrunch the numbers isatestament
tothe progress made incomputational power. The
Artificial intelligence (AI) isused tocreate smarter second isthat AlphaGo beat ahuman against the
technologies that can make decisions orsupport deci- expectations of(many) AIexperts who did not expect
sion making. The main goal ofAI isto create technolo- the application tobe sophisticated enough interms
gies that are sosmart that they can think and act like ofreasoning and decision making. Ifyou want tosee
humans. The Turing test isthe benchmark for AIafter more current examples ofhow AIis developing,
which itcan beconsidered human smart. https://aiexperiments.withgoogle.com/ isagood
website toget inspiration.
Artificial Intelligence isabroad concept that encom-
passes machine learning, deep learning and inter- Figure 8: Components ofArtificial Intelligence
sects with other types ofanalytics, such asdata
mining and statistics. Inthis toolkit werestrict
ourselves tothe intelligent applications ofAI where (Big) Data
the application exhibits certain levels ofsmartness
based onlearning and creativity.

The Turing test


The Turing test was developed bymathematician Artificial
Alan Turing in1950 inorder totest the ability Intelligence
ofamachine toexhibit intelligent behaviour.
The common version ofthe Turing test requires Learning Computational
acomputer topretend tobe ahuman. Ahuman algorithms Power
interrogates the machine (without knowing itis
one) and subsequently decides whether he/she
was talking toahuman ormachine. Ifthe
machine wins, ithas passed the Turing test. These two examples, and the progress they show-
case, help usunderstand the more foundational
properties ofAI (see figure 8). Furthermore, ithelps
Understanding AIworks best bylooking atreal world usunderstand what aPES needs inorder tostart
applications. Perhaps the earliest application ofAI using AI:
that received world-wide attention was when IBMs Large amounts ofdata that are available for
chess computer Deep Blue beat world champion AIto process.
chess player Gary Kasparov in1996. This computer The availability ofsufficient amounts ofpro-
was able toprocess and analyse enormous amounts cessing power.
ofdata (previous chess games) and based onKasp- Algorithms that can learn from the data and
arovs playing style reason which move was best become increasingly smart.
out ofall available moves. This ability ofreasoning
isakey characteristic ofAI.

More recently (March 2016), AlphaGo (by Google


DeepMind) was the first artificial intelligence applica-
tion tobeat ahuman atthe game Go. This isan
important feat for anumber ofreasons. The first
isthat, compared tochess, Gois amuch more com-
plicated game with many more moves. Analysing
32

EXAMPLES OFAI POTENTIAL FOR PES


Chat bots are more and more commonplace inthe private Similarly, bots could beused inservice settings. They could
sector. These bots are able toprovide basic customer service ease the burden ofcall centre agents byresponding tosimple
and help clients solve basic problems. inquiries leaving tothe complex and ambiguous tasks for
human agents.
Virtual Assistants such asSiri, OkGoogle, Cortana, that Virtual Assistants that guide the job seeker through the entire
assist people throughout their days bygiven (localised process ofregistering tofinding ajob and exiting the system.
and personalised) recommendations, pro-actively serve Itreminds job seekers ofwhen todo things, gives them advice
information and answer questions. onhow todo things and answers questions about the process
(e.g. Give writing tips when ajobseeker creates aresume
orapplication letter).
Video Games where artificial intelligence controls the behaviour Serious gaming where jobseekers can practice job interviews
ofopponents inshooter games. These opponents can learn and based onAI the interviewer asks relevant questions (and
from the players behaviour and subsequently change their subsequently gives feedback).
routines.

Examples and potential for PES Getting started

In the table above, wedescribe several existing With the novelty toAI and the lack ofexperiences
applications orAI and how PES could develop similar ofPES (or other governments) that could bereadily
applications ofAI. implemented within PES, itseems advisable tostart
small. Smaller scale experiments with AIallow PES
toexplore the possibilities, reduce the amount
ofdata needed and the complexity ofthe algorithms.
The following tools could beof help.

PACKAGE/TOOL EXPLANATION LINK


Open Cog OpenCog isan open-source software http://opencog.org/
project aimed creating general Artificial
Intelligence applications.
Watson AI Watson isbeing marketed byIBM http://www.ibm.com/watson/
asrelatively generic application
tobusiness and governments and isused
invarious areas such as(structured and
unstructured) analytics, virtual assistant,
data integration and search.
PRACTITIONERS TOOLKIT

33

4.4.2 Machine Learning and includes systems that can make decisions,
combine elements, reason and thus show behaviour
Machine learning isused tocreate better functioning comparable tohuman thinking. Herein also lies akey
algorithms and models bylearning from ongoing difference between machine learning and data-
analysis. Machine learning isasubset ofartificial mining/KDD. Inmachine learning there isaclear
intelligence and there isdisagreement about the emphasis inlearning from the data and the applied
exact difference between the two concepts. Wesee analysis for future iterations.
the difference asmachine learning being mostly
used toanalyse large volumes ofdata, discover Several types ofmachine learning exist; inthe table
patterns inthese data and subsequently learn from below welist anumber ofcommon types ofmachine
the data. Artificial intelligence goes one step further learning:

TYPE EXPLANATION PES POTENTIAL APPLICATION


Supervised Learning The team feeds clearly labelled data into Learn how jobseekers best match certain
the computer with desired outcomes and jobs based onpre-defined inputs.
the machine learns how tomanipulate
inputs into outputs.
Unsupervised Learning In this type oflearning, the machine tries Analyse job-seekers resumes, create
tocreate inferences byitself based certain skill categories based onthese
onunlabelled orunstructured data. resumes and match these skills
tovacancies.
Reinforced Learning The machine must achieve agoal and Running simulations tominimise
has create its own feedback mechanisms unemployment over time.
toassess whether itis getting closer
toits goal (such asself-driving cars).

Machine learning isbeing widely applied incom- In the following table welist some more applications
mercial settings and healthcare. Virus scanners on from machine learning from other domains and how
computers are based onmachine learning and soare these could beapplied within PES.
smart thermostats that learn about your heating
preferences and adjust heating cycles accordingly.

EXAMPLES OFMACHINE LEARNING POTENTIAL FOR PES


Recommender systems from companies such asNetflix Improvement ofjob matching systems based ontechnology
and Amazon that become smarter bylearning from users similar to(commercial) recommender systems. For example,
behaviours and people similar tothe user. based onsuccessful matches of(previous) jobseekers, with
similar characteristics, more tailored recommendations for
vacancies could bemade.
Fraud detection systems such asused bybanks, insurance, Detection ofpotentially fraudulent benefit applications
and credit card companies rely onmachine learning todetect byanalysing patterns inpast known applications and learning
potentially fraudulent transactions. the difference between fraudulent and non-fraudulent
applications.
Chat application Skype9 uses machine learning totranslate Facilitate international job interviews orcross-regional
voice orinstant messaging conversations inamultitude interviews incountries with multiple languages. Itcould also
of languages inreal time. Its machine learning algorithms beused for job (language) training and/or counselling.
make itbetter asit gets used more frequently.

9 See https://www.skype.com/en/features/
skype-translator/
34

Examples and potential for PES Getting started

Within PES, machine learning has not been widely Several ofthe tools already mentioned can beused
applied. One notable exception isthe application for machine learning (such asR& Python). The table
oflearning algorithms atthe Flemish PES (VDAB) below lists more relevant tools.
(see below).

PACKAGE/TOOL EXPLANATION LINK


Apache Mahout An environment for quickly creating https://mahout.apache.org/
scalable machine learning applications
aswell asafree library ofmachine
learning algorithms.
OpenNN A C++ library implementing neural http://www.opennn.net/
networks.
TensorFlow A software library for machine learning. https://www.tensorflow.org/
Encog Machine Learning Framework Encog isan advanced machine learning http://www.heatonresearch.com/encog/
framework that supports avariety
ofadvanced algorithms, aswell
assupport classes tonormalize and
process data.

PRACTICAL EXAMPLE
is asubset ofartificial intelligence. Inour view10,
the key difference between machine learning and
deep learning isthat deep learning focuses heavily
The Flemish PES (VDAB) isworking toimprove job matching byusing onunstructured and abstract data aswell asthe
Big Data torecommend vacancies tojobseekers when they access combination ofmany layers ofdata. Machine learn-
the VDABs vacancy system. In2016 aRecommender system was ing tends tofocus onstructured data and discovering
developed based onatwofold objective: finding out which users are patterns indata that are well organised.
interested inwhich vacancies (by looking atwhat they click on, what
they read, open and look at); and predicting ajobseekers interest
Deep learning isfor example used toautomatically
inother vacancies (by looking atwhat similar users have looked
at, analysing behaviours). VDAB seeks tomake both accurate and organise and tag photos. Companies like Google and
extended recommendations tojobseekers through this system, looking Facebook can recognise people and locations inpho-
toopen the pool ofjobs that ajobseeker could find interesting based tos and can use this information totag and catego-
onarange ofpreferences that are expressed inthe vacancy. rise the photos.
The Recommender system iscurrently being tested onaset ofsub-
Thinking along these lines, apotential application
users. This system has been developed byVDABs Innovation Lab.
More information can befound about the Lab onaspecific fiche for deep learning inPES isto have algorithms learn
accessible inthe PES Practices website. from job seekers CVs orresumes and discover useful
information. For example, formatting styles, fonts
used, colours and pictures could tell ussomething
about the job seeker that could beuseful when
recommending jobs orto help them optimise their
resumes. Similarly, recordings (with consent and for
4.4.3 Deep Learning

Deep learning isused toexplore data that ishighly 10 Once again, many different interpretations exist,
unstructured and abstracted and tries tocreate so the reader may have come across different
abstractions from this data. Deep learning isasub- definitions. We have tried to create an easy to
set of machine learning (which, asexplained above) understand common definition.
PRACTITIONERS TOOLKIT

35

PACKAGE/TOOL EXPLANATION LINK


Deeplearning4j Open-source, distributed deep learning https://deeplearning4j.org/
framework.
Keras Python library for development and https://keras.io/
evaluation ofdeep learning models.

research purposes) could beused toanalyse tone Within PES, this could beused to:
ofvoice and emotions and this could beused toper- Better understand customer service communi-
sonalise service delivery processes. cation (and for example create content that
better aligns with clients language)
Getting started Interpret jobseekers resumes, and better
match the languages used byjobseekers
As with machine learning, many ofthe tools already and employers
mentioned can beused for deep learning (such Create better classification schemes (e.g. ESCO)
asR& Python). The table above lists more relevant bymapping jargon and technical terms
tools and/or specific packages. tohuman language.

4.5 Combinations & Derivations 3. Image recognition

In the previous section wehave described what cur- This isaspecial class ofmachine/deep learning focused
rently (in our view) the most important and promising onunderstanding orlearning from (digital) images.
types ofanalytics are for PES. However, many sub- Asmentioned above, the underlying deep learning
types, combinations and derivations ofthese main algorithms are used totag orcategorize content).
classes exist. Inthis section webriefly mention
several ofthese types ofanalytics. For each type, Within PES, this could beused to:
wedefine and explain the concept, describe how Analyse profile pictures used for resumes and
itrelates toother types ofanalytics, and outline how make recommendations for job seekers
itmay beof value for PES. pictures tobetter match certain jobs.

1. Predictive/Prescriptive analytics 4. Speech recognition

Based onmodels, trying toextrapolate from previous Closely related tonatural language processing,
data points tofuture data points. Many recommender speech recognition isused tounderstand spoken
systems are based onpredictive analytics and soare language. Combined with natural language process-
well known examples such asweather forecasts. ing and other types ofmachine learning and AI, this
could beused tocreated social robots and/or chat
Within PES, this could, for example beused in: bots. Currently, speech recognition isused totran-
Unemployment orlabour market forecasting scribe spoken communication. This helps create
Job seeker profiling applications (e.g. Byesti- archives ofcommunication and allows organisations
mating job seekers developments and training tounderstand content (such asquestions customers
needs) have and their accompanying emotions).
Matching applications (e.g. Bypredicting the
ease with which vacancies can befilled). Within PES, this has the following potential applications:
Understanding tone ofvoice and emotions
2. Natural language processing (NLP) incustomer service interactions tobetter
understand perceived problems and obstacles
NLP refers toabroad class ofmethods tointerpret Allow communication tobe continued and
normal people ornatural language (and translate stored onother channels
that inother types oflanguage). This isused Better understand word choices and use these
byspeech recognition and translation software. toupdate web and written content.
36

Chapter 5. In this chapter wefocus onpresenting results and


creating reports. Wedo not focus ontraditional (writ-

Presenting
ten) reports (given the abundance ofresources
onreporting). Wedo focus on(interactive) visualisa-
tions and dashboards aswell assharing data with

& Reporting the public using open data. Throughout, wediscuss


pros and cons and considerations.

The following topics form the main content ofthis


chapter:
Understand why traditional reports may not
bethe best way topresent results from smart
data analytics
Provide overview of(novel) ways topresent
and report data
Considerations when reporting data and who
toopen results upto.

Strategic questions this chapter answers:


How can Iget the best possible insights tohelp
my(strategic) work?

Tactical questions this chapter answers:


What are available ways topresent and report
data?
What are the use cases for each ofthose and
how will they help PES?

5.1 Why move away from


traditional reports?
While the traditional report isstill widely used and
inmany cases agood way toconvey information
and report research findings, many organisations
are moving away from written reports when reporting
about smart data. Several reasons for this exist:
Keep results upto date
One problem with traditional reports istheir
static nature. These creates problems with
dynamic data and especially high velocity data
that creates acontinuous stream ofinsights
with nonatural stopping points. Tobenefit from
the possibility torefresh results, more dynamic
ways toreport outcomes are desired.
Say more with less
Text isoften not the best way todescribe
results and the static nature ofmany graphs
inreports donot allow todig deeper into the
data. This isaproblem solved byinteractive
visualisations that allow tofilter, sort and
search through the data and show those
insights that matter.
PRACTITIONERS TOOLKIT

37

Be more appealing and actionable elements could beused toshow the extent
One ofthe bigger problems with traditional towhich job seekers match tocertain jobs.
reports istheir linear nature and the often dry In situations when summaries ofinformation
nature ofwriting that does not compel readers are presented (for example inevaluations
toread the entire report, let alone follow upon ofpilots). Condensed versions ofinformation
its recommendations. This isaproblem that (like summaries) reduce the complexities
many interactive tools try tosolve byoffering ofinformation and make iteasier touse
personalised insights and offer more dynamic visualisations).
routes toexplore outcomes. When room for contextual information isavail-
able (e.g. indashboards orinteractive web
5.2 (Interactive) Visualisations pages) more complex information can be
transmitted using visualisations, provided
We start this overview with ashort discussion ofthe the contextual information allows tointerpret
role ofvisualisations. Visualisations are apowerful information correctly.
way toshow findings and interesting patterns indata. When experts (e.g. from the data team) are
Key benefit ofvisualisations isthat they allow to: available tohelp interpret information, more
More easily show relationships and develop- complex visualisations can beused and the
ments (over time) than using text. experts can help resolve any ambiguities.
Provide abetter way ofconveying and remem-
bering information than using text for many Getting started
people (depending ontheir learning styles).
There are many ways tocreate following methods,
However, visualisations also have drawbacks, most tools and/or applications can beused tocalculate
notably: descriptive analytics:
Only visual representations may lead tofalse Most spreadsheet software (e.g. Microsoft Excel,
conclusions, for example when agraph suggests LibreOffice Calc) include basic graph functional-
arelationship between variables while inreal ity aswell asrudimentary capabilities for
life there isno significant relationship. interactivity.
Complex graphs can beoverwhelming and Most statistical software (e.g. SPSS, SAS) have
distract from the message that the sender rudimentary visualisation capabilities that allow
wants toconvey. Especially the volume aspect for basic manipulations.
ofbig data can cause problems when trying Many packages orprogramming languages
tovisualise too many data points atonce. can beused for data visualisations. Some of
Very often itstill requires skills and (domain) the most common ones are the jQuery, Chart.js
expertise tointerpret results. When just visuali- orlibraries for JavaScript. For some (interactive)
sations are presented itcan bedifficult tojudge examples see http://bl.ocks.org/mbostock.
the value ofthe results presented. Many online services are built ontop ofthese
While great for presenting results and data packages (and other code) and allow tocreate
visualisations are not the best vehicle toraise visualizations online without the need for coding
concerns, points for discussion and describe (e.g. Datawrapper.de).
context. Certain dedicated software tools exist for data
visualisations. Tableau (.com) isawell-known
In general, when information isvery complex and example, but others exist, such asSisense
much contextual information isneeded simple visu- (.com). Very often these tools blur the line
alisations asastand-alone way toreport information between simple visualisation tools and (online)
are not the best way toconvey amessage. For these dashboards that can beused toanalyse and
reasons, the following seem good use cases touse manipulate data.
(interactive) visualisations:
In production environments when the information
issimple enough tobe understood without too
much contextual information. For example, visual
38

PRACTICAL EXAMPLE

WollyBI (wollybi.com) isaspin-off from the Department ofStatistics and Quantitative Methods (CRISP), University ofMilan-Bicocca.
Together with regional public employment service, CRISP created the WollyBI platform asameans tovisually explore the labour
market invarious regions inItaly based onvacancies, location and skills. Itcurrently allows to inauser friendly manner visualise
various analyses ofover 1.5 million job vacancies.

5.3 Interactive Tools & Dashboards Interactive tools and dashboard aim toresolve many
ofthe issues existing with stand-alone visualisations.
In this section wediscuss (novel) ways topresent For example:
data using interactive (web) tools and online dash- Dashboards typically allow for more sophisti-
boards. The benefit ofthese tools isthat they allow cated types ofdata manipulations such
the user tointeract with the data and thus allow the assorting, searching and filtering.
user toa) personalise the data tofit his/her needs Dashboards can include additional information
(e.g. Through searching, sorting, and filtering), b) explaining data points that help interpret
explore patterns more easily while going from one information.
section toanother (related) section and c) get help Dashboards can include contextual information
and contextual information more easily (e.g. Through that help understand the setting inwhich the
embedded help functions). data was collected and analysed. This can
increase understanding ofthe data.
We define aDashboard inthis context asan interac- Dashboards can include recommendations or
tive tool, most often based onweb technologies, conclusions that build upon the data presented.
working directly ontop ofdata sources that allow Dashboards allow tolink between different
tomanipulate and visualise information and provide sections and thus allow for more dynamic
additional textual and contextual information. routing ofinformation.
PRACTITIONERS TOOLKIT

39

Dashboards can link toinformation outside risk exists oftrying toadd everything toadash-
ofthe dashboard, allowing tolink toother board leading tosomething called feature creep,
relevant subjects. namely adding every single possible data point
Dashboards allow tointegrate communication and type ofinformation. This could severely
tools (e.g. Chat applications) orlink tocommuni- distract from the goal which should beachieved.
cation tools, sothat the user ofthe dashboard Maintaining access rights requires resources
can contact the data team orother support and solid policies for this have tobe created.
staff toaid with the use ofthe dashboard and/ When dashboards are open (i.e. without user
or interpretation ofinformation. authorisation oraccessible without credentials),
extra care needs tobe taken toprevent misin-
In many cases dashboards are cloud based or terpretation and abuse ofdata.
through other means accessible onthe internet Even with the possibilities toexplain and add
or intranet. This allows toshare information easily context, interpreting data remains complicated
and allow flexible access policies. Furthermore, and even with the best dashboard, experts may
because dashboards typically work ontop ofanalyt- still beneeded. The risk exists that dashboards
ics data sources they allow for easy discrete orcon- are used toreplace instead ofcomplement
tinuous data updates. experts.

However, dashboards could suffer from the following Getting started


drawbacks:
Because ofthe flexible nature ofthe dashboards The following tools can help getting started with
and the many features than can beadded, the more interactive visualisations ordashboards.

PACKAGE/TOOL EXPLANATION LINK


Shiny Web application toturn (R Based) http://shiny.rstudio.com/
analytics into interactive applications.
Google Charts Web application tocreate (interactive) https://developers.google.com/chart/
graphs from various types ofdata.
Tableau One ofthe most popular applications http://tableau.com
to create interactive graphs
ordashboards. Available desktop,
server and cloud solution. [has afree
trial, afterwards paid for]
40

5.4 Open data The following examples ofopen data can be


inspirational:
A special class ofreporting isthat ofopen data. Open LMI for All (see http://www.lmiforall.org.uk)
data isslightly outside ofthe scope ofthis toolkit and Online data portal, containing sources ofhigh
fir this reason, weonly discuss the concept briefly. quality, reliable labour market information (LMI)
Inopen data, data sources (such asraw data orana- from the UK.
lysed data) are opened upfor broader audiences Data.gov (http://data.gov)
touse. Four main reasons for open data exist: Open data portal from the United States
Create transparency Government. Contains awide variety ofopen
By opening updata about governments and data sets and agood searchable interface.
processes, public organisations allow the public European Union Open Data Portal (https://data.
tosee what governments do, how they function europa.eu/euodp/en/data)
and how money isbeing spend. The EUOpen Data Portal isan access point
Accountability toagrowing range ofdata produced bythe
A second reason, and tied tothe first, isthat institutions and other bodies ofthe European
of accountability. Using open data, citizens Union.
can check upon the promises and actions Canadian Open Data Portal (http://open.canada.
ofgovernments. ca/en/open-data)
Improve services orprocesses Is interesting because itshowcased applications
Using open data, external parties can help developed based onthe open data available
governments improve processes (e.g. byfinding inthe portal.
data patterns orresults not previously discov-
ered bygovernments themselves). Furthermore,
itcreates the opportunity for external parties
tocompete with PES and docertain things more
effectively and/or efficiently.
Innovate
Using open data, third parties could develop
new, innovative applications that could provide
new oradditional services for job seekers
oremployers. This could indirectly benefit PES.

While open data have these potential benefits, the


following points, directly tied tothe topic ofthis
toolkit need tobe kept inmind:
Develop good policies for privacy and confiden-
tiality when opening updata tothe public.
Also bear inmind that, through combinations
ofvariables and smart inferences, itsometimes
ispossible toidentify (possible) individuals even
when PII isremoved from datasets. For this
reason, open data should bescreened extra
carefully.
Security isalways important and when creating
anopen data infrastructure, the PES basically
creates a(vulnerable) entrance toits data
systems. Tominimise security risks, itis
advisable tocreate aseparate infrastructure
for open data that isnot connected tocritical
systems orinfrastructures.
To make sure open data isused and interpreted
properly, itis important that open data are
documented and described properly.
PRACTITIONERS TOOLKIT

41

Chapter 6. In the sixth and final chapter wefocus onaspects


pertaining toevaluation, and continuation. Evaluat-

Evaluation &
ing both the outcomes and the process are often
neglected parts ofany analytical process. However,
they are extremely important injudging the quality

Continuation ofboth process and outcomes and therefore are


acrucial step indeciding what todo with the out-
comes ofan analytics process. For example; should
results ofapilot beimplemented organisation wide?
Only athorough evaluation can help answer this
question properly.

The following topics form the main content ofthis


chapter:
Help understand the importance ofevaluation
Create anongoing data analytics infrastructure
and culture.

In this chapter, weanswer the following strategic


questions:
How dowe ensure our analytics are correct?
What doIneed todo tomake this part
of(ongoing work in) the PES?

The following tactical questions are answered:


How doIevaluate the process and the
outcomes?
What can Ido toensure continuation?
How doIscale upfrom pilots/small scale
projects tosomething larger?

6.1 Evaluation
Note: this section does not focus onevaluation
asaresearch method. Itfocuses onthe evaluation
ofany analytics process.

Evaluation isthe last step inany (stand-alone) ana-


lytics process and should beapart ofany ongoing
orcontinuous activity. There are many reasons why
evaluations are very important, such as:
Learning from the experience inorder
toimprove future activities orprevent mistakes
from happening again.
Assessing the quality ofthe results sothat they
can bejudged bytheir true value.
Assessing the quality ofthe process toverify
whether nomistakes have kept inand the
results are reflective ofthe desired situation.
Documenting the experiences sothat other
teams can replicate the process and/or draw
the same learnings.
42

EVALUATION FOCUS
Process oriented Outcome oriented
EVALUATION FOCUS Continuous Process flow evaluations System outcome evaluations
Ad-hoc Evaluations Assessments

Understanding whether the results ofthe analytics asexpected? Do(predictive) analytics


analytics activities are worth the (financial) match actual outcomes orcan wetriangulate
investments. outcomes toother data-sources and assess
Determining whether one time projects orpilots their validity?
can orneed tobe implemented throughout the For example, when aPES implements apredic-
organisation orbe scaled up. tive analytics tool topredict the likelihood of
job seekers finding anew job within acertain
Given the many reasons toconduct evaluations itis amount oftime, system outcome evaluations
important toinclude evaluations inany data related will focus onmeasuring whether the predicted
activity and this entails: result (eventually) will lead tojob seekers
Incorporation ofevaluation asastep oractivity finding jobs.
inany data analytics process Process performance evaluations
Dedication ofresources toconduct the In these evaluations, usually when aproject
evaluation. orpilot isconcluded oraprocess isevaluated
more thorough, the process itself isbeing
Not every evaluation isthe same though and wecan evaluated. Usually, these evaluations are
distinguish between four different types ofevalua- broader than process flow evaluations and
tions11 based onthe combination ofthe evaluation process performance evaluations, inthe context
focus (what isbeing evaluated) and evaluation ofanalytics, focus onthe entire analytical
moment (when isthe evaluation taking place). The process and could include such questions as:
following figure gives anoverview. Different types was the team working well together, was the
ofevaluations can also take place simultaneously- process effective and efficient, are wehappy
they are not mutually exclusive. with the tools and methods used? Are outcomes
being used properly?
Each ofthese evaluations has the following For example, when aPES conducts apilot
characteristics: toimplement anew profiling system, aprocess
Process flow evaluations performance evaluation could consist ofinter-
These are ongoing evaluations ofanalytical views with members ofthe data team and PES
processes and aim toanswer such questions as: employees using the new profiling system
Isthe analytical system working asexpected? toevaluate the process.
Are there any errors? Are the results within Outcome assessments
anticipated margins and are their large varia- This fourth type ofevaluation focuses oneither
tions inthe time ittakes torun models? the actual outcomes ofapilot orone-time
Process flow evaluations are typically used project orthe (overall orcumulative) outcomes
todetermine ifthe process itself isfunctioning of an ongoing process atclearly defined points
well. For example, when arecommender system in time. This helps answer questions such as:
takes much longer togenerate arecommenda- re analytics creating correct outcomes over
tion incertain situations, there could beabug longer periods oftime (e.g. When adjusting
orerror inthe model ordata input. for seasonal fluctuations), are the outcomes of
System outcome evaluations apilot satisfactory? Are the outcomes ofanew
These evaluated the outcomes ofan analytical analytical tool/method more robust, valid or
model orsystem continuously. They aim reliable compared tothe old method (and should
toanswer questions as: are the outcomes ofthe wetherefore implement the new method?).

11 Also see the Analytical Paper on his topic,


Pieterson (2016).
PRACTITIONERS TOOLKIT

43

For example, one could compare the quality of for the organisation. Itwill likely impact how (parts)
human generated vacancy matches with those ofthe organisation are working, could impact service
of anew analytics platform after acertain amount delivery for jobseekers and/or employers and could
of time and certain number ofmatches made. The consume valuable resources inthe organisation.
comparison ofthe two ways ofmatching could tell Besides these organisational aspects, there are
the organisation which one performs better (and several technical considerations when scaling upini-
should therefore beimplemented). tiatives. Here are some examples ofboth:

Getting started Technical considerations:


From adata standpoint, one has tobe abso-
The following can help ingetting started lutely certain that the results are valid, reliable
with evaluations: and can begeneralised (if that isthe aim). This
Include evaluations inany analytics plan typically requires alot oftesting and running
orproposal. This ensures commitment toevalu- models toassure model fit isright.
ation from the start. This should also include When scaling up, the assumption isthat the
the exact evaluation moment. For continuous new tool orproduct has obvious (and measured)
evaluations, many moments toassess these benefits outweighing the previous tool orprod-
evaluations could bescheduled (for example uct. Acost-benefit analysis can help inthis
recurring evaluation meetings). decision making process. Similarly, this can help
Try toconnect evaluations toKPIs orpractical indeciding whether tocontinue certain analytics
relevant outcomes inthe organisation. For projects.
example, certain performance goals for anew Broader implementation requires that the tool
tool can bespecified and evaluations can iscapable ofbeing scaled-up. This isnot always
beused tomeasure progress towards these the case. For example, code may not always
tools. This way, evaluations become amore bevery efficient which isnot necessarily
practical and useful tool. aproblem working with small datasets
Similarly to(for example) security training, insmaller settings, but could lead toperfor-
(new) data team members could betrained mance issues inlarge production environments.
regularly onthe importance ofevaluations. Atechnical assessment and requirements
Especially concerning more important initiatives, analysis can help indetermining what needs
itcould behelpful toget the help ofexternal tohappen interms ofproduct (performance)
auditors orevaluation agencies. Furthermore, requirements and stability before atool isready
external auditors can beuseful ingeneral touse for aproduction environment.
(randomly) onprojects toguarantee integrity The interface ofany product ortool could work
and guard bias within the data team. well inalaboratory setting but not necessarily
inaproduction environment. Highly trained
6.2 Continuation and scale-up members ofadata-team used toworking with
of pilots data and tools have different needs than PES
staff members working with production tools.
While ongoing evaluation isincredibly important and For example interms of: a) the ease ofUI/UX,
anorganisation should keep apulse ofany ongoing b) explanations/context surrounding datapoints,
analytical activities, one types ofevaluation isvery c) help/support functionality. User interface
important interms ofthe decision the organisation experts can help readying anexperimental tool
has tomake: for aproduction environment.

For any data analytics project that starts small, at Organisational considerations:
some point intime the organisation has todecide Very often anexpansion orscale-up ofaproject
whether this project (whether itis aresearch activity, orimplementation ofaproduct requires training
pilot orexperiment) will move from being asmall scale and support for new staff members. This
pilot tobeing rolled out inthe entire organisation. requires resources and time that may not
bedirectly available and need tobe planned.
This decision regarding continuation orfull scale-up When the purpose ofadata-team issolely
orimplementation inthe organisation isimportant tofocus onexperimenting and innovation,
44

PRACTICAL EXAMPLE
Perhaps most importantly, and often over-
looked, isdealing with resistance inthe organi-
The Finish PES developed anew statistical profiling tool that was sation, aswell ascreating adata-driven or
implemented inthe organisation in2007. The profiling tool was part innovation oriented culture inthe organisation.
ofan integrated ITsystem that calculated arisk estimate for the Without the support from the employees inthe
jobseeker atregistration using administrative data. The new model organisation, any innovation isdoomed tofail.
was found tobe 90 per cent effective atestimating the likelihood Proper communication plans and cultural
ofajobseeker being unemployed for over 12 months. However, case initiatives are essential inthe scale-up orimple-
workers did not think the tool was useful and did not trust the results mentation ofany new tool. Current attitudes
from the tool. Asaresult the tool was withdrawn from the production and cultures should betaken into account when
environment (see Kurekov, 2014). making implementation decisions.

In sum, while ananalytics activity may result in


auseful result, oreven application that could beused
the data-team may not always bethe most within the organisation, there usually isafairly long
logical place toown aproduction tool orfacil- way togo before results can beused inthe everyday
ity. Ifthat isthe case, aconsideration becomes work ofthe PES. Planning the implementation pro-
who the functional owner ofthe tool should cess carefully and taking the considerations above
be(and who isresponsible for support, further into account can help greatly increating asmooth
development and maintenance). process.
PRACTITIONERS TOOLKIT

45

Appendices
Appendix 1 | Safe Harbor
De-identification types
This list gives anoverview ofthe HHS types ofPII
that need tobe removed from data sets following
the Safe Harbor method. For the full text, consult
http://www.hhs.gov/hipaa/for-professionals/privacy/
special-topics/de-identification/index.html

a) Names

b) Address/geographic information (such asstreet


address, city, county, precinct, (full) ZIP code)

c)All elements ofdates (except year) directly related


toan individual (such asbirth date, admission
date, discharge date, death date)

d) Telephone numbers

e) Vehicle identifiers and serial numbers, including


license plate numbers

f) Fax numbers

g) Device identifiers and serial numbers

h) Email addresses

i) Web Universal Resource Locators (URLs)

j) Social security numbers [or any other national


equivalent]

k) Internet Protocol (IP) addresses

l) Medical record numbers

m) Biometric identifiers, including finger and voice


prints

n) Health plan beneficiary numbers

o) Full-face photographs and any comparable images

p) Account numbers

q) Any other unique identifying number, character-


istic, orcode

r) Certificate/license numbers
HOW TOOBTAIN EUPUBLICATIONS

Free publications:
one copy:
via EUBookshop (http://bookshop.europa.eu)
more than one copy orposters/maps:
from the European Unions representations (http://ec.europa.eu/represent_en.htm);
from the delegations innon-EU countries (http://eeas.europa.eu/delegations/index_en.htm);
by contacting the Europe Direct service (http://europa.eu/europedirect/index_en.htm) or
calling 00 800 6 7 8 9 10 11 (freephone number from anywhere inthe EU) (*).

(*) The information given isfree, asare most calls (though some operators, phone boxes orhotels may charge you).

Priced publications:
via EUBookshop (http://bookshop.europa.eu).
KE-02-17-300-EN-N

You might also like