You are on page 1of 59

National Research Institute

Data Integrit
Integrity Workshop

Washington, D.C.
November 14-
14-15, 2008

by
Sen--Yoni Musingo,
Sen Musingo Ph.D.
Ph D
1
2
Data Integrity
g y Components
p

3
Data Integrity
g y Components
p

ƒ Data integrity implies that the data


system, the data process and the
system
content of the data are reliable
reliable,
consistent and accurate.
ƒ Data Integrity is essential in order for
data to be considered credible.

4
Accuracy
y in Data Integrity
g y
Accuracy
y in Data Integrity
g y
z Accuracy refers to how close a measurement
(e.g., gender) is to the expected value (male or
female).
z In archery, an arrow represents a measurement
and the bulls-
bulls-eye represents the expected or
accepted value. Accuracy corresponds to the
distance between the arrows and the bulls-
bulls-eye.
z Your data is accurate if it is clean and
precise!
precise
6
Accuracy
y in Data Integrity
g y

7
Accuracy
y in Data Integrity
g y
If it stinks
stinks, you need to wash it!

It Stinks!

Very
Clean
8
Accuracy
y in Data Integrity
g y

Data Cleansing and Scrubbing


ƒ Detecting and removing and/or correcting
dirty data (e.g.,
(e g data that is incorrect
incorrect, out-of-
out of
date, redundant, incomplete, or formatted
incorrectly)
y)
ƒ Bringing consistency to different data sets
that have been merged from separate
databases

9
Accuracy in Data Integrity

Electronic Data Cleansing and Scrubbing

Enterprise Service Bus

In computing, ESB provides fundamental services via an event-


event-
driven
di and
d standard-
standard
t d d-based
b d messaging
i engine
i (th
(the b
bus))

10
Accuracy
y in Data Integrity
g y

Data Cleansing Techniques


ƒ Data collection validation:
validation to make sure
data input meets established business rules
ƒ Referential integrity and constraints:
constraints
e.g., rules that prevent orphaned and
duplicate records
ƒ Lookup p table usage:
usage
g rangesg of valid values
to prevent entry of incorrect data

11
Accuracy
y in Data Integrity
g y

Data Cleansing Techniques (cont


(cont’d)
d)
ƒ Relational edits and cross checks:
checks perform check
and balance across data elements (e.g., pregnant
male; admission date before birth date)
ƒ Null and default value management:
management making sure
stakeholders understand the use of NULL and default
values in different contexts across data elements
ƒ Exception handling and remediation:
remediation e.g., making
sure erroneous records are identified and corrected

12
Accuracy in Data Integrity
How Precise Are Your Data?
Data?

Bulls
B
Bulls-
ll -Eye
E
Every Time?

13
Accuracy
y in Data Integrity
g y

Data Precision
ƒ Also called reproducibility or repeatability, is the
degree
g to which further reporting
p g of the data
shows the same or similar results.

ƒ Example: a person’s gender is correctly and


consistently reported as male every time rather
than
h ffemalel or malel ffrom time
i to time
i

14
Consistency
y in Data Integrity
g y
Data Consistency is achieved through:
through
ƒ Standardization
ƒ Integration
ƒ Automation
ƒ Replication
ƒ Synchronization

“Consistency is contrary to nature, contrary to life. The only


completely consistent people are dead.”
dead.” Aldous Huxley
15
Consistency in Data Integrity

We Need Some Standards!


Standards!

16
Consistency in Data Integrity

Whose standards?

17
Consistency in Data Integrity

Standardization
ƒ A process of achieving agreement on
common data definitions, representation, and
structures to which all data layers must
conform
ƒ Without Standardization:
z Data exchange and interoperability are
problematic and costly
z D t cannott be
Data b aligned
li d with
ith th
the enterprise
t i
architecture
z Data q
quality
y and consistency
y are compromised
p
18
Consistency in Data Integrity

19
Consistency in Data Integrity

Integration
ƒ A process off combining
bi i d data
t ffrom diff
differentt sources and
d
providing the user with a unified view of these data.
ƒ ETL is an integration process used in data warehousing to extract data
from outside sources, transform these data to fit business needs, and
load these data into the warehouse.

20
Consistency in Data Integrity

Automation: Any Human in Charge?

21
Consistency in Data Integrity
Automation
ƒ A process that uses a computerized control
system to reduce or minimize the need for
human intervention
ƒ Its goal is twofold:
z To avoid mistakes in data entry by making the initial
entering of the data as automatic as possible.
Different situations require different automation
methods and equipment
z To avoid having to re-enter data to perform a different
task with it.

22
Consistency in Data Integrity
How Automated Is Your Data System?

23
Consistency in Data Integrity
Replication and Synchronization

24
Consistency in Data Integrity

Replication
p or Mirrorring
g
ƒ A process used to generate and manage
multiple copies of data at one or more sites,
allowing employees to stay connected to
essential business information and applications
ƒ Data replication also provides a backup system
in case of a catastrophic failure

25
Consistency in Data Integrity
Synchronization
z A process used to consolidate data being
moved from system to system
z Bad data is never spread from system to
y
system,, so the information delivered
across the enterprise is up-
up-to-
to-date,
consistent and accurate

26
Consistency
y in Data Integrity
g y
Can You Synchronize and Replicate Your Data?
Data

27
Reliability in Data Integrity

Reliability

28
Reliability in Data Integrity
z Reliability = Accuracy + Consistency + More…
z Data are reliable also when they are:
z Complete: contain all the data elements needed for
Complete:
the intended purposes of use
z Timely:: accessible and available to users as
Timely
needed when needed
needed,
z Valid:: represent what is being measured
Valid
z Secure:: p
Secure protected against
g malicious or
unintentional alterations

29
Reliability in Data Integrity

30
Reliability in Data Integrity
How Good Are Your Data Sources?
Sources?

Do you trust them?

31
y in Data Integrity
Reliability g y
Your Data Are As Good As Your Sources!

32
y in Data Integrity
Reliability g y

How Complete Are Your Data?


y in Data Integrity
Reliability g y
How Valid Are Your Data?
Data

Are they measuring


what they are
supposed
to measure?

34
Reliability in Data Integrity
How Useful Are Your Data?
Data?

35
y in Data Integrity
Reliability g y
How Timely Are Your Data?
Data

36
y in Data Integrity
Reliability g y
Timeliness
z Making data available in the form
needed when needed
needed, needed, and where
needed
z Timeliness
Ti li off D
Data
t CCollection
ll ti andd
Submission
z Timeliness of Data Processing, Analysis
and Reporting
37
y in Data Integrity
Reliability g y
How Accessible and Visible Are Your Data?
Data?

In Black Hole?

38
Reliability in Data Integrity
……Out of the Black Hole?

39
Data Integrity Drivers

Any Training Need?

40
Data Integrity Drivers

How Frequent is Your System User Training?


Training?

41
Data Integrity Drivers
Do You Collaborate with Your Stakeholders?

42
Data Integrity Drivers
H
How St
Strong iis Y
Your Collaboration?
C ll b ti ?

43
Data Integrity Drivers
Do You Have A Data Integrity Workgroup?

44
Data Integrity Drivers
Does the Workgroup Meet Regularly?

45
Data Integrity Drivers

Business Rules
z Do you have consistent and coherent
business rules for collection
collection, submission
submission,
maintenance and use of data?
z What is the role of the Data Integrity
Workgroup?

46
Data Integrity Drivers

How Secure Is Your Data?

Data
Meltdown

47
Data Integrity Drivers

Some Safeguards:
z Data should be physically, technically,
and logically secured!
z Should have policies and procedures to
ensure that sensitive data access is on a

“need
d tto know”
k ”b
basis
i
z Should have user authentication to
provide
id assurance as tto whoh iis
accessing what, when and how

48
Data Integrity Drivers

z Data integrity is compromised when data has been maliciously


modified altered,
modified, altered or destroyed!

49
Data Integrity Drivers

Malicious Altering
z Internal threat from users: conscious
and intentional attack
z External threats: viruses, worms, and
hackers from the Internet
z Theft and security breach

z Fraud

50
Data Integrity Drivers

Data Integrity is Compromised thru


Inadvertent/Accidental Altering

51
Data Integrity Drivers

Inadvertent/Accidental Altering
z Well meaning users compromising information
through inadvertent or ill-advised actions
z Hardware malfunction

z Disk crashes, susceptible cables, etc.

z Environmental hazards or Human error

z Heat, dust, electrical surges

z Improper network administration

z Improper authorization levels

52
Data Integrity Drivers

Does your System Have a Firewall?

53
Data Integrity Drivers

Firewall:
ƒ Prevent
P t unauthorized
th i d electronic
l t i access tto your
networked computer system
ƒ Permit, deny, encrypt, decrypt, or proxy all
computer traffic between different security
domains
ƒ Prevent hackers from accessing a computer and
also keep information from being sent out from
your computer without your knowledge.
ƒ Don’t prevent virus attacks but, in some
circumstances, they can stop viruses from
sendingg information from an infected computer
p
54
Data Integrity Drivers

How Adequate Is Your Data Infrastructure?

55
Data Infrastructure
Data Collection and Submission

56
Data Infrastructure

57
Data Infrastructure

58
Got Data Integrity?

Data Integrity is collecting, processing, maintaining and using information reliably, 
accurately, and consistently, even if nobody is watching you!
accurately, and consistently, even if nobody is watching you!

59

You might also like