You are on page 1of 11

Data cleaning: hints

and tips
Felicity Clemens
Stata Users’ Group meeting
London, 17 & 18th May 2005

Felicity Clemens 18 May 2005


Introduction

 Data cleaning – one of the most time


consuming jobs of all!
 Many ways of attacking the same
problem when using Stata
 The talk will describe some common
problems and propose possible solutions
 These are mostly reminders!

Felicity Clemens 18 May 2005


Contents

1) Introduction to the first datasets


2) Identifying and removing duplicates
– by hand
3) Merging data and uses of the
merge command
4) Generating a moving target
variable
Felicity Clemens 18 May 2005
The study

 A case-control study carried across 3


central European countries
 Exposure of interest: exposure to
chemicals in the environment
 Outcome of interest: cancer

Felicity Clemens 18 May 2005


Identifying duplicates in a
dataset
 This can be done automatically (using
the duplicates set of commands)
 We will demonstrate a manual method of
identifying duplicates
 Two different possibilities:
 The same data have been entered on more
than one occasion;

Felicity Clemens 18 May 2005


Identifying duplicates in a
dataset
 This can be done automatically (using the
duplicates set of commands)
 We will demonstrate a manual method of
identifying duplicates
 Two different possibilities:
 The same data have been entered on more
than one occasion;
 Different data have been entered using the
same identifier (id numbers)
Felicity Clemens 18 May 2005
The merge command

A necessary command in data


management of most big studies
There are many different uses of the merge
command. We look at two of them:
 Simple merge on id
 Multiple merge on id

Felicity Clemens 18 May 2005


Identifying a moving
target
 Scenario: we have data for each town giving
the chemical concentration for each year
between 1982 and 2002
 Problem: we need to identify the year counting
backwards from 2002 in which the chemical
changed from its 2002 level
 Why? We need to overwrite the 2002 value
with a new value, and overwrite backwards
until the value changed
Felicity Clemens 18 May 2005
Identifying a moving
target (2)
rescode y1990 y1991 y1992
1010113 65 32 32
1010114 41 41 41
1010115 78 23 23
1010116 44 44 44
1010117 82 82 29
1010118 25 25 25
1010119 12 12 6
1010120 40 12 7

Felicity Clemens 18 May 2005


Identifying a moving
target (3)
We will use the forval loop to examine the
relationship between each year’s
observed value and the observed value
for the previous year

Felicity Clemens 18 May 2005


Summary

 Identifying duplicates – can be done by


hand or automatically using the
“duplicates” set of commands
 Use of the merge command – to merge
on a specific variable, to multiply merge
datasets
 Generating a moving target variable – the
use of the “forval” loop
Felicity Clemens 18 May 2005

You might also like