Welcome to Scribd!

Data Cleaning Public

Uploaded by

0% found this document useful (0 votes)

11 views11 pages

This document summarizes Felicity Clemens' presentation on data cleaning techniques in Stata. It discusses identifying and removing duplicate records manually or automatically, merging datasets using Stata's merge command, and generating a moving target variable to identify the year a chemical concentration changed from its 2002 level by using a forval loop to examine relationships between years. The presentation provided hints and tips for common data cleaning problems in Stata.

Original Description:

Copyright

Available Formats

PPT, PDF, TXT or read online from Scribd

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Report this Document

Copyright:

Attribution Non-Commercial (BY-NC)

Available Formats

Download as PPT, PDF, TXT or read online from Scribd

Flag for inappropriate content

0% found this document useful (0 votes)

11 views11 pages

Data Cleaning Public

Uploaded by

Shwetank Vashisht

Copyright:

Attribution Non-Commercial (BY-NC)

Available Formats

Download as PPT, PDF, TXT or read online from Scribd

Flag for inappropriate content

Jump to Page

You are on page 1of 11

Search inside document

Data cleaning: hints

and tips
Felicity Clemens
Stata Users’ Group meeting
London, 17 & 18th May 2005

Felicity Clemens 18 May 2005

Introduction

 Data cleaning – one of the most time

consuming jobs of all!
 Many ways of attacking the same
problem when using Stata
 The talk will describe some common
problems and propose possible solutions
 These are mostly reminders!

Felicity Clemens 18 May 2005

Contents

1) Introduction to the first datasets

2) Identifying and removing duplicates
– by hand
3) Merging data and uses of the
merge command
4) Generating a moving target
variable
Felicity Clemens 18 May 2005
The study

 A case-control study carried across 3

central European countries
 Exposure of interest: exposure to
chemicals in the environment
 Outcome of interest: cancer

Felicity Clemens 18 May 2005

Identifying duplicates in a
dataset
 This can be done automatically (using the
duplicates set of commands)
 We will demonstrate a manual method of
identifying duplicates
 Two different possibilities:
 The same data have been entered on more
than one occasion;
 Different data have been entered using the
same identifier (id numbers)
Felicity Clemens 18 May 2005
The merge command

A necessary command in data

management of most big studies
There are many different uses of the merge
command. We look at two of them:
 Simple merge on id
 Multiple merge on id

Felicity Clemens 18 May 2005

Identifying a moving
target
 Scenario: we have data for each town giving
the chemical concentration for each year
between 1982 and 2002
 Problem: we need to identify the year counting
backwards from 2002 in which the chemical
changed from its 2002 level
 Why? We need to overwrite the 2002 value
with a new value, and overwrite backwards
until the value changed
Felicity Clemens 18 May 2005
Identifying a moving
target (2)
rescode y1990 y1991 y1992
1010113 65 32 32
1010114 41 41 41
1010115 78 23 23
1010116 44 44 44
1010117 82 82 29
1010118 25 25 25
1010119 12 12 6
1010120 40 12 7

Felicity Clemens 18 May 2005

Identifying a moving
target (3)
We will use the forval loop to examine the
relationship between each year’s
observed value and the observed value
for the previous year

Felicity Clemens 18 May 2005

Summary

 Identifying duplicates – can be done by

hand or automatically using the
“duplicates” set of commands
 Use of the merge command – to merge
on a specific variable, to multiply merge
datasets
 Generating a moving target variable – the
use of the “forval” loop
Felicity Clemens 18 May 2005

Illustrating Evolutionary Computation with Mathematica
From Everand
Illustrating Evolutionary Computation with Mathematica
Christian Jacob
Rating: 4 out of 5 stars
4/5 (1)
Lecture1: Symbolic Model Checking With Bdds
Document33 pages
Lecture1: Symbolic Model Checking With Bdds
yathisha12
No ratings yet
Panel Data Econometrics: Theory
From Everand
Panel Data Econometrics: Theory
Mike Tsionas
No ratings yet
Lab Manual Computer Science & Engineering
Document29 pages
Lab Manual Computer Science & Engineering
41- Vaibhav Vyas
No ratings yet
Using Degradation Measures To Estimate A Time-to-Failure Distribution
Document15 pages
Using Degradation Measures To Estimate A Time-to-Failure Distribution
purmac
100% (1)
Remarks On Monte Carlo Method in Simulation of Financial Problems - Final2
Document9 pages
Remarks On Monte Carlo Method in Simulation of Financial Problems - Final2
Oyelami Benjamin Oyediran
No ratings yet
Butterfly Method
Document14 pages
Butterfly Method
Cj Reyes
No ratings yet
Butterfly Method
Document14 pages
Butterfly Method
lizbet08
No ratings yet
Thesis Pbs
Document7 pages
Thesis Pbs
Fiona Phillips
100% (2)
Final Questions For Last Class
Document5 pages
Final Questions For Last Class
Nitish Kumar
No ratings yet
Bahan Ajar Minggu 12 Simsis
Document10 pages
Bahan Ajar Minggu 12 Simsis
jovanka
No ratings yet
Marco P. Tucci David A. Kendrick Hans M. Hamman
Document40 pages
Marco P. Tucci David A. Kendrick Hans M. Hamman
hoahairau
No ratings yet
Discrete Choice Methods With Simulation: Kenneth E. Train
Document8 pages
Discrete Choice Methods With Simulation: Kenneth E. Train
Jacky C.Y. Ho
No ratings yet
Econometrics Problems Autocorrelation An
Document42 pages
Econometrics Problems Autocorrelation An
Janestacy Anyango
No ratings yet
Panel Data Assignment
Document32 pages
Panel Data Assignment
Fatima Zehra
No ratings yet
Econometrics Definations
Document5 pages
Econometrics Definations
mehwish sughra
No ratings yet
Gustavo Stas PCA Generic
Document52 pages
Gustavo Stas PCA Generic
Mohammad Nahid Mia
No ratings yet
Greenhouse Monitoring & Controlling Agent: AI314 Autonomous Multiagent Systems 2
Document6 pages
Greenhouse Monitoring & Controlling Agent: AI314 Autonomous Multiagent Systems 2
Amal Sherif
No ratings yet
Balance de Hidratacion
Document34 pages
Balance de Hidratacion
Miguel Angel Izarra Porras
No ratings yet
Exercises: Di Culty and Topics Covered
Document2 pages
Exercises: Di Culty and Topics Covered
skullskull
No ratings yet
Dav Cia 2
Document6 pages
Dav Cia 2
Kishan Tiwari
No ratings yet
CIA PMF Allocation
Document28 pages
CIA PMF Allocation
SEEMA NIHALANI
No ratings yet
Clustering Lecture
Document46 pages
Clustering Lecture
ahmetdursun03
No ratings yet
Clarke
Document27 pages
Clarke
Stoune Stoune JR
No ratings yet
Chapter 06 - Heteroskedasticity
Document30 pages
Chapter 06 - Heteroskedasticity
Lê Minh
No ratings yet
Butterfly Method - Foa
Document14 pages
Butterfly Method - Foa
Xai
No ratings yet
Units of Conversion, Significant Figures, Scientific Notation and Temperature
Document34 pages
Units of Conversion, Significant Figures, Scientific Notation and Temperature
Beatrice Agustin
No ratings yet
Assignment Problems
Document7 pages
Assignment Problems
Hari Haran
No ratings yet
Predicting and Estimating Nov 06
Document89 pages
Predicting and Estimating Nov 06
Ram Ganesh
No ratings yet
SERP2003001
Document20 pages
SERP2003001
Saragih Hans
No ratings yet
Fiskom - CFD Vol.I by K A.hoffmann
Document500 pages
Fiskom - CFD Vol.I by K A.hoffmann
AsmaAL-farizi
No ratings yet
GMM and OLS Estimation and Inference For New Keynesian Phillips Curve
Document26 pages
GMM and OLS Estimation and Inference For New Keynesian Phillips Curve
Quang Kien Ta
No ratings yet
Operation Research
Document78 pages
Operation Research
kamun0
No ratings yet
Computers and Chemical Engineering
Document23 pages
Computers and Chemical Engineering
ManuelRamos
No ratings yet
Tutorial13 Basic TimeSeries
Document80 pages
Tutorial13 Basic TimeSeries
Ghulam Nabi
No ratings yet
Books 3337 0 0 (1201-1240)
Document40 pages
Books 3337 0 0 (1201-1240)
Pablo Ledezma
No ratings yet
Chapter 6-SCM - S06
Document54 pages
Chapter 6-SCM - S06
eurosign100
No ratings yet
Ass1 mth513
Document3 pages
Ass1 mth513
Muhammad Idrees
0% (1)
Lecture 22
Document6 pages
Lecture 22
Winny Shiru Machira
No ratings yet
Chapter 7
Document38 pages
Chapter 7
Mian Muhammad Rizwan
33% (3)
Sticky Information Models in Dynare: Dynare Working Papers Series
Document18 pages
Sticky Information Models in Dynare: Dynare Working Papers Series
Laur Laur
No ratings yet
Statistic - Rich Task 2
Document3 pages
Statistic - Rich Task 2
Tooba Aamir
No ratings yet
KEY Energy Webquest
Document12 pages
KEY Energy Webquest
Elena Bishop
No ratings yet
Introduction To Modeling in With Odes: Mathematical
Document83 pages
Introduction To Modeling in With Odes: Mathematical
Nirmala Pasala
No ratings yet
SSRN id356241EconomicForecastingLessonsL
Document38 pages
SSRN id356241EconomicForecastingLessonsL
christian.lochmueller
No ratings yet
Machine Learning Project: Name-Rasmita Mallick Date - 5 September 2021
Document47 pages
Machine Learning Project: Name-Rasmita Mallick Date - 5 September 2021
Ashish Gupta
100% (1)
Module 4-1
Document23 pages
Module 4-1
Aditya ranjan Bubun
No ratings yet
(Daum) Nonlinear Filters - Beyond The Kalman Filter
Document13 pages
(Daum) Nonlinear Filters - Beyond The Kalman Filter
John Adcox
No ratings yet
A Comprehensive Note On The Informed Principal With Private Values and Independent Types
Document19 pages
A Comprehensive Note On The Informed Principal With Private Values and Independent Types
Lucía Quesada
No ratings yet
Infra 4 Deterioration
Document19 pages
Infra 4 Deterioration
nob1taku
No ratings yet
DSG Bring Your Own Project
Document8 pages
DSG Bring Your Own Project
kritig
No ratings yet
Mavroeidis-Weak Identi Cation of Forward-Looking Models in Monetary
Document29 pages
Mavroeidis-Weak Identi Cation of Forward-Looking Models in Monetary
elielo0604
No ratings yet
Principles of System Safety Engineering and Management
Document161 pages
Principles of System Safety Engineering and Management
Kenneth Landicho
100% (1)
A Bayesian Analysis of The Multinomial Probit Model Using Marginal Data Augmentation
Document24 pages
A Bayesian Analysis of The Multinomial Probit Model Using Marginal Data Augmentation
Vishal Sharma
No ratings yet
Stasioneritas
Document32 pages
Stasioneritas
Bojes Wandi
No ratings yet
DB140
Document35 pages
DB140
Asad khattak
No ratings yet
Assignment Week3
Document20 pages
Assignment Week3
totomkos
No ratings yet
Garcia and Tsafac - 2011
Document36 pages
Garcia and Tsafac - 2011
Maruška Vizek
No ratings yet
Data Mining Assignment
Document8 pages
Data Mining Assignment
Amanat Construction
No ratings yet
Tailieuxanh Eps2004035lis 9058920585 Debrito 2 9343
Document150 pages
Tailieuxanh Eps2004035lis 9058920585 Debrito 2 9343
Mỹ Hạnh
No ratings yet
P Training For Resilience
Document35 pages
P Training For Resilience
Lbrito01
100% (1)
Characteristics of Victorian Britain
Document3 pages
Characteristics of Victorian Britain
mwaqasengg
No ratings yet
Rural Marketing Merged
Document146 pages
Rural Marketing Merged
Rishabh Hemani
No ratings yet
PE 12 Q3 WK1-2 Understanding On Health Related Fitness
Document8 pages
PE 12 Q3 WK1-2 Understanding On Health Related Fitness
Emarkzkie Mosra Orecreb
No ratings yet
Rights of An Accused Under Custodial Investigation
Document17 pages
Rights of An Accused Under Custodial Investigation
adrianfrancis9
100% (1)
D90KS
Document1 page
D90KS
Amilton filho
No ratings yet
Concordance C Index - 2 PDF
Document8 pages
Concordance C Index - 2 PDF
nuriyesan
No ratings yet
JamPlay 30 Minute Guitar Pratice Template
Document23 pages
JamPlay 30 Minute Guitar Pratice Template
Sunkilr Sönny
100% (4)
Pre-Qin Philosophers and Thinkers
Document22 pages
Pre-Qin Philosophers and Thinkers
Helder Jorge
No ratings yet
New York Life
Document38 pages
New York Life
Daniel Sineus
No ratings yet
Price List Printer HP Per November 2017
Document14 pages
Price List Printer HP Per November 2017
anthony_prawira
No ratings yet
Some Sunsickday - Kitchenro11
Document356 pages
Some Sunsickday - Kitchenro11
Spencer H
No ratings yet
06 Renr5908 08 01 All PDF
Document108 pages
06 Renr5908 08 01 All PDF
Francisco Ospino Arrieta
100% (2)
A A A
Document5 pages
A A A
Salvador__Dali
No ratings yet
New Arrivals 17 - 08 - 2021
Document16 pages
New Arrivals 17 - 08 - 2021
polar neckson
No ratings yet
Tadano450xl PDF
Document12 pages
Tadano450xl PDF
munawar
0% (1)
Cosmology Questions and Answers - Sanfoundry
Document9 pages
Cosmology Questions and Answers - Sanfoundry
Gopinathan M
No ratings yet
The Mutant Epoch Mature Adult Content Mutations v1
Document4 pages
The Mutant Epoch Mature Adult Content Mutations v1
Joshua Gibson
No ratings yet
Difference Between Dada and Surrealism
Document5 pages
Difference Between Dada and Surrealism
Pro Fukai
No ratings yet
Electricity
Document196 pages
Electricity
jingcong liu
No ratings yet
Emerging and Less Common Viral Encephalitides - Chapter 91
Document34 pages
Emerging and Less Common Viral Encephalitides - Chapter 91
Victro Chong
No ratings yet
Abbreviations For O&G Industry
Document38 pages
Abbreviations For O&G Industry
Mike George Meyer
No ratings yet
Class XI-Writing-Job Application
Document13 pages
Class XI-Writing-Job Application
isnprincipal2020
No ratings yet
Power Systems (K-Wiki - CH 4 - Stability)
Document32 pages
Power Systems (K-Wiki - CH 4 - Stability)
Priyanshu Gupta
No ratings yet
January 2014 QP - Paper 1 Edexcel (B) Maths IGCSE
Document24 pages
January 2014 QP - Paper 1 Edexcel (B) Maths IGCSE
Stevenstrange001 Catty
No ratings yet
Module 8 - Facilitating Learner - Centered Teaching
Document4 pages
Module 8 - Facilitating Learner - Centered Teaching
Sheila Mae Paltep
100% (3)
PriceDoxy 09 September 2011
Document56 pages
PriceDoxy 09 September 2011
Elena Oltu
No ratings yet
North Rig 4
Document1 page
North Rig 4
avefenix666
No ratings yet
Good Data Won't Guarantee Good Decisions
Document3 pages
Good Data Won't Guarantee Good Decisions
Aditya Sharma
No ratings yet
Vivekananda'S Conception of Normative Ethics and Resolution Ethical Problems in Business
Document8 pages
Vivekananda'S Conception of Normative Ethics and Resolution Ethical Problems in Business
Yajat Bhargav
No ratings yet