2 views

Uploaded by Bama Raja Segaran

Ch2-DataIssues

- [Edu.joshuatly.com] Klang Trial STPM 2010 Maths T [D6180F2E]
- OPS SP006 Statistical Techniques
- 2010+12MAT3CD+Sem2+Wesley+CF
- Quiz Feedback.pdf
- Estadística - Ritchey Ch04
- Ch1 Tendency
- Microbial Distribution 2010.pdf
- Statistic Test
- Sampling Distributions
- description: tags: ak3
- GridDataReport-Data Surfer Geofisika
- 01_Report_-_INDOOR_WALK_TEST_B.pdf
- control_chart11
- 9709_s07_qp_7
- Psyc 60 Central Tendency and Variability_2.ppt
- Test 1 (for Printing)
- CacingA
- case problem 2
- cm05_univariateStatsVisuals.pdf
- Statistical Static-Timing Analysis

You are on page 1of 21

Attributes (describe objects)

Variable, field, characteristic, feature or

observation

Objects (have attributes)

Record, point, case, sample, entity or item

Data Set

Collection of objects

A1 A2 An C

Type of an Attributes

Depends on the following properties:

Distinctness: =,

Order: <, >, =

Addition: +, -

Multiplication: *, /

Types of Attributes

Attribute Type

Description

Examples

Operations

Nominal

Each value represents a

label. (Typical

comparisons between

two values are limited to

equal or not equal)

Flower color, gender,

sip code

Mode, entropy,

contingency correlation,

2

test

Ordinal

The values can be

ordered. (Typical

comparisons between

two values are equal

or greater or less

Hardness of minerals,

(good, better, best),

grades, street numbers,

rank, age

Median, percentiles,

rank correlation, run

tests, sign tests

Interval

The differences between

values are meaningful

i.e., a unit of

measurement exist. (+, -

)

Calendar dates,

temperature in Celsius

or Fahrenheit

Mean, standard

deviation, Pearsons

correlation, t anf F test

Ration

Differences and ratios

are meaningful. (*, /)

Monetary quantities,

count, age, mass, length,

electrical current

Geometric mean,

harmonic mean, percent

variation

Attribute level

Discrete and Continuous Attributes

Discrete attributes

A discrete attributes has

only a finite or countably

infinite set of values. Eg.

Zip codes, counts or the

set of words in a

document

Discrete attributes are

often represented as

integer variables

Binary attributes are a special

case of discrete attributes and

assume only two values

Eg. Yes/no, true/false,

male/female

Binary attributes are often

represented as Boolean

variables, or as integer

variables that take on the

values 0 or 1

Continuous attributes

A continuous attribute

has real number values.

Eg. Temperature, height

or weight (Practically real

values can only be

measured and

represented to finite

number of digits)

Continuous attributes are

typically represented as

floating point variables

Structured Data Sets

Common types

Record

Graph

Ordered

Three important characteristics

Dimensionality

Sparsity

resolution

Record Data

Most of the existing

data mining work is

focused around data

sets that consist of a

collection of records

(data objects), each

of which consists of

fixed set of data

fields (attributes)

Name

Gender

Height

Output

Kristina

F

1.6 m

Medium

Jim

M

2 m

Medium

Maggie

F

1.9 m

Tall

Martha

F

1.88 m

Tall

Stephanie

F

1.7 m

Medium

Bob

M

1.85 m

Medium

Kathy

F

1.6 m

Medium

Dave

M

1.7 m

Medium

Worth

M

2.2 m

Tall

Steven

M

2.1 m

Tall

Debbie

F

1.8 m

Medium

Todd

M

1.95 m

Medium

Kim

F

1.9 m

Tall

Amy

F

1.8 m

Medium

Lynette

F

1.75 m

Medium

Data Matrix

If all objects in data set have the same set of numeric

attributes, then each object represents a point

(vector) in multi-dimensional space.

Each attribute of the object corresponds to a

dimension

Projection of X

load

Projection of Y

load

Distance

Load

Thickness

10.23

5.27

15.22

2.7

1.2

12.65

6.25

16.22

2.2

1.1

Document Data

Each document becomes a term vector, where each

term is a component (attribute) of the vector, and

where the value of each component of the vector is

the number of times the corresponding term occurs in

the document

s

e

a

s

o

n

t

i

m

e

o

u

t

l

o

s

t

w

i

n

g

a

m

e

s

c

o

r

e

b

a

l

l

p

l

a

y

c

o

a

c

h

t

e

a

m

document 1

document 2

document 3

3 0 5 0 2 6 0 2 0 2

0 7 0 2 1 0 0 3 0 0

0 1 0 0 1 2 2 0 3 0

Transaction Data

Transaction data is a

special type of record

data, where each

record (transaction)

involves a set of items.

For example, consider a

grocery store. The set

of product purchased by

a customer during one

shopping trip constitute

a transaction, while the

individual products that

were purchased are

items

Items

Butter

cheese

b

d

Bread

Milk

coke

a

c

e

Database

1

4 a b c e

b c e

Transaction ID (tid)

2

3

Items bought

a b d e

a b d e

a b c d e 5

6 b c d

Graph Data

Data with relationships among objects

Likes OO modeling where node represents

object and the link representing relationship.

Eg. Linked web pages

Data with object that are graphs

Object contains another objects known as

subobjects, e.g., the structure of chemical

compounds, where the nodes are atoms and the

links between nodes are chemical bonds.

Ordered Data:

The attributes have relationships that involve order in

time or space.

Fig. 2.4 variations of ordered data

Sequential data

Sequence data

Time series data

Spatial data

Data Quality

The following are some well known

issues

Noise and outliers

Missing values

Duplicate data

Inconsistent values

Noise and Outliers

Typographical errors lead to incorrect values

Nominal attribute is misspelled

Extra possible value for that attribute

Different names for the same thing (eg: Pepsi

/ Pepsi Cola

Noise modification of original value

Random error

Non-random error

Noise can be

Temporal

Spatial

Contd

Signal processing can reduce (generally

not eliminate) noise

Outliers small number of points with

characteristics different from rest of the

data

Missing Values: Eliminate Data

Objects with Missing Values

Out-of-range entries

Negative number in a field that is normally

only positive

Zero in a field that can never normally be

zero

The reason:

Malfunctional measurement equipment

Changes in experimental design during data collection

Collection of several similar but not identical datasets

Respondents in a survey may refuse to answer certain

questions as age or income

Contd

A simple and effective strategy is to

eliminate those records which have

missing values. A related strategy is to

eliminate attribute that have missing

values

Drawback: you may end up removing a

large number of objects

Missing values:

Estimating them

Price of the IBM stock changes in a

reasonably smooth fashion. The missing

values can be estimated by interpolation

For a data set that has many similar data

points, a nearest neighbor approach can be

used to estimate the missing values. If the

attribute is continuous, then the average

attribute value of the nearest neighbors can

be used. While if the attribute is categorical,

then the most commonly occurring attribute

value can be taken

Missing values: Using the

missing value as another value

Many data mining approaches can be

modified to operate by ignoring missing

values.

E.g. Clustering similarity between pairs of

data objects need to be calculated. If one or

both objects of a pair have missing values for

some attributes, then the similarity can be

calculated by using only other attributes.

Inconsistent/ Duplicate Data

Most machine learning tools will produce

different results if data files are duplicated

repetition influence on the result.

Errors made at data entry spelling errors

Can corrected manually

Data integration

Different names of an attribute in different

databases

Also having duplication

- [Edu.joshuatly.com] Klang Trial STPM 2010 Maths T [D6180F2E]Uploaded byHaRry Chg
- OPS SP006 Statistical TechniquesUploaded bypuput utomo
- 2010+12MAT3CD+Sem2+Wesley+CFUploaded byFred
- Quiz Feedback.pdfUploaded byJonny Michal
- Estadística - Ritchey Ch04Uploaded byVictoria Liendo
- Ch1 TendencyUploaded byelune121
- Microbial Distribution 2010.pdfUploaded byGustavo Sotomayor
- Statistic TestUploaded byRebecca Gildener
- Sampling DistributionsUploaded byAhmed Kadem Arab
- description: tags: ak3Uploaded byanon-673794
- GridDataReport-Data Surfer GeofisikaUploaded byWidi Geofani
- 01_Report_-_INDOOR_WALK_TEST_B.pdfUploaded byKevin
- control_chart11Uploaded byAshok Setty
- 9709_s07_qp_7Uploaded byapi-3765617
- Psyc 60 Central Tendency and Variability_2.pptUploaded byYosef Imanuel Yulius Opi
- Test 1 (for Printing)Uploaded byChrysler Duaso
- CacingAUploaded byJasa Ketik Malang
- case problem 2Uploaded byapi-312107195
- cm05_univariateStatsVisuals.pdfUploaded byGraciela Marques
- Statistical Static-Timing AnalysisUploaded byNikhil Puri
- Ch 10 Review 2 AnswersUploaded bycamillesyp
- OutputUploaded byMutia Oktavia
- PRMO2016_answers.pdfUploaded byNimai Roy
- Consultant Fee Survey 2011Uploaded bywachutunai
- Quantitative Research DesignUploaded byAli Ahmed
- SPSS InstructionUploaded byVannizsa Ibañez
- Dunbar Some Aspects of Research Design and Their Implications in the Observational Study of BehaviourUploaded byujikol1
- Effect of Adversity Quotient, Motivation and Discipline on the Performance of Employees PT. PLN (Persero) West Sumatra Padang IndonesiaUploaded byInternational Journal of Innovative Science and Research Technology
- CHAPTER VUploaded byLorienel
- hw4Uploaded bynilyu

- Vadrevu Info Extrac WebUploaded byBama Raja Segaran
- Lecture 3Uploaded byਰਜਨੀ ਖਜੂਰਿਆ
- c03 SyntaxUploaded byBama Raja Segaran
- c03 SyntaxUploaded byBama Raja Segaran
- ForsythUploaded byBama Raja Segaran
- Kapoor 03 ExperimentalUploaded byBama Raja Segaran
- Camba Zog LuUploaded byBama Raja Segaran
- Lecture 8Uploaded byBama Raja Segaran
- Di Mascio Bozen 22Uploaded byBama Raja Segaran
- MMI 05 SlidesUploaded byBama Raja Segaran
- Ch5 Names TypeUploaded byBama Raja Segaran
- MM2004_HoiUploaded byBama Raja Segaran
- Types of PL - SajilalUploaded byBama Raja Segaran
- Payment SlipUploaded byBama Raja Segaran
- Cluster Search ResultUploaded byBama Raja Segaran
- TTMEI0809Uploaded byBama Raja Segaran
- Ch5 SequentialUploaded byBama Raja Segaran
- Ch5 SequentialUploaded byBama Raja Segaran
- Ch7 ClusteringUploaded byBama Raja Segaran
- Lecture 3Uploaded byਰਜਨੀ ਖਜੂਰਿਆ
- Chapter 2 SQ FactorsUploaded byBama Raja Segaran
- masterC-semOfferedNEWUploaded byBama Raja Segaran
- Chapter 9 SQA PlanningUploaded byBama Raja Segaran
- Ch0 ContentUploaded byBama Raja Segaran
- Ch0 ContentUploaded byBama Raja Segaran
- Senat539_Masters Programme _without Thesis_ Senate 15.07Uploaded byBama Raja Segaran
- Ch1 IntroductionUploaded byBama Raja Segaran
- 105 Weeks EnUploaded byBama Raja Segaran
- TTNOV0708-2(1)Uploaded byBama Raja Segaran

- Reading a CaseUploaded bylokesh05
- Tor for Kaps Final Draft (2)Uploaded bymarting69
- Role and Practices of Marketing in SmesUploaded byIzabela Mugosa
- 29th REVES Meeting Santiago 2017Uploaded byFelipe Chutzinski
- Dutch Test for Handling ConflictUploaded bybob
- Formulating Problem StatementsUploaded bySohail Sham
- Records of AchievmentUploaded byBob Andrepont
- Senior Business Solutions Project Manager in Los Angeles CA Resume Diane JesterUploaded byDianeJester
- Dialogues on JusticeUploaded byalexbetancourt
- Jadad ScoreUploaded byLira Mirandus Manik
- Blackboard CLTUploaded byYiLong Xu
- Oral appliances.pdfUploaded bycristianamihaila
- SO IndermediateUploaded byfatih
- Airline Delay ModelUploaded byvishalmorde
- Dacon BrochureUploaded byMohammed Ilyas Mohiuddin
- From Knowledge Sharing to Knowledge Creation a Blended Knowledge-managemenUploaded byAsalSanjabzadeh
- Operations Management (OPM530) C10 Project ManagementUploaded byazwan ayop
- Half a Century of Research on the Stropp EffectUploaded byGrace McQueen
- Lehman OOC4e Experiment CorrelationsUploaded byPreeti Gunthey Diwan
- Process Plant Engineer Activite ModelUploaded byCristian Romero
- A Review of Research on Project Based LearningUploaded byNadia Cenat Cenut
- Landscape ArchUploaded byEdita Vučić
- (Decision Engineering 1) Lixing Yeung, CKM Lee (Auth.), Hing Kai Chan, Fiona Lettice, Olatunde Amoo Durowoju (Eds.)-Decision-Making for SuUploaded byMarsha Aulia Hakim
- 2014072917023333Uploaded byEnrique Alejandro Ovando
- Some Dynamic Properties of Attitude Structures (1991)Uploaded byc_a_landini
- Chinese Cross ListingUploaded byGGUDBA
- Which Measure of Central Tendency to UseUploaded bySyah Mi
- Questioning the Value of the Research Selectivity Process in British University AccountingUploaded byvaniamar
- 1998_Adelaide Medical Uni Graduate English ProficiencyUploaded bykhoimaingoc
- Wp Residential Real Estate Appraisers Guide Accurately Valuate Residential Rooftop Solar Electric PvUploaded byjital1994