Professional Documents
Culture Documents
Data Quality
Data Cleaning
Data Integration
Data Reduction
Summary
1
Data cleaning
Data integration
Data reduction
Dimensionality reduction
Numerosity reduction
Data compression
Normalization
Forms of Data
Preprocessing
Data Quality
Data Cleaning
Data Integration
Data Reduction
Summary
5
Data Cleaning
Age=42, Birthday=03/07/2010
equipment malfunction
Noisy Data
Binning
first sort data and partition into (equal-frequency)
bins
then one can smooth by bin means, smooth by bin
median, smooth by bin boundaries, etc.
Regression
smooth by fitting the data into regression functions
Clustering
detect and remove outliers
Combined computer and human inspection
detect suspicious values and check by human (e.g.,
deal with possible outliers)
10
Data Quality
Data Cleaning
Data Integration
Data Reduction
Summary
12
Data Integration
Data integration:
2 (chi-square) test
2
(
Observed
Expected
)
2
Expected
15
Chi-Square Calculation: An
Example
Male
Female
Sum
(row)
250(90)
200(360)
450
50(210)
1000(840)
1050
Sum(col.)
300
1200
1500
507.93
90
210
360
840
2
i 1 (ai A)(bi B)
n
rA, B
(n 1) A B
i 1
(ai bi ) n A B
(n 1) A B
Correlation coefficient:
where n is the number of tuples,
and
are the respective mean or
expected values of A and B, AAand BB
are the respective standard
deviation of A and B.
Some pairs of random variables may have a covariance of 0 but are not
independent. Only under some additional assumptions (e.g., the data follow
multivariate normal distributions) does a covariance of 0 imply independence
19
Co-Variance: An Example
Data Quality
Data Cleaning
Data Integration
Data Reduction
Summary
21
22
Curse of dimensionality
Dimensionality reduction
Wavelet transforms
Fourier transform
Wavelet transform
Frequency
24
25
Wavelet
Transformation
Haar2
Daubechie4
Method:
Wavelet Decomposition
27
Hierarchical
2.75
decomposition
structure (a.k.a. +
error tree) + -1.25
0.5
+
+
2
-1.25
-1
-1
- +
2
0
0
- +
5
0
-1
-1
0
0.5
-+
-+
-+
28
29
The original data are projected onto a much smaller space, resulting in
dimensionality reduction. We find the eigenvectors of the covariance
matrix, and these eigenvectors define the new space
x2
e
x1
30
Normalize input data: Each attribute falls within the same range
Since the components are sorted, the size of the data can be
reduced by eliminating the weak components, i.e., those with low
variance (i.e., using the strongest principal components, it is
possible to reconstruct a good approximation of the original data)
Redundant attributes
Irrelevant attributes
Domain-specific
Mapping data to new space (see: data reduction)
Data discretization
34
35
Linear regression
Data modeled to fit a straight line
Often uses the least-square method to fit the
line
Multiple regression
Allows a response variable Y to be modeled as
a linear function of multidimensional feature
vector
Log-linear model
Approximates discrete multidimensional
probability distributions
36
Regression Analysis
Y1
Y1
y=x+1
X1
(including forecasting of
time-series data),
inference, hypothesis
testing, and modeling of
causal relationships
Linear regression: Y = w X + b
Multiple regression: Y = b0 + b1 X1 + b2 X2
Log-linear models:
Estimate the probability of each point (tuple) in a multidimensional space for a set of discretized attributes, based on a
smaller subset of dimensional combinations
Histogram Analysis
25
Equal-width: equal
bucket range
20
Equal-frequency (or
equal-depth)
10
15
5
100000
90000
80000
70000
60000
0
50000
30
40000
35
30000
Partitioning rules:
40
20000
10000
39
Clustering
Sampling
Types of Sampling
R
O
W
SRS le random
t
p
u
o
m
i
h
t
s
i
(
w
e
l
samp ment)
ce
a
l
p
e
r
SRSW
R
Raw Data
43
Cluster/Stratified Sample
44
String compression
There are extensive theories and well-tuned algorithms
Typically lossless, but only limited manipulation is
possible without expansion
Audio/video compression
Typically lossy compression, with progressive
refinement
Sometimes small fragments of signal can be
reconstructed without reconstructing the whole
Time sequence is not audio
Typically short and vary slowly with time
Dimensionality and numerosity reduction may also be
considered as forms of data compression
46
Data Compression
Compressed
Data
Original Data
lossless
Original Data
Approximated
y
s
s
lo
47
Data Quality
Data Cleaning
Data Integration
Data Reduction
Summary
48
Data Transformation
Methods
Attribute/feature construction
min-max normalization
z-score normalization
Normalization
v'
v minA
(new _ maxA new _ minA) new _ minA
maxA minA
v'
v A
73,600 54,000
1.225
Then16,000
v
v' j
10
Discretization
Binning
Histogram analysis
if A and B are the lowest and highest values of the attribute, the
width of intervals will be: W = (B A)/N.
54
Class Labels
(Binning vs. Clustering)
Data
55
Merge
performed
recursively,
until
predefined
stopping
condition
56
country
15 distinct values
province_or_ state
city
street
Data Quality
Data Cleaning
Data Integration
Data Reduction
Summary
60
Summary
References