You are on page 1of 3

CS490/584 Data Mining

Homework 3 (based on Han chapter 2sections 2.1-2.3)


NAME ___Jacob Adams____________________________ Grade _________/70_______
The following are based on questions from Hans book pages 98 and 99. Note that for questions involving
computation, you must show your work (formula, intermediate steps, etc).
1. Suppose that the data for analysis includes the attribute age. The age values for the data tuples are (in
increasing order) 13, 15, 16, 16, 19, 20, 20, 21, 22, 22, 25, 25, 25, 25, 30, 33, 33, 35, 35, 35, 35, 36, 40,
45, 46, 52, 70. (50 points)
(a) What is the mean of the data? What is the median? What is the standard deviation?
Mean = (13+15+16+16+19+20+20+21+22+22+25+25+25+25+
30+33+33+35+35+35+35+36+40+45+46+52+ 70) / 27 = 809/27 = 29.96
Median = middle value = 25
Standard deviation
Deviations from mean = {-16.96, -14.96, -13.96, -13.96, -10.96, -9.96, -9.96, -8.96, 7.96, -7.96, -4.96, -4.96, -4.96, -4.96, 0.04, 3.04, 3.04, 5.04, 5.04, 5.04, 5.04, 6.04,
10.04, 15.04, 16.04, 22.04, 40.04}
Squared deviations from mean = { 287.6416, 223.8016, 194.8816,194.8816,
120.1216,
99.2016,99.2016, 80.2816, 63.3616, 63.3616, 24.6016,24.6016,24.6016,
24.6016,
0.0016,
9.2416,9.2416, 25.4016, 25.4016, 25.4016, 25.4016, 36.4816,
100.8016,
226.2016, 257.2816, 485.7616, 1603.2016}
Sum of squared deviations = 287.6416, 223.8016, 194.8816,194.8816, 120.1216,
99.2016,
99.2016, 80.2816, 63.3616, 63.3616, 24.6016,24.6016,24.6016,
24.6016, 0.0016, 9.2416, 9.2416, 25.4016, 25.4016, 25.4016, 25.4016, 36.4816,
100.8016, 226.2016, 257.2816, 485.7616, 1603.2016= 4354.9632
Mean of squared deviations = 4354.9632/ 27 = 161.2949
Standard Deviation = 161.2949 = 12.7
What is the mode of the data? Comment on the data's modality (i.e., bimodal, trimodal, etc.).
Mode = most frequent number = {25, 35}
Modality = bimodal
(b) What is the midrange of the data?
Midrange = (13+70)/2 = 41.5
(c) Can you find (roughly) the 1st quartile (Q1) and the third quartile (Q3) of the data?
Q1 = median of lower half = 20 Q3 = median of upper half = 35
1

(d) Give the five-number summary of the data.


min, Q1, median, Q3, max = 13, 20, 25, 35, 70
(e) If you plot a boxplot of this data, what will be the box length (in actual number)? What min and max
value would the whisker extend to? Is there any outliner (i.e., elements beyond the extreme low and
high)?
Length = IQR = Q3-Q1 = 35-20 = 15
Min whisker = MAX(Q1 1.5 * IRQ, min) = MAX(20-1.5*15, 13) = MAX(-2.5, 13) = 13
Max whisker = MIN(Q3 + 1.5 * IRQ, max) = MIN(35+1.5*15, 70) = MIN(67.5, 70) = 67.5
Outliers = anything higher than 67.5 or lower than 13 = {70}
(g) Use smoothing by bin means to smooth the above data, using a bin depth of 3. Illustrate your steps.
13, 15, 16, 16, 19, 20, 20, 21, 22, 22, 25, 25, 25, 25, 30, 33, 33, 35, 35, 35, 35, 36, 40, 45, 46, 52, 70
Step 1: Partition in bins of size 3
Bin 1: {13, 15, 16}
Bin 2: {16, 19, 20}
Bin 3: {20, 21, 22}
Bin 4: {22, 25, 25}
Bin 5: {25, 25, 30}
Bin 6: {33, 33, 35}
Bin 7: {35, 35, 35}
Bin 8: {36, 40, 45}
Bin 9: {46, 52, 70}

Step 2: Set each member in each bin each to the mean of that bin
Bin 1: {14.7, 14.7, 14.7}
Bin 2: {18.3, 18.3, 18.3}
Bin 3: {21, 21, 21}
Bin 4: {24, 24, 24}
Bin 5: {26.7, 26.7, 26.7}
Bin 6: {33.7, 33.7, 33.7}
Bin 7: {35, 35, 35}
Bin 8: {40.3, 40.3, 40.3}
Bin 9: {56, 56, 56}
2. In real-world data, tuples with missing values for some attributes are a common occurrence. Describe
the various methods for handling this problem. (20 points)
Ignore the tuple Exclude the tuple from processing. Although a simple process, it can
have a negative impact on the information that can be mined. One exception where this
can make sense is if the tuple is missing several values.

Fill in the value manually This is one of the most accurate approaches. However, it can be
very time consuming. As a result, it doesnt scale well for large datasets
Use a global constant to fill in the missing value - Fill in all missing values with the same
constant. It is a simple approach. It biases the data. It also uses less information to fill in
this data than other approaches.
Use the attribute mean to fill in the missing values Fill missing values with the mean of
that attribute for other instances. This is slightly more complex that using a constant, but
it should provides higher quality data.
Use the attribute mean for all samples belonging to the same class as the given tuple- The
same as the previous method except that the mean is calculated only for tuples with the
same classification as the tuple with the missing values. It is once again more complex but
it should yield higher quality information.
Use the most probable value to fill in the missing value- Use advanced techniques such as
Bayesian networks or decision trees induction to determine the missing value. This is a
preferred method since it uses the most information to make the decision. However, it is
also the most complicated to implement.

This assignment is DUE 10AM Friday (2/6/09). Please name your file HW3yourlastname.doc and
submit it on Moodle @ classes.cs.siue.edu

You might also like