You are on page 1of 23

UKP 6053: ANALISIS DATA DAN PENAKSIRAN

DESCRIPTIVE STATISTICS
AND
PROBABILITY

Lecturer;
Prof Madya Dr Abd Wahab Bin Jusoh

Prepared by;
Mr Sazliman Ismail (M20082000084)
Mr Mahmud Ahmad (M20082000083)
Mr Tho Siew Wei (M20082000303)
Mrs Norazlilah Md. Nordin (M20082000300)
Miss Wong Yew Tuang (M20082000XXX)
EXERCISES (Measures of Central Tendency)

3.1) Explain how the value of the median is determined for a data set that contains an
odd number of observations and for a data set that contains an even number of
observations.

Solution:
To determine the value of the median for a data set that contains an odd number
of observations first rank the data in increasing order then find the position of the
middle term in a data set with n values as follows:
n +1
Position of the middle term = . The value at this position is the median
2
For example the mark for a sample of five students of class 5R as follows:
76 56 85 63 91
So to find the median we must do the following steps:
STEP 1: rank the data in increasing order
56 63 76 85 91
STEP 2: find the position of middle term
n +1 5 +1
where n = 5, so the position of the middle term = = =3
2 2
Therefore, median marks for this sample is 76.
If the number of observation is even, we also need to rank the data in increasing
order and median can be find from the average of the values of the two middle
terms. For example we have the following data:
4 6 11 9 20 14
STEP 1: rank the data in increasing order
4 6 9 11 14 20
STEP 2: find the position of middle term
n +1 6 +1
where n = 6, so the position of the middle term = = = 3.5
2 2
9 + 11
So median is the average value of = 10
2
3.2) Briefly explain the meaning of an outlier. Is the mean or the median a better
measure of central tendency for a data set that contains an outlier? Illustrate with
the help of an example.

Solution:
Outliers or Extreme Values is a data set that contains a few very small or a very
large values, relative to the majority of the values in a data set .The mean is not
always the best measure of central tendency because it is heavily influenced by
outliers but median it is not influenced by outliers.
Example: Table below show the monthly salary for 7 staff at Rahman Industries.
Staff Salary/month
1 1500
2 1000
3 800
4 15000
5 1200
6 1000
7 1600
Mean for sample with outliers Mean for sample without outliers
22100 7100
x = = 3157.14 x = = 1183.33
7 6
Median is not influences by the outliers because the mediangives the center of a
histogram, with half of the data values to the left of the median and half to the
right of the median.
3.4) Which of the three measures of central tendency (mean, median, mode) can be
calculated for quantitative data only, and which can be calculated for both
quantitative and qualitative data? Illustrate with example.

Solution:
Mode can be calculated for both kinds of data, qualitative and quantitative,
whereas the mean and median can be calculated for only quantitative data.
Example: The race of 10 students who are members of Sc. and Math. Club are:
malay chinese chinese indian
chinese malay chinese indian chinese
malay
Because Chinese occurs more frequently, than the other categories, it is the mode
for this data set. We cannot calculate the mean and median for this data set.

3.18) The following data give the 1998 profits (million dollars) of the 10 airlines listed
in ‘Fortune magazine’s top 1000 U.S Corporations’. A number with a negative
sign represents a loss for that airline. The profits are respectively,
1314 821 1001 -286 538 383 433 -120 109 124
Compute the mean and median. Do these data have a mode? Explain.

Solution:
1314 + 821 + 1001 + (−286) + 538 + 383 + 433 + ( −120) + 109 + 124
Mean= = 431.7
10

Median for this data we must rank the data in increasing order
-286 -120 109 124 383 433 538 821 1001 1314
n + 1 10 + 1
After that find the position of middle term = = = 5.5
2 2
383 + 433
So median is the average value of = 408
2

This data have no mode because it data set with each value occurring only once.
3.23) The following data give the 1998 revenues (million dollars) of the eight
automotive retailing and service companies.
1410 1604 1630 2298 2616 3045 5189 17487

(a) Calculate the mean and median for these data


(b) Do these data contain an outlier? If so, drop the outlier and recalculate the
mean and median. Which of these two summary measure changes by a large
amount when you drop the outlier
(c) Which is the better summary measure for these data, the mean or the median ?
Explain.

Solution:
35279
a) Mean = = 4409.88 million dollars
8

2298 + 2616
Median = = 2457 million dollars
2

b) Yes, these data contain an outlier which is 17487 million dollars.

17792
Mean = = 2541.71 million dollars
7
To find the median rank the data in increasing or decreasing order where this
data already arrange in decreasing order so we can continue to find the
position of the middle term as n =7
7 +1
So the position of middle term = = 4 hence the median values is 2298.
2
The value of mean are affected seriously when we drop the outlier .
c) Median is the better summary measure for these data because the value of
mean are affected seriously when we drop the outlier meanwhile the value
of median are not influenced by outlier.

EXERCISES (Measures of dispersion : Ungrouped data)

3.37) When is the value of the standard deviation for a data set zero? Give one example.
Calculate the standard deviation for the example and show that its value is zero.

Solution:
The sum of the deviations of the x values from the mean is always zero. That

means Σ(x - µ) = 0 and ∑( x − x) = 0

Example: The final examination result of Physics subject of 6 students are 58, 76
64, 68, 81 and 73

58 + 76 + 64 + 68 + 81 + 73
Mean = = 70 marks
6
X x−x
58 58 – 70 = -12
76 76 – 70 = +6
64 64 – 70 = -6
68 68 – 70 = -2
81 81 – 70 = +11
73 73 – 70 = +3
∑x− x = 0
From the table, the sum of the deviations of the x values from the mean is zero.
That means,

∑x− x = 0
3.41) The following data give the weekly food expenditures for a sample of five
families.
65 82 92 116 170
(a) Find the mean for these data. Calculate the deviations of the data values from
the mean. Is the sum of these deviations zero?
(b) Calculate the range, variance and standard deviation.

Solution:
65 + 82 + 92 + 116 + 170
a) Mean = = 105 dollars
5
X x−x
65 65 – 105 = -40
82 82 – 105 = -23
92 92 – 105 = -13
116 116 – 105 = +11
170 170 – 105 = +65
∑x− x = 0
The sum of these deviations is Zero.

b) Range = 170 – 65 = 105 dollars


X x2
65 4225
82 6724
92 8464
116 13456
170 28900
∑ x = 525 ∑ x 2 = 61769
(525) 2
61769 −
Variance, 5
s2 =
5 −1
61769 − 55125
s2 =
4
s = 1661 dollars
2

Standard deviation, s = 1661


s = 40.76 dollars
EXERCISES : (Measure of dispersion : Grouped Data)

3.66) The following table gives information on the amounts (dollars) of electric bills for
August 1999 for a sample of 50 families.

Amount of electric bill (dollars) Number of families


0 to less than 20 5
20 to less than 40 16
40 to less than 60 11
60 to less than 80 10
80 to less than 100 8

Find the mean, variance and standard deviation.


Give a brief interpretation of the values in the column labeled mƒ in your table of
calculations. What does Σmƒ represent?

Solution:
Since, the data set includes only 50 families, it represents a sample.

Amount of electric f m mf m2f


bill (dollars)
0 to less than 20 5 10 50 500
20 to less than 40 16 30 480 14400
40 to less than 60 11 50 550 27500
60 to less than 80 10 70 700 49000
80 to less than 100 8 90 720 64800
n=50 ∑ mf = 2500 ∑m 2
f = 156200

Variance, σ2:

(∑ mf ) 2
∑m f − n
2

σ2 =
n −1
(2500) 2
156200 −
σ2 = 50
49
σ 2 = 636.73 dollars

Standard deviation, σ:
σ = 636.73
σ = 25.23 dollars

Where m is midpoint amount of electric bill in dollars and ƒ is the frequency of a


class. Thus, mƒ represents the approximate total amount of electric bill spent in
August 1990 by families.
EXERCISES (Chebyshev’s Theorem & Empirical Rule)

3.76) The mean time taken by all participants to run a road race was found to be 220
minutes with a standard deviation 0f 20 minutes. Using Chebyshev’s theorem,
find the percentage of runners who ran this road race in
a) 180 to 260 minutes
b) 160 to 280 minutes
c) 170 to 270 minutes

Solution:

From the given information, for this distribution;


Mean, µ = 220
Standard deviation, σ = 20

a) As shown below, each of the two points is 40 units away from the mean.
180-220=-40 260-220=40

180 220 260


distance between mean and each point
k=
standard deviation
40
= =2
20
Then,
1
1−
k2
1
=1−
22
1
=1−
4
= 0.75 or 75%
According to Chebyshev’s theorem, at least 75% of the runners who ran this
road race in 180 to 260 minutes.

At least 75% of the values lie in


the shaded area

µ -2σ
-2σ µ µ +2σ
+2σ

b) As shown below, each of the two points is 60 units away from the mean.
160-220=-60 280-220=60

160 220 280


distance between mean and each point
k=
standard deviation
60
= =3
20
Then,
1
1−
k2
1
=1−
32
1
=1−
9
= 0.889 or 88.9%
According to Chebyshev’s theorem, at least 88.9% of the runners who ran this
road race in 160 to 280 minutes.
At least 88.9% of the values lie in
the shaded area

µ -3σ
-3σ µ µ +3σ
+3σ

c) As shown below, each of the two points is 50 units away from the mean.
170-220=-50 270-220=50

170 220 270


distance between mean and each point
k=
standard deviation
50
= = 2.5
20
Then,
1
1−
k2
1
=1−
( 2.5) 2
4
=1−
25
= 0.84 or 84%
According to Chebyshev’s theorem, at least 84% of the runners who ran this
road race in 170 to 270 minutes.

At least 84% of the values lie in


the shaded area

µ -2.5σ
-2.5σ µ µ +2.5σ
+2.5σ
3.80) The mean life of a certain brand of auto batteries is 44 months with a standard
deviation of 3 months. Assume that the lives of all auto batteries of this brand
have a bell-shaped distribution. Using the empirical rule, find the percentage of
auto batteries of this brand that have a life of
a) 41 to 47 months
b) 38 to 50 months
c) 35 to 53 months

Solution:

We use the empirical rule to find the required percentage because the distribution
of ages follows a bell-shaped curve. From the given information, for this
distribution,
x = 44 months and s = 3 months

a) Each of the two points, 41 and 47, is 3 units away from the mean. Therefore,
k = 3/3 = 1
Thus, the distance between 41 and 44 and between 44 and 47 is equal to s.

From the empirical rule, because the area within one standard deviation of the
mean is approximately 68% for a bell-shaped curve, approximately 68% of
the auto batteries in the sample have a life between 41 to 47 months.

-s s

41 44 47
x–s x x+s
b) Each of the two points, 38 and 50, is 6 units away from the mean. Therefore,
k = 6/3 = 2
Thus, the distance between 38 and 44 and between 44 and 50 is equal to 2s.

From the empirical rule, because the area within two standard deviations of
the mean is approximately 95% for a bell-shaped curve, approximately 95%
of the auto batteries in the sample have a life between 38 to 50 months.

- 2s 2s

38 44 50
x – 2s x x + 2s

c) Each of the two points, 35 and 53, is 9 units away from the mean. Therefore,
k = 9/3 = 3
Thus, the distance between 35 and 44 and between 44 and 53 is equal to 3s.

From the empirical rule, because the area within two standard deviations of
the mean is approximately 99.7% for a bell-shaped curve, approximately
99.7% of the auto batteries in the sample have a life between 35 to 53 months.
- 3s 3s

35 44 53
x – 3s x x + 3s
EXERCISES (Measure of Position)

2.89) The following data give the speeds of 13 cars, measured by radar, traveling on
interstate highway I-84.

73 75 69 68 78 69 74 76 72 79 68
77 71

(a) Find the values of three quartiles and interquartile range.


(b) Calculate the (approximate) value of the 35th percentile.
(c) Compute the percentile rank of 71.

Solution:

(a) Find the values of three quartiles and interquartile range.


First, we rank the given scores in increasing order. Then we calculate the three
quartiles as follows:

Values less than the median Values greater than the median

68 68 69 69 71 72 73 74 75 76 77 78 79

69 + 69 76 + 77
Q1= Q2= 73 Q3=
2 2
(median)
= 69 = 76.5
Interquartile range, IRQ
= Q3 – Q1
= 76.5 – 69
= 7.5

(b) Calculate the (approximate) value of the 35th percentile.


First, we arrange the given scores in increasing order
68 68 69 69 71 72 73 74 75 76 77 78 79

The value of the 35th percentile is;


kn 35(13)
= = 4.55 th term
100 100
The value of the 4.55th term can be approximated by the average of the fourth
and fifth terms in the ranked data. Therefore,
69 + 71
35th percentile, P35 = = 70
2

(c) Compute the percentile rank of 71.


First, arrange the scores in increasing order.
68 68 69 69 71 72 73 74 75 76 77 78 79

In this data set, 4 of the 13 scores are less than 71.


4
Hence, Percentile rank of 71 = × 100 = 30.8%
13
Rounding this answer to the nearest integral value. Therefore, we can state
that about 31% of the scores in this sample are less than 71.
EXERCISES (Probability)

4.19) Which of the following values cannot be probabilities of events and why ?
1/5 0.97 -0.5 1.56 5/3 0.0 -2/7 1.0

Solution:

The probability of an event always lies in the range 0 to 1.


Simple event E1 = 0 ≤ P(E1) ≤ 1
Compound event A = 0 ≤ P(A) ≤ 1
-0.5, 1.56, 5/3, -2.7 cannot be probabilities of events. This is because -0.5 and -2.7
have probability less than 0 meanwhile 1.56 and 5/3 have probability more than 1.
Meanwhile, 1/5, 0.97, 0.0 and 1.0 are probabilities of events because there are in
range 0 to 1.

4.25) A hat contains 40 marbles with 18 are red and 22 are green. If one marble is
randomly selected out of this hat, what is the probability that this marble is
(a) red (b) green

Solution:
Let n denote the total number of marbles in the hat and f 1 represents number of red
marble and f2 represents number of green marble.
n = 40, f1 = 18, f2 = 22

a) using the relative frequency concept of probability:

P (red marbles in the hat) = = = 0.45

b) using the relative frequency concept of probability:

P (green marbles in the hat) = = = 0.55

MARBLES Frequency, f RELATIVE FREQUENCY


RED 18 = 0.45
GREEN 22 = 0.55
n = 40 Sum = 1.00

4.26) A dice is rolled once. What is the probability that


(a) a number less than 5 is obtained
(b) a number 3 to 6 is obtained ?

Solution:

a) The experiment has a total of six outcomes = 1, 2, 3, 4, 5 and 6.


All these outcomes are equally likely. Let A be an event that a number less
than 5 is observed on the dice. Event A include four outcomes = 1, 2, 3 and 4;
that is A = {1, 2, 3, 4}
If any one of these four number is obtained, event A is said to occur. Hence,

P(A) = = = 0.67
b) Let B be an event that a number 3 to 6 is obtained. Event B include four
outcomes: 3, 4, 5, 6; that is B = {3, 4, 5, 6}
If anyone of these four numbers is obtained, event B is said to occur. Hence,

P(B) = = = 0.67

4.45) How many different outcomes are possible for four rolls of a dice?

Solution:
Suppose we roll a dice four times, where each step has 6 outcomes; 1, 2, 3, 4, 5
and 6.
Total outcomes for four roll of a dice
= 6 X 6 X 6 X 6 = 64 = 1296

4.46) How many different outcomes are possible for 10 tosses of a coin?

Solution:
Suppose we toss a coin 10 times, where each step has 2 outcomes; head and tail.
Total outcomes for 10 toss of a coin
= 2 X 2 X 2 X 2 X 2 X 2 X 2 X 2 X 2 X 2 = 210 = 1024

4.47) A statistical experiment has eight equally likely outcomes that are denoted by 1, 2,
3, 4, 5, 6,7, and 8. Let event A ={ 2, 5, 7 } and event B = { 2, 4, 8 }.
(a) Are events A and B mutually exclusive events?
(b) Are events A and B independent events?
(c) What are the complements of events A and B, respectively, and their
probabilities?

Solution:
a) The following are the Venn diagram of event A and B

6
A B

4
5
2
7
8

1 3

Mutually nonexclusive events A and B.


At a 2 spot, A and B happen at the same time. Hence, event A and B are not
mutually exclusive.

b) Two events are said to be independent if the occurrence of one does not affect
the probability of the occurrence of the other. So, A and B are independent
events if:
P (A | B) = P (A) or P (B | A) = P (B)
But in this cases, P (A | B) ≠ P (A) or P (B | A) ≠ P (B)
Thus, events A and B are not independent events.

c) P (A’) =1- P (A)

= 1-

P (B’) = 1- P (B)

= 1-

=
4.54) The following table gives a two-way classification, based on gender and
employment status, of the civilian labor force age 16 to 24 years as of July 1999.
The numbers in the table are thousands.

EMPLOYED UNEMPLOYED TOTAL


MALE 11638 1337 12975
FEMALE 10540 1157 11697
TOTAL 22178 2494 24672

(a) If one person is selected at random from these young persons, find the
probability that this person is
i) unemployed
ii) a female
iii) employed given the person is male
iv) a female given the person is unemployed
(b) Are the events ‘employed’ and ‘unemployed’ mutually exclusive? What about
the events ‘unemployed’ and ‘male’?
(c) Are the events ‘female’ and ‘unemployed’ independent? Why or why not?

Solution:
a) i) P(unemployed) = = = 0.1010

ii) P (female) = = = 0.4741

iii) P (employed | male) = = = 0.8970

iv) P (female | unemployed) = = = 0.4639

b) Mutually exclusive events are the events are that cannot occur together.
For the events ‘employed’ and ‘unemployed’, a gender is neither an employed
nor unemployed. A gender cannot have both identities, meaning that he or
she cannot be a employed and unemployed at the same time. Thus, employed
and unemployed is mutually exclusive.
S

employed unemployed

For the events unemployed and male, if we choose a male. The male can be
employed or unemployed. Then unemployed gender who is a male happens at
the same time. Hence, event unemployed and male are not mutually exclusive.
S
male unemployed
c) Two events are said to be independent if the occurrence of one does not affect
the probability of the occurrence of the other. Thus events female and
unemployed is said to be independent if: either P (female | unemployed) = P
(female) or P (unemployed | female) = P (unemployed)

However, in this experiment, P (female | unemployed) + P (female) where not


all the female is unemployed, still have same female is employed and P
(unemployed | female) + P (unemployed) where not all unemployed is female,
some unemployed is male. The occurrence of unemployed events affects the
probability if occurrence of female event and inversely. Thus, female and
unemployed is independent events.

You might also like