You are on page 1of 15

T.A.

PAI MANAGEMENT INSTITUTE


MANIPAL

MANAGERIAL STATISTICS PROJECT PART-2


Submitted to: Prof. Sudhindra S
PGDM 2018-20
Section 7

Submitted by:
Group AN1
Ankur Inani-18S711
Darshika Goel-18S716
Mayur Phalak-18S726
Prithviraj Padgalwar-18S734
Thomas Kuncheria-18S758
Executive summary: From the data on salaries of different class of employee for each town of
France. We applied relevant statistical techniques to analyze the salary information of various
groups.

Introduction: We have used simple random sample technique to select the various samples
from the population. Margin of error and interval estimate was used to find the interval in
which population was lying. We have found the sample size for an interval estimate of
population mean. Hypothesis testing involved an attempt to gather evidence in support of
research hypothesis. We began with alternative Hypothesis and made it the conclusion that
researcher wants to support. Null Hypothesis was used as an assumption to be challenged and if
we found any statistical data that assumption was wrong then we rejected the null hypothesis.
We also found Hypothesis tests about population mean to find the p-value. Critical value
approach was also used the Hypothesis tests. We also inferred about difference between two
populations means when standard deviations are unknown. Inference of population variance
was done using the Chi-Square distribution. Also, we found the test of independence between
two categorical data. We used the concept of matched samples for inferring about the
difference between two population means.

Problem statement

The National Institute of Statistics and Economic Studies, France has released data on salaries of
different class of employee for each town of France. The data consists of 5136 records. There
are data regarding the mean net salary of the entire population, mean net salary each for
woman and men, the mean salary for different age groups and mean net salary for different
roles in organization. The institute wanted to analyze the data to understand the salary
differences among the different sections of the employees. INSEE (Institute National de la
Statisque et Des’etudes Economiques) is the national institute for statistics and economic
studies in France. It collects and publishes information about the French economy, various
economic indicators, various economic factors about people and carries out national census.
We have taken data from a report of INSEE report 2017 on mean salary among different classes
(gender based, age based, post based) across 5136 towns. Our goal is to explore inequality
between men and women, youngsters and elders, working / social classes. We have formed
questions to determine whether there is equality of salary among different classes exist or not.
This data is also used by Espectra, a major employment agency in France. The agency wants to
hire professional employees at a lower rate with help of this data. So, they tried to know
whether there can be any salary discrimination among different classes to reduce costs for
hiring.
Objectives of the study

1) To study how to use Hypothesis testing in real life situation to understand data and make
decisions

2) To understand the Sampling distributions: random and non-random sampling techniques,


sampling distribution of sample mean and proportion

3) Apply relevant statistical inference techniques to business data and draw fitting conclusions

4) To examine and justify the applications of various statistical hypothesis testing tools while
exploring business situations

Methodology
Source of data
The data was collected by The National Institute of Statistics and Economic Studies, France

The National Institute of Statistics and Economic Studies collects, produces, analyzes and
disseminates information on the French economy and society.

We were able to obtain this data from the internet


Website: Kaggle.com

Link: https://www.kaggle.com/etiennelq/french-employment-by-town

Main Excel File

Group AN1.xlsx

1)France Institute has issued reports on salaries of different class of employee for each town
of France. Institute has reported that mean salary per hour for men is $14.84/hr in 2017 while
mean salary/hr for woman is $12.03. Since sufficiently historical data is available for all town
about mean men salary and mean women salary. Standard deviation for are 3.17 and 1.78
respectively for men and women.

a.We are interested to find what is the probability that a sample of mean salary per hour for
men in 100 towns will provide a mean salary with in an interval of $0.5 to the population
mean of $14.84
b.What is the probability that a sample of mean salary per hour for woman in 100 towns will
provide a mean salary within an interval of $0.5 to the population mean of $12.03

Firstly, we need to find the Z value of the given data to find the interval

Z= (Sample mean) - (Population Mean) \Standard deviation of the sample

Here population standard deviation is 3.17 for men

The value of sample mean will be 14.34 and 15.34 to be within a range of $0.5

Sample standard deviation= (Standard Deviation) \ (Sq. Root of n)

σx = (3.17) \ (Sq. Root of 100) = 0.317

z= (0.5) \ (0.317) = 1.57


By using table or else using excel we found the probability as 0.9426

Similarly, for $14.34 we get Z= -1.57 and probability will be 0.0573

We need to find the difference between these two probabilities to find the actual probability of
mean salary being within an interval of $0.5

Therefore, we get P= (0.9426-0.0573)

= 0.8853

Similarly, for next scenario we use

Z= (Sample mean) - (Population Mean) \Standard deviation of the sample

Here population standard deviation is 1.78 for women

The value of sample mean will be 11.53 and 12.53 to be within a range of $0.5

Sample standard deviation= (Standard Deviation) \ (Sq. Root of n)

σx = (1.78) \ (Sq. Root of 100) = 0.178

z= (0.5) \ (0.178) = 2.80

By using table or else using excel we found the probability as 0.9974

Similarly, for $14.34 we get Z= -2.80 and probability will be 0.0026


We need to find the difference between these two probabilities to find the actual probability of
mean salary being within an interval of $0.5

Therefore, we get P= (0.9974-0.0026)

= 0.9948

2)For France, the mean net salary is 13.7. A sample of 50 woman showed a sample mean of
11.946. The population standard deviation is 2.93.

Formulate hypothesis for a test to determine whether the same data supports the conclusion
that mean female net salary of population is less than the mean of 13.7 for the total
population.

Using significance level as 0.05, what can be concluded.


Sampling distribution of 

1) Develop the hypothesis (Lower tail test)


H0 : µ >= 13.70
Hα : µ<13.70

2) Specify the level of significance


α= .05

3) Compute the value of test statistic


Z=-µ/ (σ/ sqrt(n))
= 11.18-13.70/ (2.55/sqrt (50))
=-6.96
4) Compute the p-value and critical value
For z=-6.96
p-value=0.0000455

Critical value
For α= .05, Z.05= -1.960

5) Determining whether to reject H0

Reject H0 if Z<=- Z.05 and p value< α


Since -6.96<-1.96, We reject H0
And
Since p value< α, we reject H0

Conclusion:

There is enough statistical evidence to infer that the null hypothesis is false. Hence, we can say
the mean net salary of woman is less the mean of net salary of men and woman combined.

3) Institute wanted to know whether the salaries earned by the people is independent of the
Gender or not. Also, they wanted to know if the p-value and critical value method will give
them the same result or not.

For knowing this we use Test of Independence where chi-Square test involves using the sample
data to test for the independence of two categorical variables. The null Hypothesis for this is
that two categorical data are independent. We take a sample of 200 males and females and ask
them whether they think salaries are dependent on Gender.

H0: Salaries are independent of Gender

Ha: Salaries are not independent of Gender

Sample Results

Opinion Male Female Total


Yes 51 54 105
No 56 39 95
Total 107 93 200

Percentage of people who agree salaries are independent of Gender = (105\200) = 0.525

Percentage of people who agree salaries are not independent of Gender = (95\200) = 0.475

Computation of the CHI-SQUARE test statistic for the test of independence between Gender and
Salaries

Opinion Gender Observed Expected Difference Squared Squared


Frequency Frequency (Fij - Eij) Difference Difference
Fij Eij (Fij - Eij)2 divided by
Expected
Frequency
(Fij - Eij)2/ Eij
Yes Male 51 56.175 -5.175 26.78 0.476
Yes Female 54 48.825 5.175 26.78 0.548
No Male 56 50.825 5.175 26.78 0.526
No Female 39 44.175 -5.175 26.78 0.606
Total 200 200 Χ2= 2.156

With r rows and c columns, the chi-square distribution will have (r-1) (c-1) degrees of freedom
provided the expected frequency is at least 5 for each cell. Thus, degrees of freedom will be (2-
(2-1)=1.

Bar Chart
0.7

0.6

0.5

0.4

0.3

0.2

0.1

0
Mal e Femal e
Yes No

We can now use the upper tail of chi-square distribution with 1 degree of freedom and p-value
approach to determine whether null hypothesis salaries are independent of gender can be
rejected.
Χ2 falls in between 0.10 and 0.90 and corresponding p-value must be between 0.1 and 0.90.
With p≤0.5 we must reject null hypothesis. But here p is greater than 0.1 so we accept null
hypothesis and accept that salaries are independent of gender.

By critical value approach also we draw the same conclusion. With α=0.05 and 1 degree of
freedom, the critical value for the chi-square test statistic is χ 0.51 = 3.841. The upper tail rejection
region becomes

Reject Ho if ≥ 3.3841

With 2.156≤ 3.3841 we do no reject Ho

4)France Institute had the variance of mean salaries per hour of the population as 6.54. There
was a new list which was prepared by institute. Administrators of the France Institute now
would like the variance of the new list for the mean salaries per hour to remain at historical
level. They want to evaluate the new list variance. Use of level of significance as 0.05 to
conduct hypothesis test. Check the result using p-value and critical method.

B. Find the interval estimation of population standard deviation for 95% confidence interval

The detailed hypothesis test was proposed by the France Institute

Ho: σ2 = 6.548

Ha: σ2≠ 6.548

Rejection of Ho will indicate that a change in the variance has occurred and suggest some
changes are needed to make variance of new list similar to that of old values. A sample of 50
people data from the new list will be used to do the analysis.

Sample of 50 people mean salary per hour gave a variance of 2.59. The value of the chi-square
test statistic is as follows

Χ2= (n-1) s2 \ σo2

= (50-1) 2.59 \ 6.548

= 19.38

Let us now compute the p-value. By using chi-square distribution table for 49 degrees of
freedom and X2= 19.38, we get area in upper tail as 1. With p value ≥ 0.05, we do no reject null
hypothesis.

Now let us check the result using critical value method. With α=0.05, Χ .205 provides the critical
value for upper tail hypothesis test. For degree of freedom 49, Χ.205= 66.39

Now we reject HO if Χ2≥66.39. Since our value is less than 66.39 we do no reject null hypothesis.
Conclusion: p-value and critical value method generated the same result.

B.

France Institute was interested in finding the interval estimate of population variance with
sample of 50 mean salaries. Sample variance was 2.69 (using same previous sample). With a
sample size of 50, we have degree of freedom as 49. We need to determine Χ .2025 and Χ.2975 as
level of significance is 0.05

We use the formula ((n-1) s2\ Χ.2025) ≤ σ2 ≤ ((n-1) s2\ Χ.2975)

(49) (2.69) \ 70.22 ≤ σ2 ≤ (49) (2.69) \ 31.55

1.87 ≤ σ2 ≤ 4.17

Taking the square root of these values provides the following 95% confidence interval for the
population standard deviation

1.36 ≤ σ ≤ 2.04

5)France Institute takes two sample of salaries from mean net salary/hr for women and mean
net salary/hr for feminine executive. It is interested in finding the interval estimate of the
difference between two population means with σ1 and σ2 unknown. Taking 0.05 as level of
significance

It takes sample size of n2=50 from mean net salary/hr for women and sample size n1=60.

Sample mean for first sample x2= 11.946 and sample mean for second sample x1= 21.008

Sample standard deviation s2= 1.107 and sample standard deviation s1= 2.322

Here we use x1-x2± tα/2 sq. Root (s12\ n1 + s22\ n2) to find interval estimate
Degree of freedom
Putting s2= 1.107, n2=50, s1= 2.322, n1=60 in the above formula we get df= 67

Now we develop the 95% confidence interval estimate of the difference between two
population means by using

t.025= 1.996 for df=67 and x2= 11.946, x1= 21.008, s2= 1.107, n2=50, s1= 2.322, n1=60

substitute in above formula we get 9.062± 0.672

Conclusion

The point estimate of difference between two population mean checking mean salaries per
hour is 9.062. The margin of error is 0.672 and 95% confidence interval estimate of the
difference between the two-population means is 8.39, 9.734.

6)Is it possible to conclude, using a .05 level of significance, that the average of the mean net
salaries of male executive is more than the average of the mean net salary of female
executive?

µ1= sample average of the mean net salaries of male executive


µ2= sample average of the mean net salary of female executive

1) Develop the hypothesis


H0: µ1- µ2<=0 (Right Tail test)
Ha: µ1- µ2>0

σ1= 3.42 (S.D for population of mean net salaries of male executive)

σ2= 2.32 (S.D for population of mean net salaries of female executive)

Sampling distribution of 
2) Specify the level of significance
α= .05

3) Compute the value of test statistic

Z= (1-2)-D0/sqrt(σ12/n1-σ22/n2)

Z= (25.946-20.972)-0/sqrt (3.4212/50-2.3212/50)
=13.99

4) Compute the p-value and critical value

For z=13.99

The P-Value is < 0.00001

Critical value

For α= .05, Z.05= 1.960

5) Determining whether to reject H0

Reject H0, if Z<=-Z.05 and p value< α

Since 14.46>1.96, We reject H0


And
Since p value< α, we reject H0

Conclusion

At the .05 level of significance, the sample evidence indicates that the average of mean net
salary of male executive is greater than the average of mean net salary of female executive.

7)Espectra, the famous French employment agency wants to hire some new employees at a
lower cost. So, it wants to know about whether there is disparity of salaries among towns.

It took two equal sized samples of towns (size 50) and compared the mean hourly salary (for
all). It assumed initially that two samples have same mean hourly salary

Thus, the hypothesis is written as follows:

Ho: µ1 - µ2 =0

Ha: µ1 - µ2 ≠0

So, the null hypotheses will be rejected if two sample’s means are not equal
Let µd = The mean of the difference in values for the salary

Name of the Mean net salary Name of the Mean net salary di
town (sample 1) (for each town) town (sample 2) (for each town)
Saulieu 11.2 Montreal-la- 12.4 -1.2
clause
Fontaine 12.9 Balan 13.9 -1
Bouvigny- 13.8 Marboz 12.7 1.1
Boyeffles
Chantilly 17.8 Saint -andra-de 15 2.8
corcy

A table of only 4 towns are shown to represent sample size of 50, actual sample size is 50.

Last column is showing the difference in salary among towns (row wise)

d bar = ∑ di/n where n= sample size

Calculation:

d bar = 32.8/50= 0.656

standard deviation (sd) = sqrt [∑(di- d bar )2 / n-1]

Calculation:

sd = sqrt (438.06/8.94)
sd = 2.99

Now we must calculate test statistics for hypothesis tests involving matched samples

t = (dbar - µd)/ (sd/ sqrt n)

t = (0.656-0)/(2.99/sqrt50)
t = 0.656/0.422
t = 1.551381
Now we must calculate the p-value for these two tailed tests. Because t =1.55 > 0, the test
statistic is in the upper tail of the t-distribution. With t =1.55, the area in the upper tail to the
right of the test statistic was found by using the t distribution table with degrees of freedom
here = 50-1=49

From t distribution table we found that the area in the upper tail is between 0.10 and 0.05 since
it is two tailed test p-value is between 0.2 and 0.1.

This p-value is greater than α =0.05 thus the null hypothesis Ho: µ1 - µ2 =0 is not rejected.

In addition, we can obtain an interval estimate of the difference between two population mean

At 95% confidence, the calculation:

d bar± t0.025 (sd/ sqrt n)

=0.656 ±2.01* 2.99/sqrt(50)

= -0.1922 to 1.50

8. INSEE Report states that mean hourly salary for all persons of age between 18-25 years
(including men and women both) is $9.54/hr. Now we took sample of 50 towns for women
having age between 18-25 years old and men having age between 18-25 years old from same
population to see if the mean hourly salary for these samples differ from reported mean of
salary for all (men & women). Result will help in knowing the payment disparity based on
Gender.
a. Formulate the null and alternative hypothesis that can be used to determine if the mean
hourly salary for women (18-25) differ from population mean hourly salary for all (including
men & women).

b. Sample of 50 towns showed a sample mean of $9.19/hr for women (18-25 year). Sample
standard deviation is $0.425, compute the p-value.

c. With α= 0.05 as level of significance what can be determined.

d. Formulate the null and alternative hypothesis that can be used to determine if the mean
hourly salary for men (18-25) differ from population mean hourly salary for all (including men &
women).

e. Suppose a sample of 50 towns showed a sample mean of $9.86/hr for men (18-25 year).
Sample standard deviation is $0.4260, compute the p-value.

f. With α= 0.05 as level of significance what can be determined.


Solution:
a.
Ho: µ (18-25 all)-µ (18-25-year female) = 0
Ha: µ (18-25 all)-µ (18-25-year female) ≠ 0

b.
t = (9.19-9.54)/(0.42/sqrt50)
t = -5.750973964
We can see with help of excel that with 49 degrees of freedom and t value equal to - 5.7509739,
p-value for upper tail comes out to be almost zero. So, we can conclude that probability of
getting salaries equal for female and that for overall is almost impossible.
It concludes that there is a disparity among salaries between female and overall.
c. Because p-value is zero it is less than α= 0.05 so the hypothesis got rejected. It interprets that
salaries are not equal for men and women that’s why overall salary is not coming closer to that
of female’s salary.
d.
Ho: µ (18-25 all)-µ (18-25-year male) = 0
Ha: µ (18-25 all)-µ (18-25-year male) ≠ 0
e.
t = (9.86-9.54)/(0.42/sqrt50)
t = 5.377125759
We can see with help of excel that with 49 degrees of freedom and t value equal to
5.377125759, p-value for upper tail comes out to be zero. So, we can conclude that probability
of getting salaries equal for male and that for overall is almost impossible. It implies that
because females are getting less salary while males are getting more salary that’s why both are
not equal to overall average.
It concludes that there is a disparity among salaries between female and overall.
Results and analysis
We were able to use sampling techniques to select a sample from population and estimate the
sample mean and sampling distribution of x bar for the mean hourly salary. We were able to
find interval estimate of population and was used to do the interval estimate of the population
mean with standard deviation known. We were able to make tentative assumptions about the
population parameters using Hypothesis testing. We used one tailed test and two tailed tests for
Hypothesis testing. Chi square distribution gave us the Hypothesis test to find population
variance and inference about the difference between two population mean was achieved by
using T-distribution with standard deviation unknown. Match sample concept gave the
inference about the difference between two population mean. Chi square test was used for test
of independence between two categorical variables.

You might also like