You are on page 1of 23

Summarizing Data

XINLONG ZHANG

Data Summary
Important summary statistics for a distribution
of data can include:

Sample mean,

Sample median, M

Sample mode,

Sample variance, s2

Sample range, r

x
Measures of the center

Measures of Variability

Notation

Say we have a sample of data. There are n data points in a


sample.

We will represent the variable being measured with a letter such


as X.

We use the symbol xi to represent the ith observation of the


variable x, where i= 1, 2, , n.

So the values in our sample can be written as x1, x2, , xn.

Notation
Example
A students exam scores in his basic statistics course were (out of 100)
58, 37, and 75.

Variable: Let x represent exam score

We have n = 3 observations, so i = 1, 2, 3.

We use the symbol xi to represent the ith observation of the variable x:


x1 =
x2 =

x3 =

Sample Mean

Say there are n data points in a sample (denoted by x1, x2, ,


xn). The sample mean is then

https://www.khanacademy.org/math/probability/statisticsinferential/sampling_distribution/v/sampling-distribution-of-the-sample-mean

This quantity can be influenced by extreme observations


(outliers).

Descriptive Statistics

Sample Mean

Emergency room waiting times are continually


increasing. One factor that identified as affecting wait
time was turnaround time for basic blood analysis.
Turnaround times (in minutes) for ten such tests on one
particular day are
68 70 77 56 58 65 48 66 70 71
n

x
i 1

x1 x2 xn

https://www.youtube.com/watch?v=8P5WZ6TfuZg
5

Sample Median

Divides the data into two parts. 50% of all the data lie at or below this
value.

To find the median of a set of data (n observations):


Order the data in increasing order
Calculate (n+1)/2; This is the location of the median in the dataset.

If n is odd, then the exact location of the median value is (n+1)/2


If n is even, then the median will be the average of the two values
around the location, (n+1)/2.

Median is a resistant measure of the center. That is, extreme


observations will not necessarily influence the value of the median.

Descriptive Statistics

Sample Median

Divides the data into two parts.

50% of all the data lie at or below this value (50th


Percentile)

Median is a resistant measure of the center. That is,


extreme observations will not necessarily influence the
value of the median.

Consider the emergency room data again:


68 70 77 56 58 65 48 66 70 71
In order:
7

Descriptive Statistics
Dotplot of Turnaround Time

48

52

56

60
64
Turnaround Time

Sample Mean, 64.9

68

72

76

Sample Median, 67

Graphical Analysis
Box plots
Very good for comparing two or more groups of data.

Boxplot of Faults vs Shift, Period


3
2
1
Faults

0
-1
-2
-3
Period
Shift

Weekday Weekend
1

Weekday Weekend
2

Weekday Weekend
3

Organizing Data
Example
Top five primary reasons given by patients for an
emergency room visit.

Reason
Frequency
Stomach Pain
6012
Chest Pain
2185
Broken Bones
4331
Headache
2876
Vomiting
1244

Organizing Data
Example
A psychology student is looking at some data collected
on 10 patients seen by a particular doctor. The data are
the diagnosis of each patient:

Bipolar, depression, depression, schizophrenia,


hypochondriac, bipolar, Obsessive-compulsive disorder,
depression, anxiety, anxiety.
Question: For this set of data, which measure of the
center is most appropriate (mean, median, mode) and
what is it?

Descriptive Statistics

Sample Mode

The value that occurs most frequently in a set


of data. There can be more than one mode.

Commonly used with categorical (qualitative)


data.
Defect
Count
Misregistration
98
Peeling
34
Scratch
53
Short
3
Underetch
261
Wrong Part Number
2
12

Measures of Center
Mode

The value that occurs most often is the mode.

If each data value appears only once, then there is no mode.

68 70 77 56 58 65 48 66 70 71

In this case the mode is 70.

**The mode is the most appropriate measure of the


center for qualitative data.

Descriptive Statistics

Put the three together

Positively skewed data:

14

Descriptive Statistics

Put the three together

Negatively skewed data:

15

Descriptive Statistics

Put the three together

Symmetric data:

16

Descriptive Statistics

Put the three together

For symmetric data


The mean is recommended as the measure of the
center

For skewed data


The median is recommended as the measure of the
center

For categorical type data


The mode is recommended as the measure of the
center.

17

www.zillow.com
Mean SFR: $863,699;

Median: $685,000

Measures of Center
Mean, Median, and Shape of the Distribution

If the distribution is skewed (left or right) the median is a better indicator


of the central value
A distribution is skewed because there are observations (extreme?) way out in the
tale.

Median is a resistant measure.

If the distribution is bell shaped or symmetric, then mean and median


are (approximately) equal.

Measures of Center
Mean, Median, and Shape of the Distribution

Consider the following salaries of employees in one department of a


company:

$20000, 20000, 22000, 28000, 90000

Average salary: $36,000


Median salary: $22,000

Measures of Center
Mean, Median, and Shape of the Distribution

Consider the following salaries of employees in one department of a


company:
$20000, 20000, 22000, 28000, 90000

Average salary: $36,000; Median salary: $22,000

What if the $90,000 was an error and it is really $130,000?

Average salary: $44,000; Median salary: $22,000

Graphical Displays

What to look for:

What is the overall shape of the data?

Are there any unusual observations?

Where is the center or average of the data located?

What is the spread of the data? Is the data spread out or close
to the center?

50 Exam Scores.
3
4
4
5
5
6
6
7
7
8
8
9
9

9
0134
66
0012223
9
11233
889
223
6799
2233344
5578
0123
55688

Location of the
Median:

You might also like