You are on page 1of 15

Statistical Computing

Certificate Course

Outline
1. Data, Information and Knowledge

Batch -10

2. Stages of a Research Process


3. Sampling Technique

Organized by

4. Data Classification

Department of Statistics
University of Rajshahi

5. Data Processing
6. Graphical Representation of Data

Introduction to Data

7. Summary Chart Concerning Analysis of Data

Prof. Dr. Md. Rezaul Karim

Email: mrezakarim@yahoo.com
July 05, 2014

Prof. Dr. M. Rezaul Karim, Statistics, RU

Data, Information and Knowledge

1. Data, Information and Knowledge

Prof. Dr. M. Rezaul Karim, Statistics, RU

Data are collected to extract information, which


in turn generates knowledge. An understanding
of this is important in any data collection and
analysis.

The word "data" used to be considered as the


plural of "datum", which means a single piece of
information

Data are typically the results of measurements;


individual pieces of information.

Prof. Dr. M. Rezaul Karim, Statistics, RU

Data, Information and Knowledge

Data, Information and Knowledge

Raw data (also known as primary data,


unprocessed data) is a term for data collected
from a source.

Field data refers to raw data that is collected in


an uncontrolled environment or field

Experimental data refers to data that is


generated within the context of a scientific
investigation by observation and recording

Prof. Dr. M. Rezaul Karim, Statistics, RU

Information is extracted from data through


analysis.
(i) Data represents a fact or statement of an event

without relation to other things. Information


embodies the understanding of the relationship of
some sort, possibly cause and effect (Bellinger et
al. 1997)
(ii) Data are raw facts that have not been organized or

cannot possibly be interpreted. Information is data


that are understood. Information comes from the
relationship between pieces of data (Benyon 1990)
Prof. Dr. M. Rezaul Karim, Statistics, RU

Data, Information and Knowledge

Data, Information and Knowledge

Example:

Knowledge is the ability of individuals to


understand the information and the manner in
which the information is used in a specific context,
as illustrated by the following:
(i) Knowledge represents a pattern that connects and

generally provides a high level of predictability as


to what is described and what will happen next
(Bellinger et al. 1997)
(ii) Data gets transformed into information through an

understanding of the relationships, and information


yields knowledge through an understanding of the
patterns (Bellinger et al. 1997)
Prof. Dr. M. Rezaul Karim, Statistics, RU

The height of Mt. Everest is generally


considered as "data" [Its peak is 8,848 meters
(29,029 ft) above sea level]

A book on Mt.
characteristics may
"information", and

A report containing practical information on the


best way to reach Mt. Everest's peak may be
considered as "knowledge".

Everest geological
be considered as

Prof. Dr. M. Rezaul Karim, Statistics, RU

Data, Information and Knowledge

Data, Information and Knowledge

The link between data, information and knowledge


can be characterized through the DIKW (Data,
Information, Knowledge, and Wisdom) hierarchy - a term attributed to Ackoff (1989).

1. Data: symbols

According to this, the content of the human mind


can be classified into five categories:

3. Knowledge: application of data and information;


answers "how" questions

2. Information: data that are processed to be useful;


provides answers to "who", "what", "where", and
"when" questions

4. Understanding: appreciation of "why"


5. Wisdom: evaluated understanding.

Prof. Dr. M. Rezaul Karim, Statistics, RU

Prof. Dr. M. Rezaul Karim, Statistics, RU

10

Stages of a Research Process


Problem Discovery
and Definition

2. Stages of a Research Process

Research
Design

Discovery and
Definition

and so on
Conclusions and
Report

Sampling
Data Processing
and Analysis
Data
Gathering

Prof. Dr. M. Rezaul Karim, Statistics, RU

11

Prof. Dr. M. Rezaul Karim, Statistics, RU

12

Sampling
A process of selecting units from a population
A process of selecting a sample to determine
certain characteristics of a population

3. Sampling Technique
Sample
Population

Population Sample
Prof. Dr. M. Rezaul Karim, Statistics, RU

13

Sampling

Prof. Dr. M. Rezaul Karim, Statistics, RU

14

Sampling

From the characteristics of samples, we can infer


the characteristics of population, if the sample is
representative of the population

Sampling frame: A complete list of every unit in


the population of interest, called a sampling frame,
is needed to select a random sample.

Population: The complete set of people or things


being studied. It could be all the citizens in a
country, all farms in a region, or all children under
the age of five in a country, etc.

Sample Design: The method of sample selection

Sample: A subset (representative) of the


population that is actually studied (and from
which the raw data are actually obtained)
Prof. Dr. M. Rezaul Karim, Statistics, RU

15

Prof. Dr. M. Rezaul Karim, Statistics, RU

16

Random (or Probability) Sampling

Random (or Probability) Sampling

Simple Random Sampling:

Stratified Random Sampling:

By definition, a simple random sample refers to


those cases that are selected so that each element
in the population has an equal or known chance
of being included in the sample.

In this sampling technique the population is


divided into two or more homogenous subgroups
or strata and a simple random sample would be
taken from each subgroup.

A lottery draw is a good example of simple


random sampling.

Suppose a farmer wishes to work out the average


milk yield of each cow type in his herd which
consists of Ayrshire, Friesian, Galloway and
Jersey cows. He could divide up his herd into the
four sub-groups and take samples from these.

A sample of 5 numbers is randomly generated


from a population of 48, with each number
having an equal chance of being selected
Prof. Dr. M. Rezaul Karim, Statistics, RU

17

Random (or Probability) Sampling

Prof. Dr. M. Rezaul Karim, Statistics, RU

18

Random (or Probability) Sampling


Cluster Sampling
Cluster sampling is another form of random sampling. A
cluster is any naturally occurring aggregate of the units
that are to be sampled. Thus households (or homes) are
clusters of people and towns are clusters of households.
Cluster samples are most often used when:
o one do not have a complete list of everyone in the
population of interest but do have a complete list of
the clusters in which they occur, or
o have a complete list of everyone, but they are so
widely disbursed that it would be too time consuming
and expensive to send data collectors out to a simple
random sample.

Systematic Sampling:
A sample drawn from a list using a random start
followed by a fixed sampling interval.
Often used in industry, where an item is selected
for testing from a production line (say, every
fifteen minutes) to ensure that machines and
equipment are working to specification.
Alternatively, the manufacturer might decide to
select every 20th item on a production line to test
for defects and quality.

Two Stage and Multistage Cluster Sample


Prof. Dr. M. Rezaul Karim, Statistics, RU

19

Prof. Dr. M. Rezaul Karim, Statistics, RU

20

Nonrandom (nonprobability) Sampling

Nonrandom (nonprobability) Sampling

Quota sampling
A sample in which a specific number of different
types of units are selected. For example, we may
want to interview 10 teachers and decide that five
will be men and five will be women.

Judgmental sampling
In this kind of sample, selections are made based on
pre-determined criteria that, in your judgment, will
provide the data you need. For example, you may
want to interview primary school principals and
decide to interview some from rural areas as well as
some from urban areas (but no quota is established).

Snowball sampling
This type of sampling is used when we do not know
who or what should be included. Typically used in
interviews, we would ask the interviewees who else
you should talk to. We would continue until no new
suggestions are obtained.
Prof. Dr. M. Rezaul Karim, Statistics, RU

Convenience sampling
In this type, selections are made based on the
convenience to the evaluator. Principals from local
schools may be selected because they are near where
the evaluators are located.
Prof. Dr. M. Rezaul Karim, Statistics, RU

21

22

Data Classification

Data can be classified in several ways, for example:


1. According to Representation level

4. Data Classification

Prof. Dr. M. Rezaul Karim, Statistics, RU

23

Qualitative (or Categorical) data

Quantitative (or numerical) data

Prof. Dr. M. Rezaul Karim, Statistics, RU

24

Qualitative Data

Quantitative Data

Qualitative data take on values that are categories,


characteristic names or labels or descriptions

Quantitative data are numeric and represent a


measurable quantity with numbers

Data can be observed but not measured

Data which can be measured

Example: Gender, Colors,


appearance, beauty, etc.

Example: Length, height, weight, speed, time,


cost, ages, etc.

Data analysis includes the coding of the data

Data analysis is mainly statistical

smells,

Prof. Dr. M. Rezaul Karim, Statistics, RU

tastes,

25

Prof. Dr. M. Rezaul Karim, Statistics, RU

Data Classification

26

Discrete Data

2. According to Measurement

Countable numerical observation

Discrete

o whole numbers only

Continuous

o has an equal whole number interval


o obtained through counting

Prof. Dr. M. Rezaul Karim, Statistics, RU

27

Example: number of occurrence, number of


students, etc.

Prof. Dr. M. Rezaul Karim, Statistics, RU

28

Continuous Data

Data Classification

Measurable observations

Decimals or fractions

Obtained through measuring

Example: Height, Weight, Bank deposits, Volume


of liquid, etc.

Prof. Dr. M. Rezaul Karim, Statistics, RU

3. According to Source
Primary data
o First-hand information
o Example: Autobiography, first-time taken
financial statement, etc.

Secondary data
o Second-hand information
o Example: Weather forecast from news papers,
Data taken from published journals, books,
webpage, etc.

29

Prof. Dr. M. Rezaul Karim, Statistics, RU

Data Classification

Data Classification
5. According to dependency of time
Time series data

4. According to Arrangement
Ungrouped data
o Raw data
o No specific arrangement

Grouped data
o Organized set of data
o At least 2 groups
o Arranged in any order

Prof. Dr. M. Rezaul Karim, Statistics, RU

30

a sequence of data points, measured typically at


successive points in time spaced at uniform time
intervals
Example: weekly share prices, daily rainfall,
temperature, etc.

Cross-section or Cross-sectional data


o data collected by observing many subjects (such as

individuals, firms, countries, etc.) at the same point


of time, or without regard to differences in time.
o Example: Weight and height of randomly selected
100 people.
31

Prof. Dr. M. Rezaul Karim, Statistics, RU

32

Data Classification

Nominal scale
Nominal scale is simply a system of assigning
number symbols to events in order to label them.

6. According to Measuring scale


Nominal
Ordinal
Interval
Ratio

Nominal scales provide convenient ways of keeping


track of people, objects and events.
Data are categorical
Examples Car bands, Gender, Marital status (as 1
for single, 2 for married, 3 for widowed or 4 for
divorced)

Prof. Dr. M. Rezaul Karim, Statistics, RU

33

Prof. Dr. M. Rezaul Karim, Statistics, RU

Nominal scale

34

Nominal scale

Allowable operations counts only; no ranking or


numerical operations.
For instance, if we record marital status as 1, 2, 3,
or 4 as stated above, we cannot write
4 > 2 or 3 < 4,

Descriptive Statistics used:


o mode (most often observed data category), and
o percent.

Note: averages (mean) and standard deviation are


not appropriate!

31 = 42,
1+3 = 4 or 4/2 = 2.

Prof. Dr. M. Rezaul Karim, Statistics, RU

35

Prof. Dr. M. Rezaul Karim, Statistics, RU

36

Ordinal scale

Interval scale

Data are categorical with a rank-order relationship


One value is greater or less than another, but the
magnitude of the difference is unknown.
Examples A students rank in his class, rating
scales (severity of damage on a scale of 1 to 4;
quality of sound of a speaker)
Allowable operations counts and ranking; no
numerical operations
For instance, if As position in his class is 10 and Bs
position is 40, it cannot be said that As position is
four times as good as that of B.
Prof. Dr. M. Rezaul Karim, Statistics, RU

Data are numerical values on an equal-interval


scale.
(Note: on an interval scale, there is no true zero it does not have the capacity to measure the
complete absence of a trait or characteristic)
Example temperature in oC
Allowable operations ranking; addition and
subtraction
(and
therefore
averaging);
multiplication and division are not meaningful

37

Prof. Dr. M. Rezaul Karim, Statistics, RU

Ratio scale

Scale of measurement

Data are numerical values on an equal-interval scale


with a uniquely defined zero (For example, the zero
point on a centimeter scale indicates the complete
absence of length or height)
Examples height, weight, distance, time to failure
of an item, cost of repair, number of replacements
under warranty
Allowable operations all ordinary numerical and
mathematical operations.

Prof. Dr. M. Rezaul Karim, Statistics, RU

38

39

Note:
It is essential to understand the above differences
in the nature of data and suggest appropriate
method to store and analyze them.
Many software (e.g. MS Excel and R) do not
automatically understand the nature of the data, so
we need to explicitly define the data for those
tools.

Prof. Dr. M. Rezaul Karim, Statistics, RU

40

10

Scale of measurement

Data Classification
7. According to Failure/Survival characteristic

most
precise

least
precise
Nominal

Ordinal

Interval

Complete or Exact failure data

Incomplete or Censored data

Ratio

Prof. Dr. M. Rezaul Karim, Statistics, RU

41

Failure and Censored data

Prof. Dr. M. Rezaul Karim, Statistics, RU

42

Data related problems

1. Complete or Exact failure data


The value of each sample unit is observed or known.
Example: the fan failed at exactly 500 days

Data problems include


Too much data (massive data sets, irrelevant data)
Corrupt and/or noisy data

2. Incomplete or Censored data


All of the units in the sample may not have failed or
the exact times-to-failure of all the units are not
known.
Example: the fan had not yet failed at 500 days;
the fan failed sometime before 500 days
Prof. Dr. M. Rezaul Karim, Statistics, RU

43

Too little data (missing entries, missing variables,


too few observations)
Fractured data (multiple sources, incompatible
data, data obtained at different levels)

Prof. Dr. M. Rezaul Karim, Statistics, RU

44

11

Data related problems


Techniques for dealing with these problems
include
Data transformation (data filtering, ordering,
editing, and modeling)

5. Data Processing

Interactive techniques (data visualization,


elimination, selection, identification of principal
components, sampling)
New information generation (time series analysis,
data fusion, simulation, dimensional analysis,
etc.)
Prof. Dr. M. Rezaul Karim, Statistics, RU

Process

46

Data Processing Overview

What is Data Processing?


Data

Prof. Dr. M. Rezaul Karim, Statistics, RU

45

Information

o Data - the raw facts-record measures of certain


phenomena
o Process - implies editing, coding, classification
and tabulation of collected data
Validation &
Editing

o Information - facts in a form suitable for taking


decisions by researchers
Prof. Dr. M. Rezaul Karim, Statistics, RU

47

Coding Classification Tabulation Using


percentages

Prof. Dr. M. Rezaul Karim, Statistics, RU

48

12

Data Processing

Data Processing

Step One:
Validation: Confirming the interviews/surveys occurred
Editing: The procedure that improves the quality of the data
for coding. That is, the process of checking and adjusting the
data
Consistency
Completeness
Questions answered out of order

Prof. Dr. M. Rezaul Karim, Statistics, RU

49

Step Two:
Coding: Grouping and assigning numeric codes to the
question responses. (Codes also may be other character
symbols)
Rules for coding:
o Categories should be exhaustive
o Categories should be mutually exclusive and independent
Step Three:
Classification: Large volume of raw data are reduced into
homogeneous groups (if we are to get meaningful
relationships). Classification can be (i) according to
attributes or (ii) according to class-intervals
Prof. Dr. M. Rezaul Karim, Statistics, RU

50

Data Processing
Step Four:
Tabulation: Tabulation is the process of summarizing raw
data and displaying the same in compact form (i.e., in the
form of statistical tables) for further analysis.
Step Five:
Percentages: Percentages are often used in data
presentation as they simplify numbers, reducing all of
them to a 0 to 100 range.

Prof. Dr. M. Rezaul Karim, Statistics, RU

51

6. Graphical representation of Data

Prof. Dr. M. Rezaul Karim, Statistics, RU

52

13

Interval Plot
Matrix Plot

Probability Plot

Stem-and-Leaf

Time Series Plot

3D Scatter Plot
3D Surface Plot

Area Graph
Bar Chart

Box Plot

Contour Plot
Dot Plot
Empirical CDF
Histogram

Prof. Dr. M. Rezaul Karim, Statistics, RU

53

Assess relationships among


three variables

Pie Chart
Scatter Plot

Plot a series of data over time

Individual Value Plot


Marginal Plot

Assess distributions of counts

Graphs

Compare summaries or
individual values of a variable

Commonly used graphs and their uses in analysis


of data.

Assess distributions

Assess relationships
between pairs of variables

Objective of analysis

Graphical representation of Data

Prof. Dr. M. Rezaul Karim, Statistics, RU

54

Summary Chart Concerning Analysis of Data

7. Summary Chart Concerning


Analysis of Data

Prof. Dr. M. Rezaul Karim, Statistics, RU

55

Prof. Dr. M. Rezaul Karim, Statistics, RU

56

14

References

Ackoff, R.L. (1989). From Data to Wisdom. J. Appl. Sys.


Analysis, 16:3-9.

Bellinger, G., Castro, D., Mills, A. (1977). Data,


information,
knowledge,
and
wisdom.
From
http://www.outsights.com/systems/dikw/dikw.htm

Benyon, D. (1990). Information and Data Modeling.


Alfred Waller, Heneley-on-Thames

Blischke W.R., Karim, M. R. and Murthy D.N.P. (2011).


Warranty Data Collection and Analysis. Springer-Verlag,
London Ltd.

Kothari, C.R. (2004). Research Methodology: Methods


and Techniques, 2nd Ed., New Age International (P) Ltd.,
Prof. Dr. M. Rezaul Karim, Statistics, RU

Thank you
Any Questions?

57

Prof. Dr. M. Rezaul Karim, Statistics, RU

58

15

You might also like