You are on page 1of 122

1

ISBN 978-80-904 948-6-2


Monolingual English Version

2

The Principles of Probability and Statistics
(data mining approach)
Monilingual English Version
Zklady pravdpodobnosti a statistiky
(data miningov pstup)
Jednojazykov anglick verze

CURRICULUM 2013. First edition.

No part of the present publication may be reproduced and distributed in any way and in
any form without express permission of the author and of the Publishing House
Curriculum

The publisher and author will appreciate possible comments concerning the work. They
may be forwarded to the addresses of the publisher and author presented below.




The grant project was supported by: MAKET PROMOTION INSTITUTE
The Company Corporation 1313 N.Market Street Wilmington, DE 19801-1151,
U.S.A.

The publisher: Publishing House CURRICULUM
Cholupick 39, CZ-142 00 Praha 4, Czech Republic
e-mail: phcurriculum@yahoo.com

The author: Assoc. Prof. RNDr. Pemysl Zkodn, CSc., Emy Destinov 17,
CZ-370 01 esk Budjovice, Czech Republic
e-mail: pzaskodny@gmail.com

Affiliation of the author:
The University of South Bohemia, esk Budjovice, Czech Republic
The University of Finance and Administration, Praha, Czech Republic

The reviewers:
RNDr. Ivan Havlek, CSc.
Assoc. Prof. Ing. Vladislav Pavlt, CSc.
Mgr. Petr Prochzka
Assoc. Prof. PaeDr. Jana krabnkov, CSc.


On line presentation: http//sites.google.com/site/csrggroup/textbook2/

ISBN 978-80-904948-6-2



3

CONTENTS

Introduction-5

Part 1. The Main Methods of Descriptive Statistics, Statistical Probability-11

1.1. Formulation of Statistical Investigation-11
1.2. Creation of Scale-14
1.3. Measurement-16
1.4. Elementary Statistical Processing-19
1.4.1. Table-21
1.4.2. Empirical Distribution of Frequencies-22
1.4.3. Empirical Parameters-23
1.4.4. Illustration of Calculation of Empirical Parameters-26

Part 2. The Main Methods of Mathematical Statistics, Probability Distribution-28

2.1. Assignment of Theoretical Distribution to Empirical Distribution-28
2.1.1. Interval Division of Frequencies-30
2.1.2. Theoretical Distribution-31
2.1.3. Description of Selected Theoretical Distributions-36
2.1.4. Apparatus of Non-parametric Testing-41
2.1.5. Illustration of Non-parametric Testing-43
2.2. Comparison of Empirical and Theoretical Parameters Estimations of Theoretical
Parameters, Testing Parametric Hypotheses-46
2.2.1. Basics of Estimation Theory-48
2.2.2. Illustration of Confidence Intervals Construction-50
2.2.3. Basics of Parametric Hypotheses Testing-51
2.2.4. Illustration of Parametric Testing-54
2.3. Measurement of Statistical Dependences Some Fundaments of Regression
and Correlation Analysis-59
2.3.1, Delimitation of Problem-60
2.3.2. Simple Linear and Quadratic Regression Analysis-62
2.3.3. Simple Linear and Quadratic Correlation Analysis-65
2.3.4. Illustration of Dependence Measurement-66

Part 3. Applications-70

3.1. Description of Statistical and Probability Base of Financial Options-70
3.1.1. Introduction
3.1.2. Financial Options
3.1.3. Statistical and Probability Base of Black-Scholes Model
3.1.4. Statistical and Probability Base of Binomial and Trinomial Model
3.1.5. Statistical and Probability Data Mining Tools Normal, Binomial and Trinomial Distribution
3.1.6. Conclusion
3.2. Description of Statistical and Probability Base of Greeks-76
3.2.1. Introduction
3.2.2. Greeks
3.2.3. Value Function
3.2.4. Segmentation and Definitions of Greeks
3.2.5. Indications of Greeks
3.2.6. Formulas for Greeks
3.2.7. Needful Statistical and Probability Relations for Deduction of Greeks Formulas
3.2.8. Conclusion, References
3.3. Data Mining Tools in Statistics Education-85
3.3.1. Introduction
4

3.3.2. Data Mining
3.3.3. Data Preprocessing in Statistics Education
3.3.4. Data Processing in Statistics Education
3.3.5. Complex and Partial Tool of DMSTE CP-DMSTE, ASM-DMSTE
3.3.6. Conclusion, References
3.3.7. Supplement of Chapter 3.3. The Principles of Data Mining Approach
3.3.7.1. Quotations from Sources
3.3.7.2. Brief Summary
3.3.7.3. Data Mining Cycle, References

Part 4. Statistical Tables-109

CV of Author-119

Bibliography of Author-120

Global References-122































5

THE PRINCIPLES
OF PROBABILITY AND STATISTICS
(DATA MINING APPROACH)


Introduction

The applications of descriptive and mathematical statistics and probability theory in an
investigation of the collective random phenomena are the subject of probability and statistics.
To describe these applications it is necessary to first be concerned with descriptive and
mathematical statistics and probability theory. In view of the fact that the extent of the
probability and statistics presentation is to a certain degree limited (due to the study text
orientation to concrete branches of study) it will be effective to acquaint ourselves above all
with main statistical methods, continuously to illustrate them by the assigned example, by the
survey of acquired concepts and the check questions, marginally to touch of some concepts of
probability theory and finally to approach the applications. The studies of so structured
orientation is although accessible for attendance and combined form of study, it cannot,
however, be confused with a continuous and coherent study of statistics and probability
theory as a separate scientific disciplines.

The structure of presentation will be introduced by analytical-synthetic model of the
structure of statistics as a whole. This model can be used for the immediate classification of
statistical method and for the immediate location of previous and follow-up methods. The
model also has a significant cognitive dimension it is showing which the operations of
analysis, abstraction and synthesis are to be carried out to be complete the adoption of
relevant statistical method. The presented model in figure Fig.1 contains the four partial
analytical-synthetic structures. The model in figure Fig.1, the legend to figure Fig.1 and the
description of component structural parts is presented only in English.

Following a short part of text presented only in English represents data mining
approach to the study of the principles of statistics and several needful concepts of
probability. The data mining approach enables to work with the integral concepts and
knowledge pieces in their system shape (see analytical-synthetic model). The data mining
6

approach is explained in Part.3 Applications more detail. The immediate structural
orientation, showing which part of the statistics and its probability applications is just
acquired in the course of the study, isnt useless. It is always good to know whether the
selective statistical set (SSS) is only determined (the first partial structure from element a-1
up to element e-1), whether the empirical picture of set SSS is already created (the second
partial structure from element a-2 up to element e-2) or whether the probability picture of set
SSS is already even explored (the third partial structure from element a-3 up to element e-3)
or whether it was already entered to the process of creation of the associative picture of set
SSS (the fourth partial structure from element a-4 up to element e-4). In addition, the study of
the texts in English is needful assumption for the study of foreign literature.


7




















Fig.1 Analytical synthetic model of statistics and needful probability concepts
formed by four partial models a1-e1, a2-e2, a3-e3, a4-e4



Frequencies tables
(Empirical distribution)
Graphical expression Empirical parameters
Empirical picture of selective statistical set, Necessity of probable investigation e-2=a-3
Probability distributions
Creating of scale
Measurement
Choice of acceptable
theoretical distribution
Quantification of
theoretical parameters
Comparison of theoretical and
empirical parameters
Testing non-parametric
hypotheses
Point & interval estimation
(e.g. confidence interval)
Testing parametric hypotheses
Collective random phenomenon and reason of its investigation a-1
Statistical unit Statistical sign
Selective statistical set (SSS) as a part of basic statistical set, Goals of statistical examination e-1=a-2
Statistical probability
Statistical dependence
(causal, non-causal)
Regression analysis
Variants (values) of
statistical sign
Choice of statistical
units
Empirical & probable picture of selective statistical set, Necessity of association investigation e-3=a-4
Correlation analysis
Empirical & probable & association picture of selective statistical set
Interpretation and conclusions as the statistical & probable dimension e-4
of investigation collective random phenomenon
Applied probability and statistics
(e.g. financial options and their mathematical and statistical elaboration by means of greeks calculation and
option hedging models)
8



LEGEND to whole figure Fig.1


, , ,

One Sample Analysis, Two / Multiple Sample Analysis

LEGEND to partial models of figure Fig.1



Formulation of statistical examination



Relative & Cumulative Frequencies (Empirical distribution)
Plotting functions: e.g. Plot Frequency Polygon (Graphical expression)
Average-Means (Arithmetic Mean), Variance-Standard (Determinative) Deviation,
Obliqueness (Skewness), Pointedness (Kurtosis) (Empirical parameters)



Theoretical Distribution (partial survey in alphabetical order):
Bernoulli, Beta, Binomial, Chi-square, Discrete Uniform, Erlang, Exponential, F, Gamma,
Geometric, Lognormal, Negative binomial, Normal, Poisson, Students, Triangular,
Trinomial, Uniform, Weibull
Testing Non-parametric Hypotheses (Hypothesis test for H
0
receive or reject H
0
):
e.g. computed Wilcoxons test, Kolmogorov-Smirnov test, Chi-square test
e.g. at alpha = 0,05
Point & Interval Estimation:
e.g. confidence interval for Mean, confidence interval for Standard Deviation
Testing Parametric Hypotheses (Hypothesis test for H
0
receive or reject H
0
):
e.g. computed u-statistic, t-statistic, F-statistic, Chi-square statistic, Cochrans test, Barletts
test, Hartleys test
e.g. at alpha = 0,05




Statistical dependence:
e.g. confidence interval for difference in Means (Equal variances, Unequal variances)
e.g. confidence interval for Ratio of Variances
Regression analysis:
simple multiple, linear non-linear
Correlation analysis:
e.g. Rank correlation coefficient, Pearsons correlation coefficient
a-1 e-1 a-2 e-2 a-3 e-3 a-4 e-4
a-1 e-1
a-2 e-2
a-3 e-3
a-4 e-4
9




Description of four partial analytical synthetic structures

The example of applicability of analytical synthetic modeling presented via Fig. 1 is
introduced by means of description of statistics as a whole. In the framework of this
description it is possible to indicate four partial analytical-synthetic structures of statistical
dimension of investigated problem.

Now, these four partial analytical synthetic structures will be presented. Within this
presentation let us compare general model of analytical synthetic structure of investigated
problem (from investigated phenomenon to the result of solution given by intellectual
reconstruction) with figure Fig. 1 "Analytical synthetic model of statistics formed by four
partial models".

First structure a-1 e-1 (see Fig. 1)
From investigated phenomenon (marked a-1)
"Collective random phenomenon and reason of its investigation"
to the result of intellectual reconstruction (marked e-1)
"Selective statistical set as a part of basic statistical set"


Second structure a-2 e-2 (see Fig. 1)
From investigated phenomenon (marked a-2)
"Selective statistical set as a part of basic statistical set"
to the result of intellectual reconstruction (marked e-2)
"Empirical picture of selective statistical set"


Third structure a-3 e-3 (see Fig. 1)
From investigated phenomenon (marked a-3)
"Empirical picture of selective statistical set"
to the result of intellectual reconstruction (marked e-3)
"Probable picture of selective statistical set"


Fourth structure a-4 e-4 (see Fig. 1)
From investigated phenomenon (marked a-4)
"Probable picture of selective statistical set"
to the result of intellectual reconstruction (marked e-4)
"Association picture of selective statistical set"

Applied statistics a5 (see Fig. 1)

10

The structure of explanation will reflect the model represented by figure Fig.1.
Therefore, the interpretation of individual paragraphs can be described by means of the
structural elements a-1 up to a-5 and e-1 up to e-4. The explanation will be fulfilled for
persons interested in deeper understanding by both the chapter explaining some basic
concepts of probability theory and the survey of basic statistical tables.

The structure of explanation will be as follows:
Part 1. The main methods of descriptive statistics, Statistical probability

1.1. Formulation of statistical investigation
(from element a-1 to element e-1)
1.2. Creation of scale
(from element a-2 to element e-2)
1.3. Measurement, Probability
(from element a-2 to element e-2)
1.4. Elementary statistical processing
(from element a-2 to element e-2)

Part 2. The main methods of mathematical statistics, Probability distribution

2.1. Assignment of theoretical distribution to empirical distribution testing non-parametric
hypotheses, Probability theoretical distributions
(from element a-3 to element e-3)
2.2. Comparison of empirical and theoretical parameters estimations of theoretical
parameters, testing parametric hypotheses
(from element a-3 to element e-3)
2.3. Measurement of statistical dependences some fundaments of regression and
correlation analysis
(from element a-4 to element e-4)

Part 3. Applications (element a5)

3.1. Description of statistical and probability base of financial options
3.2. Description of statistical and probability base of Greeks
3.3. Data Mining Tools in Statistics Education

Part 4. Statistical tables









11

Part 1. The Main Methods of Descriptive Statistics, Statistical
Probability


1.1. Formulation of Statistical Investigation


Goals:
- Collective random phenomenon and reason of its investigation
- Selective statistical set as a part of basic statistical set



Acquired concepts and knowledge pieces:
Collective random phenomenon, statistical unit, statistical sign statistical character, values
of statistical sign, basic statistical set basic statistical file population, selective statistical
set sample statistical file




Check questions:

- What is the subject of investigation of statististics and probability theory

- What is the collective random phenomenon

- How is the statistical unit delimited

- How are statistical sign and its values delimited

- What is the difference between basic and selective statistical set

- Why is the procces of random selection important








12

The explanation will be illustrated by means of the assigned example.

Assigned example:
.
The 4000 enterprises have undergone tests on export ability. The average export
abilityon a scale 1 to 5 (1 maximum export ability, 5 minimum export ability) was
necessary to define for preliminary information.That is why the 50 tests was randomly
selected and their results are presented in table Tab.1. Elaborate the collective random
phenomenon (export ability of enterprise) gradually and complexly.

x
i
n
i
n
i
/n

n
i
/n x
i
n
i
x
i
2
n
i
x
i
3
n
i

x
i
4
n
i

1 9 0,18 0,18 9 9 9 9
2 15 0,3 0,48 30 60 120 240
3 20 0,4 0,88 60 180 540 1620
4 4 0,08 0,96 16 64 256 1024
5 2 0,04 1,00 10 50 250 1250
50 1,00 125 363 1175 4143

Table Tab.1: The results of 50 test elaboration


The formulation of statistical investigation is worked on delimitation of following concepts:

- collective random phenomenon CRP

- statistical unit SU

- statistical sign SS

- values of statistical sign VSS

- basic statistical set and its extent BSS

- random selection RS

- selective statistical set and its extent SSS




Collective random phenomenon CRP (e.g. export ability of enterprise) is the realization
of the activities or processes whose result cannot be predicted with certainty and which are
taking place in an extensive set of elements (e.g. enterprises). These elements have the certain
13

group of identical properties (e.g. identical type of economical parameter enterprise
character) and the other a group of different properties (e.g. the different values of export
ability of global economical state of enterprise). Mathematical statististics and probability
theory deal with qualitative and quantitative analysis of the patterns of collective random
phenomena.

The statistical unit SU is delimited by the identical properties of investigated set
elements (e.g. the enterprises and their character).

The statistical sign SS is given by some from different properties of investigated set
elements (e.g. by export ability of enterprise).

The values of statistical sign VSS are a way of investigated statistical sign description
(e.g. the description of export ability of mining industry enterprises by the percent of the
mined ore transported for the processing within fortnight from the extraction).

The basic statistical set BSS (population) is given by all the statistical units, its extent is
equal to the number of all the statistical units (e.g. the extent of investigated BSS is equal to
the total number of 4000 enterprises in the assigned example). It is usually not in the practical
possibilities of statisticians to investigate the statistical sign SS in all the statistical units SU
and it is required to limit the number of statistical units SU.

The random selection RS is limit the number of investigated statistical units SU in such
a way, in order to transfer the results obtained to the entire BSS. The various ways of random
selection are existing (drawing, generating a table of random numerals, deliberate selection).
It is necessary to verify whether it could be considered as random selection obtained.

The selected statistical set SSS is given those statistical units, which have been selected
from the basic statistical set by the process of random selection. The extent of SSS is equal to
the number of selected statistical units (e.g. the extent of SSS in the assigned example is equal
to the number of 50 selected enterprises). Selected statistical set SSS is one-dimensional if it
investigated only one statistical sign, multidimensional set found at, if investigated more
statistical signs.

The formulation of the statistical investigation is implemented in the assigned example
by the delimitation of selective statistical set 50 enterprises. In the context of this delimitation
must be exactly characterized all the follow-up concepts investigated collective random
phenomenon CRP, definition of the statistical unit SU, determination of the investigated
statistical sign SS, characterization of the statistical sign values VSS, exact delimitation of the
basic statistical set BSS and finally, ensuring the procedure of random selection RS.



14

1.2. Creation of Scale

Goals:
- Creation of scale scaling
- Choice of scale type


Acquired concepts and knowledge pieces:
Scale, classification of scales, parameters of selective type of scale)



Check questions:

- What is the creation of scale

- Is it possible to distinguish the types of scales according to which facts

- What are the basic types of scales

- What is the difference between the quantitative metric scale and absolute metric scale



The scale creation is the suitable expression of statistical sign values by means of scale
elements. The point is that the statistical sign values can be divided into reasonable groups,
into scale elements. The system of scale elements creates the scale. The number k of scale
elements can be calculated, for example, by Sturges rule k = 1 + 3.3 log
10
n, where n is an
extent of selective statistical set SSS.
According to the nature of statistical sign it is possible to distinguish, e.g., four types of
scales: qualitative (nominal), ordinal, quantitative metric and absolute metric. The
classification of scales can be used also to classify statistical signs. In some cases, the
statistical sign values immediately identify the scale and scaling isnt necessary.

The nominal scale is the classification into categories (the scale elements are the
individual categories). For every two statistical units of selective statistical set it is possible to
decide whether or not they are in terms of investigated statistical sign of identical or different
(such as gender or employment, if the statististical units are individual persons).

15

The ordinal scale enables you to not only decide on the identity or the diversity of the
statistical units, but also to establish their order (e.g., achieve the degree of scholastic
education). The scale elements are the individual order. This one doesnt enable to determine
the distance between two neighbouring statistical units arranged according to this scale.

The quantitative metric scale already enables to establish the distance between two
neighbouring statistical units from this perspective, it is needful to define the unit of scale
(e.g. percentage evaluation of export ability or other parameter of the global economical
condition, the temperature in degrees Celsius). The scale elements are the individual points of
scale expressed the numerical sizes. The quantitative metric scale expesses the values of
statistical sign without the possibility factually to interpret, in the beginning (zero point) of
scale the choice of scale beginning is the question of free choice.

The absolute metric scale is a quantitative metric scale and, in addition, it can be
interpreted in the beginning of the scale factually the scale zero responds to real zero value
of investigated statistical sign (e.g. the temperature in degrees Kelvin, the number of errors in
testing, the length of school attendance). The scale elements are the individual points of scale
of numeric sizes not only expressed but also the absolute zero of scale. Only the absolute
metric scale enables to calculate the divisions, the proportion of any two points of scale
doesnt depend on the choice of scale unit.

In the assigned example the statistical sign values degree of export ability are given
by the degrees 1, 2, , 5. It is evident the way of export ability expression had to be produced
(e.g. degree 1 exported 100%-80% of mined ore by enterprise of mining industry, degree 2
exported 80%-60% of mined ore, , degree 5 exported 20%-0% of mined ore) so the
degrees 1, 2, , 5 can be identified the scale of, which is the typical quantitative metric
scale. The scale elements are the points of scale expressed by numerical sizes x
1
= 1, x
2
= 2,
, x
5
= 5. This scale should reflect the identical distance (e.g. 20%) of export ability
between any two neighbouring scale elements.








16

1.3. Measurement


Goals:
- Process of measurement
- Expression of measurement results



Acquired concepts and knowledge pieces:
Measurement, absolute frequency, relative frequency, cumulative frequencies



Kontroln otzky: Check questions:
- What is the measurement within statistical elaboration of collective random
phenomenon

- What does the selection of measurement method depend on

- What conditions must the measurement method fulfil

- What are the results of measurement

- What is the statistical definition of probability

- How is the absolute and relative frequency defined

- How are the cumulative frequencies defined



The measurement is the process by which is one of k scale elements x
1
, x
2
, , x
k

assigned to each statistical unit SU of selective statistical set SSS (with extent n of
statististical units). The measurement results are the findings, that the scale element
x
i
(i = 1, 2, , k) was measured n
i
times. The summation of all the values n
i
(i = 1, 2, , k),
so called the absolute frequencies, must be equal to the extent n of selective statistical
set SSS.

17

The potential results of measurement (i = 1, 2, , k) can be evaluated by the size of the
probability which appears in the course of measurement. The statistical definition of
probability works on n times independently carried out measurement (the number of
measurement n corresponds to the extent of selective statistical set SSS) and on discovered
the absolute frequencies n
i
of potential measurement results. The statistical probability p(x
i
) of
result x
i
is then given by so called relative frequency n
i
/ n. The summation of all the relative
frequencies must be equal to 1.

Also the cumulative frequencies can be classified as the results of the measurement. The
cumulative frequency (n
i
/ n) is the probability that the measurement result will be measured
lesser or equal to result x
i
. It is evident the cumulative frequencies can be detected only within
quantitative metric or absolute metric scales. The cumulative frequencies, for example, are of
great significance in the construction of financial or economical balance sheets.

Within the assigned example it is possible through table Tab.1 to discover that it was
being worked with the scale created by 5 elements x
1
=1, x
2
=2, , x
5
=5 (see the first column in
table), their absolute frequencies were gradually n
1
=9, n
2
=15, n
3
=20, n
4
=4, n
5
=2 (see the
second column in table). The relative frequencies n
i
/ n are then presented in the third column
of the table, the cumulative frequencies in the fourth column. Of the fifty enterprises selective
statistical set (n=50) 9 enterprises were with the maximum export ability (probability of this
degree is 0.18), 15 enterprises were with the lower degree than the highest degree
(probability 0.30), 20 enterprises were with the middle export ability (probability 0.40),
4 enterprises were with the degree of development lower than middle degree (probability
0.08) and 2 enterprises were with the lowest degree of export ability (probability 0.04).

Within the assigned example the cumulative frequency, e.g. of result x
3
=3, is given by
probability 0.88. This probability, that the degree 1, 2 or 3 will be determined within the
investigation of export ability degree, can be determined by the summation of probabilities
p(1) + p(2) + p(3) = 0.18 + 0.30 + 0.40 = 0.88. So the probability of detection of the middle
degree is significantly high.

In the case of quantitative metric scale or absolute metric scale the measurement can be
considered the projection of statistical units set (e.g. within selective statistical set) into set of
real numbers.

18

The measurement methods depend on the expert field, which was defined in the
investigated selective statistical set SSS. They will be different, e.g., in the investigation of a
collective random phenomenon in sociology (various questionnaire forms of measurement)
and the investigation of a collective random phenomenon in economy (various ways of export
ability measurement before and after application of economical optimization of enterprise).

The measurement method shall comply with the conditions of validity (whether it is
measured what is to be measured), reliability (reproducibility of measurements) and
objectivity (whether the various evaluators will mesure the statistical unit in the same way).

The measurement results of investigated selective statistical set SSS are given by the
information on statistical sign values, i.e. by the information on the absolute frequencies and
the relative frequencies of individual scale elements and by the information on the cumulative
frequencies.






























19

1.4. Elementary Statistical Processing





Goals:
- Goals of investigation of descriptive statistics

- Empirical picture of selective statistical set






Acquired concepts and knowledge pieces:

Frequencies tables

Empirical distribution

Graphical expression

Plotting function Graphical expression of empirical distribution

Frequency polygon

Empirical parameters

General moments, e.g. average-means (arithmetic mean)

Central moments, e.g. variance-standard deviation (determinative deviation)

Standardized moments, e.g. obliqueness (skewness), pointedness (kurtosis)








20



Check questions:

- What are the main goals of the elementary statistical processing


- How can be the measurement results arranged by suitable way


- How can be the measurement results graphically expressed by suitable way


- How can be the parameters of measurement results expressed by suitable way


- What is the empirical distribution of frequencies


- How can be the empirical distribution of one-dimensional statistical set expressed by
graphical way


- What is the frequency polygon


- What is the significance of graphical expression of empirical distribution

-

- How can be the empirical parameters divided according to described feature of
investigated statistical set


- How can be the empirical parameters divided according to calculation way


- How are defined the general, central and standardized moments


- What is the most important parameter of location, variability, skewness and kurtosis,
what is the statistical interpretation of these parameters


- How is the excess quantity defined and what is its significance



21

The measurement results, it is necessary to arrange, to express graphically and to
express by suitable empirical parameters. These assignments can be fulfilled using the
elementary statistical processing. The empirical picture of investigated selective statistical set
SSS is the result of the elementary statistical processing. The elementary statistical processing
also completes this group of major statistical methods that can be called descriptive statistics.

The partial assignments arrangement, graphical expression and expression by
parameters can be represented in three basic results of the elementary statistical processing
table, empirical distributions (preferably in the shape of polygon) and empirical
parameters.


1.4.1. Table

The table represents a form of arrangement of the measurement results. In the
description of the table stated in the assigned illustrating example, it can be watched the table
Tab.1.

The table contains eight columns. The first four columns are necessary partly for
the display of the measurement results (fulfillment of task arrangement) partly for
the representation of the empirical distributions (fulfillment of task graphical expression).
The remaining four columns have the helping significance and they can be used to easy and
quick calculation of empirical parameters (fulfillment of task expression by parameters).


The first four columns contain:

1. column marked x
i
scale elements
2. column marked n
i
absolute frequencies of scale elements
3. column marked n
i
/ n relative frequencies of scale elements
4. column marked (n
i
/ n) cumulative frequencies


The following four columns contain the products needed for the calculation of empirical
parameters:

5. column contains the products x
i
.n
i

6. column contains the products x
i
2
.n
i

7. column contains the products x
i
3
.n
i

8. column contains the products x
i
4
.n
i


22

The table is closed by summations of the data in individual columns. In the first four
columns these summations have the checking significance, in the other four columns they are
needed for the calculation of empirical parameters.


1.4.2. Empirical Distributions of Frequencies

The empirical distributions of frequencies can be divided into two basic types. The first
type assigns corresponding absolute frequencies n
i
or relative frequencies n
i
/ n to the scale
elements x
i
. The second type assigns corresponding cumulative frequencies (n
i
/ n) to the
scale elements x
i
.

The graphical expression of empirical distribution of one-dimensional statistical set is
connected with the use of the coordinate system in the plane. In this coordinate system the
scale elements x
i
are always applied to horizontal axis, the corresponding frequencies to
vertical axis. The graphical expression of these functional dependences is given by the set of
points the first coordinate of which is always scale element x
i
, the second coordinate is
corresponding frequency. By connection of neighbouring points of this set of the line
segments it is possible to obtain the broken line which is called polygon. It is possible to
distinguish polygon of absolute frequencies, polygon of relative frequencies, polygon of
cumulative frequencies.

In addition to the graphical expression of empirical distributions by polygon the ranks
of helping graphical representations is used. Their advantage is a deviation from
mathematically exact apparatus and a certain quick orientation. The impossibility to continue
by a deepen apparatus of the mathematical statistics is the shortage, above all from the point
of view of the investigation of dependencies for the multi-dimensional statistical sets. The bar
charts, the bar graphs, the pie charts, etcetera, belong to these helping graphical
representations. Generally, it is possible to recommend the unique resorting to exact graphical
expression.
The significance of the graphical expression of the empirical distribution is substantial.
The graphical expression enables the immediate investigation which the theoretical
distribution (in terms of probability theory) is close to the empirical distribution obtained as a
result of descriptive statistics. The next significance consists in the immediate evaluation of
23

parameters of location, variability, skewness and kurtosis of empirical distribution and by this
way also of investigated statistical set.


Within the assigned example it is possible to practice, e.g., the construction of polygons
of the absolute and the cumulative frequency. In figure Fig.2 the absolute frequencies polygon
is represented, in figure Fig.3 then the cumulative frequencies polygon.


Fig.2 Absolute frequencies polygon Fig.3 Cumulative frequencies polygon


1.4.3. Empirical Parameters

The empirical parameters briefly and simply express the nature of investigated
statistical set. The empirical parameters are mostly related to a selective statistical set thats
why they often bear the naming selective parameters. As selective parameters they have
themselves the statistics-probability character and from this reason they behave as a special
group of statistical signs. This view will not be developed in following explanation but it is
necessary to draw attention to it, especially from the point of view of a deeper study of
statistics and probability theory.
The empirical parameters can be classified according to the feature of the investigated
statistical set (investigated statistical sign):

parameters of location
parameters of variability
parameters of obliqueness (skewness)
parameters of pointedness (kurtosis)

0
0,2
0,4
0,6
0,8
1
1 2 3 4 5
0
5
10
15
20
25
1 2 3 4 5
24

The second classification is classification of empirical parameters according to the way
of their calculation:

- moment parameters (they work as a function of all values of statistical sign)
- quantile parameters (they represent only certain values of statistical sign)

The quantile parameters are closely related to the moment parameters but they are
constructed by different way. The empirical quantile is always a certain value of statistical
sign (which is expressed by quantitative metric or absolute metric scale). That value divides
the number of smaller and greater values of statistical sign in certain ratio. E.g., the quantile
dividing the values of statistical sign in the identical parts (i.e. fiftypercentage quantile) is
called a median. The quantile parameters will not be investigated in more detail.

The moment parameters are divided into general moments, central moments and
standardized moments. The location moment (arithmetic mean) can be accurately
characterized using general moment of 1.order, the variability moment (empirical variance)
can be accurately characterized using central moment of 2.order , the obliqueness (skewness)
and pointedness (kurtosis) can be accurately characterized using standardized moments of
3. and 4.order.

As the standardized moments can be calculated using central moments and the central
moments using general moments, the following procedure will be selected in next explanation
(within this procedure the investigated statistical sign will be marked by letter x; the marks of
statistical sign values x
i
, of absolute frequencies n
i
and of selective statistical set extent n
dont change themselves):

- Presentation of common relations for general and central moments
- Expression of needful central moments using general moments
- Expression of needful standardized moments using central moments

a) The common relations for general and central moments

General moment of r-th order: O
r
(x) =
1
n
n
i
.(x
i
)
r


General moment of 1. order: O
1
(x) = x

(arithmetic mean)

25


Central moment of r-th order: C
r
(x) =
1
n
n
i
.(x
i
x )
r


Central moment of 2. order: C
2
(x) = S
x
2
(empirical variance)

Determinative (standard) deviation: S
x
=
2
( ) C x

b) The expression of needful central moments using general moments

C
2
(x) = O
2
(x) |O
1
(x)|
2

C
3
(x) = O
3
(x) 3.O
2
(x).O
1
(x) + 2.|O
1
(x)|
3

C
4
(x) = O
4
(x) 4.O
3
(x).O
1
(x) + 6.O
2
(x).|O
1
(x)|
2
3.|O
1
(x)|
4


c) The expression of needful standardized moments using central moments

N
3
(x) =
3
2 2
( )
( ) ( )
C x
C x C x


N
4
(x) =
| |
4
2
2
( )
( )
C x
C x


The procedure for calculation of general, central and standardized moments was
realized using the steps ad a), ad b) and ad c). Since all the needful moment parameters can be
determined using this procedure, now it is possible to describe the parameters of location,
variability, obliqueness (skewness) and pointedness (kurtosis).

The location parameter is determined by general moment of 1. order O
1
(x) and it bears
the name arithmetic mean. The position of the frequency empirical distribution is its
location on the horizontal axis of the coordinate system.

The variability parameter is determined by central moment of 2. order C
2
(x) and it bears
the name empirical variance (the square root from variance then bears the name standard
deviation). Determinative (standard) deviation shows what the information value is given to
arithmetic mean. If the determinative (standard) deviation is large, the information value of
arithmetic mean is small and vice versa.

26

The obliqueness parameter (skewness) is dominantly determined using standardized
moment of 3. order N
3
(x) and it bears then the name coefficient of skewness. If the
skewness coefficient is positive, then the scale elements lying to the left of the arithmetic
mean have greater frequencies (positively skew distribution of frequencies greater
concentration of the lower scale elements, of the smaller values of statistical sign) and vice
versa.

The pointedness parameter (kurtosis) is dominantly determined using standardized
moment of 4. order N
4
(x) and it bears then the name coefficient of kurtosis. The greater
value of kurtosis coefficient corresponds to more pointed distribution of frequencies for
a given variance. The quantity excess, defined by relation E
x
= N
4
(x) 3, is used as well.
The excess compares the kurtosis of empirical distribution with the kurtosis of known
standardized normal distribution. If the excess is positive, the empirical distribution is more
pointed than this distribution.


1.4.4. Illustration of Calculation of Empirical Parameters

In the assigned example the calculation of the empirical parameters of location,
variability, skewness and kurtosis will be now carried out. The soonest the general moments
of 1. to 4. order will be calculated using 5. up to 8. column of table Tab.1.

O
1
(x) = 2.50
O
2
(x) = 7.26
O
3
(x) = 23.50
O
4
(x) = 82.86

Next part of the procedure will consist in the calculation of central moments of
2. up to 4. order:

C
2
(x) = 1.031 (standard deviation S
x
= 1.015)
C
3
(x) = 0.300
C
4
(x) = 2.922

Final part of the procedure of empirical parameters calculation will be aimed at the
determination of standardized moments of 3. and 4. order and excess:

27

N
3
(x) =
3
2 2
( )
( ) ( )
C x
C x C x
= 0.28

N
4
(x)=
| |
4
2
2
( )
( )
C x
C x
= 2.75

E
x
= N
4
(x) 3 = 0.25

Location parameter (arithmetic mean) O
1
(x) shows to the placement of frequencies
empirical distribution on the horizontal axis the arithmetic mean of export ability is 2.5
(a lower value than the middle degree of export ability)

Determinative (standard) deviation expressed by the square root from C
2
(x) gives
an indication of the arithmetic mean information value. An indication of the information value
can be quantified by following way in the range from export ability degree 1.5 to export
ability degree 3.5 the 70% enterprises is roughly situated (the applicability of this information
depends on whether the empirical distribution can be substituted by theoretical normal
distribution).

The positive skewness coefficient N
3
(x) shows to the greater concentration of lower
scale elements, of lower degrees of export ability development. The figure Fig.2 confirms that
determination the slight asymmetry of the left to the arithmetic mean.

Relatively the high value of kurtosis coefficient and also the value of excess show to
a comparability with the kurtosis of standardized normal distribution. This communication
additionally supports the conclusion of arithmetic mean good information value.











28

Part 2. The Main Methods of Mathematical Statistics,
Probability Distribution

2.1. Assignment of Theoretical Distribution to Empirical Distribution


Goals:

Probable investigation of selective statistical set: Choice of acceptable theoretical distribution


Probable picture of selective statistical set: Testing non-parametric hypotheses




Acquired concepts and knowledge pieces:

Theoretical distribution, partial survey in alphabetical order:


Bernoulli, Beta, Binomial, Chi-square, Discrete Uniform, Erlang, Exponential, F, Gamma,

Geometric, Lognormal, Negative binomial, Normal, Poisson, Students, Triangular, Uniform,

Weibull


Testing nonparametric hypotheses

Test of zero hypothesis H
0


Receiving or rejecting of zero hypothesis H
0


Level of statistical significance o, e.g. at o = 0,05







29



Check questions:

Why is it advantegous to substitute an empirical distribution by theoretical distribution
Describe the division of statistical sign values extent into suitable number of intervals
What is the interval division of frequencies, what is the condition for creation of frequency
interval division in the case of testing non-parametric hypotheses
What is the random attempt and random variable
How are the random variables divided
How do the values of discrete and continuous random variable differ
How is the theoretical distribution (the distribution of random variable) defined
How are the theoretical distributions divided
What is the form of discrete theoretical distribution description
What is the form of continuous theoretical distribution description
What is the difference between probability function and probability density
What is the significance of binomial distribution
What is the significance of normal distribution
What is the formulation of central limit theorem
Present the form of distribution function of binomial and normal distribution
Present the form of probability function (probability density) of binomial distribution (normal
distribution)
How many of the theoretical parameters do binomial and normal distribution depend on,
describe the theoretical parameters
What is standardized normal distribution
What are the common relations for mean value and variance for discrete and continuous
theoretical distribution
What is the relation between empirical and theoretical parameters
What does the law of large numbers express
What is the apparatus of non-parametric testing
What do the zero and alternative hypothesis suppose in the case of non-parametric testing
What is the essence of testing non-parametric hypotheses
What are the theoretical distributions used for testing non-parametric hypotheses
What is the relation of theoretical distribution and statistical criterion
What is the relation of experimental value and critical theoretical value of statistical criterion
What is the critical domain of statistical criterion
Describe the testing technique of chi-square
What is the level of statistical significance
What is the error of I. type






30

The assignment of theoretical distribution to empirical distribution is the expression of
content of statistical method which bears the name testing non-parametric hypotheses.
Within this statistical method it will be needful to deal with the interval division of
frequencies, the concept theoretical distribution, the apparatus of non-parametric testing and
the assigned example. The significance of testing non-parametric hypotheses consists above
all in the fact that it is always more advantageous to substitute an empirical distribution by
theoretical distribution the simple mathematical apparatus is connected with theoretical
distribution and such apparatus enables to detect the information inaccessible by another way.


2.1.1. Interval Division of Frequencies

In some cases (e.g., for needs of non-parametric testing) it is useful to divide the extent
of statistical sign values or the extent of metric scale elements into a certain number of
intervals. In each from intervals created, then the corresponding values of statistical sign or
the corresponding elements of metric scale will be included. Usually it is recommended to
construct 5 20 intervals of the same length, also the empirical rules (working on an extent n
of selective statistical set SSS) are in being for rough delimitation of interval number k (e.g.
Sturges rule k = 1 + 3.3 log
10
n). It is needful to dedicate a relevant attention also for the
determination of interval boundaries.

Within the assigned example it will be determined if the empirical distribution in figure
Fig.1 can be substituted by normal distribution. This intention leads to the determination of
intervals number and intervals boundaries how it is presented in table Tab.2.


x
i
interval n
i
n
i
/n

n
i
/n n
i
x
i
n
i
x
i
2
n
i
x
i
3
n
i
x
i
4
1 ( - ; 1,5) 9 0,18 0,18 9 9 9 9
2 ( 1,5; 2,5) 15 0,3 0,48 30 60 120 240
3 ( 2,5; 3,5) 20 0,4 0,88 60 180 540 1620
4 ( 3,5; 4,5) 4 0,08 0,96 16 64 256 1024
5 ( 4,5; ) 2 0,04 1,00 10 50 250 1250
50 1,00 125 363 1175 4143

Table Tab. 2: Interval division of frequencies


31

2.1.2. Theoretical Distribution

The concept theoretical distribution is one from the fundamental concepts of
probability theory. The collective random phenomenon CRP, which is the subject of both
statistics and probability theory, is investigated in probability theory by means of the concepts
random attempt and random variable. The random attempt is a realization of activities or
processes the result of which isnt possible to anticipate with certainty. The random variable
RV is then variable the value of which is definitely determined by result of random attempt.

The value of random variable VRV is concept which has strong theoretical
dimension. By certain analogy of this concept, the origin of which can be discovered in
probability theory, it is concept the value of statistical sign VSS, the origin of which can be
discovered in descriptive statistics. The concept value of statistical sign VSS so has on the
contrary strong empirical dimension.

The random variables RV can be divided into discrete (the values of discrete random
variable dont follow themselves and they will be marked x
i
) and continuous (the values of
continuous random variable will be marked x and these values are continuously following
themselves it isnt possible to find the nearest neighbouring value). To values of random
variable it is possible to assign the probabilities with which they come in the course of
random attempt. These probabilities can be defined in a classical way (a number of random
attempt results positive to given value divided by the number of all random attempt results) or
e.g. according to Kolmogorov (by application of measure theory).

The rule that every value of random variable or every interval of values assigns
the probability is called the law of random variable distribution or shortly the random variable
distribution or also the theoretical distribution. From the point of view of cooperation
between probability theory and statistics the concept theoretical distribution is adequate to
statistical concept empirical distribution of frequency. According to an essence of random
variable RV the theoretical distributions can be divided into discrete and continuous ones.

The distribution function F is the important form of theoretical distribution description.
The distribution function F in the case of discrete random variable quotes the probability that
a random variable RV obtains the values smaller or equal to just chosen value x
i
and this
cumulative probability will be expressed by a summation of partial probabilities. In the case
of continuous random variable the distribution function F quotes that a random variable RV
32

obtains values smaller or equal to just selected value x, but this cumulative probability instead
of a summation will be expressed by an integral the lower limit of which is usually equal to 0
and upper limit is corresponding with selected value x. From the point of view of cooperation
between probability theory and statistics the concept distribution function is adequate to
statistical concept empirical distribution of cumulative frequency.

a) Binomial distribution the example of discrete theoretical distribution

The characteristic of collective random phenomenon

The n independent random attempts are carried out, the probability of monitored random
phenomenon is the same in the all random attempts and it is equal to p. It is sought the
probability that this phenomenon occurs itself 0, 1, , n-times. According to this definition
the values x
0
, x
1
, , x
n
of relevant random variable are given by numbers 0, 1, , n.

Theoretical distribution, distribution function

The theoretical distribution is called probability function in discrete case. For described
random phenomenon the probability function is a rule which assigns the probabilities P
i
for
i = 0, 1, , n to the values x
i
of random variable. The form of probability function is

( ) 1
n i
i
i
n
P p p
i
| |
=
|
\ .
.

The relevant form of distribution function (cumulative probability) F(x
j
) = F
j
is given by
summation

0
j
j i
i
F P
=
=

,

where adding index i obtains the values from 0 to j.
The binomial distribution depends on two theoretical parameters p, n.

The significance of binomial distribution

A typical example of independent random attempts is a random selection of elements from
a set if the selected element is returned back, so called the selection with return. It can be
shown that, in the case where the extent of selective set is small in comparison with the extent
of basic set, the difference between the selection with return and the selection without return
33

is insignificant. The binomial distribution can therefore serve as a suitable criterion, whether
the selective statistical set was created on the basis of random selection.

b) Normal distribution the example of continuous theoretical distribution

The characteristic of collective random phenomenon

The continuous random variable whose values xe(,), can have a normal distribution. The
graph of function which assigns the probabilities to these values of random variable is given
by well-known Gauss curve in the shape of a bell. It is so sought a probability which will be
assigned to unit interval of continuous random variable values in the sense that this interval
will contain the value of x.

Theoretical distribution, distribution function

The theoretical distribution is called probability density in continuous case (the random
variable values continuously follow themselves, it is needful to assign the probabilities to
unit intervals of values because the nearest neighbouring value to value x isnt possible to
find). The form of probability density is

( )
( )
2
2
2
1
2
x
x e

o

o t

= .

The relevant form of distribution function (cumulative probability) F(x) is given by integral

( ) ( ) ,
t
F t x dx

=
}


where lower integral limit acquires value 0, upper limit then value t.

The normal distribution depends on two theoretical parameters , . This dependence
is usually recorded N(,). The theoretical parameter is a theoretical analogy of general
moment of 1.order O
1
(x) and so it is theoretical analogy of empirical arithmetic mean x . The
theoretical parameter is a theoretical analogy of the square root of central moment of 2.order
C
2
(x) and so it is theoretical analogy of empirical standard (determinative) deviation S
x
.
The normal distribution can be normalized to the values of theoretical parameters =0,
=1 by means of standardized random variable

34

x
u

o

= .

This dependence is usually recorded N(0,1) and so called standardized normal distribution
(see figure Fig.4) is then marked by this record. The probability density of standardized
normal distribution will be marked ( ) u due to introduced variable u, the distribution
function is often called Laplace function and marked by record F(u). Very detailed statistical
tables are elaborated for the values of Laplace function. The graphical representation of
standardized normal distribution probability density is in the figure Fig.4.





Fig.4 Graphical representation of probability density ( ) u of standardized normal
distribution (the values u are applied in horizontal axis, the values of
probability density ( ) u are applied in vertical axis)

The significance of normal distribution

The significance of normal distribution is described by central limit theorem. Its essence is the
statement that the random variable, being created as the summation of a large number of
mutually independent random variables, has approximately the normal distribution under very
general conditions. The exact formulation is presented by Ljapunov theorem the component
of which is the condition enabling to work with a normal distribution for sufficiently the big
extent of selective set. The special forms of that theorem Lindberg-Lvy theorem and
Moivre-Laplace theorem (this theorem shows that for sufficiently the big number of
independent attempts the binomial distribution is converging to normal distribution) are
useful, too.
35

c) Parameters of theoretical distributions

For the discrete theoretical distributions the P
j
will mark the distribution function and
the x
i
the values of random variable RV. For the continuous theoretical distributions the
( ) x will mark the probability density and the x the values of continuous random variable.
The theoretical general, central and standardized moments O
j
, C
j
and N
j
are important
parameters of all the theoretical distributions. The theoretical general, central and
standardized moments O
j
, C
j
and N
j
can be expressed through the formulas:
1
( ) ,
b
n
j j
j j i
i
a
O x x dx O i P
=
= =

}

( ) ( )
1 1
1
( ) ,
b
n
j j
j j i
i
a
C x O x dx C i O P
=
= =

}

1 1
1
2 2
( ) ,
j j
b
n
j j i
i
a
x O i O
N x dx N P
C C

=
| | | |

= = | |
| |
\ . \ .

}


Often the names and marks mean value (expected value) E and dispersion
(variance) D are used, too. The expected value E is a location parameter which measures the
level of random variable RV. The dispersion D is a variability parameter which measures the
diffusion of random variable values. The expected value E is equal to theoretical general
moment of 1.order O
1
, the dispersion D is equal to theoretical central moment of 2.order C
2
.
The theoretical general moment of 1.order O
1
is the location parameter, the theoretical
central moment of 2.order C
2
is the variability parameter, the theoretical standardized moment
of 3.order N
3
is the skewness parameter and the theoretical standardized parameter of 4.order
N
4
is the kurtosis parameter.

The relation between empirical and theoretical parameters describes the law of large
numbers. Subject to compliance with certain conditions, it can be expected that the empirical
distribution and related empirical parameters will approximate the theoretical distribution and
associated with him theoretical parameters. And the more, the greater the extent of selective
statistical set (the larger the number of realized random attempts). Approaching the empirical
parameters to the theoretical parameters has not character of mathematical convergence but
probability convergence.

36

2.1.3. Description of Selected Probability (Theoretical) Distributions


a) Discrete theoretical distribution Alternative distribution

The alternative distribution is discrete theoretical distribution A(p) with one theoretical
parameter of zero-one random variable RV (the random variable has values x
i
= i = 0, 1).
The probability and distribution functions P
i
and F
i
as analogies of empirical relative
and cumulative frequency and theoretical moments O
j
, C
j
have for alternative distribution the
forms

( )
( ) ( )( )
( )( )
1
0
1 2 3 4
1 2 3
2
4
1 , where 0,1, , where 1
theoretical moments , , ,
, 1 , 1 1 2 ,
1 1 3 3 .
i
i
i
i i i
j
i i
P p p i F P i
O C C C
O E p C D p p C p p p
C p p p p

=
= = = s
= = = = =
=




b) Discrete theoretical distribution Binomial distribution

The binomial distribution is discrete theoretical distribution Bi(n, p) with two
theoretical parameters n, p of random variable RV (the random variable has values
x
i
= i = 0,1, .,n).
The probability and distribution functions P
i
and F
i
as analogies of empirical relative
and cumulative frequency and theoretical moments O
j
, C
j
have for binomial distribution the
forms

( )
( ) ( )( )
( ) ( )( )
0
1 2 3 4
1 2 3
2
2 2 2
4
1 , where 0,1,...., , , where ,
theoretical moments , , ,
, 1 , 1 1 2 ,
3 1 1 1 6 6 .
i
n i
i
i i i
j
i i
n
P p p i n F P i n
i
O C C C
O E np C D np p C np p p
C n p p np p p p

=
| |
= = = s
|
\ .
= = = = =
= + +




c) Discrete theoretical distribution Poisson distribution

The Poisson distribution is discrete theoretical distribution Po() with one theoretical
parameter of random variable RV (the random variable has values
x
i
= i = 0,1, ., ).
37

The probability and distribution functions P
i
and F
i
as analogies of empirical relative
and cumulative frequency and theoretical moments O
j
, C
j
have for Poisson distribution the
forms

0
1 2 3 4
2
1 2 3 4
, where 0,1,...., , , where ,
!
theoretical moments , , ,
, , , 3 .
i i
i i i
j
i i
P e i F P i
i
O C C C
O E C D C C

=
= = = s
= = = = = = +



The binomial distribution Bi(n, p) may be approximated by Poisson distribution Po()
for n > 30 and for p 0 (p 0.1 is sufficient).

d) Discrete theoretical distribution Geometric distribution

The geometric distribution is discrete theoretical distribution Ge(p) with one theoretical
parameter p of random variable RV (the random variable has values
x
i
= i = 0,1, ., ).
The probabilities P
i
geometrically decreases with increasing values i. The independent
attempts are carried out and a probability taking the observed phenomenon (i.e. the
probability of success) is for all the attempts the same and equal to p. The probability of
success only in attempt i + 1 is given by probability function P
i
.

The probability and distribution functions P
i
and F
i
as analogies of empirical relative
and cumulative frequency and theoretical moments O
j
, C
j
have for geometric distribution
Ge(p) the forms

( )
0
1 2
1 2 2
1 , where 0,1, 2,...., , , where ,
theoretical moments ,
1 1
, .
i
i
i i i
j
i i
P p p i F P i
O C
p p
O E C D
p p
=
= = = s

= = = =




e) Discrete theoretical distribution Hypergeometric distribution

The hypergeometric distribution is discrete theoretical distribution HGe(N, M, n) with
three theoretical parameters N, M, n of random variable RV (the random variable has values
x
i
= i = max(0, M N + n),., min(M, n)).
38


The hypergeometric distribution, unlike the previous discrete distributions, has the
dependent repeated random attempts (e.g. it is worked with N elements, M elements of which
has observed sign and n elements is selected from these N elements without return).

The probability function P
i
as analogy of empirical relative frequency and theoretical
moments O
j
, C
j
have for hypergeometric distribution HGe(N, M, n) the forms


( )
1 2
1 2
, where max 0, ,..., min( , ),
theoretical moments ,
, 1 .
1
i
i i
M N M
i n i
P i M N n M n
N
n
O C
M M M N n
O E n C D n
N N N N
| || |
| |

\ .\ .
= = +
| |
|
\ .
| |
= = = =
|

\ .


The forms of the theoretical parameters O
1
, C
2
for N sufficiently large against n
correspond to forms of theoretical parameters O
1
, C
2
of binomial distribution Bi(n, p) with
probability
M
p
N
= .

The hypergeometric distribution HGe(N, M, n) may be approximated for


0, 05
n
N
s ,
M
p
N
=

by binomial distribution Bi(n, p).

The hypergeometric distribution HGe(N, M, n) may be approximated for small fractions
,
n M
N N
and for n large

0, 05, 0,1, 31,
n M M
n n
N N N
s s > =


by Poisson distribution Po().



39

f) Discrete theoretical distribution Multinomial distribution

The s-multiple multinomial distribution is discrete theoretical distribution
s-Multi(n,p
1
,.,p
s-1
) with s theoretical parameters n, p
1
,, p
s-1
(the random variables
RV
1
,, RV
s
have values marked i
1
,, i
s
= 0, 1,., n).

The distribution s-Multi(n, p
1
,, p
s-1
) is connected with incompatible random
phenomena A
1
,., A
s
which can come in n independent attempts with the probabilities
p
1
,., p
s
(the summation of probabilities is equal to 1, s-multiple multinomial distribution is
therefore only with s1 independent probabilities). The numbers of random phenomena A
i

occurrence in n attempts have the binomial distributions Bi(n, p
i
).
The probability function P
i
for multinomial distribution s-Multi(n, p
1
,,p
s-1
) has as
analogy of empirical relative frequency the form

1
1
1
,..., 1
1
1
1
!
... 1 .
!... ! !
s
j
j
s
s
n i
s
i i
i i s j
s
j
s j
j
n
P p p p
i i n i
=

=
=

| |
=
|
| |
\ .

|
\ .



The individual binomial distributions ( ) Bi ,
i
n p have the theoretical parameters

( )
1 2
, 1 .
i i i i i
O E np C D np p = = = =

The distribution of one random variable (s = 2) is binomial distribution Bi(n, p
i
). The
distribution of two random variables (s = 3) is trinomial distribution Tr(n,p
i
,p
j
). The
probability function P
ij
for trinomial distribution Tr(n,p
i
,p
j
) has the form



The multinomial distribution for n , p
i
0 (i=1,,s) may be approximated for

i
= np
i
(
i
are the finite numbers) by multi-dimensional Poisson distribution Po(
i
).


g) Continuous theoretical distribution Normal and standardized normal distribution

The normal distribution is continuous theoretical distribution N(,) of random variable
RV (the random variable acquires the values ( ) ; xe ). The normal distribution has two
theoretical parameters , . The standardized normal distrinution is continuous theoretical
( )
( )
1 2 1 2
!
1 .
! ! !
n i j
i j
ij
n
P p p p p
i j n i j

=

40

distribution N(0,1) of random variable U (the random variable acquires the values
( ) ; ue ). For standardized normal distribution the parameters , are standardized to
values 0, 1 by the substitution of the random variable RV by new random variable U

( ) ( )
2
, 0, 1.
E x D x
x x x
u E D


o o o o o

| | | |
= = = = =
| |
\ . \ .


The probability densities (x), (u) (corresponding with relative frequency), the
distribution functions F(x), F(u) (corresponding with cumulative frequency) and standardizing
conditions (corresponding with empirical standardizing condition) have the forms

( )
( )
( )
( ) ( ) ( ) ( )
( ) ( ) ( ) ( )
2
2
2
2 2
1 1
,
2 2
,
1, 1
x
u
t t
x e u e
F t x dx F t u du
F x dx F u du

o

o t t





= =
= =
= = = =
} }
} }


The theoretical parameters O
1
, C
2
can be calculated in the form

( ) ( ) ( ) ( )
( ) ( ) ( ) ( )
1 1
2 2 2
2 1 2
, 0
( ) , 1.
O E x x x dx O E u u u du
C D x x O x dx C D u u u du

o




= = = = = =
= = = = = =
} }
} }



h) Continuous theoretical distribution Lognormal distribution

The lognormal distribution is continuous theoretical distribution LN(, ) of random
variable RV which is increasing function of random variable Y in the form x = e
y
(the random
variable Y has normal distribution N(, )). The lognormal distribution has two theoretical
parameters , .

The probability density (x) (corresponding with relative frequency) has the form

( )
( )
2
2
ln
1
exp , where 0 .
2
2
x
x x
x

o
o t
| |

| = < <
|
\ .



41

The theoretical parameters O
k
, O
1
, C
2
can be calculated in the form

( ) ( )
( )
( ) ( )( )
2 2
0
2
2
1 2
2 2 2
2 2 1
exp
2
exp , exp 2 2 ,
2
exp 2 exp 1 .
k k
k
k
O E x x x dx k
O O
C D x O O
o

o
o
o o

| |
= = = +
|
\ .
| |
= + = +
|
\ .
= = = +
}



2.1.4. Apparatus of Non-parametric Testing

The use of apparatus of the zero hypotheses H
0
and the alternative hypotheses H
a
is the
foundation of the testing non-parametric (but also parametric) hypotheses.

In the case of non-parametric hypotheses the zero hypothesis supposes that empirical
distribution can be substituted by intended theoretical distribution (regarding the substitution
by normal distribution it had been a test of normality). An alternative hypothesis then
supposes that this presumption isnt correct. A comparison between theoretical and empirical
absolute frequencies is the essence of testing non-parametric hypotheses. The empirical
absolute frequencies are calculated by means of elementary statistical processing in relation to
the empirical distribution. The theoretical absolute frequencies are then calculated through
probability function or probability density in relation to the intended theoretical distribution.

The parametric hypotheses relate to a comparison of empirical and theoretical
parameters and the zero and alternative hypotheses play the similar role here.
.
For the verification of non-parametric and parametric hypotheses the special group of
theoretical distributions was developed these distributions are not intended to replace the
empirical distributions but they work as statistical criteria. The normal distribution is the only
exception in its standardized shape it may play a role of statistical criterion, in its non-
standardized shape may substitute the empirical distributions.

Standardized normal distribution (u-test), Student distribution (t-test), Pearson

2
distribution (
2
-test, chi-square) and Fisher-Snedecor distribution (F-test) belong among the
most frequent statistical criteria. The detailed statistical tables are elaborated for all presented
statistical criteria.

42

For verification of hypotheses H
0
and H
a
the suitable statistical criterion is needful to
select. The
2
-test is used the most frequently for verification of a non-parametric hypothesis.
If the creation of interval division of frequencies is a condition for its application, it is then
needful to connect the each partial interval with the absolute frequency equal to at least 5. If
this condition isnt fulfilled it is necessary to connect the partial intervals. Similarly, it is
necessary to proceed to the interval division of frequencies.

After the selection of statistical criterion (e.g.,
2
-test) it is needful to come up to the
determination of experimental value of this criterion (e.g.,
2
exp
_ ) and critical theoretical value
(e.g.,
2
teor
_ ). So called the critical domain W of relevant statistical criterion will be recorded
by means of the critical theoretical value.

If the experimental value of selected criterion will be an element of the critical domain
W it is necessary to receive the alternative hypothesis H
a
i.e. the empirical distribution
cannot be substituted by intended theoretical distribution. In the contrary case (the
experimental value will not be an element of the critical domain W) the zero hypothesis H
0

can be received i.e. the empirical distribution can be substituted by intended theoretical
distribution.

The determination of significance level is an essential element of testing non-
parametric and parametric hypotheses. This significance level quotes the probability of
erroneous rejection of tested hypothesis (i.e. the probability of the error of I. type). The most
frequent significance levels are the values = 0.05 and = 0.01. E.g., the significance level
0.05 enables for the positive test of normality (i.e. it is received the hypothesis H
0
on the
possibility to substitute the empirical distribution by normal distribution and the hypothesis H
a

is refused) to determine the conclusion if the selective statistical set SSS will be selected
100 times from basic statistical set BSS, in 95 cases it will be shown the empirical distribution
can be substituted by normal distribution.

The proper procedure of non-parametric testing can be exercised by means of the
solution of the assigned example.




43

2.1.5. Illustration of Non-parametric Testing

Within the assigned example it is now possible to monitor the procedure for the
verification of the zero hypotheses H
0
that the empirical distribution in figure Fig.2 can be
substituted by a normal distribution (see Fig.4).

In the course of testing the
2
-test will be applied, in the course of its application the
letter k will be to refer to the number of intervals of frequency interval division, the letter r
then to the number of normal distribution theoretical parameters (i.e. r = 2). The formulation
= kr1 expresses the number of freedom degrees which enables together with a selected
level of significance to determine the critical theoretical value
2
teor
_ =
2
- -1 k r
_ using statistical
tables. The significance level is selected = 0,05.

The letter F marks the Laplace function depending on standardized random variable u
i

(u
i
is standardized value reflecting the upper limit x
i
of relevant interval of frequency interval
division). The probabilities p
i
(expressed by integral calculus) are given by the difference of
Laplace function values, the products n.p
i
then express the theoretical absolute frequencies,
the values n
i
denote the empirical absolute frequencies (see tables Tab.1 and Tab.2).

The calculation of standardized values u
i
using the relation
1 i
i
x
x O
u
S

=
(general moment of 1.order O
1
= 2,5, standard deviation S
x
= 1, the upper limits x
i
are
x
1
= 1,5,
x
2
= 2,5,
x
3
= 3,5,
x
4
= 4,5,
x
5
= )

leads to the values

u
1
= 1,
u
2
= 0,
u
3
= 1,
u
4
= 2,
u
5
= .


44

The calculation of probabilities p
i
using the integral calculus and using the Laplace
function values F(u):

( ) ( ) ( )
( ) ( ) ( ) ( )
( ) ( ) ( ) ( )
( ) ( ) ( ) ( )
( ) ( ) ( ) ( )
1,5 1
1 1
2,5 0
2 2
1,5 1
3,5 1
3 3
2,5 0
4,5 2
4 4
3,5 1
5 4
4,5 2
, 1
, 0 1
, 1 0
, 2 1
, 2
p x dx p u du F
p x dx p u du F F
p x dx p u du F F
p x dx p u du F F
p x dx p u du F F





= = =
= = =
= = =
= = =
= = =
} }
} }
} }
} }
} }



The application of
2
-test form

( )
( ) ( )
2
2
exp 1
1
,
k
i i
i i i
i
i
n np
p F u F u
np
_

=

= =



already enables to realize the needful partial calculations (see table Tab.3).

x
i
Interval n
i
u
i
F(u
i
) p
i
np
i

1 ( ; 1,5) 9 1 0,1625 0,1625 8,125
2 ( 1,5; 2,5) 15 0 0,5000 0,3375 16,875
3 ( 2,5; 3,5) 20 1 0,8175 0,3175 15,875
4 ( 3,5; 4,5) 4 2 0,9754 0,1579 7,895
5 ( 4,5; ) 2 1,0000 0,0246 1,230

Table Tab.3: The calculations of u
i
, F(u
i
), p
i
and n.p
i



The table Tab.4 reacts to the requirement at least 5 and more measurement results must
be in each interval in the course of normality test. The neighbouring intervals come together
to reach the 5 and more measurement results. At the same time the additional calculations,
enabling to establish the experimental value of statistical criterion, are carried out in this table.

45

x
i
n
i
np
i

2
( - ) i i
i
n np
np

1 9 8,1 0,100
2 15 16,9 0,214
3 20 15,9 1,057
4 + 5 6 9,1 1,056
= 2,427 =
2
exp
_

Table Tab.4: The adjustment of intervals number, the calculation of
2
exp
_

In the final part of non-parametric testing it was needful to determine the critical
theoretical value
2
teor
_ =
2
v
_ =
2
- -1 k r
_ =
2
4-2-1
_ =
2
1
_ = 3.84 using the calculated number of
freedom degrees = k r 1 = 4 2 1 = 1 and using the statistical tables with significance
level = 0.05. By means of the critical theoretical value already it was possible to record the
right-sided critical domain W =
( )
2
, ) 3.84, ). _ o =
For the experimental value of statistical criterion
2
exp
_ = 2.427 (i.e.
2
exp
_
e
W) it is
possible to do the conclusive verdict related to non-parametric hypothesis test:
The experimental value
2
exp
_ doesnt belong to critical domain, the zero hypothesis H
0

can be received and the empirical distribution (empirical polygon) can be substituted by
theoretical normal distribution with the significance level = 0.05. This conclusion is of
considerable importance in the course of deducing the additional information it is possible
to use not only the simple mathematical apparatus connected with normal distribution but also
in the course of parametric hypotheses testing it is possible to apply the testing techniques
which are just bound to the normal distribution.













46

2.2. Comparison of Empirical and Theoretical Parameters Estimations of
Theoretical Parameters, Testing Parametric Hypotheses




Goals:
- Probable investigation of selective statistical set: Quantification of theoretical
parameters, Comparison between theoretical and empirical parameters

- Probable picture of selective statistical set: Point & interval estimation e.g.
confidence interval, Testing parametric hypotheses





Acquired concepts and knowledge pieces:

Point estimation

Interval estimation

Confidence interval

Confidence interval for mean value

Confidence interval for standard deviation

Testing parametric hypotheses

Computed u-statistic

Computed t-statistic

Computed F-statistic

Computed chi-square statistic







47





Check questions:


Why do the estimations of theoretical parameters come before the comparison of theoretical
and empirical parameters?


What conditions must good point estimation fulfil?


What are the methods of point estimations?


What are the advantages of interval estimations?


Describe the way of confidence intervals construction


Which are the statistical criteria used for confidence intervals construction?


What is the apparatus of parametric testing?


What is the difference between one-selective and two-selective testing parametric
hypotheses?


What is the procedure for parametric testing?


Present a survey of the most general statistical criteria
.










48

Another of the main methods of statistics Comparison of empirical and theoretical
parameters builds on Assignment of theoretical distribution to empirical distribution. The
theoretical distribution is identified and assigned by non-parametric testing, but it contains
still the unknown values of theoretical parameters. Before an implementation of comparison
between empirical and theoretical parameters it is needful to estimate the theoretical
parameters. Then it is possible to approach to a comparison between empirical and theoretical
parameters with the application of parametric testing apparatus.


2.2.1. Basics of Estimation Theory

It is necessary to estimate the theoretical parameters (e.g. mean value E = and
dispersion D =
2
for the normal distribution). Two kinds of the theoretical parameters
estimations can be: the point and the interval ones.

The good point estimations should fulfil the conditions of consistency, impartiality,
abundance and sufficiency. Here these conditions are reminded only, more detailed
information can be obtained in a literature dealing with estimation theory. The point
estimation can be carried out by moment method or by method of maximum likelihood. The
moment method is based on the effect that the empirical parameters are considered the
estimations of corresponding theoretical parameters. The method of mathematical likelihood
is essentially mathematically more demanding. The disadvantage of point estimations consists
above all in the ignorance of exactness which the estimation was done with.

The interval estimations remove the problem of estimation exactness ignorance. They
are trying to construct an interval providing the reasonable guarantee (sufficiently high
probability) the real value of theoretical parameter is located inside interval. This probability
relates to the selection of significance level again and the constructed interval then bears the
name 100 (1)% confidence interval (e.g., for = 0,05 the point will be 95% confidence
interval).

a) The construction of confidence interval for mean value of normal distribution using u-test
(the condition of construction the variance
2
is assigned in advance) works on the form of
statistical criterion
1
O
u n

= .
49

The critical values are u(/2), u(/2), the conditions for construction of confidence
interval can be recorded in the form of inequalities u(/2)< u< u(/2). After the solution of
presented inequalities it is possible to obtain the confidence interval (the interval estimation
of ):
( ) ( )
1 1
2 2
;
u u
O O
n n
o o
o o

| |
|
e +
|
|
\ .
.

b) The construction of confidence interval for mean value of normal distribution using t-test
(the condition of construction the variance
2
isnt assigned in advance) works on the form
of statistical criterion
1
x
O
t n
S

= .

The critical values are t
n1
(/2), t
n1
(/2), the conditions for construction of confidence
interval can be recorded in the form of inequalities t
n1
(/2) < t < t
n1
(/2). After the solution
of presented inequalities it is possible to obtain the confidence interval (the interval estimation
of ):
( ) ( ) 1 1
1 1
2 2
;
n x n x
t S t S
O O
n n
o o


| |
|
e +
|
|
\ .
,

c) The construction of confidence interval for variance
2
of normal distribution using
2
-testu (the condition of construction the empirical variance S
x
2
is needful to calculate)
works on the form of statistical criterion
( )
2
2
2
1
x
n S
_
o

=


The critical values are
( ) ( )
2 2
1 1
1 ,
2 2
n n
o o
_ _

, the conditions for construction of
confidence interval can be recorded in the form of inequalities
( ) ( )
2 2 2
1 1
1 < <
2 2
n n
o o
_ _ _

. After the solution of presented inequalities it is possible to
obtain the confidence interval (the interval estimation of
2
):
50

( )
( )
( )
( )
2 2
2
2 2
1 1
1 1
;
1
2 2
x x
n n
n S n S
o
o o
_ _

| |

|
e
|
|
\ .
.


2.2.2. Illustration of Confidence Intervals Construction

a) Within the assigned example the construction of confidence interval will be carried out for
mean value using t-test.
The confidence interval is given by form:
( ) ( ) 1 1
1 1
2 2
;
n x n x
t S t S
O O
n n
o o


| |
|
e +
|
|
\ .

For the significance level = 0.05, for the extent n = 50 of selective statistical set SSS,
for standard deviation S
x
= 1 (approximative value) and for the arithmetic mean O
1
= 2.5 the
critical values are, according to the statistical tables, equal to t
49
(0.025) = 1.96 (for
freedom degrees number n1 > 33 it is possible to apply the statistical table for u-test).
After implementation into 95% confidence interval it is possible to obtain
( ) 2.221; 2.779 e .

b) Within the assigned example the construction of confidence interval will be carried out for
variance
2
using
2
-test.
The confidence interval is given by form:
( )
( )
( )
( )
2 2
2
2 2
1 1
1 1
;
1
2 2
x x
n n
n S n S
o
o o
_ _

| |

|
e
|
|
\ .
.
For the significance level = 0.05, for the extent n = 50 of selective statistical set SSS,
for standard deviation S
x
= 1 (approximative value) the critical values are according to the
statistical tables
2 2
49 49
2 2
49 49
(1 ( / 2) ) (0.975) 30.60
( / 2) (0.025) 70.22
_ o _
_ o _
= =
= =

After implementation into 95% confidence interval it is possible to obtain
( ) ( )
2
0.705; 1.617 , 0.839; 1.272 o o e e .
51

2.2.3. Basics of Parametric Hypotheses Testing

The parametric hypotheses testing again works on the apparatus of zero hypothesis H
0

and alternative hypotheses H
a
. This apparatus shall be accompanied by usual apparatus of
critical domain W. Due to the central limit theorem it is the natural assumption that the
normal distribution, as the most suitable theoretical distribution, may be assigned to empirical
distribution.

The parametric testing can be divided into one-selective testing hypotheses of the mean
value or of the variance (then the one-selective tests u-test and t-test are used for mean value
and one-selective
2
-test for variance) and into two-selective testing hypotheses of an equality
of the mean values or of the variances (then the two-selective tests u-test and t-test are used
for an equality of the mean values and two-selective F-test for an equality of the variances).

In the case of one-selective testing the hypothesis H
0
and H
a
can be written in the form

H
0
: =
0
or H
0
: =
0
, H
a
: =
0
or H
a
: =
0
.

The one-selective parametric testing works on the comparison between an empirical
parameter or an empirical parameter (by these symbols the results of elementary statistical
processing of selective statistical set SSS are marked, by means of these results the relevant
theoretical parameters , of corresponding normal distribution were estimated) and some
external theoretical data
0
,
0
, origin of which can be various (study of literature, research
reports, commercial indicators and the like). By the collective denominator of these external
data it can be the determination that they probably characterize the certain significant basic
statistical set BSS. The one-selective parametric testing, then from the point of view of the
mathematical statistics, answers the question whether the investigated selective statistical set
SSS could be chosen from the described significant basic statistical set BSS. In the case of
hypotheses H
0
verification it is possible to look at the results of selective statistical set SSS
investigation in the context created by basic statistical set BSS. In the case of hypothesis H
a

acceptance it is not possible to work on this context.


In the case of two-selective testing the hypothesis H
0
and H
a
can be written in the form

H
0
:
1
=
2
or H
0
:
1
=
2
, H
a
:
1
=
2
or H
a
:
1
=
2
.

52

The two-selective parametric testing works on the comparison between an empirical
parameter
1
or an empirical parameter
1
(by these symbols the results of elementary
statistical processing of selective statistical set SSS
1
are marked, by means of these results the
relevant theoretical parameters
1
,
1
of corresponding normal distribution were estimated)
and some external theoretical data
0
,
0
, origin of which can be usually found in the
investigation results of another selective statistical set SSS
2
. The two-selective parametric
testing, then from mathematical statistics point of view, answers the question whether both of
selective statistical sets SSS
1
and SSS
2
have investigated an analogous problem and whether
these sets can co-operate. In the case of confirmation of the hypotheses H
0
it is possible to
consider the selective sets SSS
1
and SSS
2
the selective sets chosen from the same basic
statistical set BSS and usually the endeavour to identify the set BSS is worth. In the case of
acceptance of the hypotheses H
a
it is necessary, from mathematical statistics point of view, to
articulate the doubts as to the compatibility of the sets SSS
1
and SSS
2
.

The procedure for parametric testing is similar to the procedure for non-parametric
testing. First, it is needful to formulate a zero and an alternative hypothesis and to select the
significance level . Then it is needful to select a suitable statistical criterion (u-test, t-test,

2
-test, F-test), to discover its critical value and to record a corresponding critical domain W.
Finally it is necessary to approach to the calculation of statistical criterion empirical value and
to determine if it is or it isnt the element of critical domain W. If the empirical value is an
element of domain W it is necessary to accept the alternative hypothesis H
a
, in the opposite
case then the zero hypothesis H
0
.


Survey of some one-selective statistical criteria (n the extent of set SSS):


a) One-selective u-test (the testing hypothesis about the mean value of the known variance
2
)

( ) ( )
0
exp
, ( ; 2 2 ; ) u n W u u

o o
o

= = ) ( .

b) One-selective t-test (the testing hypothesis about the mean value of the unknown
variance
2
)

( ) ( )
0
exp 1 1
, ( ; 2 2 ; )
n n
x
t n W t t
S

o o

= = ) ( .

53


c) One-selective
2
-test (the testing hypothesis about the variance of the unknown
parameters ,
2
)

( )
( ) ( )
2
2 2 2
exp 1 1 2
0
1
, 0; 1 2 2 ; )
n n
n
W
o
_ _ o _ o
o

= = ( ) ( .


Survey of some two-selective statistical criteria:


a) Two-selective u-test (the testing hypothesis about the equality of mean values of the known
variances
1
2
,
2
2
), n
1
, n
2
are the extents of selective statistical sets SSS
1
, SSS
2


( ) ( )
1 2
exp
2 2
1 2
1 2
, ( ; 2 2 ; ) u W u u
n n

o o
o o

= = ) (
+
.


b) Two-selective t-test (the testing hypothesis about the equality of mean values of the
unknown variances
1
2
,
2
2
), n
1
, n
2
are the extents of selective statistical sets SSS
1
, SSS
2
,
S
x1
, S
x2
are the empirical standard deviations of selective statistical sets SSS
1
, SSS
2






c) Two-selective F-test (the testing hypothesis about the equality of variances of the unknown
parameters
1
,
2
,
1
2
,
2
2
), n
1
, n
2
are the extents of selective statistical sets SSS
1
, SSS
2
,
S
x1
, S
x2
are the empirical standard deviations of selective statistical sets SSS
1
, SSS
2


( ) ( )
1 2 1 2
2
1
exp
2
2
1, 1 1, 1
0; 1 2 2 ; )
x
x
n n n n
S
F
S
W F F o o

=
= ( ) (
.





( ) ( )
( )
( ) ( )
1 2 1 2
1 2 1 2
1 2
exp
2 2
1 2
1 1 2 2
2 2
2
,
1 1
( ; 2 2 ; )
x x
n n n n
n n n n
t
n n
n S n S
W t t

o o
+ +
+

=
+
+
= ) (
54

The remark: The larger square power of square powers of the standard deviations S
x1
2
, S
x2
2
is
usually put into the numerator of statistical criterion

2
1
exp 2
2
x
x
S
F
S
= .

From this point of view the right-sided critical domain W = ( )
1 2
1, 1
; )
n n
F o

( with the value
instead of value /2 is usually used.

d) The paired t-test (the transformation of two-selective t-test on one-selective t-test on the
basis of the zero hypothesis H
0
:
1

2
= A where the most frequent A = 0).


2.2.4. Illustration of Parametric Testing

a) Assigned example testing hypotheses about mean value

Determine if the investigated selective statistical set SSS ( = 2.5, n = 50) could be, for
the significance level = 0.05, selected from the basic statistical set BSS which is
characterized by the mean value a1)
0
= 2.6, a2)
0
= 2.9.

The information about variance is missing it is needful to use the one-selective t-test:

( ) ( )
0
exp 1 1
, ( ; 2 2 ; )
n n
x
t n W t t
S

o o

= = ) (


The formulation of zero and alternative hypothesis: H
0
: =
0
, H
a
:
0


The determination of critical values and and critical domain:

t
49
(0.025) = u(0.025) = 1.96, W = ( ; 1.96) ( 1.96; )


The calculation of statistical criterion experimental value for the case a1)

t
exp
= 0.704, t
exp
eW

The result interpretation:
The experimental value t
exp
doesnt belong to the critical domain, on the significance
level = 0.05 it is possible to accept the zero hypothesis H
0
. The investigated selective
statistical set could be selected from an external set BSS. The difference
0
is statistically
55

unimportant for the significance level = 0.05 (it can be noted that the value
0


is the
element of the 95% confidence interval in the case a1))


The calculation of statistical criterion experimental value for the case a2):

t
exp
= 2.814, t
exp
eW


The result interpretation:
The experimental value t
exp
is the element of the critical domain, on the significance
level = 0.05 it is possible to refuse the zero hypothesis H
0
. The investigated selective
statistical set SSS couldnt be selected from an external set BSS. The difference
0
is, on
the significance level = 0.05, statistically important (it can be noted that the value
0


isnt
the element of the 95% confidence interval in the case a2))


b) Assigned example testing hypothesis about variance

Determine if the investigated selective statistical set SSS ( = 2.5, S
x
= = 1.005,
n = 50) could be, for the significance level = 0.05, selected from the basic statistical set BSS
which is characterized by the standard deviation b1)
0
= 1, b2)
0
= 0.5.

The one selective
2
-test will be used:

( )
( ) ( )
2
2 2 2
exp 1 1
2
0
1
, W 0; 1 2 2 ; )
n n
n o
_ _ o _ o
o

= = ( ) ( .

The formulation of zero and alternative hypothesis: H
0
: =
0
, H
a
: =
0
.

The determination of critical values and and critical domain:


( )
2
49
0.975 30.60 _ = , ( )
2
49
0.025 70.22 _ = , W 0; 30.60 70.22; ) = ( ) ( .


The calculation of statistical criterion experimental value for the case b1):

2 2
exp exp
49.49, W _ _ = e


56

The result interpretation:
The experimental value
2
exp
_ doesnt belong to the critical domain, on the significance
level = 0.05 it is possible to accept the zero hypothesis H
0
. The investigated selective
statistical set SSS could be selected from an external set BSS. The quotient between and
0

is statistically unimportant for the significance level = 0,05 (it can be noted that the value
0


is the element of the 95% confidence interval in the case b1))


The calculation of statistical criterion experimental value for the case b2):



2 2
exp exp
197.96, W _ _ = e


The result interpretation:
The experimental value
2
exp
_ belongs to the critical domain, on the significance level
= 0.05 it isnt possible to accept the zero hypothesis H
0
. The investigated selective statistical
set SSS couldnt be selected from an external set BSS. The quotient between and
0
is, on
the significance level = 0,05, statistically important (it can be noted that the value
0


isnt
the element of the 95% confidence interval in the case b2))


c) Assigned example testing hypotheses about equality of mean values

An analogous observation of the export ability as within the assign example (here it
was investigated the selective statistical set SSS
1
n
1
= 50 enterprises with the result
1
= 2.5)
has led to the average export ability c1)
2
= 2.6, c2)
2
= 2.9 for n
2
= 100 enterprises (the
variances were comparable, but the information about variance size is missing it is needful
to use two-selective t-test). Determine if this selective statistical set SSS
2
could be, for the
statistical significance level = 0.05, selected from the same basic statistical set BSS as the
set SSS
1
.

The two-selective t-test will be used:

( ) ( )
( )
( ) ( )
1 2 1 2
1 2 1 2
1 2
exp
2 2
1 2
1 1 2 2
2 2
2
,
1 1
W ( ; ; )
2 2
x x
n n n n
n n n n
t
n n
n S n S
t t

o o
+ +
+

=
+
+
= ) (

57

The formulation of zero and alternative hypothesis: H
0
:
1
=
2
, H
a
:
1

2

The determination of critical values and and critical domain:

t
148
(0.025) = 1.96, W = ( ; 1.96) ( 1.96; )


The calculation of statistical criterion experimental value for the case c1):

t
exp
= 0.574, t
exp
eW


The result interpretation:
The experimental value t
exp
doesnt belong to the critical domain, it is possible to accept
the zero hypotheses H
0
for the significance level = 0.05. The investigated selective
statistical set SSS
1
and the additional selective set SSS
2
could be selected from one and the
same external set BSS. The difference between
1
and
2
is statistically unimportant with the
significance level = 0.05.


The calculation of statistical criterion experimental value for the case c2):

t
exp
= 2.298, t
exp
eW


The result interpretation:
The experimental value t
exp
belongs to the critical domain, on the significance level
= 0.05 it isnt possible to accept the zero hypothesis H
0
. The investigated selective set SSS
1

and the additional selective set SSS
2
couldnt be selected from one and the same external set
BSS. The difference between
1
and
2
is statistically important with the significance level
= 0.05.


d) Assigned example testing hypotheses about equality of variances

An analogous observation of the export ability as within the assign example (here it
was investigated the selective statistical set SSS
1
n
1
= 50 enterprises with the result
S
x1
2
=
1
2
=1.01) has led to the average export ability for n
2
= 100 enterprises which enabled
the calculation of variance d1) S
x2
2
=
2
2
= 1, d2) S
x2
2
=
2
2
= 1.631. Determine if this selective
statistical set SSS
2
could be, for the statistical significance level = 0.05, selected from the
same basic statistical set BSS as the set SSS
1
.
58

The two-selective F-test (with the right-sided critical domain W) will be used:

( )
1 2
2
1
exp 1, 1
2
2
, W ; )
x
n n
x
S
F F
S
o

= = ( for the case d1),
( )
1 2
2
2
exp 1, 1
2
1
, W ; )
x
n n
x
S
F F
S
o

= = ( for the case d2).

The formulation of the zero and right-sided alternative hypothesis:

H
0
:
1
=
2
, i.e. S
x1
= S
x2
H
a
:
1
>
2
, tj. S
x1
> S
x2
(the case d1))

H
0
:
2
=
1
, i.e.. S
x2
= S
x1
H
a
:
2
>
1
, tj. S
x2
> S
x1
(the case d2))

The determination of critical value and right-sided critical domain:

F
49,99
(0.05) = 1.545, W = ( 1.545; )


The calculation of statistical criterion experimental value for the case d1):

F
exp
= 1.01, F
exp
e W


The result interpretation:
The experimental value F
exp
doesnt belong to the critical domain, it is possible to
accept the zero hypothesis H
0
for the significance level = 0.05. The investigated selective
statistical set SSS
1
and the additional selective set SSS
2
could be selected from one and the
same external set BSS. The difference between S
x1
2
= 1.01 and S
x2
2
= 1 is statistically
unimportant with the significance level = 0.05.


The calculation of statistical criterion experimental value for the case d2):

F
exp
= 1.615, F
exp
e W


The result interpretation:
The experimental value F
exp
belongs to the critical domain, on the significance level
= 0.05 it is possible to refuse the zero hypothesis H
0
. The investigated selective set SSS
1
and
the additional selective set SSS
2
couldnt be selected from one and the same external set BSS.
The difference between S
x1
2
= 1.01 and S
x2
2
= 1.631 is statistically important with the
significance level = 0.05.

59

2.3. Measurement of Statistical Dependences Some Fundaments
of Regression and Correlation Analysis



Goals:


Association investigation: Statistical dependence causal, non-causal

Association picture of selective statistical set: Regression analysis, Correlation analysis






Acquired concepts and knowledge pieces:


Simple and multiple selective statistical set


Statistical dependence


Simple and multiple regression dependence


Linear and nonlinear regression dependence


Regression analysis


Simple and multiple correlation


Correlation analysis


Pearson correlation coefficient




60



Check questions:


What is the difference between simple and multiple statistical set?


What is the statistical dependence?


What is the difference between simple and multiple regression and correlation analysis?


Wherein do the regression analysis basic tasks lie?


Wherein do the correlation analysis basic tasks lie?


What is the method of the least squares?


What is the normal equations system for simple linear and quadratic regression?


What is the difference between Pearson correlation coefficient and correlation index






2.3.1. Delimitation of Problem

The simple selective set SSS was investigated hitherto, only one statistical sign was
explored for the statistical units of this set. The statistical dependences measurement is
connected with a multiple selective set SSS, it will be simultaneously explored more
statistical signs for the statistical units.

The statistical dependence between the signs x, s is given by an instruction which
assigns exactly one empirical distribution of the frequencies of statistical sign s (the values of
sign s have to show the character of a random variable) to measured or entered values of sign
x (the values of sign x contrarily not has to have the character of a random variable).

61

The simple (paired) regression dependence then generally is one-sided dependence of
the given random variable s on another variable x (not necessarily random) the point is an
inestigation of two-dimensional selective statistical set SSS. The multi-dimensional
(multiple) regression dependence is the dependence of given random variable s on the larger
number of another variable x, y, z, (not necessarily random) the point is an investigation
of multiple set SSS.

The concept correlation dependence is the narrower concept than regression
dependence. The simple (paired) correlation can be understood as the mutual dependence of
two random variables (two statistical signs x, s) which is associated, for a change of values of
one statistical sign (either x or s), with a change of the arithmetic mean deduced from the
exploration of the second statistical sign (either s or x). In the continuity with the dependence
of larger number of random variables (statistical signs) it would be possible analogously to
define the multiple correlation.

The definitions of regression and corretation dependence are different from the
definitions of the functions of one or more variables, and so from the functional dependences.

The part of mathematical statistics, which deals with the study of regression and
correlation dependences, is called regeression and correlation analysis.

The basic tasks of regression analysis consist in the detection of suitable regression
function for the expression of observed dependence, in the point and interval estimation of
the parameters and the values of theoretical regression function and in the verification of
harmony of regression function with experimental data. According to the type of the
appropriate theoretical regression function it can be spoken also about the types of regression
analysis e.g. on polynomial regression, exponential regression, logarithmic regression,
hyperbolic regression and the like. The following explanation will be aimed at the seeking of
the suitable theoretical regression functions.

The basic tasks of correlation analysis consist in the measurement of correlation
tightness (strength, intensity). The problems of simple linear and non-linear correlation is
usually investigated, provided that the changes of random variables x, s (statistical signs x, s)
are correctly expressed by linear or non-linear regression function. Also for an investigation
of multiple correlation it is worked on the dependence description which is given by
62

regression function. The tasks of correlation analysis can be then transferred to the seeking of
correlation coefficients as the basic measures of tightness of the given correlation type. In
addition to using the correlation coefficients associated with the metric scales it is also
essential to explore the coefficients of ordinal correlation these are worked on the ordinal
scales. The following explanation will be aimed only at the use of a simple relation for the
linear correlation coefficient.

On the basis of the reduction of the number of investigated statistical signs of the two
the problem of regression dependences measurement can be described in a simplified form.
Two-dimensional selective statistical set SSS is connected with the exploration of two
statistical signs SS-x and SS-s. The metric scale with elements x
1
, x
2
, , x
n
is associated with
the sign x (the elements of scale were measured and the results of these measurements are
given by the absolute frequencies of individual elements), the measurement results
s
1
, s
2
, , s
n
are then connected with the sign s (the absolute frequencies measured for the
sign x are included in these results). By this way the measurement results are at disposal in
the form of n ordered pairs |x
i
, s
i
|.

On the basis of described simplification it is possible to use the method of least squares
in measuring the dependence between the signs SZ-x and SZ-s (the condition is that the
measurement errors of sign SZ-s, whose the values show the character of special random
variable, have the zero mean value and the same, although unknown, but the final variance).
Let the theoretical regression function generally described within the simple regression by an
equation y = f(x). The summation of least squares can be then expressed by relation
S = (s
i
- y
i
)
2
where y
i
are the values of function y = f(x) corresponding to the values x = x
i
.
The method of least squares then consists in the seeking of regression function y = f(x) by
means of the minimum value of summation S.


2.3.2. Simple Linear and Quadratic Regression Analysis

The way of the regression function seeking will be described by means of the graphical
delimitation of problem in the figure Fig.5 Simple linear regression analysis. In this figure it
is work on n = 5 of the ordered pairs |x
i
, s
i
|, which characterize the statistical dependence
between statistical signs SS-x and SS-s. The scale elements x
1
, x
2
, , x
5
, connected with the
statistical sign x, are deposited on the horizontal axis. The measurement results s
1
, s
2
, , s
5
of
the sign s (the absolute frequencies, measured for the sign x, are already included in these
63

results) are deposited on the vertical axis. The ordered pairs |x
i
, s
i
| are the coordinates of five
points A
1
|x
1
, s
1
|, A
2
|x
2
, s
2
|, A
3
|x
3
, s
3
|, A
4
|x
4
, s
4
|, A
5
|x
5
, s
5
|. These 5 points graphically
express the dependence between the signs SS-x and SS-s. The goal of simple linear regression
analysis is to express this statistical dependence by the straight line the analytical expression
of which is given by the usual form y = b
0
+ b
1
.x for polynomial function of the 1.order.



Fig.5 Simple linear regression analysis



The least squares method is aimed at the seeking of minimum value of expression
S = (s
i
y
i
)
2
in which the adding index i acquires the values i = 1, 2, , 5. Through y
i
it will
be installed y
i
= b
0
+ b
1.
x
i
and it will be looked for the minimum of function S which is the
function of two variables b
0
a b
1
, i.e. S = g(b
0
, b
1
).

64

The conditions for the seeking of minimum are given by the realization of partial
derivatives of function S according to both variables and by their annulment (for the persons
interested in the exact seeking of function extremes with more variables it is possible to
recommend to acquaint themselves with Sylvestr theorem from the area of mathematical
analysis).

The conditions for the seeking of minimum of function S can be recorded in the form

0 b
S
c
c
= 0,
1 b
S
c
c
= 0.

Obtained system of the equations is called the system of normal equations for simple
linear regression and after the realization of derivatives it acquires the known form

Es
i
= nb
0
+ b
1
Ex
i


Es
i
x
i
= b
0
Ex
i
+ b
1
Ex
i
2
.

The adding index i generally acquires the values i = 1, 2, , n. The values of
parameters b
0
, b
1
can be obtained through the solution of normal equations system and then it
is possible to record the straight line equation y = b
0
+ b
1
.x. The predictions of values s
i

corresponding with the relevant values x
i
for i > 5 can be then done according to the figure
Fig.5 through the obtained regression function. The predictions of the time or also the
comparative trends would not be possible without the realization of linear regression analysis.

By the analogous way it is possible to explain the fundaments of simple quadratic
regression. In this case the investigated statistical dependence would be expressed by
polynomial function of 2.order the graph of which is a parabola. The analytical expression
y = f(x) of a parabola is given by the equation y = b
0
+ b
1
x + b
2
x
2
, the method of least squares
leads again to the seeking of minimum of function S = (s
i
y
i
)
2
. This function
S = h(b
0
,b
1
,b
2
) is function of three variables, for the discovery of minimum the three partial
derivatives are already needful and their annulment leads to the normal equations system

0


S
b
c
c
= 0 .
1


S
b
c
c
= 0 .
2


S
b
c
c
= 0.

After the realization of derivatives the normal equations system for simple quadratic
regression acquires the form

65

Es
i
= nb
0
+ b
1
Ex
i
+ b
2
Ex
i
2


Es
i
x
i
= b
0
Ex
i
+ b
1
Ex
i
2
+ b
2
Ex
i
3


Es
i
x
i
2
= b
0
Ex
i
2
+ b
1
Ex
i
3
+ b
2
Ex
i
4
.

The adding index i acquires the values i = 1, 2, ,5 in the figure Fig.5, in the general
case then the values i = 1, 2, , n (in the case of quadratic regression the group of points
A
1
|x
1
, s
1
|, A
2
|x
2
, s
2
|, A
3
|x
3
, s
3
|, A
4
|x
4
, s
4
|, A
5
|x
5
, s
5
| should naturally map the progress of
the parabola instead of the straight line). The values of parameters b
0
, b
1
, b
2
can be obtained
by the solution of normal equations system and then it is possible to record the parabola
equation y = b
0
+ b
1
.x + b
2
.x
2
. The predictions of values s
i
corresponding with the relevant
values x
i
for i > 5 can be then done according to the figure Fig.5 by means of obtained
regression function. The predictions of the time or also the comparative trends would not be
possible without the realization of quadratic regression analysis.


2.3.3. Simple Linear and Quadratic Correlation Analysis

For the delimitation of problem it is again possible to use the graphical way indicated
by means of the figure Fig.5. After the realization of simple linear regression analysis (the
result is indicated by the drawn straight line in Fig.5) it is possible to approach to the
determination of statistical dependence tightness between the statistical signs SS-x and SS-s
of investigated selected statistical set SSS.

The most used measure of simple linear correlation tightness is Pearsoncorrelation
coefficient k
xs
. This coefficient is given by relation

k
xs
=
s x
xs
S S
S
.
,

it acquires the values from interval 1, 1
xs
k e + (this conclusion can be easily deduced
from so called Schwarz inequality). The values approaching to 1 from the right correspond
with the case of positive correlation (the values of both statistical signs SS-x and SS-s
increase or decrease at the same time, the figure Fig.5 is connected with this case). The
values approaching to 1 from the left describe the negative correlation (while the values of
one statistical sign are increasing the values of the second sign are decreasing). The values
around 0 indicate the signs dont correlate (it is possible to express no collective trends in the
66

increases or the decreases of the signs values). The Pearson correlation coefficient as the
empirical parameter has the character of a random variable and it can be used as a point
estimation of theoretical correlation coefficient.

In the relation for Pearson correlation coefficient the mixed central moment
C
2
(x,s) = S
xs
of 2.order also occurs in addition to the usual standard deviations S
x
and S
s
(i.e.
the square roots of central moments C
2
(x) and C
2
(s)) connected with the investigation of
statistical signs SS-x and SS-s. The mixed central moment of 2.order is defined by relation
(k is number of scale elements for both statistical signs)

( )( )
1 1
i
xs i x i s
n
S x O s O
n
=

, where the adding index i acquires commonly values


i = 1, 2, , k.


Apart from the Pearson correlation coefficient the other quantities are also used for the
measurement of simple linear correlation tightness (e.g. the size of the smaller of the angles
included by the associated regression straight lines or the determination coefficient). The
index of correlation is used for the measurement of simple quadratic correlation (the
statistical dependence is expressed by quadratic regression function). The relation for
correlation index can be used also for the investigation of other simple non-linear correlations
within this relation it is only necessary to install the used regression function instead of
quadratic regression function.


2.3.4. Illustration of Dependence Measurement

a) Simple linear regression

The observation of economical state within the assigned example (it was investigated
the selective statistical set SSS with the extent n = 50 enterprises, the statistical sign SS-x
export ability was explored for the enterprises) was connected with the observation of the
second statistical sign SS-s on the basis of use of the analogous metric scale (the scale
element 1 corresponds with the best value, it was realized the elementary statistical
processing). The determined values x
i
(the development degrees) and s
i
(the evaluation of
suitable parameter of the economical state) are presented in the table. The goal is to estimate
67

the type of regression dependence of both statistical data, to express it by suitable regression
function and to determine the tightness of correlation by means of suitable coefficient.



The sign SS-x: values x
i


1 2 3 4 5

The sign SS-s: values s
i

1,8 2,2 3,8 4,2 4,6


The estimated type of regression dependence:
The simple linear regression expressed by regression straight line y = b
0
+ b
1
.x

Thy system of normal equations for the linear regression:

Es
i
= nb
0
+ b
1
Ex
i

Es
i
x
i
= b
0
Ex
i
+ b
1
Ex
i
2


The system of normal equations for the concrete case:

5b
0
+ 15b
1
= 16,6 (5b
0
+ 15b
1
= 16.6)
15b
0
+ 55b
1
= 57,4 (15b
0
+ 55b
1
= 57.4)

The discovery of regression function:

y = 1,48 + 0,64.x (y = 1.48 + 0.64.x)


The investigation of trends:
After the installment of sign SS-x value x
i
= 6 it is possible to calculate the corresponding
value s
i
= 5,32 of sign SS-s (on the basis of the greater degree of development it is possible to
calculate the increased value of relevant parameter of the economical state)


The calculation of correlation coefficient:
- The values given by the elementary statistical processing of both statistical signs are equal
to S
s
= 1.166,
1
3.02
s
O = , S
x
= 1.015,
1
2.5
x
O =
- The calculation of mixed central moment of 2.order gives the value S
xs
= 0.763
68

- The installment into the relation for Pearson coefficient enables to determine the
correlation tightness k
xs
=
s x
xs
S S
S
.
= 0.645
- The interepretation of result tight positive correlation


b) Simple quadratic regression

The observation of economical state within the assigned example (it was investigated
the selective statistical set SSS with the extent n = 50 enterprises, the statistical sign SS-x
export ability was explored for the enterprises) was connected with the observation of the
second statistical sign SS-s. This sign was described by the percentage expression in
association with analogous metric scale. The determined values x
i
(the development degrees)
and s
i
(the percentage evaluation of suitable parameter of the economical state) are presented
in the table. The goal is to estimate the type of regression dependence of both statistical data
and to express it by suitable regression function.


The sign SS-x: values x
i
1 2 3 4 5
The sign SS-s: values s
i
20 % 10 % 6 % 2 % 2 %


The estimated type of regression dependence:
The simple quadratic regression expressed by regression parabola y = b
0
+ b
1
x + b
2
x
2


The system of normal equations for the quadratic regression:

Es
i
= nb
0
+ b
1
Ex
i
+ b
2
Ex
i
2

Es
i
x
i
= b
0
Ex
i
+ b
1
Ex
i
2
+ b
2
Ex
i
3

Es
i
x
i
2
= b
0
Ex
i
2
+ b
1
Ex
i
3
+ b
2
Ex
i
4



The system of normal equations for the concrete case:

x
i
x
i
2
x
i
3
x
i
4
s
i
s
i
x
i
s
i
x
i
2
1 1 1 1 20 20 20
2 4 8 16 10 20 40
3 9 27 81 6 18 54
4 16 64 256 2 8 32
5 25 125 625 2 10 50
E 15 55 225 979 40 76 196
69


5b
0
+ 15b
1
+ 55b
2
= 40
15b
0
+ 55b
1
+ 225b
2
= 76
55b
0
+ 225b
1
+ 980b
2
= 196



The discovery of regression function:
- First, the adjustment of relevant matrices (through the achievement of zero elements under
the main diagonal) will be carried out

5 15 55 / 40 5 15 55 / 40 5 15 55 / 40
15 55 225 / 76 0 10 60 / 44 0 10 60 / 44
55 225 980 / 196 0 60 375 / 244 0 0 15 / 20


- On the basis of adjusted matrices it is possible to carry out the calculation of coefficients
values b
0
, b
1
, b
2

b
2
= 1.33, b
1
= 12.4, b
0
= 30.54

- By the installment into general equation of parabola it is possible to obtain the analytical
expression of regression parabola y = 1.33x
2
12.4x + 30.54 and after the adjustment to
obtain the form y = 1.33 (x 4.7)
2
+ 1.21. From here the coordinates V [4.7; 1.21] of the top
of the parabola are evident

- Now the graph of regression parabola can be already constructed as a result of realized
simple quadratic regression analysis


The investigation of trends:
The corresponding value s
i
= 24.67% of sign SS-s can be calculated on the basis of
installment of sign SS-x value x
i
= 0.5 (from a very high degree of export ability it is possible
to calculate a high value of the relevant parameter of the economical state).
0
5
10
15
20
25
30
1 2 3 4 5
70

Part 3. Applications

3.1. Description of Statistical and Probability Base of Financial Options


3.1.1. Introduction

An imperative of data mining and a need of cooperation of the human with todays computers
are emphasized by D.A.Keim (Keim, 2002):

The progress made in hardware technology allows todays computer systems to store very
large amounts of data. Researchers from the University of Berkeley estimate that every year 1 Exabyte
(= 1 Million Terabyte) of data are generated, of which a large portion is available in digital form. This
means that in the next three years more data will be generated than in all of human history before.
If the data is presented textually, the amount of data which can be displayed is in range one
hundred data items, but this is like a drop in the ocean when dealing with data sets containing millions
of data items.
For data mining to be effective, it is important to include the human in the data exploration
process and combine the flexibility, creativity, and general knowledge of the human with the
enormous storage capacity and the computational power of todays computers.

The financial derivatives are such derivative contracts in which the underlying securities
are financial instruments such as stocks, bonds or an interest rate. The important constituent
of financial derivatives is created by financial options. The statistical and probability base of
financial options is exactly processed.

The Black-Scholes model observes the evolution of the options key underlying
variables in continuous-time. The Binomial and Trinomial model (the simplest variants of
the Mulltinomial model) observe the evolution of the option's key underlying variables in
discrete-time.

The statistical and probability base of financial options is connected, above all, with the
Black-Scholes model and the Multinomial model. These statistical and probability
applications will be described by means of data mining approach.




71

3.1.2. Financial Options
(quoted according to www.economywatch.com)

Financial options are those derivative contracts in which the underlying assets are
financial instruments such as stocks, bonds or an interest rate. The options on financial
instruments provide a buyer with the right to either buy or sell the underlying financial
instruments at a specified price on a specified future date. Although the buyer gets the rights
to buy or sell the underlying options, there is no obligation to use this option. However, the
seller of the contract is under an obligation to buy or sell the underlying instruments if the
option is used.
Two types of financial options exist, namely call options and put options. Under a call
option, the buyer of the contract gets the right to buy the financial instrument at the specified
price at a future date, whereas a put option gives the buyer the right to sell the same at the
specified price at the specified future date. The price that is paid by the buyer to the seller for
using this level of flexibility is called the premium (the fair price). The prescribed future price
is called the strike price.

The theoretical calculation of premium is connected namely with both the Black-
Scholes model (continuous statistical model based on normal distribution) and the Binomial
or Trinomial model (discrete statistical models based on binomial or trinomial distribution).
Financial options are either traded in an organized stock exchange or over-the-counter.
The exchange traded options are known as standardized options. The options exchange is
responsible for this standardization. This is done by specifying the quantity of the underlying
financial instrument, its price and the future date of expiration. The details of these
specifications may very vary from exchange to exchange. However, the broad outlines are
similar.
Financial options are used either to hedge against risks by buying contracts that will pay
out if something with negative financial consequences happens, or it allows the traders to
magnify the profits while the risks are limiting disadvantage.
Financial options involve the risk of losing some or all of the contract prices, if the market
moves against the trend expected, and counterpart risk, such as broker insolvency or contractors who
do not fulfil their contractual obligations.




72

3.1.3. Statistical and Probability Base of Black-Scholes Model
(quoted according to mars.wiwi.hu-berlin.de/ebooks/html/sfe/sfenode41.html. and
Zaskodny,P., Pavlat,V., Budik,J. (2007). Financial Derivates and Their Evaluation, Prague,
Czech Republic: University of Finance and Administration)

The Black-Scholes model observes the evolution of the options key underlying
variables in continuous-time. This is done by means of both the standard normal probability
densities (d
1
), (d
2
) and the standard normal distribution functions N(d
1
), N(d
2
).
The variables d
1
, d
2
are connected with Spot price S, Strike price X, Risk-Free Rate r,
Annual Dividend d, Time to Maturity , and Volatility .

The basic formulas for Black-Scholes model (Value Function Fair Price for call option
is marked C ( ) , Value Function Fair Price for put option is marked P ( ) ):

( ) ( ) ( ) ( )
( )
( ) ( ) ( ) ( ) ( ) ( )
( ) ( )
1 2
2 2
1 2
1 2 2 1
2
1 2 1
1 1 1 2 2 2
2 2
1 2
,
ln
2
,
,
1 1
,
2 2
d r r d
d d
d d
C Se N d Xe N d P Xe N d Se N d
S
r d
X
d d d
N d d d d N d d d d
d e d e
t t t t
o
t
o t
o t


t t



( ) = ( ) =
+ +
= =
= =
= =
} }


3.1.4. Statistical and Probability Base of Binomial and Trinomial Model
(quoted according to mars.wiwi.hu-berlin.de/ebooks/html/sfe/sfenode41.html. and
Zaskodny,P., Pavlat,V., Budik,J. (2007). Financial Derivates and Their Evaluation, Prague,
Czech Republic: University of Finance and Administration)

The Binomial model observes the evolution of the option's key underlying variables in
discrete-time. This is done by means of a binomial tree, for a number of time steps between
the valuation and expiration dates (the number of time steps is marked n). Each node, in the
tree, represents a possible price of the underlying at a given point in time.
At each step, it is assumed that the underlying instrument will move up or down by
a specific factor (u or d) per step of the tree (where, by definition, u1 and 0<d1). So, if S is
the spot price, then in the next period the price will be either S
up
= S.u or S
down
= S.d.
The number of up factors is marked j, the number of down factors is nj.
X is the Strike price and S is the Spot price of the underlying security.
73

Under the risk neutrality assumption, today's fair price of a derivative is equal to the
expected value of its future payoff discounted by the risk-free rate. Therefore, expected value
is calculated using the option values from the later two nodes (Option up and Option down)
weighted by their respective probabilities "probability" p of an up move in the underlying,
and "probability" (1-p) of a down move. The expected value is then discounted at q, the risk-
free rate corresponding with the life of the option (
q d
p
u d

).
The basic formulas for Binomial model (Value Function Fair Price for call option is
marked C ( ) , Value Function Fair Price for put option is marked P ( ) ):

0
1
n
j j n
j
C C
q
=
( ) = H

, C
j
= max (0, S
j
X)
j
n
j
j
n
P
q
P

=
H = ) (
0
1
, P
j
= max (0, X S
j
)
( )
j n j
j
p p
j
n

|
|
.
|

\
|
= H 1
S d u S S d u S
j k j k
j
j n j
j

= = ,
( )
m m
k k n
n
k
n
....... 2 . 1 ! ,
! !
!
=

=
|
|
.
|

\
|

,1
q d u q
p p
u d u d

= =

.

The Trinomial model observes the evolution of the option's key underlying variables in
discrete-time. This is done by means of a trinomial tree, for a number of time steps between
the valuation and expiration dates (the number of time steps is marked n). Each node, in the
tree, represents a possible price of the underlying at a given point in time.
The fair price can be determined numerically. The Binomial model after Cox-Ross-
Rubinstein can be used. In this section it will be introduced a less complex but numerically
efficient approach based on trinomial trees. It is related to the classical numerical procedures
for solving partial differential equations, which are also used to solve the Black-Scholes
differential equations.
The Trinomial model follows the procedure of the binomial model whereby the price at
each time step can change to three instead of two directions.
74

At each step, it is assumed that the underlying instrument will move up or down by
a specific factor (e.g. two up factors u
1
, u
2
and one down factor d) per step of the tree (where,
by definition, u
1
,u
2
1 and 0<d1). So, if S is the Spot price, then in the next period the price
will either be S
u1
= S.u
1
, S
u2
= S.u
2
or S
d
= S.d. The probability with which the price moves
from S to S
u1
, S
u2
, S
d
is represented as p
1
, p
2
, p
3
(p
1
+ p
2
+ p
3
= 1).
The number of u
1
factors is marked j, the number of u
2
factors is marked i, and the
number of d factors is nji.
The basic formulas for Trinomial model (Value Function Fair Price for call option is
marked C ( ) , Value Function Fair Price for put option is marked P ( ) ):
( )
max
0 0
1
,
max 0,
n n
ij ij n
i j
ij ij
C C i j n
q
C S X
= =
( ) = H + =
=


1 2
max
0 0
,
j i n i j
ij
n n
ij ij
i j
S u u d S
S S i j n

= =
=
= H + =



( )
1 2 1 2
1
n i j
i j
ij
n
p p p p
ij
| |
H =
|
\ .

max
0 0
1,
n n
ij
i j
i j n
= =
H = + =


( )
!
! ! !
n
n
ij i j n i j
| |
=
|

\ .



3.1.5. Statistical and Probability Data Mining Tools Normal, Binomial and Trinomial
Distribution


a) Standard normal probability density (x) and standard normal distribution
function N(x)
( ) ( )
( )
2
2
1
2
x
x
N x x dx
x e

=
=
}

75

b) Binomial and Trinomial probability function
( )
j n j
j
p p
j
n

|
|
.
|

\
|
= H 1
( )
1 2 1 2
1
n i j
i j
ij
n
p p p p
ij
| |
H =
|
\ .


3.1.6. Conclusion

The statistical and probability base of financial options as a part of statistical data mining
tools is created by
- Normal distribution,
- Binomial distribution,
- Trinomial distribution.































76

3.2. Description of Statistical and Probability Base of Greeks


3.2.1. Introduction

In mathematical finance, the Greeks are the quantities representing the sensitivities of
derivatives such as options to a change in underlying parameters on which the value function
of an instrument or portfolio of financial instruments is dependent. The name is used because
the most common of these sensitivities are often denoted by Greek letters.
The Greeks in the Black-Scholes model are relatively easy to calculate, a desirable
property of financial models, and are very useful for derivatives traders, especially those who
seek to hedge their portfolios from unfavourable changes in market conditions. For this
reason, those Greeks which are particularly for Hedging Delta, Gamma and Vega are well-
defined for measuring changes in Price, Time and Volatility.

The statistical and probability base of financial options is also connected with the
Greeks. These statistical applications will be described by means of data mining approach.


3.2.2. Greeks
(quoted according to http://en.wikipedia.org/wiki/Greeks_(finance) )

The Greeks are the quantities describing the sensitivities of financial options to
a change in underlying parameters on which the fair price (the value function) of an
instrument or portfolio of financial instruments is dependent. Collectively these have also
been called the Risk Sensitivities, Risk Measures or Hedge Parameters.
The Greeks are vital tools in Risk Management. Each Greek measures the sensitivity
of the fair price (the value function) of a financial instrument or portfolio to a small change in
a given underlying parameter, so that component risks may be treated in isolation, and the
portfolio rebalanced accordingly to achieve a desired state (see for example Delta Hedging).

According to 3.2.1. the Greeks in the Black-Scholes model are relatively easy to
calculate, a desirable property of financial models, and are very useful for derivatives traders,
especially those who seek to hedge their portfolios from adverse changes in market
conditions. For this reason, those Greeks which are particularly for Hedging Delta, Gamma
and Vega are well-defined for measuring changes in Price, Time and Volatility.
77

The most common of the Greeks are the first order derivates: Delta, Dual Delta, Vega,
Theta and Rho as well as Gamma, a second-order derivate of fair price (value function).
Although Rho is a primary input into the Black-Scholes model, the overall impact on the fair
price (the value function) of an option corresponding with changes in the risk-free rate is
generally insignificant and therefore higher-order derivates involving the risk-free interest rate
are not common.
The most used of the Greeks are some second order derivates: Gamma, Dual Gamma,
Vomma, Vanna, Charm, DvegaDtime. Also the most used of the Greeks are some third order
derivates: Speed, Zomma, Color, Ultima.
The Greeks in the Binomial model observe the evolution of the option's key
underlying variables in discrete-time. The most used of the Greeks are the Delta and Gamma.
Those Greeks are well-defined for Hedging Delta and Gamma.
The most common of the Greeks in the Black-Scholes and Binomial models are the
Delta, Vega, Theta and Gamma. The most used of the Option Hedging are the Hedging Delta
and Gamma. The remaining sensitivities (and hedging connected with them) in this list are
common enough that they have common names, but this list is by no means exhaustive.


3.2.3. Value Function
(quoted according to Zkodn,P., Havlek,I., Budinsk,P. (2010-2011), Partial Data
Mining Tools in Statistics Education in Greeks and Option Hedging (In: Tarbek,P.,
Zkodn,P. (2010-2011), Educational and Didactic Communication 2010, Bratislava,
Slovak Republic: Didaktis, www.didaktis.sk.)

According to 3.1.2. the financial options are those derivative contracts in which the
underlying assets are financial instruments such as stocks, bonds or an interest rate. The
options on financial instruments provide a buyer with the right to either buy or sell the
underlying financial instruments at a specified price on a specified future date. Although the
buyer gets the rights to buy or sell the underlying options, there is no obligation to exercise
this option. However, the seller of the contract is under an obligation to buy or sell the
underlying instruments if the option is exercised.

According to 3.1.2. two types of financial options exist, namely call options and put
options. Under a call option, the buyer of the contract gets the right to buy the financial
instrument at the specified price at a future date, whereas a put option gives the buyer the
right to sell the same at the specified price at the specified future date. The price that is paid
78

by the buyer to the seller for exercising this level of flexibility is called the premium (the fair
price, the value function). The prescribed future price is called the strike price.

The theoretical calculation of premium is connected namely with both the Black-
Scholes Model (continuous statistical model based on normal distribution) and the Binomial
or Trinomial Model (discrete statistical models based on binomial or trinomial distribution).
In this explanation the priority will be given to Black-Scholes Model.

The Black-Scholes model traces the evolution of the options key underlying variables
in continuous-time. This is done by means of both the standard normal probability densities
(d
1
), (d
2
) and the standard normal distribution functions N(d
1
), N(d
2
).

The variables d
1
, d
2
are connected with Spot price S, Strike price X, Risk-Free Rate r,
Annual Dividend d, Time to Maturity , Volatility , and Annual Dividend Yield d.

Value Function V (as Fair Price or as Premium) can be expressed as a function of five
quantities V = f (S, X, r, , )

The basic formulas for Black-Scholes model (Value Function V Fair Price for call
option is marked C ( ) , Value Function Fair Price for put option is marked P ( ) ):

( ) ( ) ( ) ( )
( )
( ) ( ) ( ) ( ) ( ) ( )
( ) ( )
1 2
2 2
1 2
1 2 2 1
2
1 2 1
1 1 1 2 2 2
2 2
1 2
,
ln
2
,
,
1 1
,
2 2
d r r d
d d
d d
C Se N d Xe N d P Xe N d Se N d
S
r d
X
d d d
N d d d d N d d d d
d e d e
t t t t
o
t
o t
o t


t t



( ) = ( ) =
+ +
= =
= =
= =
} }



79

3.2.4. Segmentation and Definitions of Greeks

a) Greeks of first order

The speeds of value function change:

( )
Dual
vega
V
S
V
X
V
V
V
r
v
o
t

c
A =
c
c
A =
c
c
=
c
c
O =
c
c
=
c


b) Greeks of individual second order

The accelerations of value function change & the speeds of first order greeks change:

2
2
2
2
2
2
2
2
2
2
Dual
Vomma
Out of Use
Out of Use
V
S
V
X
V
V
V
r
o
t
c
I =
c
c
I =
c
c
=
c
c
=
c
c
=
c



c) Greeks of combined second order

The speeds of first order greeks change:

2
2
2
Vanna
Charm
DvegaDtime
V
S
V
S
V
o
t
o t
c
=
c c
c
=
c c
c
=
c c



80

d) Greeks of third order

The speeds of second order greeks change:

3
3
3
2
3
2
3
3
Speed
Zomma
Color
Ultima
V
S
V
S
V
S
V
o
t
o
c
=
c
c
=
c c
c
=
c c
c
=
c


3.2.5. Indications of Greeks

a) Greeks of First Order

( )
DvalueDspot
Dual DvalueDstrike
Vega DvalueDvol
DvalueDtime
DvalueDrate
V
S
V
X
V
V
V
r
v
o
t

c
A = =
c
c
A = =
c
c
= =
c
c
O = =
c
c
= =
c


b) Greeks of Second Order

2
2
2
2
2
2
DdeltaDspot
Dual
Dual DdualdeltaDstrike
X
Vomma DvegaDvol
V
S S
V
X
V v
o o
c cA
I = = =
c c
c c A
I = = =
c c
c c
= = =
c c

( )
( )
( )
( )
2
2
2
Vanna DdeltaDvol DvegaDspot
Charm DdeltaDtime D theta Dspot
S
DvegaDtime D theta Dvol DvegaDtime
V
S S
V
S
V
v
o o
t t
v
o t o t
c cA c
= = = = =
c c c c
c O
c cA
= = = = =
c c c c
c O
c c
= = = = =
c c c c


81

c) Greeks of Third Order

( )
3 2
3 2
3 2 2
2 2
2
3 2
2 2
3 2
3 2
Speed DgammaDspot
Zomma DgammaDvol
Color DgammaDtime
vomma
Ultima DvommaDvol
V
S S S
V
S S S
V
S S S
V
v
o o o
t t t
v
o o o
c cI c A
= = = =
c c c
c cI c A c
= = = = =
c c c c c c
c O
c cI c A
= = = = =
c c c c c c
c c c
= = = =
c c c



3.2.6. Formulas for Greeks (CO Call Option, PO Put Option)

a) Formulas for Delta Greek A
( )
1
d
CO
e N d
t
A =
( )
1
d
PO
e N d
t
A =

b) Formulas for Dual Delta Greek Dual A
( )
2
Dual
r
CO
e N d
t
A =
( )
2
Dual
r
PO
e N d
t
A =

c) Formulas for Vega Greek v
( ) ( )
, 1 2
d r
CO PO
e S d Xe d
t t
v t t

= =

d) Formulas for Theta Greek O
( )
( )
1
2
2
d r
CO
S d
e rXe N d
t t
o
t

O =
( )
( )
1
2
2
d r
PO
S d
e rXe N d
t t
o
t

O = +

e) Formulas for Rho Greek
( )
2
r
CO
Xe N d
t
t

=
( )
2
r
PO
Xe N d
t
t

=

f) Formula for Gamma Greek I
( )
1
,
d
CO PO
d
e
S
t

o t

I =

g) Formula for Dual Gamma Greek Dual I
( )
2
,
Dual
r
CO PO
d
e
X
t

o t

I =
82

i) Formulas for Vomma Greek Vomma
( )
1 2 1 2
, 1
Vomma
d
CO PO
d d d d
Se d
t
t v
o o

= =

j) Formulas for Vanna Greek Vanna
( )
2 2 1
, 1
Vanna 1
d
CO PO
d d d
e d
S S
t
v v

o o t o t

| |
= = =
|
\ .


k) Formulas for Charm Greek Charm
( ) ( )
( )
( ) ( )
( )
2
1 1
2
1 1
2
Charm
2
2
Charm
2
d d
CO
d d
PO
r d d
de N d e d
r d d
de N d e d
t t
t t
t o t

ot t
t o t

ot t



= +

= +


l) Formulas for DvegaDtime Greek DvegaDtime
( )
( )
( )
1
1 2
, 1
1
1 2
,
1
DvegaDtime
2
1
DvegaDtime
2
d
CO PO
CO PO
r d d
d d
e S d d
r d d
d d
d
t
t
t o t
v
t o t

| | +
= +
|
\ .
| | +
= +
|
\ .


m) Formulas for Speed Greek Speed
( )
1
1 1
,
2
Speed 1 1
d
CO PO
d
d d
e
S S
t

o t o t o t

I | | | |
= + = +
| |
\ . \ .


n) Formulas for Zomma Greek Zomma
( )
( )
1
1 2
, 1 2
2
1
Zomma 1
d
CO PO
d
d d
e d d
S
t

o o t

| |
= = I
|
\ .


o) Formulas for Color Greek Color
( ) ( )
( )
1 2
, 1
2
, 1
2
Color 2 1
2
2
Color 2 1
2
d
CO PO
CO PO
d r d d
e d d
S
r d d
d d
t
t o t
t
ot t o t
t o t
t
t o t

| |

= + + |
|
\ .
| |

I
= + + |
|
\ .


p) Formulas for Ultima Greek Ultima
( )
( )
( )
( )
( )
( )
( )
1
, 1 2 1 2 2 1 2
, 1 2 1 2 2 1 2
Ultima 2 1
Ultima 2 1
d
CO PO
CO PO
S d
e d d d d d d
d d d d d d
t
t
t o o t
o
v
t o o t
o

= +
= +




83

3.2.7. Needful Statistical and Probability Relations for Deduction of Greeks Formulas

( ) ( ) ( ) ( )
( ) ( )
( ) ( )
( ) ( ) ( ) ( )
2 2
1 2
2
2 1
1 2 2 1
2 2
1 2
2 1
2 2
1 2
2
1 2 2 1
a) Value Function
,
ln ln
2 2
,
b) Standard Normal ProbabilityDensities
1 1
,
2 2
,
d r r d
d d
d d
C Se N d Xe N d P Xe N d Se N d
S S
r d r d
X X
d d
d d
d e d e
d d e e d d e
t t t t
to
o t
o o
t t
o t o t
o t

t t


( ) = ( ) =
+ + +
= =
=
= =
= =
( ) ( )
( ) ( ) ( ) ( ) ( ) ( )
( ) ( ) ( ) ( )
( )
( )
( )
( )
2
2 2
1 2
1 2
2
2 2
1 1 1 2 2 2
1 1 2 2
1 2
1 2
1 2
,
c) Standard Normal Distribution Functions
,
1, 1
,
r r
d d d d
d d
e
S S
e e e e e e
X X
N d d d d N d d d d
N d N d N d N d
N d N d
d d
d d
to
o t
o t o t
o t o t t t

+


= =
= =
+ = + =
c c
= =
c c
} }


3.2.8. Conclusion, References

The results of explanation:

- Description of Value Function as Fair Price
- Description of Greeks of First Order
- Description of Greeks of Second Order
- Description of Greeks of Third Order
- Names and Indications of Greeks
- Survey of Formulas for Greeks Calculation
- Survey of Needful Relations for Greeks Calculation

84

References

- Keim,D.A. (2002)
I nformation Visualization and Visual Data Mining.
IEEE Transactions on Visualization and Computer Graphics. Vol.7, No.1, January-March 2002

- Zkodn,P., Tarbek,P. (2010-2011)
Data Mining Tools in Statistics Education
In: Tarbek,P., Zkodn,P. (2010-2011), Educational and Didactic Communication 2010
Bratislava, Slovak Republic: Didaktis, ISBN 978-80-89160-78-5
www.didaktis.sk.

- Zkodn,P., Havlek,I., Budinsk,P. (2010-2011)
Partial Data Mining Tools in Statistics Education in Greeks and Option Hedging
In: Tarbek,P., Zkodn,P. (2010-2011), Educational and Didactic Communication 2010
Bratislava, Slovak Republic: Didaktis, ISBN 978-80-89160-78-5
www.didaktis.sk.











































85

3.3. Data Mining Tools in Statistics Education


3.3.1. Introduction

In the introduction of chapter 3.3. the quotations showing the importance of educational
data mining are presented. These quotations from i) to vi) are selected according to
C.Romero, S.Ventura (2006) (In: Tarbek,P., Zkodn,P. (2009) Educational and Didactic
Communication 2009, Bratislava, Slovak Republic: Didaktis, www.didaktis.sk,
ISBN 978-80-89160-69-3).

i) Currently there is an increasing interest in data mining and educational systems (well-known
learning content management systems, adaptive and intelligent web-based educational systems),
making educational data mining as a new growing research community

ii) After preprocessing the available data in each case, data mining techniques can be applied in
educational systems statistics and visualization, clustering, classification and detection, association
rule mining and pattern mining, text mining

iii) Data mining oriented towards students to show recommendations and to use, interact,
participate and communicate by students within educational systems

iv) Data mining oriented towards educators (and academic responsible-administrators) to
show discovered knowledge and to design, plan, build and maintenance by educators (administrators)
within educational systems

v) Data mining tools provide mining algorithms, filtering and visualization techniques. The examples
of Data Mining tool:
- Tool name: Mining tool, Authors: Zaane and Luo (2001), Mining task: Association and patterns
- Tool name: Multistar, Authors: Silva and Vieiva (2002), Mining task: Association and classification
- Tool name: Synergo/ColAT, Authors: Avouris et al (2005), Mining task: Visualization

vi) Future research lines in educational data mining
- Mining tools more facilitate the application of data mining by educators or not expert users
- Standardization of data and methods (preprocessing, discovering, postprocessing)
- Integration with the e-learning system
- Specific data mining techniques

The main principle of chapter 3.3.:
Data Mining in Statistics Education (DMSTE) as Problem Solving

The main goal of chapter 3.3.:
Delimitation of Complex Tool and Partial Tool of DMSTE

The procedure of chapter 3.3.:
- Data Preprocessing in Statistics Education
- Data Processing in Statistics Education
- Complex Tool of DMSTE Curricular Process (CP-DMSTE)
- Partial Tool of DMSTE Analytical Synthetic Modelling (ASM-DMSTE)
- Application of CP-DMSTE and ASM-DMSTE
- Supplement describing the principles of data mining approach

86

The results of chapter 3.3.:
1. Educational Communication of Statistics as Result of Data Preprocessing
2. Educational Communication of Statistics as Five Transformations T1-T5 of Knowledge
from Statistics to Mind of Educant
3. Curricular Process of Statistics as Result of Data Processing
4. Curricular Process of Statistics as Structuring, Algorithm Development and Formalization
of Educational Communication of Statistics
5. Curricular Process as Succession of Five Transformations T1-T5 of Curriculum Variant
Forms
6. Curriculum Variant Forms as Forms of Education Content Existence
7. Formalization of Curriculum Variant Form (Four of Universal Structural Elements: Sense
and Interpretation, Set of Objectives, Conceptual Knowledge System, Factor of Following
Transformation)
8. Variant Forms of Curriculum Conceptual Curriculum (Communicable Scientific System
of Statistics), Intended Curriculum (Educational System of Statistics), Projected
Curriculum (Instructional Project of Statistics and Its Textbook), Implemented
Curriculum-1 (Preparedness of Educator to Education), Implemented Curriculum-2
(Results of Education in Mind of Educant), Attained Curriculum (Applicable Results of
Education)
9. Curricular Process as CP-DMSTE (Structuring, Algorithm Development and
Formalization of Five Transformations Succession T1-T5)
10. Analytical Synthetic Modeling as ASM-DMSTE (Modeling Inputs and Outputs of
Transformations T1-T5)
11. Analytical Synthetic Models as Results of Problems Solving (Real or Mediated Problems)
12. Application of CP-DMSTE and ASM-DMSTE (Visualia of Conceptual Curriculum in
Area of Statistics with Concrete Basic Statistical Set, Need of Visualiae of All Curriculum
Variant Forms as Application of CP-DMSTE)

3.3.2. Data Mining (see also Supplement of chapter 3.3.)

Data Mining analytical synthetic way of extraction of hidden and potencially useful information
from large data files (continuum data-information-knowledge, knowledge discovery)
Data Mining Techniques the system functions of structure of formerly hidden relations and patterns
(e.g. classification, association, clustering, prediction)
Data Mining Tool a concrete procedure how to reach the intended system functions
Complex Tool a resolution of complex problem of relevant science branch
Partial Tool a resolution of partial problem of relevant science branch (e.g. analytical synthetic
modeling, needful mathematical or statistical procedures)
Result of Data Mining a result of data mining tool application
Representation of Data Mining Result a description of this what is expressed
Visualization of Data Mining Result optical retrieval of data mining result
Data Mining Cycle Data Definition, Data Gathering, Data Preprocessing, Data Processing,
Discovering Knowledge or Patterns, Representation and Visualization of Results

See P.Tarabek, P.Zaskodny, V.Pavlat, P.Prochazka, V.Novak, J.Skrabankova (2009-2010,
2009-2010abcde and quoted sources).
Quoted sources in 2009-2010abcde:
E.g. American Library Association, M.C.Borba, E.M.Villarreal, G.M.Bowen, W-M Roth, C.Brunk,
J.Kelly, R.Kohavi, Mineset, B.V.Carolan, G.Natriello, N.Delavari, M.R.Beikzadeh, S.Phon-
Amnuaisuk, U-D Ehlers, J.M.Pawlowski, U.M.Fayyad, G.Piatelsky-Shapiro, P.Smyth, J.Fox, D.Gabel,
J.K.Gilbert, O.de Jong, R.Justi, D.F.Treagust, J.H.Van Driel, M.Reiner, M.Nakhleh, W.Hmlinen,
T.H.Laine, E.Sutinen, M.Hesse, A.H.Johnstone, M.J.Kearns, U.V.Vazivani, D.A.Keim, R.Kwan,
87

R.Fox, FT Chan, P.Tsang, Le Jun, J.Luan, J.Manak, National research Council-NRC, R.Newburgh,
I.Nonaka, H.Takeuchi, C.J.Petroselli, E.F.Redish, D.Reisberg, C.Romero, S.Ventura, N.Rubenking,
R.E.Scherr, M.Sabella, D.A.Simovici, C.Djeraba, V.Spousta, L.Talavera, E.Gaudioso, E.R.Tufte,
J.Tuminaro, R.Vilalta, C.Giraud-Carrier, P.Brazdil, C.Soares, D.M.Wolpert.

3.3.3. Data Preprocessing in Statistics Education

Result of Data Preprocessing Educational Communication of Statistics as
a succession of transformations of education content forms (taken over from physics education):

- The transformation T1 is transformation of scientific system of statistics to communicable
scientific system of statistics (the first form of education content existence),

- The transformation T2 is transformation of communicable scientific system of statistics to
educational system of statistics (the second form of education content existence),

- The transformation T3 is transformation of educational system of statistics to both instructional
project of statistics and preparedness of educator to education (the third and fourth forms of education
content existence),

- The transformation T4 is transformation of both instructional project of statistics and preparedness
of educator to results of education (the fifth form of education content existence),

- The transformation T5 is transformation of results of statistics education to applicable results of
statistics education (the sixth form of education content existence)

See J.Brockmeyer (1982), P.Zaskodny a kol. (2004, 2007), P.Tarabek, P.Zaskodny (2001, 2007-
2008abc, 2008-2009, 2009-2010), P.Zaskodny (2001, 2006, 2009).

3.3.4. Data Processing in Statistics Education

Result of Data Processing Curricular Process of Statistics as a succession of transformations
of algorithmized and formalized education content forms (taken over from physics education):

i. The form of education content existence - variant form of curriculum

ii. The curriculum - education content (see Prucha, 2005)

iii. The variant forms of curriculum have got the universal structure (four structural elements -
sense and interpretation, set of objectives, conceptual knowledge system, factor of following
transformation)

iv. The variant forms of curriculum were selected on the basis of fusion of Anglo-American
curricular tradition and European didactic tradition

v. The curricular process is defined as the succession of transformations T1-T5 of curriculum
variant forms:

conceptual curriculum (output of T1, the first variant form of curriculum) - the communicable
scientific system

intended curriculum (output of T2, the second variant form of curriculum) - the educational
system of statistics
88

projected curriculum (output of T3, the third variant form of curriculum) - the instructional project
of statistics

implemented curriculum-1 (output of T3, the fourth variant form of curriculum) - the preparedness
of educator to education

implemented curriculum-2 (output of T4, the fifth variant form of curriculum) the results of
education

attained curriculum (output of T5, the sixth variant form of curriculum) - applicable results of
education

See P.Prochazka, P.Zaskodny (2009-2010c).
Quoted sources in 2009-2010c:
E.g. A.V.Kelly, M.K.Smith, W.Doyle, M.Pasch, A.M.Sochor, V.V.Krajevskij, I.J.Lerner, J.McVittie,
K.Carter, G.M.Blenkin, L.Stenhouse, E.Newman, G.Ingram, F.Bobitt, R.W.Tyler, H.Taba,
C.Cornblet, S.Grundy, D.Lawton, P.Gordon, M.Certon, M.Gayle, G.J.Posner.

3.3.5. Complex and Partial Tool of DMSTE CP-DMSTE, ASM-DMSTE

Complex tool of DMSTE is given by curricular process of statistics (CP-DMSTE). CP-
DMSTE delimits the correct education content via succession of transformations T1-T5.

Partial tool of DMSTE is given by analytical synthetic modeling (ASM-DMSTE).
ASM-DMSTE describes the mediated or real problem solving within the inputs and outputs of
individual transformations T1-T5. In this paper, the description of ASM-DMSTE is realized
by means of both visualia Vis.1 and Legend to Vis.1.

Legend to Vis.1

a (Identified Complex Problem) Investigated area of reality, investigated phenomenon
B
k
(Analysis) Analytical segmentation of complex problem to partial problems
b
k
(Partial problems PP-k) Result of analysis: essential attributes and features
of investigated phenomenon

C
k
(Abstraction) Delimitation of partial problems essences by abstraction with goal
to acquire the partial solutions

c
k
(Partial solutions PS-k) Result of abstraction: partial concepts, partial pieces of
knowledge, various relations, etc.

D
k
(Synthesis) Synthetic finding dependences among results of abstraction
d
k
(Partial conclusions PC-k) Result of synthesis: principle, law, dependence, continuity
E
k
(Intellectual reconstruction) Intellectual reconstruction of investigated phenomenon /
investigated area of reality

e (Total solution of complex problem a) Result of intellectual reconstruction:
analytical synthetic structure of final knowledge (conceptual knowledge system)



89

Vis.1 General Analytical Synthetic Model of Problem Solving



ANALYSIS B
k





C
1
C
2
C
3
C
4
ABSTRACTION C
k









D
1
D
2
SYNTHESIS D
k








E
1
E
2
RECONSTRUCTION E
k



5. Application of Partial Tool ASM-DMSTE



The application of ASM-DMSTE is the visualia Vis.2 from the area of statistics education.
The visualia Vis.2 is analytical synthetic model of statistics with concrete basic statistical set. This
visualia constitutes a part of statistics conceptual curriculum as a part of communicable scientific
system of statistics (a part of output of transformation T1).
The visualized result Vis.2 of data mining in statistics education constitutes the paramorphic
model and hypertextual representation, represents the external conceptual knowledge systems as
external representation of general social experience. The visualized result also represents the concrete
type of data file the representation of statistics with concrete basic statistical set.





a - Identified Complex Problem
b
1
- Partial Problem
No. 1 (PP-1)

b
2
- Partial Problem
No. 2 (PP-2)

b
k
- Partial Problem
No. k (PP-k)

c
1
-Partial
Solution
No.1(PS-1)

c
2
-Partial
Solution
No.2(PS-2)

c
3
-Partial
Solution
No.3(PS-3)

c
4
-Partial
Solution
No.4(PS-4)

c
k
-Partial
Solution
No.k(PS-k)

d
1
- Partial Conclusion
No. 1 (PC-1)

d
2
- Partial Conclusion
No. 2 (PC-2)

d
k
- Partial Conclusion
No. k (PC-k)


e - Total Solution of Complex Problem "a" formed by means of PC-1, PC-2, .., PC-k

90

Vis.2: Analytical synthetic model of statistics formed by four partial models
a1-e1, a2-e2, a3-e3, a4-e4
(a part of conceptual curriculum of statistics a part of communicable scientific system
of statistics output of transformation T1)






















Frequencies tables
(Empirical distribution)
Graphical expression Empirical parameters
Empirical picture of selective statistical set, Necessity of probable investigation e-2=a-3
Collective random phenomenon and reason of its investigation a-1
Statistical unit Statistical sign
Selective statistical set (SSS) as a part of basic statistical set, Goals of statistical examination e-1=a-2
Comparison of theoretical and
empirical parameters
Creating of scale
Measurement
Testing of non-parametric
hypotheses
Statistical dependence
(causal, non-causal)
Choice of acceptable
theoretical distribution
Quantification of
theoretical parameters
Regression analysis
Variants (values) of
statistical sign
Choice of statistical
units
Point & interval estimation
(e.g. confidence interval)
Testing of parametric hypotheses
Empirical & probable picture of selective statistical set, Necessity of association investigation e-3=a-4
Correlation analysis
Empirical & probable & association picture of selective statistical set
Interpretation and conclusions as the statistical & probable dimension e-4
of investigation collective random phenomenon
Applied statistics
(e.g. financial options and their mathematical and statistical elaboration by means of greeks calculation and
option hedging models)
91


LEGEND to whole visualia Vis.2

, , ,

One Sample Analysis, Two / Multiple Sample Analysis
LEGEND to partial models of visualia Vis.2


Formulation of statistical examination



Relative & Cumulative Frequencies (Empirical distribution)
Plotting functions: e.g. Plot Frequency Polygon (Graphical expression)
Average-Means, Variance-Standard Deviation, Obliqueness (Skewness), Pointedness
(Kurtosis) (Empirical parameters)



Theoretical Distribution (partial survey in alphabetical order):
Bernoulli, Beta, Binomial, Chi-square, Discrete Uniform, Erlang, Exponential, F, Gamma,
Geometric, Lognormal, Negative binomial, Normal, Poisson, Students, Triangular,
Trinomial, Uniform, Weibull

Testing of Non-parametric Hypotheses (Hypothesis test for H
0
receive or reject H
0
):
e.g. computed Wilcoxons test, Kolmogorov-Smirnov test, Chi-square test
e.g. at alpha = 0,05

Point & I nterval Estimation:
e.g. confidence interval for Mean, confidence interval for Standard Deviation

Testing of Parametric Hypotheses (Hypothesis test for H0 receive or reject H0):
e.g. computed u-statistic, t-statistic, F-statistic, Chi-square statistic, Cochrans test, Barletts
test, Hartleys test
e.g. at alpha = 0,05




Statistical dependence:
e.g. confidence interval for difference in Means (Equal variances, Unequal variances)
e.g. confidence interval for Ratio of Variances
Regression analysis:
simple multiple, linear non-linear
Correlation analysis:
e.g. Rank correlation coefficient, Pearson correlation coefficient

a-1 e-1 a-2 e-2 a-3 e-3 a-4 e-4
a-1 e-1
a-2 e-2
a-3 e-3
a-4 e-4
92

3.3.6. Conclusion, References

Modeling as a partial tool of data mining quotation acoording to J.K.Gilbert (2008)
(In: Tarbek,P., Zkodn,P. (2009) Educational and Didactic Communication 2009,
Bratislava, Slovak Republic: Didaktis, www.didaktis.sk, ISBN 978-80-89160-69-3).:

In a nightmare world, we would perceive the world around us being continuous and
without structure. However, our survival as a species has been possible because we have
evolved the ability do cut up that world mentally into chunks about which we can think and
hence give meaning to.
This process of chunking, a part of all cognition, is modelling and the products of the
mental actions that have taken place are models. Science, being concerned with the provision
of explanations about the natural world, places an especial reliance on the generation and
testing of models.


References

1. Used Publications

i. Brockmeyerov,J. (1982) I ntroduction into Theory and Methodology of Physics Education. Prague, Czech
Republic: SPN

ii. CSRG (2009). Curriculum Studies Research Group.
esk Budjovice: University of South Bohemia, Czech Republic, http://sites.google.com/site/csrggroup/

iii. Gilbert,J.K. (2008) Visualization: An Emergent Field of Practice and Enquiry. In: Visualization: Theory and Practice
in Science (Models and Modeling in Science Education). New York: Springer Science + Business Media

iv. Keim,D.A. (2002) Information Visualization and Visual Data Mining. IEEE Transactions on Visualization
and Computer Graphics. Vol.7, No.1, January-March 2002

v. Prcha,J (2005) Modern pedagogika (Modern Educational Science), Prague, Czech Republic: Portl


2. Used Papers, Monographs, and Books of Author (2001-2010)

i. Tarbek,P., Zkodn,P. (2001)
Structural Textbook and I ts Creation.
Bratislava, Slovak Republic: Didaktis, ISBN 80-85456-76-1

ii. Zkodn,P. (2001)
Statistical Dimension of Scientific Research.
KONTAKT, 2, 5, 2001 ISSN 1212-4117

iii. Tarbek,P., Zkodn,P. (2007-2008a)
Educational and Didactic Communication 2007, Vol.1 Theory.
Bratislava, Slovak Republic: Didaktis, www.didaktis.sk, ISBN 987-80-89160-56-3

iv. Tarbek,P., Zkodn,P. (2007-2008b)
Educational and Didactic Communication 2007, Vol.2 Methods.
Bratislava, Slovak Republic: Didaktis, www.didaktis.sk, ISBN 987-80-89160-56-3

v. Tarbek,P., Zkodn,P. (2007-2008c)
Educational and Didactic Communication 2007, Vol.3 Applications.
Bratislava, Slovak Republic: Didaktis, www.didaktis.sk, ISBN 987-80-89160-56-3
93


vi. Tarbek,P., Zkodn,P. (2008-2009)
Educational and Didactic Communication 2008.
Bratislava, Slovak Republic: Didaktis, www.didaktis.sk, ISBN 978-80-89160-62-4

vii. Tarbek,P., Zkodn,P. (2009-2010)
Educational and Didactic Communication 2009.
Bratislava, Slovak Republic: Didaktis, www.didaktis.sk, ISBN 978-80-89160-69-3

viii. Zkodn,P. a kol. (2004)
Zklady zdravotnick statistiky.
esk Budjovice, Czech Republic: South Bohemia University ISBN 80-7040-663-1

ix. Zkodn,P. (2006)
Survey of Principles of Theoretical Physics (with Application to Radiology)
(in English). Lucerne, Switzerland, Ostrava, Czech Republic: Avenira, Algoritmus, ISBN 80-902491-9-1

x. Zkodn,P. a kol. (2007)
Zklady ekonomick statistiky.
Prague, Czech Republic: Institute of Finance and Administration ISBN 80-86754-00-6

xi. Zkodn,P. (2009)
Curicular Process of Physics (with Survey of Principles of Theoretical Physics)
(in Czech). Lucerne, Switzerland, Ostrava, Czech Republic: Avenira, Algoritmus, ISBN 978-80-902491-0-3

xii. Zkodn,P. (2009-2010)
Data Mining Tools in Science Education (in: vii.)

xiii. Zkodn,P., Pavlt,V. (2009-2010a)
Data Mining A Brief Recherche (in: vii.)

xiv. Zkodn,P., Novk,V. (2009-2010b)
Data Mining A Brief Summary (in: vii.)

xv. Zkodn,P., Prochzka,P. (2009-2010c)
Collective Scheme of Both Educational Communication and Curricular Process (in: vii.)

xvi. Zkodn,P. , krabnkov,J.(2009-2010d)
Modelling and Visualization of Problem Solving (in: vii.)

xvii. Zkodn,P. (2009-2010e)
Representation of Results of Data Mining (in: vii.)











94

3.3.7. Supplement of Chapter 3.3. The Principles of Data Mining Approach

3.3.7.1. Quotations from Sources

i) Definitions of Data Mining

J .Luan (2002)

Definition of Data Mining
a) Data Mining is the process of discovering meaningful new correlations, patterns, and trends by
sifting through large amounts of data stored in repositories and by using pattern recognition
technologies as well as statistical and mathematical techniques
b) The notion of Data Mining for higher education: Data Mining is a process of uncovering hidden
trends and patterns that lend them to predicative modeling using a combination of explicit knowledge
base, sophisticated analytical skills and academic domain knowledge

N.Rubenking (2001)

Definition of Data Mining
Data Mining is the process of automatically extracting useful information and relationships from
immense quantities of data. In its purest form, Data Mining doesnt involve looking for specific
information. Rather than starting from a question or a hypothesis, Data Mining simply finds patterns
that are already present in the data.

R.Kohavi (2000)

Definition of Data Mining as Knowledge Discovery
Data Mining (or Knowledge Discovery) is the process of identifying new patterns and insights in data

Interpretation of Data Mining
As the volume of data collected and stored in databases grows, there is a growing need to provide data
summarization, identify important patterns and trends, and act upon findings

Le J un (2008)

Definition of Data Mining as New Technology
Data Mining is extraction of hidden predictive information from large database. Data Mining is
a powerful new technology with great potential to help an scientific area focus on the most important
information in its data

N.Delavari, M.R.Beikzadeh, S.Phon-Amnuaisuk (2005)

Definition of Data Mining
Searched knowledge (meaningful knowledge, previously unknown and potentially useful information
discovered) is hidden among the raw educational data set and it is extractable through Data Mining

R.Kwan, R.Fox, FT Chan, P.Tsang (2008), Le J un (2008)

Data, Information, Knowledge
Data, Information, Knowledge are different terms, which differentiate in means and values.
a) Data is a collection of facts and quantitative measures, which exists outside of any context from
which conclusions can be drawn.
b) Information is data that people interpret and place in meaningful context, highlighting patterns,
causes of relationships in data.
95

c) Knowledge is the understanding human development as reaction to and use of information, either
individually or as an organization.

Data-Information-Knowledge Continuum
a) Data, information and knowledge are separated but linked concepts which can form a data-
information-knowledge continuum.
b) Data becomes information when people place it in context through interpretation that might seek to
highlighting.
c) Knowledge can be described as a belief that is justified through discussion, experience and perhaps
action. It can be shared with others by exchanging information in appropriate contexts.

ii) Data Mining and Problem Solving

L.Talavera, E.Gaudioso (2002)

Data Mining as Analysis Problem
In this paper we propose to shape the analysis problem as a data mining.

J .Tuminaro, E.F.Redish (2005), E.F.Redish (2005)

Problem solving
Problem solving and the use of math in physics courses
Student Use of Math in the Context of Physics Problem Solving: A Cognitive Model

M.C.Borba, E.M.Villarreal (2005)

Problem solving
Problem solving as context
Problem solving as skill
Problem solving as art

Process of modeling, process of problem solving
The process of modeling or model building is a part of the process of problem solving

Steps of problem solving process (process of problem solving as entailing several steps):

The starting point is a real problematic situation
The first step is to create a real model, making simplifications, idealizations, establishing conditions
and assumptions, but respecting original situation
In the second step, the real model is mathematized, to get a mathematical model
The third step implies the selection of suitable mathematical methods and working within
mathematics in order to get some mathematical results
In the fourth step, these results are interpreted for and translated into the real situation

iii) Forms of Data Mining, Data Mining System, Goals of Data Mining, Scope of
Data Mining

R.Kohavi (2000)

Forms of Data Mining (Structured mining etc.)
Structured mining, Text mining, Information retrieval



96

W.Hmlinen, T.H.Laine, E.Sutinen (2003)

Data Mining system, educational system
Data Mining system in educational system: the educational system should be served by Data Mining
system to monitor, intervene in, and counsel the teaching-studying-learning process

R.Kohavi (2000)

Goals of Data Mining
Data Mining serves two goals:
-Insight: Identified patterns and trends are comprehensible
-Prediction: A model is built that predicts (scores) based on input data. Prediction as classification
(discrete variable) or as regression (continuous variable)

Scope of Data Mining
The majority of research in DM has concentrated on building the best models for prediction.
A learning algorithm is given the training set and produces a model that can map new unseen data into
the prediction.

iv) Results of Data Mining, Applications of Data Minings, Interdisciplinarity of Data
Mining

R.Kohavi (2000), D.M.Wolpert (1994), M.J .Kearns, U.V.Vazivani (1994)

Some theoretical results in Data Mining
- No free lunch (All concepts are equally likely, then learning is impossible)
- Consistency (non-parametric models - target concept given enough data, parametric models as linear
regression are known to be of limited power) - enough data = consistency
- PAC learning (probably approximately correct learning) is a concept introduced to provide
guarantees about learning
- Bias-Variance decomposition

U.M.Fayyad, G.Piatelsky-Shapiro, P.Smyth (1996)

Interdisciplinarity of Data Mining
Data Mining, sometimes referred to as knowledge Discovery, is at the intersection of multiple
research area, including machine learning, statistics, pattern recognition, databases and visualization

J .Luan (2002)

Potential applications of Data Mining
There are several ways to examine the potential applications of Data Mining
a) One is to start with the functions of the algorithms to reason what can be utilized for
b) Another is to examine the attributes of a specific area where data are rich, but mining activities are
scare
c) And another is to examine the different functions of a specific area to identify the needs that can
translate themselves into Data Mining project

Notes: a) - See Curricular Process as Data Mining Algorithm
b) - See Curriculum: Theory and Practice as scientific area in which mining activities are
scare
c) - Some of the most likely places where data miners (educational researchers who wear
this hat) may initiate Data Mining projects are: Variant Forms of Curriculum

97

v) Data Mining techniques
.
N.Delavari, M.R.Beikzadeh, S.Phon-Amnuaisuk (2005)

Data Mining techniques
DM techniques can be used to extract unknown pattern from the set of data and discover useful
knowledge. It results in extracting greater value from the raw data set, and making use of strategic
resources efficiently and effectively.

J .Luan (2001)

Data Mining techniques as Data Mining functions
Prediction, clustering, classification, association

Le J un (2008)

Data Mining techniques application of Data Mining tools
Application of DM tools: To solve the task of prediction, classification, explicit modeling and
clustering. The application can help understand learners learning behaviors

C.Romero, S.Ventura (2006)

Data Mining techniques in educational systems
After preprocessing the available data in each case, Data Mining techniques can be applied in
educational systems statistics and visualization, clustering, classification and outlier detection,
association rule mining and pattern mining, text mining

J .Luan (2002)

Clustering and prediction the most striking aspects of Data Mining techniques
- The clustering aspect of Data Mining offers comprehensive characteristics analysis of investigated
area
- The predicting function estimates the likelihood for a variety of outcomes

B.V.Carolan, G.Natriello (2001)

Clustering
Data-Mining Resources to identify structural attributes of educational research community-e.g.
clustering as collaboration of physicists and biologists

D.A.Simovici, C.Djeraba (2008)

Clustering, Taxonomy of clustering
a) Clustering is the process of grouping together objects that are similar. The groups formed by
clustering are referred to as clusters.
b) Clustering can be regarded as a special type of classification, where the clusters serve as
classes of objects
c) It is widely used data mining activity with multiple applications in a variety of scientific activities
from biology and astronomy to economics and sociology
d) Taxonomy of clustering (we follow here the taxonomy of clustering)
- Exclusive or nonexclusive: Clustering may be exclusive or may not be exclusive. It is exclusive,
where an exclusive clustering technique yields clusters that are disjoint. It is nonexclusive, where
a nonexclusive technique produces overlapping clusters.
98

- Intrinsic or extrinsic: Clustering may be intrinsic or extrinsic. Intrinsic - based only on
dissimilarities between the objects to be clustered. Extrinsic - which objects should be clustered
together and which should not, such information is provided by an external source.
- Hierarchical or partitional: Clustering may be hierarchical or partitional. Hierarchical - in
hierachical clustering algorithms, a sequence of partitions) is constructed. Partitional - partitional
clusterings creates a partition of the set of objects whose blocks are the clusters such that objects in
a cluster are more similar to each other than to objects that belong to different clusters

vi) Data Mining tools

C.Brunk, J .Kelly, R.Kohavi (1997)

Data Mining tool
Mineset is a Data Mining tool that integrates Data Mining and visualization very tightly. Models
built can viewed and interacted with.

C.Romero, S.Ventura (2006)

Data Mining tools
Data Mining tools provide mining algorithms, filtering and visualization techniques. The examples
of Data Mining tool:
- Tool name: Mining tool, Authors: Zaane and Luo (2001), Mining task: Association and patterns
- Tool name: Multistar, Authors: Silva and Vieiva (2002), Mining task: Association and classification
- Tool name: Synergo/ColAT, Authors: Avouris et al (2005), Mining task: Visualization

D.A.Simovici, C.Djeraba (2008)

Mathematical tools for Data Mining
a) This book was born from experience of the authors as researches and educators, which suggests
that many students of Data Mining are handicapped in their research by the lack of formal,
systematic education in its mathematics. The book is intended as a reference for the working data
miner.
b) In our opinion, three areas of math are vital for DM:
- set theory, including partially ordered sets and combinatorics,
- linear algebra, with its many applications in principal component analysis and neural networks,
- and probability theory, which plays a foundational role in statistics, machine learning and DM

vii) Modeling, Model

J .K.Gilbert, M.Reiner, M.Nakhleh (2008), J .K.Gilbert (2008), J .K.Gilbert, R.J usti ( 2002)

Definition of Modelling, Model
We have evolved the ability do cut up that world mentally into chunks about which we can think
and hence give meaning to. This process of chunking (Data Mining clustering),
a part of all cognition, is modelling and the products of the mental actions that have taken place are
models

Significance of Modelling, Model
Modelling as an element in scientific methodology and models at the outcome of modelling are both
important aspects of the conduct of science and hence of science education

Categorization of models
a) Historical models (Curriculum models) - learning specific consensus (the P-N junction model of
transistor). Curriculum models can be used to provide an acceptable explanation of
99

a wide range of phenomena and specific facts, thats why, it is useful way of reducing, by chunking,
the ever-growing factual load of science curriculum
b) New qualitative models - developed by following the sequence of learning: To revise an
established model, To construct a model de novo (to reconstruct an established model)
c) New quantitative models - developed by following the sequence of learning: quantitative version
of a useable qualitative model of phenomenon
d) Progress in the scientific enquiry is indicated by the value of particular combination of
qualitative and quantitative models in making successful predictions about it properties

C.M.Borba, E.M.Villarreal (2005)

Definition of modeling
Modeling can be understood as a pedagogical approach that emphasizes students choice of
a problem to be investigated in the classroom. Students, therefore, play an active role in curriculum
development instead of being just the recipients of tasks designed by others.

Problem solving
- problem solving as context
- problem solving as skill
- problem solving as art

Process of modeling, process of problem solving
The process of modeling or model building is a part of the process of problem solving.

Steps of problem solving process
Process of problem solving as entailing several steps:
a) The starting point is a real problematic situation
b) The first step is to create a real model, making simplifications, idealizations, establishing
conditions and assumptions, but respecting original situation
c) In the second step, the real model is mathematized, to get a mathematical model
d) The third step implies the selection of suitable mathematical methods and working within
mathematics in order to get some mathematical results
e) In the fourth step, these results are interpreted for and translated into the real situation

J .K.Gilbert, O.de J ong, R.J usti, D.F.Treagust, J .H.van Driel (2002)

Model as a major learning and teaching tool
Models are one of the main products of science, modelling is an element in scientific methodology,
(and) models are a major learning and teaching tool in science education

Model of Modeling Framework

1. Decide on purpose - Select source for model and Have experience - Produce mental model
2. Produce mental model - Express in mode(s) of representation
3. Express in mode(s) of representation - Conduct thought experiments
4a. Conduct thought experiments (pass) - Design and perform empirical tests
4b. Conduct thought experiments (fail) - Reject mental model (Modify mental model) and back to
Select source for model (negative result)
5a. Design and perform empirical tests (pass) - Fulfill purpose and Consider scope and limitations of
model and back to Decide on purpose (positive result)
5b. Design and perform empirical tests (fail) - Reject mental model (Modify mental model) and back
to Select source for model (negative result)



100

R.J usti, J .K.Gilbert (2002)

Role of chemistry textbooks in the teaching and learning of models and modelling
This role may be discussed from two main angles:
- the way that chemical models are introduced in textbooks
(note: projected curriculum, a learning model)
- and the teaching models that they present
(note: Implemented curriculum-1, a teaching model)

Teaching model, Learning model, Analogies
A teaching model is a representation produced with the specific aim of helping students to
understand some aspect of content. Assuming the abstract nature of chemical knowledge, they
(learning models) are used very frequently in chemical textbooks mainly in the form of overt
analogies, as drawings and as diagrams (specifically to the atom, chemical bonding and chemical
equilibrium)

Some future research directions
a) How can teacherspedagogical content knowledge about models and modelling be improved?
b) The role of models and modelling in the development of chemical knowledge?
c) How can it be made evident to teachers that the introduction of model-based teaching and learning
approach can be way to shift the emphasis in chemical education from transmission of existing
knowledge to a more contemporary perspective in which students will really understand the
nature of chemistry and be able to deal critically with chemistry-related situations?

viii) Representation (Creativity)

J .K.Gilbert, M.Reiner, M.Nakhleh (2008), J .K.Gilbert (2008)

Levels of Representation
The Representation in Science Education is concerned with challenges that students face in
understanding the three levels at which models can be represented - macro, sub-micro,
symbolic - and the relationships between them.

A.H.J ohnstone(1993), D.Gabel (1999)

Representations as distinct representational levels
a) The models produced by science are expressed in three distinct representational levels
b) The macroscopic level - this consists of what is seen in that which is studied
c) The sub-microscopic level - this consists of representations of those entities that are inferred to
underlie the macroscopic level, giving rise to the properties that it displays - molecules and ions are
used to explain the properties of pure solutions, of radiotherapy)
d) The symbolic level (this consists of any qualitative abstractions used to represent each item at the
sub-microscopic level - chemical equations, mathematical equations)

J .K.Gilbert (2008), M.Hesse(1966), G.M.Bowen, W.-M.Roth (2005))

The ontological categorization of representations
a) Two approaches to the ontological categorization of representations are put forward, one based on
the purpose which the representation is intended to serve, the other on the dimensionality -
1D,2D,3D - of the representation.
b) The purpose for which a Model is Produced
- All models are produced by the use analogy. The target (which is the subject of the model) is
depicted by a partial comparison with a source. The classification is binary: The target and the source
101

are the same things (they are homomorphs - an aeroplane, a virus), They are not (they are paramorphs
- paramorphs are used to model process rather than objects)
c) The dimensionality of the Representation
The idea that modelling involves the progressive reduction of the experienced world to a set of
abstract signs can be set out in terms of dimensions are follows:
- Macro level - Perception of the world-as-experienced - 3D, 2D
- Sub-micro level - Gestures, concrete representations (structural representations) - 3D
- Photographs, virtual representations, diagrams, graphs, data arrays - 2D
- Symbolic level - Symbols and equations - 1D

E.R.Tufte(1983), J .K.Gilbert (2008), D.Reisberg (1997)

External and internal representations, Series of internal representations and creativity
a) Visualization is concerned with External Representation, the systematic and focused public
display of information in the form of pictures, diagrams, tables, and the like
b) Visualization is also concerned with Internal Representation, the mental production, storage and
use of an image that often (but not always) is the result of external representation
c) External and internal representations are linked in that their perception uses similar mental
processes
d) Visualization is thus concerned with the formation of an internal representation from an
external representation. An internal representation must be capable of mental use in the making of
predictions about the behaviour of a phenomenon under specific conditions
e) It is entirely possible that once a series of internal representations have been visualized, that they
are amalgamated/recombined to form a novel internal representation that is capable of external
representation - this is creativity

ix) Visualization

J .K.Gilbert, M.Reiner, M.Nakhleh (2008), J .K.Gilbert (2008)

Definition of Visualization
The making of meaning for any such representation is visualization. Visualization is central
the production of representations of these models (curriculum models, qualitative and quantitative
models and their combinations).

J .K.Gilbert (2008)

Visualization and Internal Representation
Visualization is also concerned with Internal Representation, the mental production, storage and
use of an image that often (but not always) is the result of external representation.

R.Kohavi (2000)

Essence of Visualization - Data Summarization
As the volume of data collected and stored in databases grows, there is a growing need to provide data
summarization (e.g. through visualization), identify important patterns and trends, and act upon
findings.

C.Brunk, J .Kelly, R.Kohavi (1997)

Serviceability of Visualization
One way to did users in understanding the models is to visualize them.


102

D.A.Keim(2002)

Serviceability of Visualization
a) Information Visualization techniques may help to solve the problem
b) Data Mining will use Information Visualization technology for an improved data analysis

Application of Visualization
Application of Visualization is Visual Data Exploration

Benefits of Visual Data Exploration
- University of Berkeley - every year 1 Exabyte of data (10
18
bytes, Gigabyte = 10
9
bytes)
- Finding the valuable information hidden in them, however, is a difficult task
- The data presented textually - The range of some one hundred data items can be displayed
(a drop in the ocean)
- The basic idea of visual data exploration is to present the data in some visual form, allowing the
human to get insight into the data, draw conclusions, and directly interact with the data (to combine
the flexibility, creativity and general knowledge of the human with the enormous storage capacity and
the computational power of todays computers)
- The visual data exploration process can be seen a hypothesis generative process (coming up with
new hypotheses and the verification of the hypotheses can be done via visual data exploration)
- The main advantages of visual data exploration: Visual data exploration can easily deal with
inhomogenous and noisy data, visual data exploration is intuitive and requires no understanding of
mathematical and statistical algorithms, visual data exploration techniques are indispensable in
conjuction with automatic exploration techniques
- Visual data exploration paradigm: overview first, zoom and filter, details-on-demand

x) Metavisualization

N.R.C. (2006)

Metavisualization - spatial thinking
The associated visualization which can be called spatial thinking

J .K.Gilbert, M.Reiner, M.Nakhleh (2008), J .K.Gilbert (2008),

Metavisualization - learning from representations
It is of such importance in science and hence in science education that the acquisition of fluency in
visualization is highly desirable and may be called metavisual capability or metavisualization. A
fluent performance in visualization has been described as requiring metavisualization and involving
the ability to acquire, monitor, integrate, and extend learning from representations. Metavisualization
- learning from representations.

Criteria for Metavisualisation
Four criteria are suggested for attainment of metavisual status. The person concerned must be able to:
a) demonstrate an understanding of the convention of representation for all the modes and sub-
modes of 3D,2D,1D representations (what they can and cannot represent)
b) demonstrate a capacity to translate a given model between the modes and sub-modes in which it can
be depicted
c) demonstrate the capacity to be able to construct a representation within any mode and sub-mode of
dimensionality for a given purpose
d) demonstrate the ability to solve novel problems using a model-based approach

Developing the Skills of Metavisualization
level 1 - representation as depiction
level 2 - early symbolic skills
103

level 3 - syntactic use of formal representations
level 4 - semantic use of formal representations
level 5 - reflective, rhetorical use of representations

xi) Visual DM techniques

D.A.Keim(2002)

Classification of Visual Data Mining Techniques (abstraction criterium)

- Techniques as x-y plots, line plots, and histogram, but they are limited to relatively and low-
dimensional data sets
- Novel information visualization techniques allowing visualization of multidimensional data without
inherent 2D or 3D semantics.

D.A.Keim(2002)

Classification of Visual DM Techniques based on three criteria a), b), c)

a) The data to be visualized (one or two- dimensional data, multidimensional data, text and
hypertext, hierarchies and graphs, algorithms and software):

Dimensionality of date set = the number of variables of data set.
Text and hypertext = in the age of the world wide web one important data type is text and hypertext
Hierarchies and graphs = data records often have some relationship to other pieces of information,
i.e. a graph consists of set objects, called nodes, and connections between these objects, called edges.
Algorithms and software= the goal of V is to support software development by helping to understand
algorithms, e.g. by showing the flow of information in a program, to enhance the understading of
written code, e.g. by representing the structure of thousands of source code lines as graphs

b) The visualization techniques (Standard 2D/3D displays, Geometrically-transformed displays,
Icon-based displays, Dense pixel displays, Stacked displays-treemaps, dimensional stacking)

Geometrically-transformed displays = these techniques aim at finding interesting transformations of
multidimensional data sets. The class of geometric display techniques includes also the well-known
Parallel Coordinate Technique (PCT). The PCT maps the k-dimensional space onto the two display
dimensions by using k equidistant axes which are parallel to one of display axes
I con-based displays = the idea is to map the attribute values of a multidimensional data item to the
features of an icon

c) The interaction (IT) and distortion (DT) techniques used (interactive projection, interactive
filtering, interactive zooming, interactive distortion, interactive linking and brushing)

I nteraction techniques allow the data analyst to directly interact with visualizations and dynamically
change the visualizations according to exploration objectives
Distortion techniques help in the data exploration process by providing means for focusing on details
while preserving an overview of the data
I nteractive filtering, I nteractive zooming - in exploring large data sets it is important to interactively
partition the data into segments and focus on interesting subsets. This can be done by a direct selection
of the desired subset (BROWSING) or by a specification of properties of the desired subset
(QUERYING).



104

xii) Educational Data Mining

C.Romero, S.Ventura (2006)

Educational Data Mining
a) Currently there is an increasing interest in data mining and educational systems (well-known
learning content management systems, adaptive and intelligent web-based educational systems),
making educational data mining as a new growing research community
b) After preprocessing the available data in each case, data mining techniques can be applied in
educational systems statistics and visualization, clustering, classification and detection, association
rule mining and pattern mining, text mining
c) Data Mining oriented towards students to show recommendations and to use, interact,
participate and communicate by students within educational systems
d) Data Mining oriented towards educators (and academic responsible-administrators) to show
discovered knowledge and to design, plan, build and maintenance by educators (administrators) within
educational systems
e) Data Mining tools provide mining algorithms, filtering and visualization techniques. The examples
of Data Mining tool:
- Tool name: Mining tool, Authors: Zaane and Luo (2001), Mining task: Association and patterns
- Tool name: Multistar, Authors: Silva and Vieiva (2002), Mining task: Association and classification
- Tool name: Synergo/ColAT, Authors: Avouris et al (2005), Mining task: Visualization
f) Future research lines in educational Data Mining
- Mining tools more facilitate the application of data mining by educators or not expert users
- Standardization of data and methods (preprocessing, discovering, postprocessing)
- Integration with the e-learning system
- Specific data mining techniques

W.Hmlinen, T.H.Laine, E.Sutinen (2003)

Data Mining system, educational system
Data Mining system in educational system: the educational system should be served by Data Mining
system to monitor, intervene in, and counsel the teaching-studying-learning process

R.E.Scherr, M.Sabella, E.F.Redish (2007)

Curriculum development
Conceptual knowledge is only one aspect of good knowledge structure: how and when knowledge is
activated and used are also important.

Representation of knowledge structure
The nodes represent knowledge. The lines represent relations between different nodes.

R.Newburgh (2008)

Linear and lateral (structural) thought process (in physics)
Why do we lose physics students?
a) There is a wide spectrum in thought process. Of the two major types one is linear (i.e. sequential)
and the other lateral (i.e. seeking horizontal connections).
b) Those who developed physics - from Galileo to Newton to Einstein to Heisenberg - were almost
exclusively linear thinkers. Paradigm for linear thought is Eucledian thinking, Eucledian logic
(many physicists chose physics for their career as a result of their exposure to geometry - a
consequence of this is that textbooks are usually written in a Eucledian format). The sense of
discovery is lost. Many students do not recognize that the Eucledian format is not a valid description
how we do physics. Their way of approaching problems is different but just as valid. Too many
105

physics teachers refuse to recognize the limitations of this approach (thereby causing would-be
students who do not think in a Eucledian fashion to leave).
c) The format of our textbooks is Eucledian. Newtons laws, Hamilton-Jacobi theory, and
Maxwells equations are often presented as quasi-axioms in advanced texts. The laboratories become
fixed exercises in which the student must confirm some principle already established. He knows the
answer before he does the experiment.
d) Now I yield to no one in my admiration for Euclid. He has been an inspiration to many of us. We
understand his genius but also see his limitations. Unfortunately there are many who do not follow
his way of thinking.
e) By presenting alternate approaches to students (specifically uses of lateral thinking), false starts
that must be corrected, and lessons that are discoveries not memorization, we can retain more
students in physics.
f) We should remember that lateral thinking is essential to the formation of analogies, an activity
that one cannot describe as Euclidean. Doing science without analogies seems to me an impossibility.

J .K.Gilbert, O.de J ong, R.J usti, D.F.Treagust, J .H.van Driel (2002), R.J usti, J .K.Gilbert (2002)

Model as a major learning and teaching tool
Models are one of the main products of scince, modelling is an element in scientific methodology,
(and) models are a major learning and teaching tool in science education.

Role of chemistry textbooks in the teaching and learning of models and modelling
This role may be discussed from two main angles:
- the way that chemical models are introduced in textbooks
- and the teaching models that they present.

Teaching model, Learning model, Analogies
A teaching model is a representation produced with the specific aim of helping students to
understand some aspect of content. Assuming the abstract nature of chemical knowledge, they
(learning models) are used very frequently in chemical textbooks mainly in the form of overt
analogies, as drawings and as diagrams (specifically to the atom, chemical bonding and chemical
equilibrium)

Some future research directions
a) How can teacherspedagogical content knowledge about models and modelling be improved?
b) The role of models and modelling in the development of chemical knowledge?
c) How can it be made evident to teachers that the introduction of model-based teaching and learning
approach can be way to shift the emphasis in chemical education from transmission of existing
knowledge to a more contemporary perspective in which students will really understand the
nature of chemistry and be able to deal critically with chemistry-related situations?

J .K.Gilbert, O.de J ong, R.J usti, D.F.Treagust, J .H.van Driel (2002), J .H.van Driel (2002)

Curriculum for Chemical Eduaction
a) The central question is concerns the design of curricula for chemical education (note: curricular
process) which make chemistry interesting and relevant for various groups of learners (professional
chemists, general educational purposes-it is useful for all citizens in the future)
b) In recent decades, curricula have been changed, on the one hand for general educational
purposes, this has led to context-based approaches to teaching chemistry, on the other hand for
professional chemists specific chemistry courses have been developed in the context of vocational
training, aimed at developing the specific chemical competencies that are needed for various
professions.
c) Finally, chemistry is nowadays also presented in informal ways, for instance, in science centres and
through chemistry shows.

106

U-D.Ehlers, J .M.Pawlowski (2006)

Quality and Standardization in E-learning
- Quality development: Methods and approaches
Methods, models, concepts and approaches for the development, management and assurance of quality
in e-learning are introduced
- E-learning standards
The main goal of e-learning standards is to provide solutions to enable and ensure interoperability and
stability of systems, components and objects.

R.Kwan, R.Fox, FT Chan, P.Tsang (2008), Le J un (2008)

Knowledge management, Data Mining
We set up a few objects and value propositions of the initiative which was set up to improve teaching
and learning, to enhance the quality of curriculum, and to extent learning support. We apply Data
Mining tools to discover behavioral characteristics. A few strategies for knowledge management in the
curriculum development in distance education will be discussed.

Le J un (2008), I .Nonaka, H.Takeuchi (1995), I.Nonaka, H.Takeuchi (2005)

Types of knowledge, Interaction of types
Many knowledge management experts agree that there are two general types of knowledge:

a) Tacit knowledge is linked to personal perspective intuition, emotion, belief, experience and value. It
is intangible, not easy to articulate, and difficult to share with others.
b) Explicit knowledge has a tangible dimension that can be more easily captured, codified and
communicated

Based on I .Nonaka, H.Takeuchi these two versions of knowledge can interact when the
knowledge conversion occurs:
- socialization: from tacit to tacit
- externalization: from tacit to explicit
- combination: from explicit to explicit
- internalization: from explicit to tacit

Le J un (2008), I .Nonaka, H.Takeuchi (2005)

Research methods for knowledge management

a) Data Mining techniques
b) Web text mining is discovery knowledge from based non-structural text (text representation,
feature extraction, text categorization, text clustering, text summarization, semantic analysis, and
information extraction)
c) Learning theory
Learning theories are classified into four paradigms: behavioral theory, cognitive theory,
constructive theory, social learning theory.
We emphasize: Learning is continuous process that was indistinguishable from ongoing work practice
- by discovering the problems, recognizing their types, and by solving problems in routine work and
learning. Learners can continuously refine their cognitive, information, social and learning
competencies.
d) Knowledge management
Knowledge sharing and application of the SECI model (see I.Nonaka, H.Takeuchi)



107

xiii) Metadata Mining Process

R.Vilalta, C.Giraud-Carrier, P.Brazdil, C.Soares (2004)

Meta-learning Support Data Mining
Current data mining tools are characterized by a plethora of algorithms but a lack of guidelines to
select the right method according to the nature of the problem under analysis. Producing such
guidelines is a primary goal by the field of meta-learning; the research objective is to understand the
interaction between the mechanism of learning and the concrete contexts in which that mechanism is
applicable. The field of meta-learning has seen continuous growth in the past years with interesting
new developments in the construction of practical model-selection assistants, task-adaptive learners,
and a solid conceptual framework. In this paper, we give an overview of different techniques
necessary to build meta-learning systems. We begin by describing an idealized meta-learning
architecture comprising a variety of relevant component techniques. We then look at how each
technique has been studied and implemented by previous research. In addition, we show how
metalearning has already been identified as an important component in real-world applications.

J .Fox (2007)

Definition Metadata Mining process
Since metadata is just another type of data, applying data mining to metadata is technically
straightforward. XML - eXtensible Markup Language

American Library Association (1999)

Definition of Metadata
a) As for most people the difference between data and information is merely a philosophical
one of no relevance in practical use, other definitions are:
Metadata is information about data.
Metadata is information about information.
Metadata contains information about that data or other data
b) There are more sophisticated definitions, such as:
Metadata is structured, encoded data that describe characteristics of information-bearing
entities to aid in the identification, discovery, assessment, and management of the described
entities.

3.3.7.2. Brief Summary

Data Mining an analytical synthetic way of extraction of hidden and potencially useful information
from the large data files (continuum data-information-knowledge, knowledge discovery)

Data Mining Techniques system functions of the structure of formerly hidden relations and patterns
(e.g. classification, association, clustering, prediction)

Data Mining Tool a concrete procedure how to reach the intended system functions
Complex Tool a resolution of the complex problem of relevant science branch
Partial Tool a resolution of the partial problem of relevant science branch
Result of Data Mining a result of the data mining tool application
Representation of Data Mining Result a description of this what is expressed
Visualization of Data Mining Result an optical retrieval of the data mining result



108

3.3.7.3. Data Mining Cycle, References

i) Quotations from Sources

U.M.Fayyad, G.Piatelsky-Shapiro, P.Smyth (1996)

Cycle of Data mining
Data Mining can be viewed as a cycle that consists of several steps:
- Identify a problem where analyzing data can provide value
- Collect the data
- Preprocess the data obtain a clean, mineable table
- Build a model that summarizes patterns of interest in a particular representational form
- Interpret/Evaluate the model
- Deploy the results incorporating the model into another system for further action.

J .Luan (2002)

Steps for Data Mining preparation (algorithm, building, visualization)
a) Investigate the possibility of overlaying Data Mining algorithms directly on a data warehouse
b) Select a solid querying tool to build Data Mining files. These files closely resemble
multidimensional cubes
c) Data Visualization and Validation. This means both examining frequency counts as well as
generating scatter plots, histograms, and other graphics, including clustering models
d) Mine your data

Le J un (2008)

Main processes of Data Mining
- The main processes include data definition, data gathering, preprocessing, data processing and
discovering knowledge or patterns (Data Mining techniques can be implemented rapidly on existing
software and hardware)
- Application of Data Mining tools: To solve the task of prediction, classification, explicit modeling
and clustering. The application can help understand learnerslearning behaviors.

ii) Brief Summary of Data Mining Cycle

- Data Definition, Data Gathering
- Data Preprocessing, Data Processing
- Data Mining Techniques and Data Mining Tools,
- Discovering Knowledge or Patterns,
- Representation and Visualization of Data Mining Results,
- Application.

References

i. Tarbek,P., Zkodn,P. (2009-2010)
Educational and Didactic Communication 2009.
Bratislava, Slovak Republic: Didaktis, www.didaktis.sk, ISBN 978-80-89160-69-3

ii. Zkodn,P., Pavlt,V. (2009-2010a)
Data Mining A Brief Recherche (in: i.)

iii. Zkodn,P., Novk,V. (2009-2010b)
Data Mining A Brief Summary (in: i.)
109

Part 4. STATISTICAL TABLES

Table I.: Values of distribution function of standardized normal distribution


u F(u) u F(u) u F(u) u F(u)

0,00 0,500 00 0,35 0,636 83 0,70 0,758 04 1,05 0,853 14
0,01 0,503 99 0,36 0,640 58 0,71 0,761 15 1,06 0,855 43
0,02 0,507 98 0,37 0,644 31 0,72 0,764 24 1,07 0,857 69
0,03 0,511 97 0,38 0,648 03 0,73 0,767 30 1,08 0,859 93
0,04 0,515 95 0,39 0,651 73 0,74 0,770 35 1,09 0,862 14

0,05 0,519 94 0,40 0,655 42 0,75 0,773 77 1,10 0,864 33
0,06 0,523 92 0,41 0,659 10 0,76 0,776 37 1,11 0,866 50
0,07 0,527 90 0,42 0,662 76 0,77 0,779 35 1,12 0,868 64
0,08 0,531 88 0,43 0,666 40 0,78 0,782 30 1,13 0,870 76
0,09 0,535 86 0,44 0,670 03 0,79 0,785 24 1,14 0,872 86

0,10 0,539 83 0,45 0,673 64 0,80 0,788 14 1,15 0,874 93
0,11 0,543 80 0,46 0,677 24 0,81 0,791 03 1,16 0,876 98
0,12 0,547 76 0,47 0,680 82 0,82 0,793 89 1,17 0,879 00
0,13 0,551 72 0,48 0,684 39 0,83 0,796 73 1,18 0,881 00
0,14 0,555 67 0,49 0,687 93 0,84 0,799 55 1,19 0,882 98

0,15 0,559 62 0,50 0,691 46 0,85 0,802 34 1,20 0,884 93
0,16 0,563 56 0,51 0,694 97 0,86 0,805 11 1,21 0,886 86
0,17 0,567 49 0,52 0,698 47 0,87 0,807 85 1,22 0,888 77
0,18 0,571 42 0,53 0,701 94 0,88 0,810 57 1,23 0,890 65
0,19 0,575 35 0,54 0,705 40 0,89 0,813 27 1,24 0,892 51

0,20 0,579 26 0,55 0,708 84 0,90 0,815 94 1,25 0,894 35
0,21 0,583 17 0,56 0,712 26 0,91 0,818 59 1,26 0,896 17
0,22 0,587 06 0,57 0,715 66 0,92 0,821 21 1,27 0,897 96
0,23 0,590 95 0,58 0,719 04 0,93 0,823 81 1,28 0,899 73
0,24 0,594 83 0,59 0,722 40 0,94 0,826 39 1,29 0,901 47

0,25 0,598 71 0,60 0,725 75 0,95 0,828 94 1,30 0,903 20
0,26 0,602 57 0,61 0,729 07 0,96 0,831 47 1,31 0,904 90
0,27 0,606 42 0,62 0,732 37 0,97 0,833 98 1,32 0,906 58
0,28 0,610 26 0,63 0,735 65 0,98 0,836 46 1,33 0,908 24
0,29 0,614 09 0,64 0,738 91 0,99 0,838 91 1,34 0,909 88

0,30 0,617 91 0,65 0,742 15 1,00 0,841 34 1,35 0,911 49
0,31 0,621 72 0,66 0,745 37 1,01 0,843 75 1,36 0,913 09
0,32 0,625 52 0,67 0,748 57 1,02 0,846 14 1,37 0,914 66
0,33 0,629 30 0,68 0,751 75 1,03 0,848 50 1,38 0,916 21
0,34 0,633 07 0,69 0,754 90 1,04 0,850 83 1,39 0,917 74

110

u F(u) u F(u) u F(u) u F(u)

1,40 0,919 24 1,85 0,967 84 2,30 0,989 28 3,00 0,998 65
1,41 0,920 73 1,86 0,968 56 2,31 0,989 56 3,02 0,998 74
1,42 0,922 20 1,87 0,969 26 2,32 0,989 83 3,04 0,998 82
1,43 0,923 64 1,88 0,969 95 2,33 0,990 10 3,06 0,998 89
1,44 0,925 07 1,89 0,970 62 2,34 0,990 36 3,08 0,998 97

1,45 0,926 47 1,90 0,971 28 2,35 0,990 61 3,10 0,999 03
1,46 0,927 86 1,91 0,971 93 2,36 0,990 86 3,12 0,999 16
1,47 0,929 22 1,92 0,972 57 2,37 0,991 11 3,14 0,999 16
1,48 0,930 56 1,93 0,973 20 2,38 0,991 34 3,16 0,999 21
1,49 0,931 89 1,94 0,973 81 2,39 0,991 58 3,18 0,999 26

1,50 0,933 19 1,95 0,974 41 2,40 0,991 80 3,20 0,999 31
1,51 0,934 48 1,96 0,975 00 2,41 0,992 02 3,22 0,999 36
1,52 0,935 74 1,97 0,975 58 2,42 0,992 24 3,24 0,999 40
1,53 0,936 99 1,98 0,976 15 2,43 0,992 45 3,26 0,999 44
1,54 0,938 22 1,99 0,976 70 2,44 0,992 66 3,28 0,999 48

1,55 0,939 43 2,00 0,977 25 2,45 0,992 86 3,30 0,999 52
1,56 0,940 62 2,01 0,977 78 2,46 0,993 05 3,32 0,999 55
1,57 0,941 79 2,02 0,978 31 2,47 0,993 05 3,34 0,999 58
1,58 0,942 95 2,03 0,978 82 2,48 0,993 43 3,36 0,999 61
1,59 0,944 08 2,04 0,979 32 2,49 0,993 48 3,38 0,999 64

1,60 0,945 20 2,05 0,979 82 2,50 0,993 79 3,40 0,999 66
1,61 0,946 30 2,06 0,980 30 2,52 0,994 13 3,42 0,999 69
1,62 0,947 38 2,07 0,980 77 2,54 0,994 46 3,44 0,999 71
1,63 0,948 45 2,08 0,981 24 2,56 0,994 77 3,46 0,999 73
1,64 0,949 50 2,09 0,981 69 2,58 0,995 06 3,48 0,999 75

1,65 0,950 53 2,10 0,982 14 2,60 0,995 34 3,50 0,999 77
1,66 0,951 54 2,11 0,982 57 2,62 0,995 60 3,55 0,999 81
1,67 0,952 54 2,12 0,983 00 2,64 0,995 85 3,60 0,999 84
1,68 0,953 52 2,13 0,983 41 2,66 0,996 09 3,65 0,999 87
1,69 0,954 49 2,14 0,983 82 2,68 0,996 32 3,70 0,999 89

1,70 0,955 43 2,15 0,984 22 2,70 0,996 53 3,75 0,999 91
1,71 0,956 37 2,16 0,984 61 2,72 0,996 74 3,80 0,999 93
1,72 0,957 28 2,17 0,985 00 2,74 0,996 93 3,85 0,999 94
1,73 0,958 18 2,18 0,985 37 2,76 0,997 11 3,90 0,999 95
1,74 0,959 07 2,19 0,985 74 2,78 0,997 28 3,95 0,999 96

1,75 0,959 94 2,20 0,986 10 2,80 0,997 44 4,00 0,999 97
1,76 0,960 80 2,21 0,986 45 2,82 0,997 60 4,05 0,999 97
1,77 0,961 64 2,22 0,986 79 2,84 0,997 74 4,10 0,999 98
1,78 0,962 46 2,23 0,987 13 2,86 0,997 88 4,15 0,999 98
1,79 0,963 27 2,24 0,987 45 2,88 0,998 01 4,20 0,999 99

111

u F(u) u F(u) u F(u) u F(u)

1,80 0,964 07 2,25 0,987 78 2,90 0,998 13 4,25 0,999 99
1,81 0,964 85 2,26 0,988 09 2,92 0,998 25 4,30 0,999 99
1,82 0,965 62 2,27 0,988 40 2,94 0,998 36 4,35 0,999 99
1,83 0,966 38 2,28 0,988 70 2,96 0,998 46 4,40 0,999 99
1,84 0,967 12 2,29 0,988 99 2,98 0,998 56 4,45 1,000 00

112

Table II.: Critical values of u-test


0,20 0,10 0,05 0,025 0,01 0,005
u() 0,842 1,282 1,645 1,960 2,326 2,576
113

Table III.: Critical values of t-test



0,05 0,025 0,01 0,005
1 6,31 12,71 31,82 63,66
2 2,92 4,30 6,96 9,92
3 2,35 3,18 4,54 5,84
4 2,13 2,78 3,75 4,60
5 2,02 2,57 3,36 4,03

6 1,94 2,45 3,14 3,71
7 1,90 2,36 3,00 3,50
8 1,86 2,31 2,90 3,38
9 1,03 2,26 2,82 3,25
10 1,81 2,23 2,76 3,17

11 1,80 2,2 2,72 3,11
12 1,70 2,18 2,68 3,06
13 1,77 2,16 2,65 3,01
14 1,76 2,14 2,62 2,98
15 1,75 2,13 2,6 2,95

16 1,75 2,12 2,58 2,92
17 1,74 2,11 2,57 2,90
18 1,73 2,10 2,55 2,88
19 1,73 2,09 2,54 2,86
20 1,72 2,09 2,53 2,84

21 1,72 2,08 2,52 2,83
22 1,72 2,07 2,51 2,82
23 1,71 2,07 2,50 2,81
24 1,71 2,06 2,49 2,80
25 1,71 2,06 2,48 2,79

26 1,71 2,06 2,48 2,78
27 1,70 2,05 2,47 2,77
28 1,70 2,05 2,47 2,76
29 1,70 2,04 2,46 2,76
30 1,70 2,04 2,46 2,75

31 1,70 2,04 2,45 2,75
32 1,69 2,03 2,45 2,74
33 1,69 2,03 2,45 2,74
114

Table IV.: Critical values of
2
-test



0,995 0,975 0,05 0,025 0,01 0,005
1 0,00 0,00 3,84 5,02 6,63 7,88
2 0,01 0,05 5,99 7,38 9,21 10,6
3 0,07 0,22 7,81 9,35 11,34 12,84
4 0,21 0,48 9,49 11,14 13,28 14,86
5 0,41 0,83 11,07 12,83 15,09 16,75

6 0,68 1,24 12,59 14,45 16,81 18,55
7 0,99 1,69 14,07 16,01 18,48 20,28
8 1,34 2,18 15,51 17,52 20,09 21,45
9 1,73 2,7 16,92 19,02 21,67 23,59
10 2,16 3,25 18,31 20,48 23,21 25,19

11 2,60 3,82 19,68 21,92 24,72 26,76
12 3,07 4,40 21,03 23,34 26,22 28,30
13 3,57 5,01 22,36 24,74 27,69 29,82
14 4,07 5,63 23,68 26,12 29,14 31,32
15 4,60 6,26 25,00 27,49 30,58 32,80

16 5,14 6,91 26,3 28,85 32,00 34,27
17 5,70 7,56 27,59 30,19 33,41 35,72
18 6,26 8,23 28,87 31,53 34,81 37,16
19 6,84 8,91 30,14 32,85 36,19 38,58
20 7,43 9,59 31,41 34,17 37,57 40,00

21 8,03 10,28 32,67 35,46 38,93 41,40
22 8,64 10,98 33,92 36,76 40,29 42,80
23 9,26 11,69 35,17 38,08 41,64 44,18
24 9,89 12,40 36,42 39,36 42,98 45,56
25 10,52 13,12 37,65 40,65 44,31 46,93

30 13,79 16,79 43,77 46,98 50,89 53,67
35 17,19 20,57 49,80 53,2 57,34 60,27
40 20,71 24,43 55,76 59,34 63,69 66,70
45 27,99 23,57 61,66 65,41 69,96 73,17
50 34,31 32,36 67,5 71,42 76,15 79,49

60 35,53 40,46 79,46 83,30 38,38 91,95
70 43,28 48,76 90,58 95,02 100,43 104,21
80 51,17 57,15 101,88 106,63 112,33 116,32
90 59,20 65,65 113,15 118,14 124,12 128,30
100 67,33 74,22 124,34 129,56 135,81 140,17
115



Table V.: Critical values of F-test for = 0,05




1 2 3 4 5 6 7 8 9 10 20 40 60 120
1 161 200 213 225 230 234 237 239 241 242 248 251 252 253
2 18,5 19,0 19,2 19,2 19,3 19,3 19,4 19,4 19,4 19,4 19,4 19,5 19,5 19,5
3 10,1 9,55 9,28 9,12 9,01 8,94 8,89 8,85 8,81 8,79 8,66 8,59 8,57 8,55
4 7,71 6,94 6,95 6,39 6,26 6,16 6,09 6,04 6,00 5,96 5,80 5,72 5,69 5,66
5 6,91 5,79 5,41 5,19 5,05 4,95 4,88 4,82 4,77 4,74 4,56 4,46 4,43 4,40

6 5,99 5,14 4,76 4,53 4,39 4,28 4,21 4,15 4,10 4,06 3,87 3,77 3,74 3,70
7 5,59 4,74 4,35 4,12 3,97 3,87 3,79 3,73 3,68 3,64 3,44 3,34 3,30 3,27
8 5,32 4,46 4,07 3,84 3,69 3,58 3,50 3,44 3,39 3,35 3,15 3,04 3,01 2,97
9 5,12 4,26 3,86 3,63 3,48 3,37 3,29 3,23 3,18 3,14 2,94 2,83 2,79 2,75
10 4,96 4,10 3,71 3,48 3,33 3,22 3,14 3,07 3,02 2,98 2,77 2,66 2,62 2,58

11 4,84 3,98 3,59 3,36 3,20 3,09 3,01 2,95 2,90 2,85 2,65 2,53 2,49 2,45
12 4,75 3,89 3,49 3,26 3,11 3,00 2,91 2,85 2,80 2,75 2,54 2,43 2,38 2,34
13 4,67 3,81 3,41 3,18 3,03 2,92 2,83 2,77 2,71 2,67 2,46 2,34 2,30 2,25
14 4,60 3,74 3,64 3,11 2,96 2,85 2,76 2,7 2,65 2,60 2,39 2,27 2,22 2,18
15 4,64 3,68 3,29 3,06 2,90 2,79 2,71 2,64 2,59 2,54 2,33 2,20 2,16 2,11
116




Table V.: Critical values of F-test for = 0,05




1 2 3 4 5 6 7 8 9 10 20 40 60 120
16 4,49 3,63 3,24 3,01 2,85 2,74 2,66 2,59 2,54 2,49 2,28 2,15 2,11 2,06
17 4,45 3,59 3,20 2,96 2,81 2,70 2,61 2,55 2,49 2,45 2,23 2,10 2,06 2,01
18 4,41 3,55 3,16 2,93 2,77 2,66 2,58 2,51 2,46 2,41 2,19 2,06 2,02 1,97
19 4,38 3,52 3,13 2,9 2,74 2,63 2,54 2,48 2,42 2,38 2,16 2,03 1,98 1,93
20 4,35 3,49 3,10 2,87 2,71 2,60 2,51 2,45 2,39 2,35 2,12 1,99 1,95 1,90

21 4,32 3,47 3,07 2,84 2,68 2,57 2,49 2,42 2,37 2,32 2,10 1,96 1,92 1,87
22 4,30 3,44 3,05 2,82 2,66 2,55 2,46 2,40 2,34 2,30 2,07 1,94 1,89 1,84
23 4,28 3,42 3,03 2,80 2,64 2,53 2,44 2,37 2,32 2,27 2,05 1,91 1,86 1,81
24 4,26 3,40 3,01 2,78 2,62 2,51 2,42 2,36 2,30 2,25 2,03 1,89 1,84 1,79
25 4,24 3,39 2,92 2,76 2,60 2,49 2,40 2,34 2,28 2,24 2,01 1,87 1,82 1,77

26 4,23 3,37 2,98 2,74 2,59 2,47 2,39 2,32 2,27 2,22 1,99 1,85 1,80 1,75
27 4,21 3,35 2,96 2,73 2,57 2,46 2,37 2,31 2,25 2,20 1,97 1,84 1,79 1,73
28 4,20 3,34 2,95 2,71 2,56 2,45 2,36 2,29 2,24 2,19 1,96 1,82 1,77 1,71
29 4,18 3,33 2,93 2,70 2,55 2,43 2,35 2,28 2,22 2,18 1,94 1,81 1,75 1,70
30 4,17 3,32 2,92 2,69 2,53 2,42 2,33 2,27 2,21 2,16 1,93 1,79 1,74 1,68

40 4,08 3,23 2,84 2,61 2,45 2,34 2,25 2,18 2,12 2,08 1,84 1,69 1,64 1,58
60 4,00 3,15 2,76 2,53 2,37 2,25 2,17 2,10 2,04 1,99 1,75 1,59 1,53 1,47
120 3,92 3,07 2,68 2,45 2,29 2,17 2,09 2,02 1,96 1,91 1,66 1,50 1,43 1,35
117



Table VI.: Critical values of F-test for = 0,01




1 2 3 4 5 6 7 8 9 10 20 40 60 120
1 4050 5000 5400 5620 5760 5860 5930 5980 6020 6060 6210 6290 6310 6340
2 998,5 99 99,2 99,2 99,3 99,3 99,4 99,4 99,4 99,4 99,4 99,5 99,5 99,5
3 34,1 30,8 29,5 28,7 28,2 27,9 27,7 27,5 27,3 27,2 26,7 26,4 26,3 26,2
4 21,2 18 16,7 16 15,5 15,2 15 14,8 14,7 14,5 14 13,7 13,7 13,6
5 16,3 13,3 12,1 11,4 11 10,7 10,5 10,3 10,2 10,1 9,55 9,2 9,2 9,11

6 13,7 10,9 9,78 9,15 8,75 8,47 8,26 8,1 7,98 7,87 7,4 7,14 7,06 6,97
7 12,2 9,55 8,45 7,85 7,46 7,19 6,99 6,84 6,72 6,62 6,16 5,91 5,82 5,74
8 11,3 8,65 7,59 7,01 6,63 6,37 6,18 6,03 5,91 5,81 5,36 5,12 5,03 4,95
9 10,6 8,02 6,99 6,42 6,06 5,8 5,61 5,47 5,35 5,26 4,81 4,57 4,48 4,4
10 10 7,56 6,55 5,99 5,64 5,39 5,2 5,06 4,94 4,85 4,41 4,17 4,08 4

11 9,65 7,21 6,22 5,67 5,32 5,07 4,89 4,74 4,63 4,54 4,1 3,86 3,78 3,69
12 9,33 6,93 5,95 5,41 5,06 4,82 4,64 4,5 4,39 4,3 3,86 3,62 3,54 3,45
13 9,07 6,7 5,74 5,21 4,86 4,62 4,44 4,3 4,19 4,1 3,66 3,43 3,34 3,25
14 8,86 6,51 5,56 5,04 4,69 4,46 4,28 4,14 4,03 3,94 3,51 3,27 3,18 3,09
15 8,68 6,36 5,42 4,89 4,56 4,32 4,14 4 3,39 3,8 3,37 3,13 3,05 2,96
118



Table VI.: Critical values of F-test for = 0,01




1 2 3 4 5 6 7 8 9 10 20 40 60 120
16 8,53 6,23 5,29 4,77 4,44 4,2 4,03 3,89 3,78 3,69 3,26 3,02 2,93 2,84
17 8,4 6,11 6,18 4,67 4,34 4,1 3,93 3,79 3,68 3,59 3,16 2,92 2,83 2,75
18 8,29 6,01 5,09 4,58 4,25 4,01 3,84 3,71 3,6 3,51 3,08 2,84 2,75 2,66
19 8,18 5,93 5,01 4,5 4,17 3,94 3,77 3,63 3,52 3,43 3 2,76 2,67 2,58
20 8,1 5,85 4,94 4,43 4,1 3,87 3,7 3,56 3,46 3,37 2,94 2,69 2,61 2,52

21 8,02 5,78 4,87 4,37 4,04 3,81 3,64 3,51 3,4 3,31 2,88 2,64 2,55 2,46
22 7,95 5,72 4,82 4,31 3,99 3,76 3,59 3,45 3,35 3,26 2,83 2,58 2,5 2,4
23 7,88 5,66 4,76 4,26 3,94 3,71 3,54 3,41 3,3 3,21 2,78 2,54 2,45 2,35
24 7,82 5,61 4,72 4,22 3,9 3,67 3,5 3,36 3,26 3,17 2,74 2,49 2,4 2,31
25 7,77 5,57 4,68 4,18 3,85 3,63 3,46 3,32 3,22 3,13 2,7 2,45 2,36 2,27

26 7,72 5,63 4,64 4,14 3,82 3,59 3,42 3,29 3,18 3,09 2,66 2,42 2,33 2,23
27 7,68 5,49 4,6 4,11 3,78 3,56 3,39 3,26 3,15 3,06 2,63 2,38 2,29 2,2
28 7,64 4,45 4,57 4,07 3,75 3,53 3,36 3,23 3,12 3,03 2,6 2,35 2,26 2,17
29 7,6 5,42 4,54 4,04 3,73 3,5 3,33 3,2 3,09 3 2,57 2,33 2,23 2,14
30 7,56 5,39 4,51 4,02 3,7 3,47 3,3 3,17 3,07 2,98 2,55 2,3 2,21 2,11

40 7,31 5,18 4,31 3,83 3,51 3,29 3,12 2,99 2,89 2,8 2,37 2,11 2,02 1,92
60 7,08 4,98 4,13 3,65 3,34 3,12 2,95 2,82 2,72 2,63 2,2 1,94 1,84 1,73
120 6,85 4,79 3,95 3,48 3,17 2,96 2,79 2,66 2,56 2,47 2,03 1,76 1,66 1,53

119


CV of Author

Assoc.Prof. RNDr. Pemysl Zkodn,CSc.

Assoc.Prof. RNDr. Pemysl Zkodn,CSc., graduated from the Mathematical-Physics
Faculty of Charles University, CSc. in the physics education, and docent (assoc. professor) of
physics education. As a university teacher, he is affiliated to the University of South Bohemia
in esk Budjovice and to the University of Finance and Administration in Prague.

He is active in scientific work in cooperation with the International Institute of
Informatics and Systemics in U.S.A., and the Curriculum Studies Research Group in
Slovakia. In his scientific work, aimed at science and statistics education, he deals with
structuring and modelling physics and statistics knowledge and systems of knowledge and
also data mining and curricular process.

In addition to support from his faculty and university, the projects granted to the
author by the Avenira Foundation in Switzerland and the University of Finance and
Administration in Czech Republic has brought a considerable contribution to the results
achieved.

The conception of the last books Survey of Principles of Theoretical Physics,
Curricular Process in Physics, Fundaments of Statistics (with co-authors), and From
Financial Derivatives to Option Hedging (with co-author) and last monographs Educational
and Didactic Communication 2008, 2009, 2010, 2011 are based on the scientific work of the
author. Some of the further works published by the author are quoted in the bibliography.

Assoc.Prof. RNDr. Pemysl Zkodn, CSc. is active as general chair of international
e-conferences OEDM-SERM 2011 and OEDM-SERM 2012 (Optimization, Education and
Data Mining in Science, Engineering and Risk Management).

















120


Bibliography of Author

i) The monographs

Tarabek,P., Zaskodny,P.: Analytical-Synthetic Modelling of Cognitive Structures (volume 1:
New structural methods and their application).
Educational Publisher Didaktis Ltd., Bratislava, London 2001

Tarabek,P., Zaskodny,P.: Analytical-Synthetic Modelling of Cognitive Structures (volume 2:
Didactic communication and educational sciences).
Educational Publisher Didaktis Ltd., Bratislava, New York 2002

Tarabek,P., Zaskodny,P.: Structure, Formation and Design of Textbook (volume 1:
Theoretical basis).
Educational Publisher Didaktis Ltd., Bratislava, London 2003

Tarabek,P., Zaskodny,P.: Structure, Formation and Design of Textbook (volume 2: Theory
and practice).
Educational Publisher Didaktis Ltd., Bratislava, London 2004

Tarabek,P., Zaskodny,P.: Modern Science and Textbook Creation (volume 1: Projection of
scientific systems).
Educational Publisher Didaktis Ltd., Bratislava, Frankfurt a.M. 2005
Tarabek,P., Zaskodny,P.: Modern Science and Textbook Creation (volume 2: Modern
tendencies in textbook creation).
Educational Publisher Didaktis Ltd., Bratislava, Frankfurt a.M. 2006

Tarabek,P., Zaskodny,P.: Educational and Didactic Communication 2007
Educational Publisher Didaktis Ltd., Bratislava, Frankfurt a.M. 2008

Tarabek,P., Zaskodny,P.: Educational and Didactic Communication 2008
Educational Publisher Didaktis Ltd., Bratislava, Frankfurt a.M. 2009

Tarabek,P., Zaskodny,P.: Educational and Didactic Communication 2009
Educational Publisher Didaktis Ltd., Bratislava, 2010

Tarabek,P., Zaskodny,P.: Educational and Didactic Communication 2010
Educational Publisher Didaktis Ltd., Bratislava, 2011

Tarabek,P., Zaskodny,P.: Educational and Didactic Communication 2011
Educational Publisher Didaktis Ltd., Bratislava, 2012


ii) The books

Pavlt,V., Zkodn,P. at al: Capital Market, The first edition, 2003

Zkodn,P.: Survey of Principles of Theoretical Physics (with Application to Radiology)
(in Czech). Didaktis, Bratislava, Slovak Republic 2005

121


Zkodn,P.: Survey of Principles of Theoretical Physics (with Application to Radiology) (in
English). Avenira, Switzerland, Algoritmus, Ostrava, Czech Republic 2006

Pavlt,V., Zkodn,P. at al: Capital Market, The second edition, 2006

Zkodn,P.: Curricular Process in Physics (in Czech). Avenira, Switzerland, Algoritmus,
Ostrava, Czech Republic 2009

Zkodn,P. at al.: Fundaments of Statistics (in Czech). Curriculum, Czech Republic 2011

Pavlt,V., Zkodn,P.: From Financial Derivatives to Option Hedging. Curriculum, Czech
Republic 2012


iii) The textbooks

Zkodn,P.: Theoretical Mechanics in Examples I (in Czech). PF, Ostrava, Czech
Republic 1984

Zkodn,P., Sklenk,L.: Theoretical Mechanics in Examples II (in Czech). PF, Ostrava,
Czech Republic 1986

Zkodn,P. et al.: Principles of Economical Statistics (in Czech). VSFS, Praha, Czech
Republic 2004

Budnsk,P., Zkodn,P.: Financial and Investment Mathematics. VSFS, Prague 2004

Zkodn,P. et al.: Principles of Health Statistics (in Czech). JU, esk Budjovice, Czech
Republic 2005

Kozlovsk,D., Skalick,Z., Zkodn,P.: Introduction to Practicum from Radiological
Physics. JCU, esk Budjovice, Czech Republic, 2007

Zkodn,P., Pavlt,V., Budk,J.: Financial Derivates and Their Evaluation. Prague,
University of Finance and Administration, 2009


iv) The papers

Approximately 100 papers










122



Global References

Dalgaard,P. (2008). Introductory Statistics with R. Second Edition. New York, USA:
Springer. (In English)
ISBN-13: 978-038779-053-4

Field,A. (2009). Discovering Statistics Using SPSS. Third Edition. London, New Delhi,
Singapore: SAGE. (In English)
ISBN-13: 978-184787-907-3

Jorion,P. (2007). Financial Risk Manager. Handbook. Hoboken, New Jersey, USA:
Wiley&Sons. (In English)
ISBN 978-0-470-12630-1

Matloff,N. (2011). The Art R Programming: A Tour of Statistical Software Design. USA: No
Starch Press. (In English)
ISBN-13: 978-159327-384-2

Pavlt,V., Zkodn,P. (2012). From Financial Derivatives to Option Hedging. Prague, Czech
Republic: Curriculum. (In Czech)
ISBN 978-80-904948-3-1

Tarbek,P., Zkodn,P. (2011). Data Mining Toos in Statistics Education. In:
Educational&Didactic Communication 2010. Bratislava, Slovakia: Didaktis. (In English)
ISBN 978-80-89160-78-5

Zkodn,P. et al (2007). Principles of Economical Statistics. Prague, Czech Republic:
Eupress. (Partly on English)
ISBN 80-86754-00-6

You might also like