You are on page 1of 25

History

The word statistik comes from the Italian word statista (meaning statesman). It was first used by Gottfried Achenwall (1719-1772), a professor at Marlborough and Gottingen. Dr. E.A.W. Zimmermam introduced the word statistics to England. Its use was popularized by Sir John Sinclair in his work Statistical Account of Scotland 1791-1799. Long before the eighteenth century, however, people had been recording and using data.

Statistics
statistics is, first and foremost, a collection of tools used for converting raw data into information to help decision makers in their works. Some consider statistics to be a mathematical body of science pertaining to the collection, analysis, interpretation or explanation, and presentation of data. Descriptive statistics: used for summarising or describing a collection of data.
Inferential statistics: patterns in the data are modelled in a way that accounts for randomness and uncertainty in the observations, and are then used for drawing inferences about the process or population being studied.

POPULATION & SAMPLE


The total number of occurrences of a particular thing (e.g. mineral species, fossil type, rock type) present in a defined area is called a Population. Sampling is a scientific, selective process applied to a large mass or group (a population, as defined by the investigator) in order to reduce its bulk for interpretation purposes. This is achieved by identifying a component part (a sample) which reflects the characteristics of the parent population within acceptable limits of accuracy, precision, and cost effectiveness.

A best estimate of population parameters can be made from statistics of samples.


It is fundamental in sampling that samples are representative, at all times, of the population. If they are not the results are incorrect. Correct sampling of such material has to ensure that all constituent units of the population have a uniform probability of being selected to form the sample. This is the concept of random sampling. (it is often very difficult in geology to take a random sample) Even a random sample may not be a good representative of the population. Variation within samples may make it difficult to interpret any effect of differences in location.

Results of analyzed samples plotted as a frequency curve are a pictorial representation of their distribution.

Distributions have characteristics such as midpoints and other measures which indicate the spread of the values and their symmetry. These are parameters if they describe a population and statistics if they refer to samples.

Descriptive measures
Four types of characteristics which describe a data set pertaining to some numerical variable or phenomenon of interest are:

Location Dispersion Relative standing Shape

In any analysis and/or interpretation of numerical data, a variety of descriptive measures representing the properties of location, variation, relative standing and shape may be used to extract and summarize the salient features of the data set.

Measures of location (or measures of central tendency)


Mean
Advantages of the mean: It is a measure that can be calculated and is unique. It is useful for performing statistical procedures such as comparing the means from several data sets. Disadvantages of the mean: It is affected by extreme values that are not representative of the rest of the data.

Median
Advantage of the median over the mean: Extreme values in data set do not affect the median as strongly as they do the mean.

Mode
A data set may have several modes. In this case it is called multimodal distribution.

Measures of data variation (spread)


Range = max min Interquartile range: The interquartile range measures the spread of the middle
50 percent of an ordered data set (unaffected by outliers).

Median Absolute Deviation (MAD): The MAD is the average of the


absolute deviation values from the median. That is, the deviations of the data values from the median are computed, then absolute (positive) values for these deviations are obtained, and the average of these positive values is calculated.

Variance: The sample variance is an approximate average of the squared


deviations from the sample mean. The average is approximate because we divide not by the sample size but by the sample size minus 1.

Standard Deviation: The sample standard deviation is the positive square root
of the sample variance. The positive square root of the variance has the same unit as the variable. The data are widely scattered about the mean, the variance and the standard deviation will be somewhat large.

Measures of relative standing


(The relative position of a particular observation in a data set.)

Percentiles are numerical values that divide an ordered data set into 100 groups of values with at most 1% of the data values in each group.

There are 99 percentiles in a data set.

The percentile corresponding to a given data value, say x, in a set is obtained by using the following formula: Percentile = (number of values below x +0.5/number of values in the dataset)x100%

Shape
The fourth important numerical characteristic of a data set is its shape. In describing a numerical data set its is not only necessary to summarize the data by presenting appropriate measures of central tendency, dispersion and relative standing, it is also necessary to consider the shape of the data the manner, in which the data are distributed. There are two measures of the shape of a data set: skewness and kurtosis. The direction of the skewness depends upon the location of the extreme values. Kurtosis characterizes the relative peakedness or flatness of a distribution compared with the bell-shaped distribution (normal distribution). Positive kurtosis indicates a
relatively peaked distribution. Negative kurtosis indicates a relatively flat distribution.

The application of classical statistics fundamentally assumes that data consist of independent (random) samples and have a normal distribution.
There are several uncontrollable factors that influence the values and variations of element contents in Earth materials that are sampled. Such factors include not only geogenic (e.g., metal scavenging by Fe-Mn oxides, lithology, etc.) and anthropogenic (i.e., man-induced) processes but also sampling and analytical procedures. Uni-element geochemical data sets invariably contain more than one population, each of which represents a unique process. Geogenic processes are spatially dependent on one another and explain the highest proportion of variations in unielement contents in geochemical samples. Therefore geochemical data are not spatially independent. Thus, many uni-element geochemical data sets invariably do not follow a normal distribution model.

Classical statistics should be applied cautiously in characterising empirical density distributions and mapping spatial distributions of uni-element geochemical data sets that do not follow a normal distribution model.

Robust Statistics
Robust statistics seeks to provide methods that emulate popular statistical methods, but which are not unduly affected by outliers or other small departures from model assumptions. In statistics, classical estimation methods rely heavily on assumptions which are often not met in practice.

The median is a robust measure of central tendency, while the mean is not. The median absolute deviation and interquartile range are robust measures of statistical dispersion, while the standard deviation and range are not.

EXPLORATORY DATA ANALYSIS


EDA is not a method but a philosophy of or an approach to robust data analysis (Tukey, 1977).

EDA consists of a collection of descriptive statistical and, mostly, graphical tools intended to:
(a) gain maximum insight into a data set, (b) discover data structure, (c) define significant variables in the data, (d) determine outliers and anomalies, (e) suggest and test hypotheses, (f) develop prudent models, and (g) identify best possible treatment and interpretation of data. Classical statistical data analysis and probabilistic data analysis are confirmatory approaches to data analysis (being based on prior assumptions of data distribution models), whilst EDA, as its name indicates, is an exploratory approach to data analysis.

The goal of EDA is to recognise potentially explicable data patterns (Good, 1983)
This is done through application of resistant and robust descriptive statistical and graphical tools that are qualitatively distinct from the classical statistical tools.

From a statistical point of view, a statistic is resistant and robust (a) if it is only slightly affected either by a small number of gross errors or by a high number of small errors (resistance) and (b) if it is only slightly affected by data outliers (robustness).
The descriptive statistical and graphical tools employed in EDA are based on the data itself but not on a data distribution model (e.g., normal distribution), yet they provide resistant definitions of univariate data statistics and outliers.

The emphasis in EDA is interaction between human cognition and computation in the form of statistical graphics that allow a user to perceive the behaviour and structure of the data. Among the several types of EDA graphical tools, the density trace, jittered one-dimensional scatterplot and boxplot are most commonly used in uni-element geochemical data analysis

Scatterplot
Count Average Median Geometric mean Variance Standard deviation MAD Minimum Maximum Range Lower quartile Upper quartile Interquartile range 1.0% 5.0% 10.0% 25.0% 50.0% 75.0% 90.0% 95.0% 99.0%
Normal Probability Plot

55

59

63

67

71

177 61.2825 61.15 61.2133 8.41586 2.90101 1.8 47.61 69.83 22.22 59.5 62.95 3.45 Percentiles 55.99 57.02 57.82 59.5 61.15 62.95 65.03 66.46 67.75

SiO2

Density Trace

99.9 99 95 80 50 20 5 1 0.1

0.15 0.12 density 0.09 0.06 0.03 0 55 59 63 SiO2 67 71

percentage

Scatterplot

20

40

60

80 100 120

These data are from a single batch of samples submitted for chemical analysis. Clearly there is a shift in calibration of the analytical instrument.

Normal Probability Plot 99.9 99 0.02 0.016 density 0.012 0.008 0.004 0 0

Density Trace

percentage

95 80 50 20 5 1 0.1

Ni
20 40 60 80 100 120 Ni

BRUSHING used in place of correlation matrix / dendogram (cluster analysis)


K-Rb-Ba Association for Potassium Feldspar
The rectangle depicts the 182 stream sediment composites (circles) on the toposheet. The 3D scatter plot (below left) shows the pattern of association between K (z axis), Rb (Y axis) and Ba (x axis). The blue spheres mark approximately 50 % of those composites which have higher values of K, Rb and Ba. These same composites are shown as orange circles on the 2D scatter plots (above left) and on the depiction of composites on the toposheet (above).

Three dimensional correlation

Two dimensional correlation

Fe-Mg-Ti-V association for mafic igneous rocks

The rectangles (above right) depict the 182 stream sediment composites (circles) on the toposheet. On the left are scatter plots. The composite samples having higher values of both elements in each scatter plots are shown in orange.

Box Plot
Data values
maximum Upper outer fence (UOF) outliers

Boxplot features

Upper inner fence (UIF) Upper whisker (UW)

X (UIF) = X (UH) + 1.5(IQR)

Upper hinge (UH) (Q3)

Median (Q2)

Lower hinge (LH) (Q1)

Lower whisker (LW) Lower inner fence (LIF) outliers mild Lower outer fence (LOF) far Minimum

IQR

Classification of uni-element geochemical data


Since geochemical data are spatially dependent, they do not follow a normal distribution model.
If a geochemical data set contains more than one population and does not follow a normal distribution model, then estimation of threshold as the mean2SDEV can lead to spurious models of geochemical anomalies. Based on a boxplot, a uni-element geochemical data set can usually be divided into five robust classes: (1) minimumLW; (2) LWLH; (3) LHUH; (4) UHUW; and (5) UWmaximum. The UIF is usually considered the threshold separating background values and anomalies. Data values in the UHUW class (at most 25% of a data set) can be considered high background, Data values in the LHUH class (at most 50% of a data set) are background Data values in the LWLH class (at most 25% of a data set) are low background and Data values in the minimumLW class are extremely low background

Box-and-Whisker Plot Amphibolite NE Gneiss SW Granite NW Granite Schist W Migmatite Gneiss SE Dolomite N Amphibolite SE 56 60 62 SiO2 % Box-and-Whisker Plot 58 64

Box-and-Whisker Plot Amphibolite NE Gneiss SW Granite NW Granite Schist W Migmatite Gneiss SE Dolomite N Amphibolite SE 11 12 13 14 15 Al2O3 16 17

Box-and-Whisker Plot

Amphibolite NE Gneiss SW Granite NW Granite Schist W Migmatite Gneiss SE Dolomite N Amphibolite SE 4.1 4.6 5.1 5.6 6.1 Fe2O3 % 6.6 7.1

Amphibolite NE Gneiss SW Granite NW Granite Schist W Migmatite Gneiss SE Dolomite N Amphibolite SE 1.6 1.9 2.2 2.5 2.8 3.1

Boxplots showing the relative differences in chemical composition of sediments derived from the different lithology groups

MgO %

Box-and-Whisker Plot

Box-and-Whisker Plot

Amphibolite NE Gneiss SW Granite NW Granite Schist W Migmatite Gneiss SE Dolomite N Amphibolite SE 0 1 2 3 CaO % 4 5 6

Amphibolite NE Gneiss SW Granite NW Granite Schist W Migmatite Gneiss SE Dolomite N Amphibolite SE 0 0.5 1 1.5 Na2O % 2 2.5 3

Geochemical statistics of oxides and elements according to lithology groups

SiO2 Lthology Group Cou nt 7 10 8 10 10 12 11 13 Avg 61.75 61.20 61.41 59.33 59.50 61.01 61.58 59.75 Media n 61.75 61.155 61.69 58.94 59.515 61.36 61.66 60.13 Geo m. mean 61.74 61.19 61.38 59.31 59.48 60.98 61.54 59.72 MA D 1.04 0.63 1.39 1.14 1.23 1.6 1.5 1.96 Min 60.67 59.73 58.13 57.04 56.65 57.91 57.18 57.02 Max 63.09 63.63 63.66 62.42 62.6 63.4 63.82 62.98 Ran ge 2.42 3.9 5.53 5.38 5.95 5.49 6.64 5.96 LH 60.68 60.44 60.18 57.82 57.95 59.19 60.16 58.17 UH 62.7 61.8 62.9 60.1 60.4 62.7 63.7 60.5 IQR 2.03 1.34 2.69 2.27 2.45 3.51 3.52 2.31 LIF 57.64 58.43 56.14 54.42 54.28 53.93 54.88 54.71 UIF 63.73 62.45 64.21 61.23 61.63 64.46 65.44 61.64 LW 60.67 59.73 58.13 57.04 56.65 57.91 57.18 57.02 UW 63.09 63.63 63.66 62.42 62.6 63.4 63.82 62.98 Low Background 60.67-60.68 59.73-60.44 58.13-60.175 57.04-57.82 56.65-57.95 57.91-59.19 57.18-60.16 57.02-58.17 Background 60.68-62.71 60.44-61.78 60.175-62.865 57.82-60.09 57.95-60.4 59.19-62.7 60.16-63.68 58.17-60.48 High Background 62.71-63.09 61.78-63.63 62.865-63.66 60.09-62.42 60.4-62.6 62.7-63.4 63.68-63.82 60.48-62.98

Amphibolite NE
Gneiss SW Granite NW Granite Schist W Migmatite Gneiss SE Dolomite N Amphibolite SE

MAD: MEDIAN ABSOLUTE DEVIATION LH: LOWER HINGE UH: UPPER HINGE

IQR: INTER QUARTILE RANGE


LIF: LOWER INNER FENCE UIF: UPPER INNER FENCE LW: LOWER WHISKER UW: UPPER WHISKER

You might also like