Professional Documents
Culture Documents
The word statistik comes from the Italian word statista (meaning statesman). It was first used by Gottfried Achenwall (1719-1772), a professor at Marlborough and Gottingen. Dr. E.A.W. Zimmermam introduced the word statistics to England. Its use was popularized by Sir John Sinclair in his work Statistical Account of Scotland 1791-1799. Long before the eighteenth century, however, people had been recording and using data.
Statistics
statistics is, first and foremost, a collection of tools used for converting raw data into information to help decision makers in their works. Some consider statistics to be a mathematical body of science pertaining to the collection, analysis, interpretation or explanation, and presentation of data. Descriptive statistics: used for summarising or describing a collection of data.
Inferential statistics: patterns in the data are modelled in a way that accounts for randomness and uncertainty in the observations, and are then used for drawing inferences about the process or population being studied.
Results of analyzed samples plotted as a frequency curve are a pictorial representation of their distribution.
Distributions have characteristics such as midpoints and other measures which indicate the spread of the values and their symmetry. These are parameters if they describe a population and statistics if they refer to samples.
Descriptive measures
Four types of characteristics which describe a data set pertaining to some numerical variable or phenomenon of interest are:
In any analysis and/or interpretation of numerical data, a variety of descriptive measures representing the properties of location, variation, relative standing and shape may be used to extract and summarize the salient features of the data set.
Median
Advantage of the median over the mean: Extreme values in data set do not affect the median as strongly as they do the mean.
Mode
A data set may have several modes. In this case it is called multimodal distribution.
Standard Deviation: The sample standard deviation is the positive square root
of the sample variance. The positive square root of the variance has the same unit as the variable. The data are widely scattered about the mean, the variance and the standard deviation will be somewhat large.
Percentiles are numerical values that divide an ordered data set into 100 groups of values with at most 1% of the data values in each group.
The percentile corresponding to a given data value, say x, in a set is obtained by using the following formula: Percentile = (number of values below x +0.5/number of values in the dataset)x100%
Shape
The fourth important numerical characteristic of a data set is its shape. In describing a numerical data set its is not only necessary to summarize the data by presenting appropriate measures of central tendency, dispersion and relative standing, it is also necessary to consider the shape of the data the manner, in which the data are distributed. There are two measures of the shape of a data set: skewness and kurtosis. The direction of the skewness depends upon the location of the extreme values. Kurtosis characterizes the relative peakedness or flatness of a distribution compared with the bell-shaped distribution (normal distribution). Positive kurtosis indicates a
relatively peaked distribution. Negative kurtosis indicates a relatively flat distribution.
The application of classical statistics fundamentally assumes that data consist of independent (random) samples and have a normal distribution.
There are several uncontrollable factors that influence the values and variations of element contents in Earth materials that are sampled. Such factors include not only geogenic (e.g., metal scavenging by Fe-Mn oxides, lithology, etc.) and anthropogenic (i.e., man-induced) processes but also sampling and analytical procedures. Uni-element geochemical data sets invariably contain more than one population, each of which represents a unique process. Geogenic processes are spatially dependent on one another and explain the highest proportion of variations in unielement contents in geochemical samples. Therefore geochemical data are not spatially independent. Thus, many uni-element geochemical data sets invariably do not follow a normal distribution model.
Classical statistics should be applied cautiously in characterising empirical density distributions and mapping spatial distributions of uni-element geochemical data sets that do not follow a normal distribution model.
Robust Statistics
Robust statistics seeks to provide methods that emulate popular statistical methods, but which are not unduly affected by outliers or other small departures from model assumptions. In statistics, classical estimation methods rely heavily on assumptions which are often not met in practice.
The median is a robust measure of central tendency, while the mean is not. The median absolute deviation and interquartile range are robust measures of statistical dispersion, while the standard deviation and range are not.
EDA consists of a collection of descriptive statistical and, mostly, graphical tools intended to:
(a) gain maximum insight into a data set, (b) discover data structure, (c) define significant variables in the data, (d) determine outliers and anomalies, (e) suggest and test hypotheses, (f) develop prudent models, and (g) identify best possible treatment and interpretation of data. Classical statistical data analysis and probabilistic data analysis are confirmatory approaches to data analysis (being based on prior assumptions of data distribution models), whilst EDA, as its name indicates, is an exploratory approach to data analysis.
The goal of EDA is to recognise potentially explicable data patterns (Good, 1983)
This is done through application of resistant and robust descriptive statistical and graphical tools that are qualitatively distinct from the classical statistical tools.
From a statistical point of view, a statistic is resistant and robust (a) if it is only slightly affected either by a small number of gross errors or by a high number of small errors (resistance) and (b) if it is only slightly affected by data outliers (robustness).
The descriptive statistical and graphical tools employed in EDA are based on the data itself but not on a data distribution model (e.g., normal distribution), yet they provide resistant definitions of univariate data statistics and outliers.
The emphasis in EDA is interaction between human cognition and computation in the form of statistical graphics that allow a user to perceive the behaviour and structure of the data. Among the several types of EDA graphical tools, the density trace, jittered one-dimensional scatterplot and boxplot are most commonly used in uni-element geochemical data analysis
Scatterplot
Count Average Median Geometric mean Variance Standard deviation MAD Minimum Maximum Range Lower quartile Upper quartile Interquartile range 1.0% 5.0% 10.0% 25.0% 50.0% 75.0% 90.0% 95.0% 99.0%
Normal Probability Plot
55
59
63
67
71
177 61.2825 61.15 61.2133 8.41586 2.90101 1.8 47.61 69.83 22.22 59.5 62.95 3.45 Percentiles 55.99 57.02 57.82 59.5 61.15 62.95 65.03 66.46 67.75
SiO2
Density Trace
99.9 99 95 80 50 20 5 1 0.1
percentage
Scatterplot
20
40
60
80 100 120
These data are from a single batch of samples submitted for chemical analysis. Clearly there is a shift in calibration of the analytical instrument.
Normal Probability Plot 99.9 99 0.02 0.016 density 0.012 0.008 0.004 0 0
Density Trace
percentage
95 80 50 20 5 1 0.1
Ni
20 40 60 80 100 120 Ni
The rectangles (above right) depict the 182 stream sediment composites (circles) on the toposheet. On the left are scatter plots. The composite samples having higher values of both elements in each scatter plots are shown in orange.
Box Plot
Data values
maximum Upper outer fence (UOF) outliers
Boxplot features
Median (Q2)
Lower whisker (LW) Lower inner fence (LIF) outliers mild Lower outer fence (LOF) far Minimum
IQR
Box-and-Whisker Plot Amphibolite NE Gneiss SW Granite NW Granite Schist W Migmatite Gneiss SE Dolomite N Amphibolite SE 56 60 62 SiO2 % Box-and-Whisker Plot 58 64
Box-and-Whisker Plot Amphibolite NE Gneiss SW Granite NW Granite Schist W Migmatite Gneiss SE Dolomite N Amphibolite SE 11 12 13 14 15 Al2O3 16 17
Box-and-Whisker Plot
Amphibolite NE Gneiss SW Granite NW Granite Schist W Migmatite Gneiss SE Dolomite N Amphibolite SE 4.1 4.6 5.1 5.6 6.1 Fe2O3 % 6.6 7.1
Amphibolite NE Gneiss SW Granite NW Granite Schist W Migmatite Gneiss SE Dolomite N Amphibolite SE 1.6 1.9 2.2 2.5 2.8 3.1
Boxplots showing the relative differences in chemical composition of sediments derived from the different lithology groups
MgO %
Box-and-Whisker Plot
Box-and-Whisker Plot
Amphibolite NE Gneiss SW Granite NW Granite Schist W Migmatite Gneiss SE Dolomite N Amphibolite SE 0 1 2 3 CaO % 4 5 6
Amphibolite NE Gneiss SW Granite NW Granite Schist W Migmatite Gneiss SE Dolomite N Amphibolite SE 0 0.5 1 1.5 Na2O % 2 2.5 3
SiO2 Lthology Group Cou nt 7 10 8 10 10 12 11 13 Avg 61.75 61.20 61.41 59.33 59.50 61.01 61.58 59.75 Media n 61.75 61.155 61.69 58.94 59.515 61.36 61.66 60.13 Geo m. mean 61.74 61.19 61.38 59.31 59.48 60.98 61.54 59.72 MA D 1.04 0.63 1.39 1.14 1.23 1.6 1.5 1.96 Min 60.67 59.73 58.13 57.04 56.65 57.91 57.18 57.02 Max 63.09 63.63 63.66 62.42 62.6 63.4 63.82 62.98 Ran ge 2.42 3.9 5.53 5.38 5.95 5.49 6.64 5.96 LH 60.68 60.44 60.18 57.82 57.95 59.19 60.16 58.17 UH 62.7 61.8 62.9 60.1 60.4 62.7 63.7 60.5 IQR 2.03 1.34 2.69 2.27 2.45 3.51 3.52 2.31 LIF 57.64 58.43 56.14 54.42 54.28 53.93 54.88 54.71 UIF 63.73 62.45 64.21 61.23 61.63 64.46 65.44 61.64 LW 60.67 59.73 58.13 57.04 56.65 57.91 57.18 57.02 UW 63.09 63.63 63.66 62.42 62.6 63.4 63.82 62.98 Low Background 60.67-60.68 59.73-60.44 58.13-60.175 57.04-57.82 56.65-57.95 57.91-59.19 57.18-60.16 57.02-58.17 Background 60.68-62.71 60.44-61.78 60.175-62.865 57.82-60.09 57.95-60.4 59.19-62.7 60.16-63.68 58.17-60.48 High Background 62.71-63.09 61.78-63.63 62.865-63.66 60.09-62.42 60.4-62.6 62.7-63.4 63.68-63.82 60.48-62.98
Amphibolite NE
Gneiss SW Granite NW Granite Schist W Migmatite Gneiss SE Dolomite N Amphibolite SE
MAD: MEDIAN ABSOLUTE DEVIATION LH: LOWER HINGE UH: UPPER HINGE