Professional Documents
Culture Documents
PART 1
Dimitri P. Solomatine
www.ihe.nl/hi/sol sol@ihe.nl
Notion of data-driven modelling (DDM) Sources of technologies for DMM: machine learning, data mining, soft computing Introduction to methods: decision, regression and model trees, Bayesian approaches, neural networks, fuzzy systems, chaotic systems Demonstration of applications
Reservoir optimization Rainfall-runoff modelling Real-time control of water levels in the polder areas surge water level prediction in the North Sea interpretation of aerial photos
2
Measuring campaigns using automatic computerised equipment: a lot of data became available important breakthroughs in computational intelligence and machine learning methods penetration of computer sciences into civil engineering (e.g., hydroinformatics, geo-informatics etc.)
Judgement engines
Decision support systems for management
Fact engines
Physically-based models Data-driven models
Communications
Real world
D.P. Solomatine. Data-driven modelling (part 1). 4
Modelling
Model is
a simplified description of reality an encapsulation of knowledge about a particular physical or social process in electronic form
use the results of modelling for making decisions (change the future)
Classification of models
specific - general model estimation - first principles models numerical - analytical stochastic - deterministic microscopic - macroscopic discrete - continuous qualitative - quantitative
independent variable X (input) and dependent variable Y (output) linear regression roughly describes the observed relationship parameters a1 and a2 are unknown and are found by feeding the model with data and solving an optimization problem (training) the model then predicts output for the new input without actual knowledge of what drives Y
X
new input value
Measured data
Instances
Attributes
xn x 1n x2 n xK n
Output y y1 y2 yK
Data-driven model
Actual (observed) output Y
Input data
Predicted output Y
DDM learns the target function Y=f (X) describing how the real system behaves Learning = process of minimizing the difference between observed data and model output. X and Y may be non-numeric After learning, being fed with the new inputs, DDM can generate output close to what the real system would generate
D.P. Solomatine. Data-driven modelling (part 1). 9
"Data-driven" model is defined as a model connecting the system state variables (input, internal and output) without much knowledge about the "physical" behaviour of the system
examples: regression model linking input and output
10
P = precipitation, E = evapotranspiration, Q = runoff, SP = snow pack, SM = soil moisture, UZ = upper groundwater zone, LZ =lower groundwater zone, lakes = lake volume
Precipitation P(t) Moving average of Precipitation MAP3(t-2) Evapotranspiration E(t) Runoff Q(t)
D.P. Solomatine. Data-driven modelling (part 1).
11
Qtup Rt Qt
runoffs (flows) Qt Inputs: lagged rainfalls Rt Rt-1 Rt-L Output to predict: Qt+T
Questions:
how to find the appropriate lags? (lags embody the physical properties of the catchment) how to build non-linear regression function F ?
D.P. Solomatine. Data-driven modelling (part 1). 12
13
1. Select clearly defined problem that the model will help to resolve. 2. Specify the required solution or the problem. 3. Define how the solution delivered is going to be used in practice. 4. Learn the problem, collect the domain knowledge, understand it. 5. Let the problem drive the modelling, including the tool selection, data preparation, etc. That is take the best tool for the job, not just a job you can do with the available tool. 6. Clearly define assumptions (do not just assume, but discuss them with the domain knowledge experts). ... ->
15
7. Refine the model iteratively (try different things until the model seems as good as it is going to get). 8. Make the model as simple as possible, but no simpler, formulated also as:
KISS ("Keep It Sufficiently Simple", or "Keep It Simple, Stupid") Minimum Description Length principle: the best model is one that that is the smallest the Occam's Razor principle - formulated by William of Occam in 1320 in the following form: shave off all the "unneeded philosophy off the explanation".
9. Define instability in the model (critical areas where small changes in inputs lead to large change in output). 10. Define uncertainty in the model (critical areas and ranges in the data where the model produces low confidence predictions).
D.P. Solomatine. Data-driven modelling (part 1).
16
Statistics Machine learning Soft computing (fuzzy systems) Computational intelligence Artificial neural networks Data mining Non-linear dynamics (chaos theory)
17
18
19
20
21
Input data
Predicted output Y
DDM tries to learn the target function Y=f (X) describing how the real system behaves Learning = process of minimizing the difference between observed data and model output. X and Y may be non-numeric After learning, being fed with the new inputs, DDM can generate output close to what the real system would generate
D.P. Solomatine. Data-driven modelling (part 1). 22
minimizing the error during the model calibration, cross-validation minimizing the error during model operation (or on the unseen test set)
Ideally, we should aim at minimizing the cross-validation error since this will give hopes that error on test set will be also small. In practice, process of training uses training set, and cross-validation set is used to check periodically the model error and stop training 23 D.P. Solomatine. Data-driven modelling (part 1).
accurate and complex during training Consider a model being progressively made more Y accurate (and complex): actual (e.g., flow) output Green Red Blue value Green (linear) model is simple but it is not accurate enough model predicts Blue model is the most accurate new output value but is it the best? Red model: less accurate than the Blue one, but captures the X trend in data. It will generalise new input (e.g. rainfall) well. value Question: how to determine Which model is better: during training when to stop improving the model? green, red or blue?
D.P. Solomatine. Data-driven modelling (part 1). 24
25
Training: iterative refinement of the model with each iteration model parameters are changed to reduce error on training set error on training set gets lower and lower error on test set gets lower, then starts to increase This may lead to overfitting and high error on crossvalidation set The moment to stop training
D.P. Solomatine. Data-driven modelling (part 1). 26
27
Data
28
Measured data
Instances
Attributes
xn x 1n x2 n xK n
Output y y1 y2 yK
Class (category, label) Ordinal (order) Numeric (real-valued) etc. (considered later)
(Time) series data: numerical data which values have associated index variable with it.
30
Classification
on the basis of classified examples, a way of classifying unseen examples is to be found
Association
association between features (which combinations of values are most frequent) is to be identified
Clustering
groups of objects (examples) that are "close" are to be identified
Numeric prediction
outcome is not a class, but a numeric (real) value often called regression
31
Hypotheses
there are various possible functions Y=f (X) (hypotheses) relating input and output machine learning is searching through this hypotheses space in order to determine one that fits the observed data and the prior knowledge of the learner
Concepts: the thing to be learned on the basis of available data. For example:
children learn how to read and write, what is sweet and salty conditions that lead to flood combinations of particular algae indicating poor water quality
33
Concepts
Concept - the thing to be learned on the basis of available data. For example:
children learn how to read and write, what is sweet and salty conditions that lead to flood combinations of particular algae indicating poor water quality
Often concept is a boolean-valued function (Yes/No) Concept learning = inferring (building) a boolean-valued function from training examples of its input and output
34
C = target (real) concept of "+" class + examples of class "+" examples of class ""
35
C' C
C' = concept of class "+" induced (learned) from data, a hypothesis. It is fully consistent with data (all +, no ) Concepts as sets: U, set of all objects Concept C: C U Learning C: for all X U to recognize whether X U, or not
D.P. Solomatine. Data-driven modelling (part 1). 36
C' C
false + false -
Errors (incorrect classification): (C-C') (C'-C) Accuracy of induced concept C' = proportion of correct classifications: |U - (C-C') - (C'-C)| / |U|
D.P. Solomatine. Data-driven modelling (part 1). 37
Instances (examples)
Instances = examples of input data. Instances that can be stored in a simple rectangular table (only these will be mainly considered):
individual unrelated customers described by a set of attributes records of rainfall, runoff, water level taken every hour
Instances that cannot be stored in a table, but require more complex structures: instances of pairs that are sisters, taken from a family tree
related tables in complex databases describing staff, their ownership, involvement in projects, borrowing of computers, etc.
38
Ordinal - categories that can be ordered (ranked) temperature expressed as cool, mild, hot water level expressed as low, medium, high ... ...
39
... ... Interval - ordered and expressed in fixed equal units. Examples:
dates (cannot be however multiplied) temperature expressed in degrees Celsius
40
prepare the data - this may include complex procedures of restoring the missing data, data transformation, etc.; survey the data - understand the nature of the data, get insight into the problem this data describes - includes identification and analysis of variability, sparsity, peaks and valleys, entropy, mutual information inputs to outputs, etc. (this step is often merged with the previous one); build the model
42
training data set - raw data is presented in a form necessary to to train the DDM; cross-validation data set - needed to detect overtraining; testing, or validation data set - it is needed to validate (test) the model's predictive performance; algorithms and software to perform pre-processing (eg., normalization); algorithms and software to perform post-processing (eg., denormalization).
43
Finding relationships between attributes (eg. correlation, average mutual information - AVI) Discretizing numeric attributes into {low, medium, high} Data reduction (Principal components analysis - PCA)
D.P. Solomatine. Data-driven modelling (part 1). 44
45
46
What to do with the outliers? How to reconstruct missing values? Estimator is a device (algorithm) used to make a justifiable guess about the value of some particular variable, that is, to produce an estimate Unbiased estimator is a method of guessing that does not change important characteristics of the data set when the estimates are included with the existing values Example: Dataset 1 2 3 x 5
Estimators:
2.750, if the mean is to be unbiased; 4.659, if the standard deviation is to be unbiased; 4.000, if the step-wise change in the variable value (trend) is to be unbiased (that is, linear interpolation is used xi = (xi+1 + xi-1) / 2 )
D.P. Solomatine. Data-driven modelling (part 1). 47
Examples:
in a harbour sedimentation is measured once in two weeks at one locations and once a month at other two locations, and never at other (maybe important) locatons in a catchment the rainfall data that was manually collected at a three gauging stations for 20 years once a day, and 3 years ago the measurements started also at 4 new automatic stations as well, with the hourly frequency
Solutions:
filling-in missing data introducing an artificial resolution being equal to the maximum for all variables
48
General form:
x'i = a xi + b
to keep data positive: x'i = xi + min (x1...xn) + SmallConst Squashing data into the range [0, 1]
xi =
49
Non-linear transformations
50
Logistic function
L( x ) =
Output va lue
1 1 + e x
Input va lue
51
Softmax function
xi =
xi E ( x ) ( x / 2 )
where E(x) is mean value of variable x; x is the standard deviation of variable x; is linear response measured in standard deviations for example (that is on either side of the central point of the distribution) cover 68% of the total range of x, 2 cover 95.5%, 3 cover 99.7%. 3.14
52
Input va lue
xi =
xi is the original value, x'i is the transformed value,
is a user-selected value.
xi 1
The second step balances the distribution by subtracting the mean and dividing the result by the standard deviation:
where
xi =
xi E ( x)
x'i is the value after the first transform, x''i is (final) standardized value, E(x') is mean value of variable x' x' is standard deviation of variable x'.
D.P. Solomatine. Data-driven modelling (part 1).
53
Transformed discharge
T ime [hrs]
54
100 80 60 40 20 0
51 90 29 2 3 34 73 12 67 06
.1
0.
9.
45
74
17
2.
8.
3.
45 3.
2.
4.
3.
7.
5.
or
10
13
21
24
27
30
.4
.0
1.
0.
1.
1.
2.
18
16
33
36
-1
-0
-0
-0
3.
0.
2.
or
.2
.8
55
transforming the distributions could be dangerous: such variables actually change the nature of data and the relationships between variables
a) original data with two clusters (two samples) visible b) normalized data clusters cannot be identified
D.P. Solomatine. Data-driven modelling (part 1). 56
Smooth data?
Simple and Weighted Moving Averages SavitzkyGolay filter: builds local polynomial regression (of degree k) on a series of values other filters (Gaussian, Fourier, etc.)
57
Fourier transform can be used to smooth data (extract only low-frequency harmonics) loworiginal signal (time series)
58
59
60
Correlation coefficient R
R=
(x x ) ( y y )
2 i =1 i i =1 i
Average mutual information (AMI). It represents the measure of information that can be learned from one set of data having knowledge of another set of data.
I ( X ;Y ) =
AMI can be used to identify the optimal time lag for a datadriven rainfall-runoff model
x X
P ( x, y ) log 2 P ( x ) P ( y )
yY
P ( x, y )
where P (x,y) is the joint probability for realisation x of X and y of Y; and P (x) and P (y) are the individual probabilities of these realisations If X is completely independent of Y then AMI I (X;Y) is zero.
D.P. Solomatine. Data-driven modelling (part 1). 61
Consider future discharge Qt+1 and past rainfalls Rt-L. What is lag L such that the relatedness is strongest ? AMI can be used to identify the optimal time lag to ensure
AMI between Qt+1 and past lagged rainfalls Rt-L
35 0 30 0
Lag
3 .5 3 2 .5
Zoom in:
Q [ m 3 /s ]
Q R
2 1 .5 1 0 .5 0 1 50
R [m m ]
Max AMI
optimal lag
Introducing classification:
main ideas
63
64
X
linearly separable examples
X
more difficult example: linear function will misclassify several examples, so a non-linear function needed (or transformation of space)
65
Decision tree: example with 2 numeric input variables, 2 output classes: 0 and 1
X2
4 3 2 class 0 1 class 1 0 1 2 3 4 5 6 class 0 class 0 class 0
class 1
X1
Yes
x2 > 2
No
x1 > 2.5
Yes No Yes
x1 < 4
No
x2 < 3.5
Yes No
class 0
class 0
Yes
x2 < 1
No
class 1
D.P. Solomatine. Data-driven modelling (part 1).
class 0
class 1
class 0
66
Several ways to represent knowledge about data set (1) Classification rules
Classification rules - they predict the classification of examples in terms of whether to play of not. E.g.:
if if if if if (Outlook=sunny) and (Humidity=high) then Play=No (Outlook=rainy) and (Windy=strong) then Play=No (Outlook=overcast) then Play=Yes (Humidity=normal) then Play=Yes (non of the above) then Play=Yes
68
Several ways to represent knowledge about data set (2) Decision trees
Decision trees - treelike structure representing classification rules
69
Several ways to represent knowledge about data set (3) Association rules
Association rules - they associate different attribute values. E.g.:
if if if if (Temperature=cool) then Humidity=normal (Humidity=normal) and (Windy=weak) then Play=Yes (Outlook=sunny) and (Play=No) then Humidity=high (Windy=false) and (Play=No) then (Outlook=sunny) and (Humidity=high)
In total there are around 60 of such rules that are 100% correct These rules can predict any of the attributes, not just Play attribute
70
Classification:
decision trees and ID3 and C4.5 algorithms
71
Outlook sunny sunny overcast rainy rainy rainy overcast sunny sunny rainy sunny overcast overcast rainy
Temp hot hot hot mild cool cool cool mild cool mild mild mild hot mild
Humid. high high high high normal normal normal high normal normal normal high normal high
Wind weak strong weak weak weak strong strong weak weak weak strong strong weak strong
Play? no no yes yes yes no yes no yes yes yes yes yes no
72
Objective of classification
73
How to construct such tree?: algorithm ID3 (Quinlan 86), extended later to C4.5 and C5
D.P. Solomatine. Data-driven modelling (part 1). 74
classification error for this verification data set of 3 examples is 1/3 = 33%.
75
Model in operation
Outlook rainy
Temp cool
Play? ?
76
1. A := the 'best' decision (split) attribute for next node ('best' = giving max information gain) 2. Assign A as decision attribute for node 3. For each value of A, create new descendant of node 4. Sort training examples to leaf nodes 5. If training examples perfectly classified, Then STOP, Else iterate over new leaf nodes But a question remains:
how to identify the best split attribute, i.e. how to compute the information gain ? Lets consider Interachtive Dichotomizer 3 (ID3) algorithm
D.P. Solomatine. Data-driven modelling (part 1). 77
E ( S ) = pi log 2 pi
i =1
logarithms is still base 2 because entropy is the measure of the expected encoding length measured in bits max value of entropy is log2c for example if c=8, and p1==p8=0.125, then E(S) = 3 (3 bits needed to send a message about the class number) if all examples belong to one class, E=0. This is what we are aiming at.
D.P. Solomatine. Data-driven modelling (part 1). 79
Entropy and Information Gain: essence of the Interachtive Dichotomizer 3 (ID3) algorithm for building decision trees
S = (14 instances: 9+, 5-)
Entropy E([9+, 5-]) = -(9/14) log(9/14) - (5/14) log(5/14) = 0.940 we want to reduce total entropy by splitting the set S into subsets with lower entropy (i.e. with higher share of examples of the same class) lower entropy = information gain if split is made on the basis of attribute A: let Values(A) is the set of all possible values for attribute A
e.g.: Values (Humidity) = {normal, high}
then Information gain (expected reduction in Entropy caused by knowing the value of attribute A):
Humidity=normal
= {7 examples},
|S
Humidity=normal
|=7
Gain( S , A) = E ( S )
D.P. Solomatine. Data-driven modelling (part 1).
| Sv | E (Sv ) vValues ( A ) | S |
80
81
82
72 Yes
80 Yes ^
90 No
83
84
Frequent situation: accuracy during learning increases, but on the test examples drops
86
Example of overfitting:
consider adding new, 15th, instance [sunny, hot, normal, strong, NO] this example is noisy - could be just wrong the new tree will have to take into account this (wrong) example (NB: check on Temperature is added) - built by ID3 algorithm in Weka software: (Original tree built on 14 examples):
outlook = sunny | temperature = hot: no | temperature = mild | | humidity = high: no | | humidity = normal: yes | temperature = cool: yes outlook = overcast: yes outlook = rainy | windy = strong: no | windy = weak : yes outlook = sunny | humidity = high: no (3.0) | humidity = normal: yes (2.0) outlook = overcast: yes (4.0) outlook = rainy | windy = strong: no (2.0) | windy = weak: yes (3.0)
Target function is discrete valued (i.e. it is a class). Possibly noisy (missing) training data are allowed
88
89
Classification rules
Several rules are often connected with the OR operator (ORed) Rules can be read off a decision tree, but they are then are far more too complex than necessary. Well constructed rules are often more compact than trees Problems: Rules can be interpreted in order (as a decision list) or individually, what to do if different rules lead to different conclusions for the same instance, etc.
90
91
92
93
Case Study Woudse: using decision trees to replicate pumping strategy for Woudse water system (Delfland, NL)
94
polder
95
Woudse Case Study: water levels and pumping (full data set)
Water level & pump discharge for Woudse
-3.7 -3.8 -3.9 -4 -4.1 -4.2 -4.3 -4.4 -4.5 -4.6 -4.7 0 500 1000 Observation 1500 WL Pump 2
0 2000
Pump Discharge
96
0
1150
Observation
WL Pump
Pump Discharge
97
Input: water level(t) and pump discharge(t-1) Output: pump discharge(t) The pumping station has two pumps, each with capacity 0.133 m3/sec. Possible pump discharge: 0, 0.133 or 0.266 m3/sec Pump discharge has been described as a category variable and is expressed as 0, 1 or 2 If water level goes up pump(s) should be switched on to reduce the water level, target water level is -4.6 M At each time level we have to determine:
the pump discharge (0, 1 or 2) based on two inputs: water level(t) and pump discharge (t-1)
Woudse case study: resulting decision tree solving the classification problem (trained on 5000 instances)
Pumpt = f (WLt, Pumpt-1) PumpT-1 | WLt | WLt | | | | | | | | PumpT-1 | WLt | | | | | WLt | | | | PumpT-1 | WLt | | | | | WLt Pump = {0, 1, 2} = 0 <= -4.577: 0 (1084.0/1.0) > -4.577 WLt <= -4.57: 0 (417.0/111.0) WLt > -4.57 | WLt <= -4.551: 1 (125.0) | WLt > -4.551: 2 (12.0) = 1 <= -4.595 WLt <= -4.601: 0 (129.0) WLt > -4.601: 1 (189.0/64.0) > -4.595 WLt <= -4.55: 1 (680.0/3.0) WLt > -4.55: 2 (41.0) = 2 <= -4.593 WLt <= -4.6: 0 (37.0) WLt > -4.6: 2 (39.0/16.0) > -4.593: 2 (2247.0/1.0)
99
Woudse case study: verification result (full data set, 3759 instances)
Water level & pump discharge for Woudse
-2,9 -3,1 -3,3 -3,5 -3,7 -3,9 -4,1 -4,3 -4,5 -4,7 0 500 1000 1500 2000 2500 3000 3500 2,5 2 1,5 1 0,5 0 4000 Pump Discharge
Observation
100
Woudse case study: verification result (fragment with the 100% correct classification)
Water level & pump discharge for Woudse -2.9 Water Level (M) -3.1 -3.3 -3.5 -3.7 -3.9 950 2.5 2 1.5 1 0.5 0 1000 Pump Discharge
960
970
980
990 WL
Observation
101
Woudse case study: verification result (fragment with some errors present)
Water level & pump discharge for Woudse -4.5 Water Level (M) 2.5 2 1.5 -4.6 1 0.5 -4.7 600 0 650 Pump Discharge
610
620
630
640 WL
Observation
102
103
104
Classification: conclusions
there is a wide choice of methods classification methods are mainly applied in pattern recognition problems engineering numerical problems could be sometimes posed as classification problems. Using classification methods (decision trees) often leads to simpler models and requires less accurate data
105
106
107
Linear regression
actual output value
Y = a1 X + a2
y(t)
X
x(t)
new input value x(v)
Given measured (training) data: T vectors {x(t),y(t)}, t =1,T . Unknown a1 and a2 are found by solving an optimization problem
E = y (t ) (a0 + a1 x ( t ) ) min
2 t =1
Then for the new V vectors {x(v)}, v =1,V this equation can approximately reproduce the corresponding functions values
{y(v)}, v =1,V
108
Y (output)
Y (output)
X2
Model 3
Model 2
input space X1X2 is split into regions; separate regression models can be built for each of the regions Tree structure where
nodes are splitting conditions leaves are:
Model 6
X1
Y (output)
Yes
x1 > 2.5
No
x1 < 4
No
x2 < 3.5
Yes No
Model 3
Model 4
Yes
x2 < 1
No
Model 1
Model 2
Model 5
Model 6
111
How to select an attribute for split in regression trees and M5 model trees
regression trees: same as in decision trees (information gain) main idea: choose the attribute that splits the portion T of the training data that reaches a particular node into subsets T1, T2, use the standard deviation sd (T) of the output values in T as a measure of error at that node (in decision trees - entropy E(T) was used) split should result in subsets Ti with low standard deviation sd(Ti)
so model trees splitting criterion is SDR (standard deviation reduction) X2 that has to be maximized: Model 2 4 Model 3
T SDR = sd (T ) i sd (Ti ) i T
3 2 Model 4 1
1 Y (output)
112
smoothing:
smoothing process is used to compensate for the sharp discontinuities between adjacent linear models
pruning (size reduction) - needed when a large tree overfits the data:
a subtree is replaced by one linear model
113
114
Trees and rules: influence of soil habitat on an Collembola apterigota (an insect)
Influence of soil habitat features on the abundance of Collembola apterigota (an insect) (Kampichler, Dzeroski) - regression and model
inputs: field type, microbial respiration, microbial biomass, soil moisture, alkalinity (pH), carbon, nitrogen, median particle size outputs: total number of collembolan individuals (abundance), total number of collembolan species (biodiversity), number of individuals of Folsomia quadrioculata (a particular type of Collembola) methods compared: linear regression (highest error), regression and model trees, neural networks (least error)
trees
115
116
117
Model structure:
Variables considered
Daily discharges (QX, QC) Daily rainfall at 17 stations Daily evaporation for 14 years (1976-1989) at 3 stations Training data: 1976-89 Cross-valid. & testing: 1990-96
Data (1976-1996)
119
4000 OBS Discharge (m3/s) 3000 2000 1000 0 96-6-1 96-7-1 Time 96-7-31 96-8-30 FS-M5 FS-ANN
120
M5 model trees and ANNs in rainfall-runoff modelling: predicting flow three hours ahead (Sieve catchment)
Inputs: REt, REt-1, REt-2, REt-3,Qt,Qt-1 (rainfall for 3 past hours, runoff for 2) ANN verification RMSE=11.353 NRMSE=0.234 COE=0.9452 MT verification RMSE=12.548 NRMSE=0.258 COE=0.9331
The model:
100
120
140
160
180
Transparency of trees: model trees is easy to understand (even by the managers) M5 model tree is a mixture of local accurate models Pruning (reducing size) allows:
to prevent overfitting to generate a family of models of various accuracy and complexity
122
End of part 1
123