Handouts On Data-Driven Modelling, Part 1 (UNESCO-IHE)

Data-driven modelling in water-related problems.
PART 1
Dimitri P. Solomatine
www.ihe.nl/hi/sol sol@ihe.nl
UNESCO-IHE Institute for Water Education

Hydroinformatics Chair
Outline of the course
Notion of data-driven modelling (DDM) Sources of technologies for DMM: machine learning, data mining, soft computing Introduction to methods: decision, regression and model trees, Bayesian approaches, neural networks, fuzzy systems, chaotic systems Demonstration of applications
Reservoir optimization Rainfall-runoff modelling Real-time control of water levels in the polder areas surge water level prediction in the North Sea interpretation of aerial photos
2
D.P. Solomatine. Data-driven modelling (part 1).
Why data-driven now?
Measuring campaigns using automatic computerised equipment: a lot of data became available important breakthroughs in computational intelligence and machine learning methods penetration of computer sciences into civil engineering (e.g., hydroinformatics, geo-informatics etc.)
Hydroinformatics system: typical architecture

User interface
Judgement engines
Decision support systems for management
Fact engines
Physically-based models Data-driven models
Data, information, knowledge
Knowledge inference engines

Knowledge-base systems
Communications
Real world
D.P. Solomatine. Data-driven modelling (part 1). 4
Modelling
Model is
a simplified description of reality an encapsulation of knowledge about a particular physical or social process in electronic form
Goals of modelling are:

understand the studied system or domain (understand the past) predict the future
predict the future values of some of the system variables, based on the knowledge about other variables
use the results of modelling for making decisions (change the future)
Classification of models
specific - general model estimation - first principles models numerical - analytical stochastic - deterministic microscopic - macroscopic discrete - continuous qualitative - quantitative
Example of a simple data-driven model

Linear regression model Y = a1 X + a2 Y
independent variable X (input) and dependent variable Y (output) linear regression roughly describes the observed relationship parameters a1 and a2 are unknown and are found by feeding the model with data and solving an optimization problem (training) the model then predicts output for the new input without actual knowledge of what drives Y
actual output value
model predicts new output value
X
new input value
Data: attributes, inputs, outputs

set K of examples (or instances) represented by the duple <xk, yk>, where k = 1,, K, vector xk = {x1,,xn}k , vector yk = {y1,,ym}k , n = number of inputs, m = number of outputs. The process of building a function ( model) y = f (x) is called training. Often only one output is considered, so m = 1.
Measured data
Instances
Attributes
Instance 1 Instance 2 Instance K
x1 x11 x21 xK1
Inputs x2 x12 x21 xK2
xn x 1n x2 n xK n
Output y y1 y2 yK
Model output y* = f (x) y* y1* y2* yK*

8
Data-driven model
Actual (observed) output Y
Input data
Modelled (real) system
Machine learning (data-driven) model
Learning is aimed at minimizing this difference
Predicted output Y
DDM learns the target function Y=f (X) describing how the real system behaves Learning = process of minimizing the difference between observed data and model output. X and Y may be non-numeric After learning, being fed with the new inputs, DDM can generate output close to what the real system would generate
Data-driven models vs Knowledge-driven (physicallybased) models (1)

Physically-based", or "knowledge-based" models are based on the understanding of the underlying processes in the system
examples: river models based on main principles of water motion, expressed in differential equations, solved using finite-difference approximations
"Data-driven" model is defined as a model connecting the system state variables (input, internal and output) without much knowledge about the "physical" behaviour of the system
examples: regression model linking input and output
Current trend: combination of both (hybrid models)
10
Data-driven models vs Physically-based models

physically-based (conceptual) hydrological rainfall-runoff model
P = precipitation, E = evapotranspiration, Q = runoff, SP = snow pack, SM = soil moisture, UZ = upper groundwater zone, LZ =lower groundwater zone, lakes = lake volume
Coefficients to be identified by calibration
Precipitation P(t) Moving average of Precipitation MAP3(t-2) Evapotranspiration E(t) Runoff Q(t)
Runoff flow Q(t+1)
data-driven rainfallrunoff model (artificial neural network)
neurons (non-linear functions)
connections with weights to be identified by training
11
Using data-driven methods in rainfall-runoff modelling

Available data: rainfalls Rt
Qtup Rt Qt
runoffs (flows) Qt Inputs: lagged rainfalls Rt Rt-1 Rt-L Output to predict: Qt+T
Model: Qt+T = F (Rt Rt-1 Rt-L Qt Qt-1 Qt-A Qtup Qt-1up )

(past rainfall) (autocorrelation) (routing)
Questions:
how to find the appropriate lags? (lags embody the physical properties of the catchment) how to build non-linear regression function F ?
So, what is then data-driven modelling in engineering?

Data-driven modelling:
oriented towards building predictive models follows general modelling guidelines adopted in engineering uses the apparatus of machine learning
Proper data-driven modelling is impossible without

good understanding of the modelled system, relation to the external environment and possible connections to other models and decision making processes
It differs from data mining in:

application area (engineering and natural processes) used terminology and relation to physically-based models usually smaller data sets (and hence possible different choice of methods)
13
Steps in modelling process: details

State the problem (why do the modelling?) Evaluate data availability, data requirements Specify the modelling methods and choose the tools Build (identify) the model:
Choose variables that reflect the physical processes Collect, analyse and prepare the data Build the model Choose objective function for model performance evaluation Calibrate (identify, estimate) the model parameters:
if possible, maximize model performance by comparing the model output to past measured data and adjusting parameters
Evaluate the model:

Evaluate the model uncertainty, sensitivity Test (validate) the model using the unseen measured data
Apply the model (and possibly assimilate real-time data)

Evaluate results, refine the model
Some golden rules in building a model (1)
1. Select clearly defined problem that the model will help to resolve. 2. Specify the required solution or the problem. 3. Define how the solution delivered is going to be used in practice. 4. Learn the problem, collect the domain knowledge, understand it. 5. Let the problem drive the modelling, including the tool selection, data preparation, etc. That is take the best tool for the job, not just a job you can do with the available tool. 6. Clearly define assumptions (do not just assume, but discuss them with the domain knowledge experts). ... ->
15
Some golden rules in building a model (2)
7. Refine the model iteratively (try different things until the model seems as good as it is going to get). 8. Make the model as simple as possible, but no simpler, formulated also as:
KISS ("Keep It Sufficiently Simple", or "Keep It Simple, Stupid") Minimum Description Length principle: the best model is one that that is the smallest the Occam's Razor principle - formulated by William of Occam in 1320 in the following form: shave off all the "unneeded philosophy off the explanation".
9. Define instability in the model (critical areas where small changes in inputs lead to large change in output). 10. Define uncertainty in the model (critical areas and ranges in the data where the model produces low confidence predictions).
16
Suppliers of methods for data-driven modelling
Statistics Machine learning Soft computing (fuzzy systems) Computational intelligence Artificial neural networks Data mining Non-linear dynamics (chaos theory)
17
Suppliers of methods for data-driven modelling (1) Machine learning (ML)

ML = constructing computer programs that automatically improve with experience Most general paradigm for DDM ML draws on results from:
statistics artificial intellingence philosophy, psychology, cognitive science, biology information theory, computational complexity control theory
For a long time concentrated on categorical (non-continuous) variables
18
Suppliers of methods for data-driven modelling (2) Soft computing

Soft computing - tolerant for imprecision and uncertainty of data (Zadeh, 1991). Currently includes almost everything:
fuzzy logic neural networks evolutionary computing probabilistic computing (incl. belief networks) chaotic systems parts of machine learning theory
19
Suppliers of methods for data-driven modelling (3) Data mining

Data mining (preparation, reduction, finding new knowledge):
automatic classification identification of trends (eg. statistical methods like ARIMA) data normalization, smoothing, data restoration association rules and decision trees
IF (WL>1.2 @3 h ago, Rainfall>50 @1 h ago) THEN (WL>1.5 @now)
neural networks fuzzy systems
Other methods oriented towards optimization:

automatic calibration (with a lot of data involved, makes a physically-driven model partly data-driven)
20
Machine learning: Learning from data
21
Data-driven modelling: (machine) learning

Actual (observed) output Y
Input data
Modelled (real) system
Machine learning (data-driven) model
Learning is aimed at minimizing this difference
Predicted output Y
DDM tries to learn the target function Y=f (X) describing how the real system behaves Learning = process of minimizing the difference between observed data and model output. X and Y may be non-numeric After learning, being fed with the new inputs, DDM can generate output close to what the real system would generate
Training (calibration), cross-validation, testing
Ideally, the observed data has to be split in three data sets:

training (calibration) cross-validation (imitates test set and model operation,
used to build the model
used to test model performance during calibration process)
testing (imitates model operation,
used for final model test, should not be seen by developer)
used to test the model after it is built
One should distinguish
minimizing the error during the model calibration, cross-validation minimizing the error during model operation (or on the unseen test set)
Ideally, we should aim at minimizing the cross-validation error since this will give hopes that error on test set will be also small. In practice, process of training uses training set, and cross-validation set is used to check periodically the model error and stop training 23 D.P. Solomatine. Data-driven modelling (part 1).
What is a good model?

Models being progressively made more
accurate and complex during training Consider a model being progressively made more Y accurate (and complex): actual (e.g., flow) output Green Red Blue value Green (linear) model is simple but it is not accurate enough model predicts Blue model is the most accurate new output value but is it the best? Red model: less accurate than the Blue one, but captures the X trend in data. It will generalise new input (e.g. rainfall) well. value Question: how to determine Which model is better: during training when to stop improving the model? green, red or blue?
Necessity of cross-validation during training
good moment to stop training
in-sample = training out-of-sample = cross-validation, or verification (testing)
25
Training, cross-validation, verification (2)

Y Relationship Y=F(X) to be modelled (unknown) Data {xi, yi}collected about this relationship Data-driven model Training Cross-validation
Training: iterative refinement of the model with each iteration model parameters are changed to reduce error on training set error on training set gets lower and lower error on test set gets lower, then starts to increase This may lead to overfitting and high error on crossvalidation set The moment to stop training
Data-Driven Modelling: care is needed
Difficulties with extrapolation (working outside the variables range)

A solution: exhaustive data collection, optimal construction of the calibration set
Care needed if the time series is not stationary

A solution: to build several models responsible for different regimes
Need to ensure that the relevant physical variables are included

A solution: use correlation and average mutual information analysis
27
Data
28
Data: attributes, inputs, outputs

set K of examples (or instances) represented by the duple <xk, yk>, where k = 1,, K, vector xk = {x1,,xn}k , vector yk = {y1,,ym}k , n = number of inputs, m = number of outputs. The process of building a function ( model) y = f (x) is called training. Often only one output is considered, so m = 1.
Measured data
Instances
Attributes
Instance 1 Instance 2 Instance K
x1 x11 x21 xK1
Inputs x2 x12 x21 xK2
xn x 1n x2 n xK n
Output y y1 y2 yK
Model output y* = f (x) y* y1* y2* yK*

29
Types of data (roughly)
Class (category, label) Ordinal (order) Numeric (real-valued) etc. (considered later)
(Time) series data: numerical data which values have associated index variable with it.
30
Four styles of learning
Classification
on the basis of classified examples, a way of classifying unseen examples is to be found
Association
association between features (which combinations of values are most frequent) is to be identified
Clustering
groups of objects (examples) that are "close" are to be identified
Numeric prediction
outcome is not a class, but a numeric (real) value often called regression
31
Important notions in machine learning
Hypotheses
there are various possible functions Y=f (X) (hypotheses) relating input and output machine learning is searching through this hypotheses space in order to determine one that fits the observed data and the prior knowledge of the learner
Concepts: the thing to be learned on the basis of available data. For example:
children learn how to read and write, what is sweet and salty conditions that lead to flood combinations of particular algae indicating poor water quality
Instances (examples) representing concepts Attributes (describing instances)

Occam's razor: short hypotheses, minimum description length (MDL)

Principle: accept the simplest hypothesis (=model) Known as the Occam's razor principle: shave off all the "unneeded philosophy off the explanation (William of Occam, 1320) In ML also known as Minimum Description Length (MDL) principle
33
Concepts
Concept - the thing to be learned on the basis of available data. For example:
children learn how to read and write, what is sweet and salty conditions that lead to flood combinations of particular algae indicating poor water quality
Often concept is a boolean-valued function (Yes/No) Concept learning = inferring (building) a boolean-valued function from training examples of its input and output
34
Learning concepts (1)
C = target (real) concept of "+" class + examples of class "+" examples of class ""
35
C' C
C' = concept of class "+" induced (learned) from data, a hypothesis. It is fully consistent with data (all +, no ) Concepts as sets: U, set of all objects Concept C: C U Learning C: for all X U to recognize whether X U, or not
C' C
false + false -
Errors (incorrect classification): (C-C') (C'-C) Accuracy of induced concept C' = proportion of correct classifications: |U - (C-C') - (C'-C)| / |U|
Instances (examples)
Instances = examples of input data. Instances that can be stored in a simple rectangular table (only these will be mainly considered):
individual unrelated customers described by a set of attributes records of rainfall, runoff, water level taken every hour
Instances that cannot be stored in a table, but require more complex structures: instances of pairs that are sisters, taken from a family tree
related tables in complex databases describing staff, their ownership, involvement in projects, borrowing of computers, etc.
38
Attributes: more detailed view at measured data (1)
Nominal (also called class, category, labels). Examples:

customers that tend to do shopping on Fridays combinations of hydrometeorological conditions leading to high surge combination of conditions leading to a high probability of flood
Ordinal - categories that can be ordered (ranked) temperature expressed as cool, mild, hot water level expressed as low, medium, high ... ...
39
Attributes: more detailed view at measured data (2)
... ... Interval - ordered and expressed in fixed equal units. Examples:
dates (cannot be however multiplied) temperature expressed in degrees Celsius
Ratio (real numbers) - there is a zero point. Examples:

distance, precipitation, water level, soil moisture, etc. but not temperature, since using another zero (C/F) changes ratios amount of money a customer tend to spend on a typical Friday
40
Attributes: main types of data in machine learning (simplified)

Nominal (also called class, category, labels, enumerated)
discrete but without ordering Special case - dichotomy (Boolean)
Ordinal (also called numeric, continuous)

could be also discrete but must be ordered prediction of real-valued output is also called regression
more complex types, requiring additional descriptions (metadata), eg.:

dimensions (for building expressions that are dimensionally correct) partial ordering (like "same day next week") subsets (like "holidays" when water consumption is different), etc.
Data preparation and surveying
prepare the data - this may include complex procedures of restoring the missing data, data transformation, etc.; survey the data - understand the nature of the data, get insight into the problem this data describes - includes identification and analysis of variability, sparsity, peaks and valleys, entropy, mutual information inputs to outputs, etc. (this step is often merged with the previous one); build the model
42
Data preparation results in:
training data set - raw data is presented in a form necessary to to train the DDM; cross-validation data set - needed to detect overtraining; testing, or validation data set - it is needed to validate (test) the model's predictive performance; algorithms and software to perform pre-processing (eg., normalization); algorithms and software to perform post-processing (eg., denormalization).
43
Important steps in data preparation

Replace missing, empty, inaccuarate values Handle issue of spatial and temporal resolution Linear scaling and normalization Non-linear transformations Transform the distributions Time series:
Fourier and wavelet transforms Identification of trend, seasonality, cyles, noise Smoothing data
Finding relationships between attributes (eg. correlation, average mutual information - AVI) Discretizing numeric attributes into {low, medium, high} Data reduction (Principal components analysis - PCA)
Example: raw data set
45
Example: normalization and redistribution
46
Replacing missing and empty values
What to do with the outliers? How to reconstruct missing values? Estimator is a device (algorithm) used to make a justifiable guess about the value of some particular variable, that is, to produce an estimate Unbiased estimator is a method of guessing that does not change important characteristics of the data set when the estimates are included with the existing values Example: Dataset 1 2 3 x 5
Estimators:
2.750, if the mean is to be unbiased; 4.659, if the standard deviation is to be unbiased; 4.000, if the step-wise change in the variable value (trend) is to be unbiased (that is, linear interpolation is used xi = (xi+1 + xi-1) / 2 )
Issue of spatial and temporal resolution
Examples:
in a harbour sedimentation is measured once in two weeks at one locations and once a month at other two locations, and never at other (maybe important) locatons in a catchment the rainfall data that was manually collected at a three gauging stations for 20 years once a day, and 3 years ago the measurements started also at 4 new automatic stations as well, with the hourly frequency
Solutions:
filling-in missing data introducing an artificial resolution being equal to the maximum for all variables
48
Linear scaling and normalization
General form:
x'i = a xi + b
to keep data positive: x'i = xi + min (x1...xn) + SmallConst Squashing data into the range [0, 1]
xi =
xi min( x1... xn ) max( x1... xn ) min( x1... xn )
49
Non-linear transformations
Logarithmic x'i = log (xi) Softmax scaling:
50
Logistic function
L( x ) =
Output va lue
1 1 + e x
L og istic fun ctio n

1.2 1 0.8 0.6 0.4 0.2 0 -10 -8 -6 -4 -2 -0.2 0 2 4 6 8 10
Input va lue
51
Softmax function
1. {x} should be first transformed linearly to vary around the mean mx :
xi =

xi E ( x ) ( x / 2 )
where E(x) is mean value of variable x; x is the standard deviation of variable x; is linear response measured in standard deviations for example (that is on either side of the central point of the distribution) cover 68% of the total range of x, 2 cover 95.5%, 3 cover 99.7%. 3.14
2. logistic function applied L(x )

Logistic fun ctio n

Output va lue
1.2 1 0.8 0.6 0.4 0.2 0 -10 -8 -6 -4 -2 -0.2 0 2 4 6 8 10
52
Input va lue
Transforming the distributions: Box-Cox transform

The first step uses the power transform to adjust the changing variance: where
xi =
xi is the original value, x'i is the transformed value,
is a user-selected value.
xi 1
The second step balances the distribution by subtracting the mean and dividing the result by the standard deviation:
where
xi =
xi E ( x)
x'i is the value after the first transform, x''i is (final) standardized value, E(x') is mean value of variable x' x' is standard deviation of variable x'.
53
Box-Cox transform of the rainfall data

Origin al ho urly d isch arg e d ata (30 d ays)
500 Disch arge [m3/s] 400 300 200 100 0 0 100 200 300 400 Time [ hrs] 500 600 700
B o x-C o x tran sfo rm of the d isch arge d ata

10 8 6 4 2 0 -2 0 100 200 300 400 500 600 700 B ox-Cox, step 1 B ox-Cox, final
Transformed discharge
T ime [hrs]
54
Box-Cox transform: original and resulting histograms
Distibution of the origina l discha rge
D istribution of the B ox -Cox tra nsform e d discha rge
400 350 300 Fre que ncy
160 140 120 Fre que ncy

.6 .9 1 4 9 6 4 1 9 6 4 1 e
250 200 150 100 50 0
100 80 60 40 20 0
51 90 29 2 3 34 73 12 67 06
.1
0.
9.
45
74
17
2.
8.
3.
45 3.
2.
4.
3.
7.
5.
or
10
13
21
24
27
30
.4
.0
1.
0.
1.
1.
2.
18
16
33
36
-1
-0
-0
-0
3.
0.
2.
Discha rge [m 3/s], bins
Tra nsform e d discha rge , bins
or
.2
.8
55
Multistationary data: care needed
transforming the distributions could be dangerous: such variables actually change the nature of data and the relationships between variables
a) original data with two clusters (two samples) visible b) normalized data clusters cannot be identified
Smooth data?
Simple and Weighted Moving Averages SavitzkyGolay filter: builds local polynomial regression (of degree k) on a series of values other filters (Gaussian, Fourier, etc.)
57
Fourier transform can be used to smooth data (extract only low-frequency harmonics) loworiginal signal (time series)
six harmonic components
58
Fourier transform: the same signal in frequency domain
59
Is life linear? Are outliers bad?
60
Finding relationships between i/o variables

( x x )( y y )
i =1 i i n
Correlation coefficient R
R=
(x x ) ( y y )
2 i =1 i i =1 i
Average mutual information (AMI). It represents the measure of information that can be learned from one set of data having knowledge of another set of data.
I ( X ;Y ) =
AMI can be used to identify the optimal time lag for a datadriven rainfall-runoff model
x X
P ( x, y ) log 2 P ( x ) P ( y )
yY
P ( x, y )
where P (x,y) is the joint probability for realisation x of X and y of Y; and P (x) and P (y) are the individual probabilities of these realisations If X is completely independent of Y then AMI I (X;Y) is zero.
Us of AMI: relatedness between flow and past rainfall

FLOW1: effective rainfall and discharge data 800 700 600 500 400 300 200 16 100 0 0 500 1000 Time [hrs] 1500 2000 18 20 2500 Discharge [m3/s] Effective rainfall [mm] Discharge [m3/s] Eff.rainfall [mm] 0 2 4 6 8 10 12 14
Consider future discharge Qt+1 and past rainfalls Rt-L. What is lag L such that the relatedness is strongest ? AMI can be used to identify the optimal time lag to ensure
AMI between Qt+1 and past lagged rainfalls Rt-L
35 0 30 0
Lag
3 .5 3 2 .5
Zoom in:
Q [ m 3 /s ]
25 0 20 0 15 0 10 0 50 0 90 100 11 0 120 t [hrs ] 13 0 140
Q R
2 1 .5 1 0 .5 0 1 50
R [m m ]
Max AMI
optimal lag
Introducing classification:
main ideas
63
k-Nearest neighbors method: a common sense method of classification

instances are points in 2-dim. space, output is boolean (+ or -) new instance xq is classified w.r.t. proximity of nearest training instances
to class + (if 1 neighbor is considered) to class - (if 4 neighbors are considered)
for discrete-valued outputs assign: the most common value
Voronoi diagram for 1-Nearest neighbor

64
Discriminating surfaces: a traditional method of classification from statistics

surface (line, hyperplane) separates examples of different classes
Y Y = a1 X + a2 Y
X
linearly separable examples
X
more difficult example: linear function will misclassify several examples, so a non-linear function needed (or transformation of space)
65
Decision tree: example with 2 numeric input variables, 2 output classes: 0 and 1
X2
4 3 2 class 0 1 class 1 0 1 2 3 4 5 6 class 0 class 0 class 0
class 1
X1
Yes
x2 > 2
No
x1 > 2.5
Yes No Yes
x1 < 4
No
x2 < 3.5
Yes No
class 0
class 0
Yes
x2 < 1
No
class 1
class 0
class 1
class 0
66
Classification rules and decision trees: Play Tennis example

Outlook sunny sunny overcast rainy rainy rainy overcast sunny sunny rainy sunny overcast overcast rainy Temp hot hot hot mild cool cool cool mild cool mild mild mild hot mild Humid. high high high high normal normal normal high normal normal normal high normal high Wind weak strong weak weak weak strong strong weak weak weak strong strong weak strong Play? no no yes yes yes no yes no yes yes yes yes yes no
Objective: to represent this knowledge using ML techniques

Several ways to represent knowledge about data set (1) Classification rules
Classification rules - they predict the classification of examples in terms of whether to play of not. E.g.:
if if if if if (Outlook=sunny) and (Humidity=high) then Play=No (Outlook=rainy) and (Windy=strong) then Play=No (Outlook=overcast) then Play=Yes (Humidity=normal) then Play=Yes (non of the above) then Play=Yes
These rules are to be interpreted in order (also called decision list)
68
Several ways to represent knowledge about data set (2) Decision trees
Decision trees - treelike structure representing classification rules
69
Several ways to represent knowledge about data set (3) Association rules
Association rules - they associate different attribute values. E.g.:
if if if if (Temperature=cool) then Humidity=normal (Humidity=normal) and (Windy=weak) then Play=Yes (Outlook=sunny) and (Play=No) then Humidity=high (Windy=false) and (Play=No) then (Outlook=sunny) and (Humidity=high)
In total there are around 60 of such rules that are 100% correct These rules can predict any of the attributes, not just Play attribute
70
Classification:
decision trees and ID3 and C4.5 algorithms
71
Example with categorical attributes: 14 or 36 possible combinations
Outlook sunny sunny overcast rainy rainy rainy overcast sunny sunny rainy sunny overcast overcast rainy
Temp hot hot hot mild cool cool cool mild cool mild mild mild hot mild
Humid. high high high high normal normal normal high normal normal normal high normal high
Wind weak strong weak weak weak strong strong weak weak weak strong strong weak strong
Play? no no yes yes yes no yes no yes yes yes yes yes no
72
Objective of classification
to build a data-driven model that would

be based on the available historical data on behaviour of a particular person (14 instances) classify a new instance (observation) into a proper class in this case into Yes (play) or No (not to play).
73
Decision tree for 'PlayTennis'

Outlook sunny sunny overcast rainy rainy rainy overcast sunny sunny rainy sunny overcast overcast rainy Temp hot hot hot mild cool cool cool mild cool mild mild mild hot mild Humid. high high high high normal normal normal high normal normal normal high normal high Wind weak strong weak weak weak strong strong weak weak weak strong strong weak strong Played tennis? no no yes yes yes no yes no yes yes yes yes yes no
each node = a test for an attribute, leaves = classification of an instance
How to construct such tree?: algorithm ID3 (Quinlan 86), extended later to C4.5 and C5
Verification (test) data set

oops! Our model is wrong here
verification (test) set: observations that also happened:

Outlook rainy sunny overcast Temp cool hot cool Humid. high normal high Wind strong strong strong Play? yes yes yes
| Model prediction | no (wrong) | yes (correct) | yes (correct)
classification error for this verification data set of 3 examples is 1/3 = 33%.
75
Model in operation
new instance that the model never saw:
Outlook rainy
Temp cool
Humid. Wind low weak
Play? ?
| Model prediction | yes
Correct? We do not know!
76
Algorithm for building decision trees
1. A := the 'best' decision (split) attribute for next node ('best' = giving max information gain) 2. Assign A as decision attribute for node 3. For each value of A, create new descendant of node 4. Sort training examples to leaf nodes 5. If training examples perfectly classified, Then STOP, Else iterate over new leaf nodes But a question remains:
how to identify the best split attribute, i.e. how to compute the information gain ? Lets consider Interachtive Dichotomizer 3 (ID3) algorithm
Entropy: measure of the uncertainty (unpredictability) associated with a random variable

Consider a set S of members that can be + or We sample a member randomly, and assume we get + with probability P if P+=1 (all members are +), then E=0 (0 bits are needed to encode a message about a sample since they are all +) if P+=P-=0.5, then E=1 (1 bit needed since a sample is + or with equal probability) if P+=0.8 then E=0.72 (so < 1 bit per message about the class) Entropy(S) = expected information (number of bits) needed to encode class (+ or ) of randomly drawn sample of S (under the shortestlength "code")
Entropy: general case
if members in S may belong to c classes:
E ( S ) = pi log 2 pi
i =1
logarithms is still base 2 because entropy is the measure of the expected encoding length measured in bits max value of entropy is log2c for example if c=8, and p1==p8=0.125, then E(S) = 3 (3 bits needed to send a message about the class number) if all examples belong to one class, E=0. This is what we are aiming at.
Entropy and Information Gain: essence of the Interachtive Dichotomizer 3 (ID3) algorithm for building decision trees
S = (14 instances: 9+, 5-)
Entropy E([9+, 5-]) = -(9/14) log(9/14) - (5/14) log(5/14) = 0.940 we want to reduce total entropy by splitting the set S into subsets with lower entropy (i.e. with higher share of examples of the same class) lower entropy = information gain if split is made on the basis of attribute A: let Values(A) is the set of all possible values for attribute A
e.g.: Values (Humidity) = {normal, high}
Sv is the subset of S for which attribute A has value v

e.g.: S
then Information gain (expected reduction in Entropy caused by knowing the value of attribute A):
Humidity=normal
= {7 examples},
|S
Humidity=normal
|=7
Gain( S , A) = E ( S )
| Sv | E (Sv ) vValues ( A ) | S |
80
Selecting 'best' attributes for leaves

for a node an attribute is selected that provides max Gain: Gain(S, Humidity) = 0.151 Gain(S, Wind) = 0.048 Gain(S, Temperature) = 0.029 Gain(S, Outlook) = 0.246 -highest
81
Selecting the next best attribute
82
Same example with continuous or mixed data

Outlook sunny sunny overcast rainy rainy rainy overcast sunny sunny rainy sunny overcast overcast rainy Temp 85 80 83 70 68 65 64 72 69 75 75 72 81 71 Humid. 85 90 86 96 80 70 65 95 70 80 70 90 75 91 Wind weak strong weak weak weak strong strong weak weak weak strong strong weak strong Play? no no yes yes yes no yes no yes yes yes yes yes no
Temperature: 40 48 60 Play? No No Yes Thresholds: ^ (for max information gain)

72 Yes
80 Yes ^
90 No
83
Unknown (missing) values in decision trees
If some examples do not have value for attribute A:

assign most common value of examples sorted to this node, or assign most common value of examples with same the target value, or assign probability pi to each possible value vi of A, and then provide fraction pi of examples to each descendant in a tree
84
Pruning Decision trees to improve interpretability and remove overfitting

Unknown values: Tree may appear to be too complex (hundreds of nodes):
difficult to interpret "too accurate" (overfitting) following all (even noisy) examples
Solution: tree can be reduced (pruned)

reduced error pruning (Quinlan 87): replace a subtree by a leaf node and assign the most common class of the associated training examples; the new tree must be not worse on the validation set algorithms can be forced to have a certain number of instances in a leave to sort tree will inevitably misclassify some instances, so accuracy is decreased but this normally removes overfitting however understandability of smaller trees is better
Overfitting: can also happen in decision trees (1)
Frequent situation: accuracy during learning increases, but on the test examples drops
86
Overfitting: can also happen in decision trees (2)
Example of overfitting:
consider adding new, 15th, instance [sunny, hot, normal, strong, NO] this example is noisy - could be just wrong the new tree will have to take into account this (wrong) example (NB: check on Temperature is added) - built by ID3 algorithm in Weka software: (Original tree built on 14 examples):
outlook = sunny | temperature = hot: no | temperature = mild | | humidity = high: no | | humidity = normal: yes | temperature = cool: yes outlook = overcast: yes outlook = rainy | windy = strong: no | windy = weak : yes outlook = sunny | humidity = high: no (3.0) | humidity = normal: yes (2.0) outlook = overcast: yes (4.0) outlook = rainy | windy = strong: no (2.0) | windy = weak: yes (3.0)
(it misclassifies the 15th example)

87
Problems appropriate for decision tree learning
Instances describable by attribute-value pairs.

Easiest is when an attribute takes a small number of disjoint possible values (e.g. Temp={cool, mild, hot}) however with proper discretization real values are also possible
Target function is discrete valued (i.e. it is a class). Possibly noisy (missing) training data are allowed
88
Classification: use of rules
89
Classification rules
A popular alternative to decision trees General form (if-then or if-then-else):

if (Antecedent) then (Consequent) Antecedent is normally several ANDed preconditions
Several rules are often connected with the OR operator (ORed) Rules can be read off a decision tree, but they are then are far more too complex than necessary. Well constructed rules are often more compact than trees Problems: Rules can be interpreted in order (as a decision list) or individually, what to do if different rules lead to different conclusions for the same instance, etc.
90
Classification rules based on representative instances (prototype-based rules)

if there is a representative example Er that carries a lot of information, it could be used as the basis for rules like:
if NewExample is close to Er then its class is the same as of Er
91
Example classification rule based on a pruned tree

Decision tree is built, then pruned, and rules are built (PART algorithm). Example:
if (outlook = overcast) then Play=YES if (humidity = high) then Play=NO otherwise Play=YES (total=4 / errors=0) (total=5 / errors=1) (total=5 / errors=1)
Accuracy is reduced, but the set of rules is very simple
92
Example of using trees and rules: influence of environmental factors on diseases

Influence of environmental factors on respiratory diseases (Kontic, Dzeroski, 2000) - decision trees (C4.5), classification rules (CN2)
760 patients: 33 input attributes:
gender, age, residence, practices: cooking, heating, smoking, allergies, education, type of house, number of people in household, exposition, month, etc.
the trained classifier was able to predict the disease (4 classes):

acute nasopharingitis (common cold), acute infection of upper respiratory organs, influenza, acute bronchitis
93
Case Study Woudse: using decision trees to replicate pumping strategy for Woudse water system (Delfland, NL)
94
Wodse: schematisation of the water system
polder
95
Woudse Case Study: water levels and pumping (full data set)
Water level & pump discharge for Woudse
-3.7 -3.8 -3.9 -4 -4.1 -4.2 -4.3 -4.4 -4.5 -4.6 -4.7 0 500 1000 Observation 1500 WL Pump 2
0 2000
Pump Discharge
Water Level (M)
96
Woudse Case Study: water levels and pumping (fragment)

Water level & pump discharge for Woudse -3.7 -3.8 -3.9 -4 -4.1 -4.2 -4.3 -4.4 -4.5 -4.6 -4.7
900 950 1000 1050 1100
0
1150
Observation
WL Pump
Pump Discharge
Water Level (M)
97
Woudse Case Study: problem description
Input: water level(t) and pump discharge(t-1) Output: pump discharge(t) The pumping station has two pumps, each with capacity 0.133 m3/sec. Possible pump discharge: 0, 0.133 or 0.266 m3/sec Pump discharge has been described as a category variable and is expressed as 0, 1 or 2 If water level goes up pump(s) should be switched on to reduce the water level, target water level is -4.6 M At each time level we have to determine:
the pump discharge (0, 1 or 2) based on two inputs: water level(t) and pump discharge (t-1)
The problem is posed as a classification problem.

Woudse case study: resulting decision tree solving the classification problem (trained on 5000 instances)
Pumpt = f (WLt, Pumpt-1) PumpT-1 | WLt | WLt | | | | | | | | PumpT-1 | WLt | | | | | WLt | | | | PumpT-1 | WLt | | | | | WLt Pump = {0, 1, 2} = 0 <= -4.577: 0 (1084.0/1.0) > -4.577 WLt <= -4.57: 0 (417.0/111.0) WLt > -4.57 | WLt <= -4.551: 1 (125.0) | WLt > -4.551: 2 (12.0) = 1 <= -4.595 WLt <= -4.601: 0 (129.0) WLt > -4.601: 1 (189.0/64.0) > -4.595 WLt <= -4.55: 1 (680.0/3.0) WLt > -4.55: 2 (41.0) = 2 <= -4.593 WLt <= -4.6: 0 (37.0) WLt > -4.6: 2 (39.0/16.0) > -4.593: 2 (2247.0/1.0)
99
Woudse case study: verification result (full data set, 3759 instances)
Water level & pump discharge for Woudse
-2,9 -3,1 -3,3 -3,5 -3,7 -3,9 -4,1 -4,3 -4,5 -4,7 0 500 1000 1500 2000 2500 3000 3500 2,5 2 1,5 1 0,5 0 4000 Pump Discharge
Water Level (M)
Observation
WL Known Pump Predicted Pump
100
Woudse case study: verification result (fragment with the 100% correct classification)
Water level & pump discharge for Woudse -2.9 Water Level (M) -3.1 -3.3 -3.5 -3.7 -3.9 950 2.5 2 1.5 1 0.5 0 1000 Pump Discharge
960
970
980
990 WL
Observation
Known Pump Predicted Pump
101
Woudse case study: verification result (fragment with some errors present)
Water level & pump discharge for Woudse -4.5 Water Level (M) 2.5 2 1.5 -4.6 1 0.5 -4.7 600 0 650 Pump Discharge
610
620
630
640 WL
Observation
Known Pump Predicted Pump
102
Clustering and classification: some applications

SOFM in finding 12 groups of catchments based on their 12 characteristics, and then applying ANN to model the regional flood frequency (Hall et al. 2000) Hannah et al. (2000) used clustering for finding groups of hydrographs on the basis of their shape and magnitude; clusters are then used for classification by experts similarly identifying the classes of river regimes (Harris et al. 2000) fuzzy c-means in classifying shallow Dutch groundwater sites into homogeneous groups (Frapporti et al. 1993) fuzzy classification in soil classification on the basis of cone penetration tests (CPTs) (Zhang et al. 1999)
103
Clustering and classification: some applications

decision trees in classifying surge water levels in the coastal zone depending on the hydrometeorological data, with the Dutch Ministry for public works (Solomatine et al., 2000; Velickov, 2004); decision trees in classifying the river flows in the problem of flood control self-organizing feature maps (Kohonen neural networks) as clustering methods, and SVM as classification method in aerial photos interpretation (Velickov et al., 2000); using decision trees, Bayesian methods and neural networks in soil classification on the basis of cone penetration tests (CPT) (Bhattacharya and Solomatine, 2006)
104
Classification: conclusions
there is a wide choice of methods classification methods are mainly applied in pattern recognition problems engineering numerical problems could be sometimes posed as classification problems. Using classification methods (decision trees) often leads to simpler models and requires less accurate data
105
Numeric prediction (regression):

linear models and their combinations in tree-like structures (M5 model trees)
106
Models for numeric prediction
Target function is real-valued There are many methods:

Linear and non-linear regression ARMA (auto-regressive moving average) and ARIMA models Artificial Neural Networks (ANN)
We will consider now:

Linear regression Regression trees Model trees
107
Linear regression
actual output value
Y = a1 X + a2
model predicts new output value y(v)
y(t)
X
x(t)
new input value x(v)
Given measured (training) data: T vectors {x(t),y(t)}, t =1,T . Unknown a1 and a2 are found by solving an optimization problem
E = y (t ) (a0 + a1 x ( t ) ) min
2 t =1
Then for the new V vectors {x(v)}, v =1,V this equation can approximately reproduce the corresponding functions values
{y(v)}, v =1,V
108
Numeric prediction by averaging in subsets (regression trees in 1D)

input X1 is split into intervals; averaging is performed in each interval
Y (output)
input space can be split according to standard deviation in subsets

Numeric prediction by piece-wise linear models (model trees in 1D)

input X1 is split into intervals; separate linear models can be built for each of the intervals
Y (output)
question is: how to split the input space in an optimal way?

Regression and M5 model trees: building them in 2D

4 3 Model 1 2 Model 4 1 Model 5 1 2 3 4 5 6 x2 > 2
Yes No Yes
X2
Model 3
Model 2
input space X1X2 is split into regions; separate regression models can be built for each of the regions Tree structure where
nodes are splitting conditions leaves are:
Model 6
X1
constants ( regression tree) linear regression models ( M5 model tree)
Y (output)
Yes
x1 > 2.5
Example of Model 1: - in a regression tree: Y = 10.5 - in M5 model tree: Y = 2.1*X1 + 0.3*X2
No
x1 < 4
No
x2 < 3.5
Yes No
Model 3
Model 4
Yes
x2 < 1
No
Model 1
Model 2
Model 5
Model 6
111
How to select an attribute for split in regression trees and M5 model trees
regression trees: same as in decision trees (information gain) main idea: choose the attribute that splits the portion T of the training data that reaches a particular node into subsets T1, T2, use the standard deviation sd (T) of the output values in T as a measure of error at that node (in decision trees - entropy E(T) was used) split should result in subsets Ti with low standard deviation sd(Ti)
so model trees splitting criterion is SDR (standard deviation reduction) X2 that has to be maximized: Model 2 4 Model 3
T SDR = sd (T ) i sd (Ti ) i T
3 2 Model 4 1
Model 1 Model 6 Model 5 X1
1 Y (output)
112
Measures to improve performance of model trees
smoothing:
smoothing process is used to compensate for the sharp discontinuities between adjacent linear models
pruning (size reduction) - needed when a large tree overfits the data:
a subtree is replaced by one linear model
113
Regression and model trees in numerical prediction: some applications
114
Trees and rules: influence of soil habitat on an Collembola apterigota (an insect)
Influence of soil habitat features on the abundance of Collembola apterigota (an insect) (Kampichler, Dzeroski) - regression and model
inputs: field type, microbial respiration, microbial biomass, soil moisture, alkalinity (pH), carbon, nitrogen, median particle size outputs: total number of collembolan individuals (abundance), total number of collembolan species (biodiversity), number of individuals of Folsomia quadrioculata (a particular type of Collembola) methods compared: linear regression (highest error), regression and model trees, neural networks (least error)
trees
115
Regression trees: prediction of Collembola population

In accuracy better than linear regression Very easy to interpret
116
M5 model trees: prediction of Collembola population

In accuracy a bit worse than ANN But very easy to interpret
117
M5 tree in Rainfall-runoff modelling (Huai river, China)

Qt+T = F (Rt Rt-1 Rt-L Qt Qt-1 Qt-A Qtup Qt-1up )
Output: predicted discharge QXt+1 Inputs (with different time lags): - daily areal rainfall (Pa) - moving average of daily areal rainfall (PaMov) - discharges (QX) and upstream (QC) Smoothed variables have higher correlation coeff. with the output, e.g. 2-day-moving average of rainfall (PaMov2t) Final model for the flood season: Output: Inputs:
discharge the next day QXt+1 Pat Pat-1 PaMov2t PaMov2t-1
Model structure:
Variables considered
Daily discharges (QX, QC) Daily rainfall at 17 stations Daily evaporation for 14 years (1976-1989) at 3 stations Training data: 1976-89 Cross-valid. & testing: 1990-96
Data (1976-1996)
Techniques used: M5 model trees, ANN

118
QCt QCt-1 QXt
Resulting M5 model tree with 7 models (Huai river)

QXt | | | | QXt | | | | | | <= 154 : PaMov2t <= 4.5 : LM1 (1499/4.86%) PaMov2t > 4.5 : | PaMov2t <= 18.5 : LM2 (315/15.9%) | PaMov2t > 18.5 : LM3 (91/86.9%) > 154 : PaMov2t-1 <= 13.5 : | PaMov2t <= 4.5 : LM4 (377/15.9%) | PaMov2t > 4.5 : LM5 (109/89.7%) PaMov2t-1 > 13.5 : | PaMov2t <= 26.5 : LM6 (135/73.1%) | PaMov2t > 26.5 : LM7 (49/270%)
Models at the leaves:

LM1: LM2: LM3: LM4: LM5: LM6: LM7: QXt+1 = 2.28 + 0.714PaMov2t-1 - 0.21PaMov2t + 1.02Pat-1 + 0.193Pat - 0.0085QCt-1 + 0.336QCt + 0.771QXt QXt+1 = -24.4 - 0.0481PaMov2t-1 - 4.96PaMov2t + 3.91Pat-1 + 4.51Pat - 0.363QCt-1 + 0.712QCt + 1.05QXt QXt+1 = -183 + 10.3PaMov2t-1 + 8.37PaMov2t - 5.32Pat-1 + 1.49Pat - 0.0193QCt-1 + 0.106QCt + 2.16QXt QXt+1 = 47.3 + 1.06PaMov2t-1 - 2.05PaMov2t + 1.91Pat-1 + 4.01Pat - 0.3QCt-1 + 1.11QCt + 0.383QXt QXt+1 = -151 - 0.277PaMov2t-1 - 37.8PaMov2t + 31.1Pat-1 + 30.3Pat - 0.672QCt-1 + 0.746QCt + 0.842QXt QXt+1 = 138 - 5.95PaMov2t-1 - 39.5PaMov2t + 29.6Pat-1 + 35.4Pat - 0.303QCt-1 + 0.836QCt + 0.461QXt QXt+1 = -131 - 27.2PaMov2t-1 + 51.9PaMov2t + 0.125Pat-1 - 5.29Pat - 0.0941QCt-1 + 0.557QCt + 0.754QXt
119
Performance of M5 and ANN models (Huai river)

D.P. Solomatine and Y. Xue. M5 model trees compared to neural networks: application to flood forecasting in the upper reach of the Huai River in China. ASCE Journal of Hydrologic Engineering, 9(6), 2004, 491-501.
4000 OBS Discharge (m3/s) 3000 2000 1000 0 96-6-1 96-7-1 Time 96-7-31 96-8-30 FS-M5 FS-ANN
M5 and ANN, flood season data (testing, fragment)
120
M5 model trees and ANNs in rainfall-runoff modelling: predicting flow three hours ahead (Sieve catchment)
Inputs: REt, REt-1, REt-2, REt-3,Qt,Qt-1 (rainfall for 3 past hours, runoff for 2) ANN verification RMSE=11.353 NRMSE=0.234 COE=0.9452 MT verification RMSE=12.548 NRMSE=0.258 COE=0.9331
The model:
Qt+3 = f (REt, REt-1, REt-2, REt-3, Qt, Qt-1 )

Prediction of Qt+3 : Verification performance
350 300 250 Q [ m 3 /s ] 200 150 100 50 0 0 20 40 60 80 t [hrs]

121
Observed Modelled (ANN) Modelled (MT)
100
120
140
160
180
Numerical prediction by M5 model trees: conclusions
Transparency of trees: model trees is easy to understand (even by the managers) M5 model tree is a mixture of local accurate models Pruning (reducing size) allows:
to prevent overfitting to generate a family of models of various accuracy and complexity
122
End of part 1
123

Handouts On Data-Driven Modelling, Part 1 (UNESCO-IHE)

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Handouts On Data-Driven Modelling, Part 1 (UNESCO-IHE)

Uploaded by

Copyright:

Available Formats

Data-driven modelling in water-related problems.

UNESCO-IHE Institute for Water Education

Outline of the course

D.P. Solomatine. Data-driven modelling (part 1).

Why data-driven now?

D.P. Solomatine. Data-driven modelling (part 1).

Hydroinformatics system: typical architecture

Data, information, knowledge

Knowledge inference engines

Goals of modelling are:

D.P. Solomatine. Data-driven modelling (part 1).

D.P. Solomatine. Data-driven modelling (part 1).

Example of a simple data-driven model

actual output value

model predicts new output value

D.P. Solomatine. Data-driven modelling (part 1).

Data: attributes, inputs, outputs

Instance 1 Instance 2 Instance K

x1 x11 x21 xK1

Inputs x2 x12 x21 xK2

Model output y* = f (x) y* y1* y2* yK*

D.P. Solomatine. Data-driven modelling (part 1).

Modelled (real) system

Machine learning (data-driven) model

Learning is aimed at minimizing this difference

Data-driven models vs Knowledge-driven (physicallybased) models (1)

Current trend: combination of both (hybrid models)

D.P. Solomatine. Data-driven modelling (part 1).

Data-driven models vs Physically-based models

Coefficients to be identified by calibration

Runoff flow Q(t+1)

data-driven rainfallrunoff model (artificial neural network)

neurons (non-linear functions)

connections with weights to be identified by training

Using data-driven methods in rainfall-runoff modelling

Model: Qt+T = F (Rt Rt-1 Rt-L Qt Qt-1 Qt-A Qtup Qt-1up )

So, what is then data-driven modelling in engineering?

Proper data-driven modelling is impossible without

It differs from data mining in:

D.P. Solomatine. Data-driven modelling (part 1).

Steps in modelling process: details

Evaluate the model:

Apply the model (and possibly assimilate real-time data)

Some golden rules in building a model (1)

D.P. Solomatine. Data-driven modelling (part 1).

Some golden rules in building a model (2)

Suppliers of methods for data-driven modelling

D.P. Solomatine. Data-driven modelling (part 1).

Suppliers of methods for data-driven modelling (1) Machine learning (ML)

For a long time concentrated on categorical (non-continuous) variables

D.P. Solomatine. Data-driven modelling (part 1).

Suppliers of methods for data-driven modelling (2) Soft computing

D.P. Solomatine. Data-driven modelling (part 1).

Suppliers of methods for data-driven modelling (3) Data mining

neural networks fuzzy systems

Other methods oriented towards optimization:

D.P. Solomatine. Data-driven modelling (part 1).

Machine learning: Learning from data

D.P. Solomatine. Data-driven modelling (part 1).

Data-driven modelling: (machine) learning

Modelled (real) system

Machine learning (data-driven) model