You are on page 1of 6

Proceedings of International Joint Conference on Neural Networks, Montreal, Canada, July 31 - August 4, 2005

A framework for the verification of air quality forecasting models


using self-organizing feature maps
Xavier Guilbeault
Departement de genie electrique et de
genie informatique
Universite de Sherbrooke
2500 boul. de l'Universite
Sherbrooke, QC, Canada, JIK 2R1
E-Mail:
Xavier.Guilbeault@
USherbrooke.ca

Stephane Gaudreault
Air Quality Modelling Applications
Canadian Meteorological Centre
Meteorological Service of Canada
2121, route Transcanadienne
Dorval, QC, Canada, H9P 1J3
E-Mail:
Stephane.Gaudreaultgec.gc.ca

Louis-Philippe Crevier,
Hugo Landry
Sylvain Menard
Air Quality Modelling Applications
Canadian Meteorological Centre
Meteorological Service of Canada
2121, route Transcanadienne
Dorval, QC, Canada, H9P 1J3

Joel Martin
Departement d'informatique
Universite de Sherbrooke
2500 boul. de l'Universite
Sherbrooke, QC, Canada, J 1K 2R1
Abstract A fundamental problem in the development of an air
quality forecast system is the implementation of an
evaluation protocol. Traditionally, statistics are
computed to compare the model output to the
observations. These methods are limited in that they
are generally unable to easily identify the nature of an
error (such as location and timing errors). In this
paper, we describe CALUMeT (Canadian pollution
monitoring tool), an experimental framework that
attempts to address these limitations. This framework
makes use of self-organizing feature maps to compute
the classification of feature vectors from the regions
of interest. It encompasses both formalism and a
software tool that is under active development. More
specifically, the framework allows the specification
and manipulation of invariants associated with
topological elements of an air quality forecast.

compare observations on precise location with the values


computed at the corresponding positions by the forecast
model. These validations are actually made by the use of
statistical analysis that is generally insensitive to location
and timing error. Widely used score, like root mean
square error and the correlation coefficient are sensitive
to discontinuities, noise and outliers. Hence, these
methods can give an increased error rate if the values are
not similar although the observed meteorological
phenomenon may have been rightfully forecast but with
only a minor spatial translation of the meteorological
event.

Previous studies have shown the importance of


discriminating between the different sources of forecast
error. Brown et al. [2] and Bullock et al. [3] developed an
alternative "object-based" approach in which forecast
and observed precipitation events are modelled as basic
geometrical shape, such as a band-aid, convex hull and
ellipse approximation. Comparison of the attributes of

I. INTRODUCTION

forecast and observed shapes features (such as the


centroid location, axis orientation, eccentricity, axis
magnitude) were then used to detect different types of
error and to characterize them. While this approach gives
interesting results for the verification of the precipitation
forecasts, more complex matching rule set may be
necessary to fully account for the unique characteristics
of atmospheric pollution.
An alternative approach for the verification of
air quality forecasts is introduced here. CALUMeT is an
experimental framework that computes regions of
interest (ROI) from both forecast field and observation
map and uses a self-organizing feature map (SOFM) to
make the classification of their feature vectors. These
features are extracted from the different sections isolated
by segmentation from the input maps. Geometrical and
physical invariants are used to develop a realistic and
discriminative model.

Air quality forecast is the process of making


predictions of concentrations of a number of pollutants
both in space and time in the atmosphere. The production
and the transport of tropospheric ozone, particulate
matter of less than 10 or 2.5 micrometers in diameter and
other chemical pollutants are predicted by sophisticated
computer models developed by the Meteorological
Service of Canada (MSC), such as CHRONOS [7] and
AURAMS. These forecasts, combined with public
awareness programs in the community, allow Canadians
to make more informed choice to protect their health and
reduce emissions at the personal and community level.
The validation of air quality models is a
challenging task because the spatial and temporal
distributions of pollution are highly discontinuous.
Traditionally, statistical methods have been used to

0-7803-9048-2/05/$20.00 @2005 IEEE

2302

The evaluation of the forecast is made using the


classified data and the disparities found. CALUMeT
implements a human-like evaluation by perceiving the
shapes and physical properties of the ROI. The key idea
of this framework is to get as close as possible to the
capabilities of an experienced chemistry meteorologist to
verify the validity of forecast models.
As an example the fields from figure I and 2 are
analysed by CALUMeT. The first field is a forecast of
ozone concentration over North-America on the I'
March 2005, at precisely 06 Z. The objective analysis [9]
presented on figure 2 is obtained by using sophisticated
mathematical methods to strike a judicious balance
between the trial (forecast) fields and the acquired
observational data.

a4

R2

0.3786

Vy

2 99326

I wo 1w1! 0~~~~~T*

4.41S6

sVi0ons of ooe

Fiue3

Corelaio

diara

of teojeciv

forec~dzasBt(03).32

N~ ~ ~1

thenaure of th diffrenceso betweeno thetobfields.ine

coe naturs of bothRe


difiges 6 reresents the tried
_

0b 7F340

SOFM with the positions of both ROI in the map (the


frontiers between the nodes of the map are darker as the
Euclidian distance between the two nodes get higher).

Figure 1 - Ozone concentration forecast over NorthAmerica, 1st March 2005

Figure 4 - Isolated ROI from the forecast ozone

Figure 2 - Objective analysis of ozone concentration

concentration.

North-America on lst March 2005


At first a correlation diagram is computed,
comparing each point of the grids one by one. This
analysis gives different statistics that provide valuable
information to the user, such as the square correlation
coefficient and root mean square. Figure 3 shows the
resulting correlation diagram. The square correlation
coefficient is 0.3706, which is rather poor and indicates
some major differences between the two fields.
Furthermore, the root mean square effor is quite high,
indicating a deviation between the forecast events and the
objective analysis. This diagram also indicates that a high
number of points of high concentration were missed by
the forecast. Many other statistics can be used, such as
the variance and the mean error, but although we have
the information that the forecast is not representative of
the objective analysis, the nature of the differences
between the two fields is difficult to retrieve. And
therefore, correcting the model proves to be a
challenging task.
over

Figure 5 - Isolated ROI from the objective analysis


con tentrati on

L IlI III

II

Figure 6 - Self-Organizing Feature Map with the node


position assigned to each ROI ( -Forecast ROI, 2Objective analysis ROI ).

2303

event in the identification of shapes such as pollution


clouds.
Figure 8 shows the architecture of the
framework and the different steps of the verification
process.
At first, the segmentation is done by applying a
threshold to isolate the different ROIs. The value of the
threshold is determined empirically by the user
depending on his needs. However, a mechanism to
isolate different threshold layers of ROI has been
implemented. This is done in the perspective that some
ROI may be present on the two input data fields but with
different intensities. A human expert may be able to
identify the corresponding regions, even with their
varying physical densities. CALUMeT tries to do the
same by classifying ROI at different intensities. This
approach makes possible the matching of shapes isolated
by different threshold values in the event that a shape
cannot be matched with a sufficient score on its layer
(score computation will be introduced further in this
paper).
Then, the produced binary images are
smoothened to remove as much noise as possible while
preserving the main characteristics of the isolated ROI.
To do so, a morphological opening operation is applied
with a cross structuring element of width of 3 pixels.
Once the images are filtered, each ROI is
isolated from the others and added to an internal list for
further processing.
Features vectors are then computed for all these
ROI. The features vectors contain physical and
geometrical invariant that are used in the classifying
stage, uniquely identifying each ROI processed. The
geometrical features will help differentiate the shapes of
the ROI while the physical features are based on the
concentrations of pollutants and help evaluate ROI in
regard to their density and weight distribution.

It can be seen that both ROI have been assigned


to very close nodes in the map, and are therefore a match
candidate. Once CALUMeT has identified the two
regions as a valid match, they are displayed to the user so
he can make use of this information to eventually
improve the air quality model. Another interesting ROI
was isolated in the objective analysis field. It is shown in
figure 7, and is relevant to the analysis because it is a
very large region that has no acceptable match. This
information helps the user identify one of the missing
regions that were suspected by the correlation diagram

(figure 3).

Figure 7 - An unmatched ROI from the objective


analysis.
As shown, the framework described does not
intend to replace the statistical approach, but to reinforce
the diagnostics and the verification process.

II. ALGORITHM DESCRIPTION


A. Description

CALUMeT pattern recognition focuses on


features that are mostly invariant to geometrical
transformations like translation, change of scale and
rotation. Self-organizing maps (SOM) are used because
they can effectively deal with very high-dimensional
feature vectors and offer a scalable solution if new
dimensions are added. Furthermore the SOM model is an
unsupervised neural network, which gives the ability to
classify vectors never seen before, a highly probable

Figure 8 - Execution flowchart of the CALUMeT framework

2304

the physical density range in "buckets" of a fixed step


value and to compute the number of points (pixels) that
fall into each "buckets". Each ROI can also be separated
in a number of sub-regions that will each have its own
list of "buckets", giving therefore a higher degree of
details to the ROI. However, these subdivisions are very
sensitive to rotations and should not be used in all cases.

B. Geometricalfeatures
1) Moments
The normalized central moment [8] is invariant
to translation and scale change. It is equivalent to normal
moments but in a way that the centroid of the ROI
represents the origin. Hu invariants are also computed
since they provide a way to identify an ROI despite of
scale change, rotation or translation.

2) Center ofMass
Another physical invariant is the center of mass
which is computed from the pollutant concentration level
in the ROI. The coordinates are computed from the
following relation:

2) Simple features (area, perimeter, geometrical

centroid)
Moreover, simple geometrical features such as
area, perimeter and position of the centroid are also used
to classify the data and give a precise evaluation when
the final step of analysing the classification results is
made.

moo

where

FC-FCmin
FCm - FCm

FCmin =n-1

n is the number of model grid points in the ROI.

FCmax=(
=

Y=

mi

(4)

where ml0, inmo and mi0 are first order moments of the
ROI.

3) Normalized compacityfactor
Compacity factor [1] is a feature that is invariant
to rotation, translation and scale changes. It qualifies
compactness of a shape as a simple metric. A feature
more robust to noise is called the normalized compacity
factor and is defined by

FCN

imno

x-n0

D. Classification
Once the features extraction step is done, the
vectors are classified using a SOFM. The main
advantage of such a classifier is its ability to classify data
in an unsupervised manner. The actual problem does not
provide sufficient training data to train an efficient multilayer perceptron neural network. Additionally, this
unsupervised learning method gives enough flexibility to
the system to classify new shapes never seen before. Our
SOFM implementation is inspired from Kohonen's [5].
The distance function used to evaluate the
winning node is the Euclidian distance although other
functions (like Manhattan distance) might be used. The

(1)
(2)
(3)

Euclidian distance is defined by

and FC is equal to the contact perimeter.

Wi)2

(5)

[x - min(x)]
[max(x)- min(x)]

(6)

dE=

4) Ellipses
Basic geometrical shapes like ellipse
approximation [2] [3] are computed because they are
invariant to translation and rotation. The ratio of the two
radius of each ellipse is also invariant to scale change.
Additionally, it gives an important tool when comes the
operation of match evaluation and comparison, where the
axis orientation difference can be used.

N(

where N is the number of features, V is the input feature


vector and W is the weight vector of a given node
The edges of the map array ought to be rather
rectangular than square [5] so that the map can stabilize
in the learning process. The size of the map is chosen
based on the number of ROI to classify, with a ratio of
nodes/ROI close to 1. The use of such a small map gives
the solution a lot of precision since similar ROI will be
close to each other in the map [4]. To ensure the
representation of rare cases on the map, the ROI from
different threshold layers of the data fields are added at
the training stage.
Each feature vector is pre-processed before they
are used to train the SOM. This pre-processing involves a
standardization of the values followed by a scaling based
on the importance of each feature. The standardization
follows the rule:

5) Pattern spectrum
The pattem spectrum [6] is a shape-size
descriptor that summarizes important shape
characteristics. The succession of morphological
operations with a structuring element increasing in size
gives a unique set of values for a given shape. For each
morphological opening made on a ROI, the area that is
subtracted is kept in a vector. After a given number of
operations (customisable by the user) this pattern
spectrum is added to the ROI feature vector.
C. Physicalfeatures
1) Density histogram
To evaluate the physical properties of the ROI, a
density histogram is computed. This feature helps
evaluating similarly dense regions that would have
important shape differences. The principle is to segment

where x is the value of a given feature.


The determination of the scaling used is an
empirical process that experimentation will help
evaluate. Based on the results of the classification, the
weights used by CALUMeT may be subject to changes.

2305

As of now, a scaling of 5 is applied to the Hu moments


and the center of mass while all other values keep their
standard values.
The training is done by choosing randomly
different input vectors from the data set (so-called
bootstrap learning). Also, the number of iterations in
training must be a lot larger than the number of input
vectors available. We use a default ratio of 1000
iterations for each input vector.
For each input vector, a winning node is
selected based on the Euclidian distance between each
node's weight vector and the input vector. The learning
process is based on the neighbourhood nodes of the given
winning node so that each node in this region is given the
ability to learn from the input. The initial distribution of
the map node's weights is assigned randomly. The
learning procedure is defined by
W,(t+ 1 )=
V(t)- WI(t)] (7)
where t is an integer representing time, hd(t) is the
neighbourhood kernel, W is the weight vector of node i
and V is the input vector.
The neighbourhood kernel used is defined in the
terms of the function

In a SOFM, distance between nodes implies


difference but nearness does not necessarily imply
resemblance. Such a score can give a precious tool to
find the best match possible by differentiating the results
obtained.
Furthermore, the score gives the user a tool to
evaluate the probability of a matched shape. This metric
is computed by using the following inputs:
distance between nodes on the SOFM
physical distance between the ROTs
euclidian distance of the ROI feature
vectors

ROTs relative and fixed area ratio.


Each input is given a weight depending
-

W,+h6(t)[

hcJ =a (t ).exp

2(t) J

on its

importance and the overall metric results in a percentage.


Since there is no precise rule on the distribution
of similar shapes apart from the fact that their vector are
grouped closely in the SOFM, the method used by
CALUMeT is to add the most probable matches to the
shape and then compute their respective scores. The
"matches" are sorted by their score so that the best
matching shape will be on top of the list. Different
methods have been implemented in the application, each
having a set of advantages and disadvantages. The

(8)

methods are:
1) SOM Radius
This method consists of searching for the
nearest neighbour of the node to which a shape is
matched. The radius of search is increased with each
iterations of the search, starting with the node itself, until
a match is found.

where d is the distance to the winning node, o(t) is the


width of the kernel that decreases in time and ax(t) is the
learning rate of the neighbourhood nodes. This type of
neighbourhood is called Gaussian, meaning the adapted
node will be much more influenced if it is closer to the
winning node (following a Gaussian distribution). It is
also possible to use a "bubble" neighbourhood, where all
nodes in the radius are adapted equally, but this
technique results in less stable maps.
The learning rate can be a function of time or
can be specified by the user. The technique used by
CALUMeT is to undergo two training stages, with a
learning rate different for each stage, starting with a
larger rate and finishing with a smaller one to obtain a
stable map. The defaults values used are 0.1 and 0.02.
The input vectors used in the training stage are
extracted from the ROI of the forecast data field. Once
the SOFM is sufficiently stabilized, the feature vectors of
the ROIs from the observation data field are given as
inputs to be classified. Based on the class each vector
gets assigned to, the similarities of the objects in the
same class are computed to give a pertinent evaluation.

2) SOM Radius -physical distance

This method is derived from the SOM radius


method. It is basically the same algorithm, but each
candidate ROI is tested in regard to the physical distance
separating it from the current ROI. This test distance is
six times the highest dimension (width or height) of the
current ROI. If the evaluated ROI passes the test it is kept
in the list. Otherwise, the search continues. This method,
while suppressing a number of false positives that the
"SOM radius" produced, may miss some regions having
a big translation vector. However, in the field of
meteorology, some distances are too broad to be
acceptable.

3) Score-Layer

This method starts the same way as SOM radius.


Each ROI evaluated is then added to a list and its score is
computed. However, at first only the shapes in the same
threshold layer are analyzed. If no match is sufficiently
satisfying (has a score of 80 % or more), the search
continues until a match is found or the layer limit is
reached.
It is interesting to note that the inclusion of the
shapes of different threshold layers in the training of the
map helps with the clustering of similar shapes. It
effectively translates in a bigger set of input vectors and
therefore gives a broader range of ROI representing their
respective category (since a small difference in the

E. Match score evaluation


The goal of using a score to grade a match is to
be able to decide which ROTs are most likely to be the
same and be therefore able to select the correct shape if

multiple matches are seen and to reject an invalid match


if it is the only one. Indeed, even if a ROI has no
equivalence, a match will be assigned to it since the
search is for the closest ROI assigned in the SOFM, it is
therefore vital to CALUMeT to be able to detect such
cases.

2306

describing a ROI in three dimensions.


When it will be completed, the CALUMeT
framework will help to evaluate air quality models and
will lead to evaluate the applicability of the present
technique to many fields of the meteorological world,
such as precipitation forecast and cloud modelling.

threshold value does not change the nature of the ROI


drastically). However it is hard to quantify the gains from
such an approach.
F. Analysis
The final analysis presented to the user includes
a summary of all the matched ROI with the differences of
their respective features, as well as a visual
representation of the matched objects on their respective
maps. The score is displayed as well, and the ratio of the
areas is computed to evaluate how the shapes superpose
each other. These ratios are the percentage of shape A
inside shape B and the percentage of shape A outside
shape B (and vice versa).
Additionally, unmatched objects can also be
visualized with a diagnosis explaining why these objects
have not been matched.

ACKNOWLEDGMENTS
The authors would like to thank the Air Quality
Modelling and Application Division of Environment
Canada, more precisely Veronique Bouchet, Sophie
Cousineau, Michel Jean, Richard Moffet, Jacinthe
Racine, Alain Robichaud and Mourad Sassi as well as
Jean-Philippe Gauthier-Bilodeau and Serge Trudel for
their help and support in this project.
REFERENCES

III. CONCLUSION AND FUTURE WORK

[1] Bribiesca, E., Measuring 2D shape compactness using


the compact perimeter, Computer Math. Applications,
33(11):1-9, 1997

This paper introduced CALUMeT, a framework


for the verification of air quality forecasts. The "objectbased" approach used by CALUMeT offers very
encouraging results which are promising for the
applicability of the presented solution.
By isolating each ROI, extracting their features
and classifying them with a SOFM, CALUMeT provides
preliminary results that show a high rate of success in
ROI matching with only a low computation cost.
Working with small maps (with 30 to 80 nodes) does not
involve high computation power while offering an
effective classification method.
Furthermore, the unsupervised nature of the
SOFM gives much flexibility to the CALUMeT
framework, therefore giving a valuable completion to the
statistical verification process.
However, CALUMeT is very sensible to
changes in its parameters. Therefore, a more complete
evaluation on the impact of these parameters ought to be
done. Techniques to determine the optimal parameters
have to be developed because they will greatly improve
the stability and the precision of CALUMeT.
Based on the experimental results of the present
algorithm the features used will be re-evaluated to ensure
they do not degrade the classification process. The
modification of the scaling of components may also be
necessary to give better results. However no simple rule
for the optimization of scaling is available and such
modifications must be determined heuristically.
Furthermore, the use of meta-heuristic, like genetic
algorithm or tabu search, in order to determine the
optimal parameters and the best invariants will be
explored.
Also, new features will be added depending on
the results they provide in the classification of shapes and
images. The CALUMeT framework is developed to be
extendable so that such additions can be easily made.
The extension of ROI from 2D to 3D maps is
also a logical step in the development of CALUMeT. By
using meteorological and air quality fields of different
altitudes, it will be possible to construct a unique vector

[2] Brown, B. G., Davis, C. A. & Bullock, R.


Verification methods for objects and fields: Standard
methods, issues and enhanced approaches. 2002
[3] Bullock, R., Brown, B. G., Davis, C. A., Chapman,
M., Manning, K. W., Morss, R., An Object-Oriented
Approach to the Verification of Quantitative
Precipitation Forecasts: Part I and II, 2004

[4] Deboeck, G, Data Mining with self-organizing maps.


PCAI magazine, Vol 13.5, 1999
[5] Kohonen, T. , Hynninen, J. , Kangas, J. , Laaksonen,
J. , SOM_PAK : The Self-Organizing Map Program
Package, Helsinki University of Technology, 1996

[6] Maragos, P, Pattern Spectrum and Multiscale Shape


Representation, IEEE, Transaction on pattern analysis
and machine intelligence, VOL II, No 7, July 1989
[7] Pudykiewicz, J., A. Kallaur, and P.K.
Smolarkiwwicz, Semi-Lagrangian modelling of
troppospheric ozone, Tellus, 49B, 231-258, 1997
[8] Sossa Azuela J. H. , Notes for the lecture Invariant
Object Recognition, Instituto Politecnico Nacional,
Centro de Investigacion en Computacion, 2001
[9] Robichaud, A. and Menard, R., Verification and
application of an objective analysis scheme for surface
ozone. Third Canadian Workshop on Air Quality, 2004

2307

You might also like