You are on page 1of 9

Analytica Chimica Acta 671 (2010) 27–35

Contents lists available at ScienceDirect

Analytica Chimica Acta


journal homepage: www.elsevier.com/locate/aca

Gasoline classification using near infrared (NIR) spectroscopy data: Comparison


of multivariate techniques
Roman M. Balabin a,∗ , Ravilya Z. Safieva b , Ekaterina I. Lomakina c
a
Department of Chemistry and Applied Biosciences, ETH Zurich, 8093 Zurich, Switzerland
b
Gubkin Russian State University of Oil and Gas, 119991 Moscow, Russia
c
Faculty of Computational Mathematics and Cybernetics, Lomonosov Moscow State University, 119992 Moscow, Russia

a r t i c l e i n f o a b s t r a c t

Article history: Near infrared (NIR) spectroscopy is a non-destructive (vibrational spectroscopy based) measurement
Received 8 April 2010 technique for many multicomponent chemical systems, including products of petroleum (crude oil) refin-
Received in revised form 6 May 2010 ing and petrochemicals, food products (tea, fruits, e.g., apples, milk, wine, spirits, meat, bread, cheese,
Accepted 7 May 2010
etc.), pharmaceuticals (drugs, tablets, bioreactor monitoring, etc.), and combustion products. In this paper
we have compared the abilities of nine different multivariate classification methods: linear discriminant
analysis (LDA), quadratic discriminant analysis (QDA), regularized discriminant analysis (RDA), soft inde-
Keywords:
pendent modeling of class analogy (SIMCA), partial least squares (PLS) classification, K-nearest neighbor
Discriminant analysis (LDA, QDA, RDA)
Petroleum (crude oil)
(KNN), support vector machines (SVM), probabilistic neural network (PNN), and multilayer perceptron
Biofuel (biodiesel, bioethanol, (ANN-MLP) – for gasoline classification. Three sets of near infrared (NIR) spectra (450, 415, and 345 spec-
ethanol–gasoline fuel) tra) were used for classification of gasolines into 3, 6, and 3 classes, respectively, according to their source
Soft independent modeling of class analogy (refinery or process) and type. The 14,000–8000 cm−1 NIR spectral region was chosen. In all cases NIR
K-Nearest neighbor method spectroscopy was found to be effective for gasoline classification purposes, when compared with nuclear
Support vector machine magnetic resonance (NMR) spectroscopy or gas chromatography (GC). KNN, SVM, and PNN techniques
Probabilistic neural network for classification were found to be among the most effective ones. Artificial neural network (ANN-MLP)
Near infrared spectroscopy
approach based on principal component analysis (PCA), which was believed to be efficient, has shown
much worse results. We hope that the results obtained in this study will help both further chemomet-
ric (multivariate data analysis) investigations and investigations in the sphere of applied vibrational
(infrared/IR, near-IR, and Raman) spectroscopy of sophisticated multicomponent systems.
© 2010 Published by Elsevier B.V.

1. Introduction ing is important to monitor the quality of products (e.g., gasoline


production) [13].
Near infrared (NIR) spectroscopy is a non-destructive measure- Multivariate statistics techniques have boosted the use of NIR
ment technique for many chemical compounds, including products instruments [13,14]. Only methods of chemometrics are able to
of petroleum refining and petrochemicals, food products (tea, process enormous amounts of sophisticated experimental data that
fruits, milk, meat, etc.), pharmaceuticals (drugs, tablets, bioreac- are provided by NIR technique. Many calibration methods were
tor monitoring, etc.), combustion products, and many other [1–4]. used for gasoline quality prediction [5–12,15]. Linear techniques,
It has proved its efficiency for laboratory and industrial applica- “quasi-nonlinear” techniques, and “truly” nonlinear techniques –
tions [1,2]. Many current studies are focused on NIR sensor design all of them were used for gasoline properties and quality coeffi-
for on-line [3] or portable operation [4]. cients prediction [5–12].
One of the possible usages of near infrared spectroscopy is in a Despite the fact that gasoline (or other products of petroleum
petroleum industry [5–12], because of the fact that many products refining) classification is also a practically important task, not many
of petroleum refining and petrochemicals consist of hydrocarbons, papers are devoted to this problem [11,16–18]. In this paper we
whose content can be estimated by NIR method. Near infrared spec- have tried to evaluate the efficiency of different classification meth-
troscopy (Tables 1 and 2) is useful to overcome many limitations, ods for gasoline classification by source and type.
especially in a complicated real process, where on-line measur- Gasoline identification by source (refinery) is an important
factor for both quality control and identification of gasoline adul-
teration. This method is based on the difference in crude oil that
∗ Corresponding author. Tel.: +41 44 632 4783.
results in chemical difference of gasolines from variant refineries.
E-mail address: balabin@org.chem.ethz.ch (R.M. Balabin).
Gasoline type is needed for classification of gasolines by quality

0003-2670/$ – see front matter © 2010 Published by Elsevier B.V.


doi:10.1016/j.aca.2010.05.013
28 R.M. Balabin et al. / Analytica Chimica Acta 671 (2010) 27–35

Table 1
Gasoline sample sets.

Set A Set B Set C


a b
Classification by Source (refinery ) Source (process ) Type
Number of samples 150 117 115
Number of classes 3 6 3
Classes Refinery 1; Refinery 2; Refinery 3 Straight-run; reformate; catalysate; isomerizate; Normald ; Regular; Premium
hydrocracking gasoline; mixturec
Distribution of samplese 50; 50; 50 30; 12; 15; 13; 12; 35 55; 45; 15
a
Refinery 1, Orsknefteorgsintez (Russia); Refinery 2, Kirishinefteorgsintez (Russia); Refinery 3, Astrakhan’gazprom (Russia).
b
For details one can consult Refs. [46,47].
c
All gasoline fractions (straight-run, reformate, catalysate, isomerizate, hydrocracking) are presented in mixtures.
d
Normal (Russia) – regular gasoline with octane number of 80.
e
Distribution of samples is given according to the classes above.

(and price). The process of gasoline production defines the chemi- a given measurement is from a given class is the likelihood ratio
cal composition of gasoline fraction. This difference is an important test.
factor for gasoline mixing. Using QDA each of the covariance matrices is estimated sep-
In this paper three gasoline sample sets were used to com- arately, which requires a larger sample than that used in LDA to
pare prediction capabilities of linear discriminant analysis (LDA), reach the same level of reliability in the estimations and hence in
quadratic discriminant analysis (QDA), regularized discriminant the predictions. QDA details can be found in Refs. [19–21].
analysis (RDA), soft independent modeling of class analogy
(SIMCA), partial least squares (PLS) classification, K-nearest neigh- 2.3. Regularized discriminant analysis (RDA)
bor (KNN), support vector machines (SVM), probabilistic neural
network (PNN), and multilayer perceptron (MLP). All classifications Regularized discriminant analysis (RDA) can be called a “com-
are based on gasoline near infrared (NIR) spectroscopy data. promise” between LDA and QDA.
RDA introduces a double type of bias in the estimation of the
2. Brief descriptions of the classifiers covariance matrices of each class with the aim of stabilizing the
prediction in the case of a deficient estimation of their elements.
In this section we have tried to give some basic remarks about This is a very frequent problem encountered with real data because
the methods used. The references on detailed papers were also the estimations of the matrices are very sensitive to the presence of
provided. different data. For a detailed analysis one can consult Refs. [21,22].

2.1. Linear discriminant analysis (LDA) 2.4. Soft independent modeling of class analogy (SIMCA)

Among many possible techniques for data classification, linear Soft independent modeling of class analogy (SIMCA) is a method
discriminant analysis (LDA) is a commonly used one. Linear dis- for supervised data classification that requires a training data set
criminant analysis (LDA) is used to find the linear combination of consisting of samples (or objects) with a set of attributes and their
features which best separate two or more classes of object or event. class membership. The term “soft” refers to the fact the classifier
The resulting combinations may be used as a linear classifier, or can identify samples as belonging to multiple classes and not nec-
more commonly in dimensionality reduction before later classifi- essarily producing a classification of samples into non-overlapping
cation. This method maximizes the ratio of between-class variance classes.
to the within-class variance in any particular data set thereby guar- In order to build the classification models, the samples belong-
anteeing maximal separability. Details about LDA can be found in ing to each class need to be analyzed using principal components
Refs. [19–21]. analysis (PCA). For a given class, the resulting model then describes
a line, plane or hyperplane. For each modeled class, the mean
orthogonal distance of training data samples from the line, plane or
2.2. Quadratic discriminant analysis (QDA)
hyperplane (calculated as the residual standard deviation) is used
to determine a critical distance for classification.
Quadratic discriminant analysis (QDA) is closely related to LDA
New observations are projected into each PC model and the
(see above). Unlike LDA however, in QDA there is no assumption
residual distances calculated. An observation is assigned to the
that the covariance of each of the classes is identical. When the
model class when its residual distance from the model is below
assumption is true, the best possible test for the hypothesis that
the statistical limit for the class. The observation may be found to
belong to multiple classes and a measure of goodness of the model
Table 2 can be found from the number of cases where the observations are
NIR experimental parameters.
classified into multiple classes. Details about SIMCA model building
Set A Set B Set C and usage can be found in Ref. [23].
Spectral range (cm−1 ) 14,000–8000 14,000–8000 13,500–8500
Sample volume (cm3 ) 100 100 100 2.5. Partial least squares (PLS) regression and classification
Optical path (cm) 10 10 10
Spectra resolution (cm−1 ) 16 8 8 Partial least squares (PLS) regression is an extension of the mul-
Number of scans 64 64 72
Cell material Quartz Quartz Quartz
tiple linear regression model.
Time of one measurement 2–3 3–4 2–3 PLS method has found widespread use for multivariate analysis
(min) of spectral data. The main goal of PLS usage is calibration model
Number of measurements 3 3–5 3 building, but this technique can also be applied for classification
(per sample)
purposes. Some information about the idea of PLS classification can
No spectra preprocessing was used. be found in Section 3.6.3 and Refs. [24–27].
R.M. Balabin et al. / Analytica Chimica Acta 671 (2010) 27–35 29

Table 3
Building parameters for K-nearest neighbor (KNN) classification model.

Weighting Formulaa Parameters Number of neighbors (K)


b
None – – K = 1–50
Weighting 1 w(xi ) = exp(−D(xi , p)) – K = 1–35
−q
Weighting 2 w(xi ) = D(xi , p) q = 1–5 K = 1–35
1
Weighting 3 w(xi ) = 2 ε = 0.001–10 K = 1–40
D(xi ,p) +ε
−1
Weighting 4  i , p) + 1)
w(xi ) = (D(x – K = 1–35
1, D(xi , p) ≤ d0
Weighting 5 w(xi ) = d0 = 0.001–10 K = 1–40
0, D(xi , p) > d0
  2 L  
xi,l − pl .
L
a
Two types of distances D(xi ,p) between vectors xi and p were used: D(xi , p) = xi,l − pl and D(xi , p) =
l=1 l=1
b
“Voting” KNN.

2.6. K-Nearest neighbor (KNN) the larger the margin or distance between these parallel hyper-
planes the better the generalization error of the classifier will be.
K-Nearest neighbor (KNN) method was first introduced by Fix Details about SVM classifier can be found in Refs. [31,34–40].
and Hodges [28].
In this method a distance (Euclidean or other) is assigned 2.8. Probabilistic neural network (PNN)
between all points in a data set. The data points, K-closest neigh-
bors (K being the number of neighbors), are then found by analyzing The probabilistic neural network (PNN) was developed by
(sorting) the distance matrix. The K-closest data points are then Specht [41,42]. The probabilistic neural network provides a gen-
analyzed to determine which class label is the most common among eral solution to pattern classification problems by following an
the set (“voted” KNN). The most common class label is then assigned approach developed in statistics, called Bayesian classifiers. PNN
to the data point being analyzed. uses a supervised training set to develop distribution functions
Other variants of KNN (“weighted” KNN) are also possible. In within a pattern layer. These functions, in the recall mode, are used
these methods different neighbors have different “weight” – ability to estimate the likelihood of an input feature vector being part of
to influence on class of sample – according to their distances from a learned category, or class. The learned patterns can also be com-
the sample. Some remarks about the possible types of functions for bined, or weighted, with the a priori probability, also called the
weighting are given in Table 3. relative frequency, of each category to determine the most likely
For KNN classifier, it is necessary to have a training set which class for a given input vector. For a detailed analysis one can consult
is not too small, and a good discriminating distance. KNN performs Refs. [41,42].
well in multi-class simultaneous problem solving. There exists an
optimal choice for the value of the parameter K, which brings to the 2.9. Multilayer perceptron (MLP)
best performance√ of the classifier. This value of K is often approxi-
mately close to N [31,32] (see also Table 3). For a detailed analysis The ideology of multilayer perceptron (MLP) usage for classifi-
one can consult Refs. [29–32]. cation problem is similar to PLS classification (see Section 3.6.7 and
Ref. [24]). The main dissimilarity between MLP and PLS is the ability
2.7. Support vector machines (SVM) of MLP to solve nonlinear problems [43,44]. For a detailed analysis
one can consult Refs. [24,44].
Support vector machines have been developed by Vapnik [33].
Support vector machines (SVM) are based on some “beautifully 3. Experimental
simple ideas” [31] and provide a clear intuition of what learning
from examples is all about. SVM are also showing high perfor- 3.1. Sample sets
mances in practical applications [31,34–37].
In very simple terms, an SVM corresponds to a linear method in Three gasoline sample sets (Sets A–C) were used in our work to
a very high dimensional feature space that is nonlinearly related to evaluate efficiency of different classification methods. Information
the input space. Even though we think of it as a linear algorithm in a about them can be found in Table 1. All gasolines are unleaded Rus-
high dimensional feature space, in practice, it does not involve any sian gasolines without additives; they were provided by refineries.
computations in that high dimensional space. By the use of kernels Table 1 also provides the information about gasoline classes and
(see Table 4), all necessary computations are performed directly in distribution of samples over them. Property prediction (using MLP)
the input space [38–40]. of some of these gasolines was already provided in Refs. [5,8].
Support vector machines map input vectors to a higher
dimensional space where a maximal separating hyperplane is con- 3.2. Apparatus and experimental parameters
structed. Two parallel hyperplanes are constructed on each side of
the hyperplane that separates the data and maximizes the distance The NIR spectra were acquired with a Near-IR FT Spectrome-
between the two parallel hyperplanes. An assumption is made that ter InfraLUM FT-10 (LUMEX, Russia) fitted with special sampler for

Table 4
Building parameters for support vector machine (SVM) classification model.

Kernel Formula Kernel parameters Cost parameter C

Linear K(xi , xj ) = xi xj + 1 – C = 0.001–100


p
K(xi , xj ) = (xi xj
Polynomial + 1)
2
p = 2–8 C = 0.001–100
xi −xj
Gaussian K(xi , xj ) = exp −  = 0.01–10 C = 0.005–50
2
30 R.M. Balabin et al. / Analytica Chimica Acta 671 (2010) 27–35

Fig. 1. Example of gasoline near infrared (NIR) spectra. Maximal transmittance is


5 a.u.

liquids (see Table 2 in Ref. [5]). The spectra were acquired at room
temperature (20–23 ◦ C). Experimental parameters (for all data sets)
are shown in Table 2.
InfraLUM FT-10 (as well as most NIR-spectrometers) has no
thermostatic control. In order to compensate this shortcoming, the
Fig. 2. Optimization of partial least squares (PLS) classification model: dependence
background spectrum was taken before and after the measurement of classification error (see Eq. (1)) on ı-parameter (see Section 3.6.3). Example
of each spectrum; then, the averaged background spectrum was parameters: Set A; LV = 14. Using different sample set division the accuracy of ı-
subtracted from the sample spectrum. This allowed obtaining an parameter can be estimated as ±0.1 (95% confidence interval).
analytical signal with satisfactory accuracy and precision (Fig. 1).
The instrument calibration (wavelength and transmittance values) this goal. The results are reported about the most effective data
was performed using four pure hydrocarbons: toluene, hexane, reduction technique.
benzene, and isooctane. It was repeated at least once per day to
ensure the experimental setup stability and data accuracy (repro- 3.6. Optimization of classifiers
ducibility).
To compare different classification models their best results
3.3. Software and computing should be obtained. The result of model usage depends on the
model parameters (see below). We have used wide ranges of mod-
MATLAB 7.4 was used as standard software for classification els’ parameters to achieve the best results.
methods realization. The following toolboxes were used:
3.6.1. Discriminant analysis (LDA, QDA, RDA)
• MATLAB Statistics Toolbox.
The optimal number of principal components (PC; see Section
• MATLAB Support Vector Machine Toolbox.
3.5) was found for each method of discriminant analysis: lin-
• MATLAB Neural Network Toolbox.
ear discriminant analysis (LDA), quadratic discriminant analysis
• N-way Toolbox for MATLAB.
(QDA), and regularized discriminant analysis (RDA). Principal com-
• PLS Toolbox Version 4.0.
ponent ranges were 1–50, 1–50, and 1–40 for LDA, QDA, and RDA,
respectively. The parameter ˛ ∈ [0;1] that controls the RDA model
Standard programs of these Toolboxes were modified and
complexity was also optimized [21,22].
extended by Balabin (see also Refs. [5,6,8,9]).

3.4. Model efficiency estimation

To characterize prediction ability (efficiency) of created classi-


fication model, the error of cross-validation (E) was used
Nwrong N0 − Nright
E= = (1)
N0 N0
where Nwrong and Nright refer to the number of wrongly and rightly
classified samples; N0 is the total number of samples in validation
set.
Five-fold cross-validation was used to evaluate the model’s effi-
ciency. It was checked that validation set consists of all gasoline
classes. Other variants of cross-validation procedure (10-fold, 7-
fold, etc.) were checked and found to produce almost identical
results.

3.5. Spectra reduction

In order to create effective and robust classification model, spec-


Fig. 3. Optimization of partial least squares (PLS) classification model: dependence
tra data (that have up to 780 independent variables) should be
of classification error (see Eq. (1)) on number of latent variables (LV). Example
reduced. Two common ways of data reduction, spectra averaging parameters: Set A; ı = 0.7. Since the LV is an integer number the expected accuracy
and principal component analysis (PCA), were used by us to achieve is ±1.
R.M. Balabin et al. / Analytica Chimica Acta 671 (2010) 27–35 31

3.6.4. K-Nearest neighbor (KNN)


Different types of K-nearest neighbor (KNN) method (voted and
weighted by different weighting functions) were used to get the
best result (Table 3).

3.6.5. Support vector machines (SVM)


Different types of kernel functions (and corresponding param-
eters) were checked for optimization of SVM classification model
(Table 4). Cost parameter C [31,34–36] was also optimized (Table 4).

3.6.6. Probabilistic neural network (PNN)


Optimal PNN spread value (the distance an input vector must be
from a neuron’s weight vector to be 0.5 [45]) was found in the range
of 10−3 to 10−7 (Fig. 4). The number of input neurons for PNN was
equal to the number of PC (1–50) or the size of averaged spectra
(152–780).

3.6.7. Multilayer perceptron (MLP)


Fig. 4. Optimization of probabilistic neural network (PNN) classification model: Artificial neural network (ANN) analysis of spectral data was also
dependence of classification error (see Eq. (1)) on spread value (see Section 3.6.6). performed with multilayer perceptron (MLP). Similar to PLS anal-
Example parameters: Set B; PC = 39. Using different sample set division the accuracy ysis, each class was coded by N binary strings (see Section 3.6.3). A
of the spread value can be estimated as ±0.013 (95% confidence interval). three-layer MLP (structure: PC–NN–N) with input, hidden, and out-
put layers was used. Each weight of the bond that connects hidden
and input units or output and hidden units was trained by back-
3.6.2. Soft independent modeling of class analogy (SIMCA) propagation. The prediction of the class of individual samples was
The optimal number of principal components (PC) or the optimal performed using N predicted output scores (see also PLS). Training
level of spectra averaging was found for soft independent modeling parameters for multilayer perceptron are shown in Table 5. Opti-
of class analogy (SIMCA) method. Principal components range was mal value of parameter ı (ı = 0.3–0.9) was also found for each data
from 1 to 30; spectra averaging up to 4 times were applied. set.

3.6.3. Partial least squares (PLS) 4. Results and discussion


For PLS classification the class (one of N) of each sample was
coded by N binary strings. In other words, class j was represented 4.1. Gasoline classification by refinery (Set A)
as
⎛ ⎞ Table 6 presents the results of Set A classification. The error of
classification (according to Eq. (1)) varies greatly from 0% for SVM
⎝0 . . . 0, 1, 0 . . . 0⎠
    and PNN methods to 11–14% for LDA, QDA, RDA, and SIMCA. The
j−1 N−j average errors of classification for different refineries are close: 7,
9, and 7% for Refineries 1–3, respectively.
The N predicted scores (by ordinary PLS method) were used to The methods can be separated into the following groups
predict the class of each sample. If the j-score (normalized) was (according to the error of classification): highly effective methods
more than or equal to parameter ı, the predicted class of the sample (KNN, SVM, and PNN), methods of medium effectiveness (MLP and
was j. The optimal value of parameter ı (ı = 0.3–0.9) and the optimal PLS), and methods of low effectiveness (SIMCA and discriminant
number of latent variables (LV = 1–25) were found for each data set analysis methods). For these three groups the average errors are
(Figs. 2 and 3). 1 ± 2, 9 ± 1, and 13 ± 1%, respectively.

Table 5
Training parameters of multilayer perceptron (MLP).

Set A Set B Set C

Algorithm Constant learning rate Constant learning rate Constant learning rate
Minimised error function Mean squared error (MSE) Mean squared error (MSE) Mean squared error (MSE)
Learning Supervised Supervised Supervised
Initialisation method Random Random Random

Transfer functions
Input layer –a – –
Hidden layer Hyperbolic tangent (tanh) Hyperbolic tangent (tanh) Hyperbolic tangent (tanh)
Output layer Hyperbolic tangent (tanh) Hyperbolic tangent (tanh) Hyperbolic tangent (tanh)

Learning rate 0.001 0.002 0.02


Maximal number of epochb 100,000 100,000 15,000
Early stopping 5-Fold cross-validation 5-Fold cross-validation 5-Fold cross-validation
Number of training iterations 250 250 200
Number of input neurons (principal components) 1–25 1–25 1–20
Number of hidden neurons 1–25 1–25 1–20
a
No transfer function is used.
b
Was never reached during the training process.
32 R.M. Balabin et al. / Analytica Chimica Acta 671 (2010) 27–35

Table 6
Classification of gasolines in sample Set A: Gasoline source – refinery.

Refinery 1 Refinery 2 Refinery 3 Total

Nright /N0 E Nright /N0 E Nright /N0 E Nright /N0 E

LDA 45/50 10% 40/50 20% 46/50 8% 131/150 13%


QDA 45/50 10% 41/50 18% 44/50 12% 130/150 13%
RDA 46/50 8% 44/50 12% 44/50 12% 134/150 11%
SIMCA 44/50 12% 41/50 18% 44/50 12% 129/150 14%
N-PLS 46/50 8% 46/50 8% 44/50 12% 136/150 9%
MLP 44/50 12% 46/50 8% 48/50 4% 138/150 8%
KNN 48/50 4% 50/50 0% 48/50 4% 146/150 3%
SVM 50/50 0% 50/50 0% 50/50 0% 150/150 0%
PNN 50/50 0% 50/50 0% 50/50 0% 150/150 0%

E is the error of 5-fold cross-validation (Eq. (1)); Nright is the number of rightly classified samples; N0 is the total number of samples in validation set. Methods: LDA, linear
discriminant analysis; QDA, quadratic discriminant analysis; RDA, regularized discriminant analysis; SIMCA, soft independent modeling of class analogy; N-PLS, multilinear
partial least squares regression; KNN, K-nearest neighbor; SVM, support vector machines; PNN, probabilistic neural network; MLP, multilayer perceptron.

One should note that with highly effective methods of classi- 4.3. Gasoline classification by type (Set C)
fication (group one) gasoline producers (refineries) can easily be
distinguished by NIR spectra. Effectiveness of different classification models for Set C is shown
in Table 11. The range of classification errors for Set C is from 0%
(for PNN) to 13% (for RDA). The average errors of classification
4.2. Gasoline classification by process (Set B) for different gasoline types are 6, 6, and 16 for Normal, Regular,
and Premium, respectively. The great value of Premium classifica-
The results of Set B classification are shown in Table 7. The error tion error can be explained by a lower number of samples 15 (in
of classification varies from 1% for PNN methods to 35% for LDA. The comparison with 55 and 45 for Normal and Regular).
average errors of classification for different processes are different Classification model can also be grouped by the effectiveness of
(from 8 to 31%). They are: 20, 30, 25, 29, 31, and 8% for straight-run Set C classification. KNN, SVM, and PNN form the first (the most
gasoline, reformate, catalysate, isomerizate, hydrocracking gaso- effective) group with classification error of 1 ± 1%; other models
line, and their mixture, respectively. One can note that classification (LDA, QDA, RDA, SIMCA, PLS, and MLP) form another group with
error for gasoline mixtures is much lower. classification error of 11 ± 2%.
By effectiveness of Set B classification the methods can be sepa- Average classification error for Set C (8%) is equal to the error
rated into three groups (the same as in Section 4.1): highly effective for Set A and is much lower than the error for Set B (20%). So the
methods (KNN, SVM, and PNN), methods of medium effectiveness classification of gasoline by type is an easy task (first of all for KNN,
(MLP and PLS), and methods of low effectiveness (SIMCA and dis- SVM, and PNN methods).
criminant analysis methods). For these three groups the average With highly effective methods of classification model building
errors are 33 ± 2, 19 ± 1, and 4 ± 3%, respectively. (KNN, SVM, and PNN) gasolines of different types can easily be
Average classification error for Set B (20%) is much higher than discriminated by NIR spectroscopy.
for Set A (8%). So the classification of gasoline by processes is a much
more sophisticated task. The greater number of classes (6 instead
of 3) and lower “samples/classes” ratio (20 instead of 50) should 4.4. Final remarks
also be taken into account.
Confusion matrixes for Set B classification by MLP, KNN, and The methods presented above (Sections 4.1–4.3) show that nine
PNN are presented in Tables 8–10. From these matrixes one can multivariate methods used in this study can be separated into three
conclude that the main problem of effective classification model groups, according to their classification accuracy. The first group
building is connected with confusion of pure gasoline fractions f (“highly accurate methods”) includes KNN, SVM, and PNN methods;
and their mixtures. It is an obvious problem because of the fact the second group (“methods of medium effectiveness”) consists of
that gasoline mixtures have different compositions and sometimes MLP and PLS models; the third group (“methods of low effective-
can consist of one or two fractions (e.g., reformate + isomerizate). ness”) includes SIMCA and discriminant analysis methods (LDA,
One should note that with highly effective methods of classifi- RDA, and QDA). Only the last data set (Set C) does not give any
cation model building (SVM and PNN) gasoline fractions (produced difference for the last two groups. It means that gasoline classifi-
by different processes) can easily be distinguished. cation according to its type can be effectively done only by rather

Table 7
Classification of gasolines in sample Set B: Gasoline source – process.

Straight-run Reformate Catalysate Isomerizate Hydrocracking gasolineMixture Total

Nright /N0 E Nright /N0 E Nright /N0 E Nright /N0 E Nright /N0 E Nright /N0 E Nright /N0 E

LDA 20/30 33% 6/12 50% 8/15 47% 7/13 46% 5/12 58% 30/35 14% 76/117 35%
QDA 20/30 33% 6/12 50% 8/15 47% 6/13 54% 5/12 58% 32/35 9% 77/117 34%
RDA 20/30 33% 8/12 33% 9/15 40% 7/13 46% 5/12 58% 31/35 11% 80/117 32%
SIMCA 21/30 30% 8/12 33% 9/15 40% 8/13 38% 6/12 50% 30/35 14% 82/117 30%
N-PLS 24/30 20% 8/12 33% 12/15 20% 8/13 38% 10/12 17% 32/35 9% 94/117 20%
MLP 26/30 13% 9/12 25% 10/15 33% 10/13 23% 10/12 17% 31/35 11% 96/117 18%
KNN 28/30 7% 9/12 25% 15/15 0% 12/13 8% 10/12 17% 35/35 0% 109/117 7%
SVM 28/30 7% 11/12 8% 15/15 0% 12/13 8% 11/12 8% 35/35 0% 112/117 4%
PNN 30/30 0% 11/12 8% 15/15 0% 13/13 0% 12/12 0% 35/35 0% 116/117 1%

See comments for Table 6.


R.M. Balabin et al. / Analytica Chimica Acta 671 (2010) 27–35 33

Table 8
Multivariate perceptron (MLP) classification: confusion matrix for Set B.

Number Straight-run Reformate Catalysate Isomerizate Hydrocracking gasoline Mixture

Straight-run 30 26 0 0 0 0 4
Reformate 12 0 9 0 1 0 2
Catalysate 15 0 2 10 1 0 2
Isomerizate 13 0 0 0 10 0 3
Hydrocracking gasoline 12 0 0 0 0 10 2
Mixture 35 1 1 0 0 2 31

Columns, real class; rows, predicted class.

Table 9
K-Nearest neighbor (KNN) classification: confusion matrix for Set B.

Number Straight-run Reformate Catalysate Isomerizate Hydrocracking gasoline Mixture

Straight-run 30 28 0 0 0 0 2
Reformate 12 0 9 1 0 0 2
Catalysate 15 0 0 15 0 0 0
Isomerizate 13 0 0 0 12 0 1
Hydrocracking gasoline 12 0 0 0 0 10 2
Mixture 35 0 0 0 0 0 35

Columns, real class; rows, predicted class.

Table 10
Probabilistic neural network (PNN) classification: confusion matrix for Set B.

Number Straight-run Reformate Catalysate Isomerizate Hydrocracking gasoline Mixture

Straight-run 30 30 0 0 0 0 0
Reformate 12 0 11 0 0 0 1
Catalysate 15 0 0 15 0 0 0
Isomerizate 13 0 0 0 13 0 0
Hydrocracking gasoline 12 0 0 0 0 12 0
Mixture 35 0 0 0 0 0 35

Columns, real class; rows, predicted class.

Table 11
Classification of gasolines in sample Set C: Gasoline type.

Normal Regular Premium Total

Nright /N0 E Nright /N0 E Nright /N0 E Nright /N0 E

LDA 50/55 9% 40/45 11% 11/15 27% 101/115 12%


QDA 51/55 7% 40/45 11% 10/15 33% 101/115 12%
RDA 48/55 13% 41/45 9% 11/15 27% 100/115 13%
SIMCA 51/55 7% 40/45 11% 12/15 20% 103/115 10%
N-PLS 50/55 9% 42/45 7% 12/15 20% 107/115 10%
MLP 50/55 9% 42/45 7% 13/15 13% 110/115 9%
KNN 55/55 0% 45/45 0% 14/15 7% 114/115 1%
SVM 55/55 0% 44/45 2% 15/15 0% 114/115 1%
PNN 55/55 0% 45/45 0% 15/15 0% 115/115 0%

See comments for Table 6.

sophisticated classification algorithms: KNN, SVM, and PNN meth- ANN training is more than 103 times time-consuming than
ods. MLR or PLS model building approach [5,8]. These data should
With respect to its effectiveness the classification methods can also be considered when making a decision on advantages of one
be arranged in the following order: or another method for processing of spectral and reference data.
PNN ≥ SVM ≥ KNN  ANN-MLP ≥ PLS > SIMCA ≥ RDA ≈ LDA ≈ QDA. One should note that these expenses are of no importance for
It should be noted that – except for accuracy – calibration the end-consumer (e.g., refinery laboratory). No real difference in
method is also characterized by simplicity for investigator (com- calculation time of created model is observed. The same can, in
prehensibility of main algorithms, availability and price of software, principle, be said about “ease of use”: if the models are just applied
etc.) and by volume of required calculations (capacity of comput- by the end-user no difference is observed. The obstacles can be
ers for realization, time and price of a model creation, etc.) [5,8]. met during model building where understanding of mathematical
With respect to these parameters, the above-stated methods can principles, on which this or that model is based, is needed [5].
be grouped into two well-separated groups (see also Ref. [5]):
5. Conclusions
(i) according to computation time needed: resource demanding
(PNN, SVM, and ANN-MLP) and resource non-demanding (KNN, The results of different classification model applications are
PLS, SIMCA, and discriminant analysis methods); shown in Fig. 5. We have examined different approaches to clas-
(ii) ease of use: sophisticated (PNN, SVM, MLP, and PLS) vs. simple sify gasolines according to their source (refinery and process) and
(KNN, SIMCA, RDA, LDA, and QDA). type. We can conclude that:
34 R.M. Balabin et al. / Analytica Chimica Acta 671 (2010) 27–35

Fig. 5. Results of gasoline classification by different methods. LDA, linear discriminant analysis; QDA, quadratic discriminant analysis; RDA, regularized discriminant analysis;
SIMCA, soft independent modeling of class analogy; PLS, partial least squares regression; KNN, K-nearest neighbor; SVM, support vector machines; PNN, probabilistic neural
network; MLP, multilayer perceptron. PNN error of classification for Set A and C is equal to zero. SVM classification error for Set A is equal to zero.

(1) Three groups of classification methods were identified: KNN, nition, and adaptive control) within the framework of Bayesian
SVM, and PNN; MLP and PLS; SIMCA and discriminant anal- statistics.
ysis methods (LDA, RDA, and QDA). Each group has different
classification accuracy (see Section 4.4 for details).
(2) The probabilistic neural network (PNN) method is the most References
effective way of gasoline classification using near infrared (NIR)
data. This method can be recommended for industrial gasoline [1] B. Osborne, T. Fearn, Near Infrared Spectroscopy in Food Analysis, Wiley, New
analysis. York, 1986.
[2] F. Chauchard, R. Cogdill, S. Roussel, J.M. Roger, V. Bellon-Maurel, Chem. Intell.
(3) K-Nearest neighbor (KNN) and support vector machines (SVM) Lab. Syst. 7 (2004) 141.
classifiers have also shown adequate results. One should note [3] P. Fayolle, J.-M. Roger, V. Steinmetz, L. Dusserre-Bresson, V. Bellon, Sensoral
that since KNN technique is much easier for understanding and 98 Colloque international sur les capteurs de la qualite des produits agro-
alimentaires, in: ENSAM-Cemagref-INRA, Montpellier, France, 1998, p. 533.
computing, it can also be recommended for gasoline NIR spectra [4] T. Hyvarinen, E. Herrala, J. Malinen, P. Niemla, NIR analysers can be miniature,
analysis. rugged and handheld, in: K. Hildrum, T. Isaksson, T. Naes, A. Tandberg (Eds.),
Near Infrared Spectroscopy, Bridging the Gap Between Data Analysis and NIR
Applications, Ellis Horwood, London, 1992, p. 1.
The choice of classification model depends on the researcher’s [5] R.M. Balabin, R.Z. Safieva, E.I. Lomakina, Chemometr. Intell. Lab. Syst. 88 (2007)
183.
knowledge: KNN can be recommended for people who are not [6] R.M. Balabin, R.Z. Safieva, J. Near Infrared Spectrosc. 15 (2007) 343.
advanced in mathematics and PNN can be used by advanced [7] R.M. Balabin, R.Z. Syunyaev, S.A. Karpov, Fuel 86 (2007) 323.
researchers. It should be noted that KNN software can easily be [8] R.M. Balabin, R.Z. Safieva, E.I. Lomakina, Chemometr. Intell. Lab. Syst. 93 (2008)
58.
written by a person with basic programming skills, while PNN [9] R.M. Balabin, R.Z. Syunyaev, S.A. Karpov, Energy Fuels 21 (2007) 2460.
method is in need of special software (e.g., MATLAB Neural Network [10] N. Pasadakis, V. Gaganis, C. Foteinopoulos, Fuel Process. Technol. 87 (2006) 505.
Toolbox) or advanced programming skills. [11] K. Brudzewski, A. Kesik, K. Kołodziejczyk, U. Zborowska, J. Ulaczyk, Fuel 85
(2006) 553.
Alternatively, for an advanced researcher, the choice of a method
[12] K. Brudzewski, S. Osowski, T. Markiewicz, J. Ulaczyk, Sens. Actuators B: Chem.
depends strongly on the data structure and the aim of the study. 113 (2006) 135.
Therefore, the advanced researcher should have a clear strategy of [13] J. Yoon, B. Lee, C. Han, Chemom. Intell. Lab. Syst. 64 (2002) 1.
[14] D.A. Burns, E.W. Ciurczak, Handbook of Near-Infrared Analysis, CRC Press, 2001.
how to explore and/or model the data, to know when and why a
[15] D. Donald, D. Coomans, Y. Everingham, D. Cozzolino, M. Gishen, T. Hancock,
given method could be applied. Our data presented above can help Chemom. Intell. Lab. Syst. 82 (2006) 122.
in the strategy choice. [16] Q.-K. Zhang, L.-K. Dai, Control Instrum. Chem. Ind. 32 (2005) 53.
We hope that the results obtained by us will help both fur- [17] T. Otto, R. Saupe, R. Bruch, U. Fritzsch, V. Stock, T. Gessner, N. Afanasyeva, Proc.
SPIE – Int. Soc. Opt. Eng. 4491 (2001) 234.
ther chemometric investigations [48–50] and investigations in [18] M. Kim, Y.-H. Lee, C. Han, Comput. Chem. Eng. 24 (2000) 513.
the sphere of vibrational (infrared, near infrared, and Raman) [19] A.I. Belousov, S.A. Verzakov, J. von Frese, Chemom. Intell. Lab. Syst. 64 (2002)
spectroscopy [51–59] of multicomponent systems. The results 15.
[20] D. Michie, D.J. Spiegelhalter, C.C. Taylor (Eds.), Mach. Learn., Neural and Statis-
presented herein can help in the rapid and accurate classi- tical Classification, http://www.amsta.leeds.ac.uk/∼charles/statlog/.
fication (or analysis) of biofuels (e.g., bioalcohols/alcohol fuel, [21] M.S. Sánchez, L.A. Sarabia, Chemom. Intell. Lab. Syst. 28 (1995) 287.
ethanol–gasoline fuel, cellulosic ethanol, bioethers, algae fuel), [22] J.H. Friedman, J. Am. Stat. Assoc. 84 (1989) 165.
[23] S. Wold, M. Sjostrom, SIMCA: a method for analyzing chemical data in terms of
products of petroleum refining (liquid petroleum gas, gasoline, similarity and analogy, in: B.R. Kowalski (Ed.), Chemometrics Theory and Appli-
naphtha, kerosene/jet aircraft fuels, diesel fuel, (marine) fuel oils, cation, American Chemical Society Symposium Series 52, American Chemical
lubricating and industrial oils, paraffin wax, asphalt and tar, and Society, Washington, DC, 1977, pp. 243–282.
[24] Y. Tominaga, Chemom. Intell. Lab. Syst. 49 (1999) 105.
petroleum coke) and petrochemicals (olefins and their precursors,
[25] S. Wold, A. Ruhe, H. Wold, W.J. Dunn III, SIMA J. Sci. Stat. Comput. (1984) 735.
aromatic hydrocarbons: e.g., benzene or mixed xylenes). The use [26] P. Geladi, B.R. Kowalski, Anal. Chim. Acta 185 (1986) 1.
of near infrared spectroscopy in other fields of analytical (and/or [27] W.G. Glen, W.J. Dunn III, D.R. Scott, Tetrahedron Comput. Methodol. 2 (1989)
349.
quantum [49,60–62]) chemistry, such as pharmaceutical (drug)
[28] E. Fix, J.L. Hodges, Int. Stat. Rev. 57 (1989) 238.
quality control, food quality control (e.g., green/black tea), active [29] B.V. Dasarathy (Ed.), Nearest Neighbor (NN) Norms: NN Pattern Classification
pharmaceutical ingredient (API)/pharmakon analysis of tablets, Techniques, 1991.
can be enhanced by application of modern methods of multi- [30] Shakhnarovish, Darrell, Indyk (Eds.), Nearest-Neighbor Methods in Learning
and Vision, The MIT Press, 2005.
variate data analysis, including artificial neural networks as well [31] S.R. Amendolia, G. Cossu, M.L. Ganadu, B. Golosio, G.L. Masala, G.M. Mura,
as other machine learning methods (data mining, pattern recog- Chemom. Intell. Lab. Syst. 69 (2003) 13.
R.M. Balabin et al. / Analytica Chimica Acta 671 (2010) 27–35 35

[32] K. Fukunaga, Introduction to Statistical Pattern Recognition, 2nd ed., Academic [44] S. Haykin, Neural Networks: A Comprehensive Foundation, Macmillan College
Press, Boston, MA, 1990. Publishing Company, NY, 1994.
[33] V.N. Vapnik, The Nature of Statistical Learning Theory, Springer-Verlag, New [45] H. Demuth, M. Beale, M. Hagan, Matlab Neural Network Toolbox 5, User’s Guide,
York, 1995. www.mathworks.com/access/helpdesk/help/pdf doc/nnet/nnet.pdf.
[34] T. Joachims, Text categorization with support vector machines: learning with [46] J.G. Speight, B. Ozum (Eds.), Petroleum Refining Processes, 1st ed., CRC, 2001.
many relevant features, in: Proceedings of the 10th European Conference [47] R.A. Meyers, Handbook of Petroleum Refining Processes, 3rd ed., McGraw-Hill
on Machine Learning (ECML), Springer-Verlag, Berlin, Heidelberg, New York, Professional, 2003.
1998. [48] R.M. Balabin, R.Z. Safieva, Fuel 87 (2008) 2745.
[35] S. Dumais, et al., Proceedings of the Conference on Information and Knowledge [49] R.M. Balabin, E.I. Lomakina, J. Chem. Phys. 131 (2009) 074104.
Management, 1998. [50] R.M. Balabin, R.Z. Syunyaev, J. Colloid Interface Sci. 318 (2008) 167.
[36] M.A. Hearst, B. Scholkopf, S. Dumais, E. Osuna, J. Platt, IEEE Intell. Syst. 13 (1998) [51] M. Blanco, J. Coello, H. Iturriaga, S. Maspoch, J. Pagès, Chemometr. Intell. Lab.
18. Syst. 50 (2000) 75.
[37] Y. Yao, P. Frasconi, M. Pontil, Proceedings of the 3rd International Con- [52] R.M. Balabin, J. Phys. Chem. A 113 (2009) 4910.
ference of Audio- and Video-Based Person Authentication AVBPA, 2001, [53] R.M. Balabin, J. Phys. Chem. A 113 (2009) 1012.
p. 253. [54] R.M. Balabin, J. Phys. Chem. Lett. 1 (2010) 20.
[38] N. Cristianini, J. Shave-Taylor, An Introduction to Support Vector Machine (and [55] R.Z. Syunyaev, R.M. Balabin, I.S. Akhatov, J.O. Safieva, Energy Fuels 23 (2009)
Other Kernel-Based Learning Methods), Cambridge Univ. Press, Cambridge, UK, 1230.
2000. [56] R.M. Balabin, J. Disper. Sci. Technol. 29 (2008) 457.
[39] V.N. Vapnik, Statistical Learning Theory, Wiley, New York, 1998. [57] R.Z. Syunyaev, R.M. Balabin, J. Disper. Sci. Technol. 28 (2007) 419.
[40] M. Pontil, A. Verri, Properties of support vector machines, Neural Comput. 10 [58] Y. Roggo, P. Chalus, L. Maurer, C. Lema-Martinez, A. Edmond, N. Jent, J. Pharm.
(1998) 955. Biomed. Anal. 44 (2007) 683.
[41] D.F. Specht, IEEE International Conference on Neural Networks, 1988, p. 525, [59] R. M. Balabin, Phys. Chem. Chem. Phys., 12 (2010) in press, doi:10.1039/
doi:10.1109/ICNN.1988.23887. b924029b.
[42] D. Specht, Neural Networks 3 (1990) 109. [60] R.M. Balabin, Chem. Phys. 352 (2008) 267.
[43] M.H. Hassoun, Fundamentals of Artificial Neural Networks, 1st ed., MIT Press [61] R.M. Balabin, J. Chem. Phys. 129 (2008) 164101.
Cambridge, MA, USA, 1995. [62] R.M. Balabin, J. Phys. Chem. A 114 (2010) 3698.

You might also like