Professional Documents
Culture Documents
Jasper Busschers
August 17, 2018
The goal in this bachelor thesis is to identify the changes that oc-
cur to fluids when kept in certain conditions. Identifying these visual
changes in samples and keeping logs of their state is a time-consuming
procedure that is done by many chemistry labs. For example Proc-
ter&Gamble(P&G), who requested this research and also provided the
data. Their interest is to improve quality assurance of their products
by performing experiments on different samples. Another goal is to
predict early how samples will change in order to reduce the duration
of the experiments.
Solving classification problems like the one discussed above is often
done via supervised learning. These methods rely on fast amounts of
labelled data that were not available in our dataset. That’s why this
thesis will cover 2 alternative approaches that can be used to solve the
classification problem. In the first approach, we’ll try to quantify the
amount of change that occurred by making an artificial neural network
(ANN) learn a similarity score. This score can then be used to identify
different changes.
The other approach uses Generative Adversarial Networks[1] in order
to directly classify samples with the correct label. This architecture
consists of 2 different networks; the generator and discriminator. The
generator is trained to produce new samples that appear realistic. This
training is done in an unsupervised way since no labels are required.
The discriminator is then trained to classify the correct change and
to predict whether the input is a real image or one generated by the
generator.
This architecture is both interesting for classifying and predicting the
state of a sample. The discriminator can be trained as a good classifier.
The generator can be trained to predict the future state of a sample.
More on this in section 5.
Contents
1 Introduction 1
1.1 Problem definition . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Classifying the correct change . . . . . . . . . . . . . . . . . . . . . 2
1.2.1 Quantifying the changes . . . . . . . . . . . . . . . . . . . . 3
1.2.2 Classifying using a Generative approach . . . . . . . . . . . 3
1.3 Making earlier predictions . . . . . . . . . . . . . . . . . . . . . . . 3
2 Dataset 4
2.1 Sequence representation . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2 Absolute difference . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.3 Failure types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.4 Manually created labels . . . . . . . . . . . . . . . . . . . . . . . . 6
2.5 Evaluating the difficulties . . . . . . . . . . . . . . . . . . . . . . . 6
5 Generative approach 17
5.1 Generative adversarial networks . . . . . . . . . . . . . . . . . . . . 18
5.2 Improvements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
i
5.2.1 DCGAN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
5.2.2 Cycle GAN . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
5.3 Semi-supervised learning with GANs . . . . . . . . . . . . . . . . . 20
7 Improving predictions 27
7.1 Predicting future appearance . . . . . . . . . . . . . . . . . . . . . . 28
7.1.1 Using the first and middle image . . . . . . . . . . . . . . . 28
7.1.2 Using the absolute difference . . . . . . . . . . . . . . . . . . 28
7.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
Bibliography 33
ii
Chapter 1
Introduction
In many industries it is very important to test how the quality of certain products
degrade over time. Such tests can be a time-consuming process since they need to
be performed for long period of time. After such an experiment, a log should also
be made for every sample stating which change occurred.
In this thesis, we are especially interested in the changes that can occur inside
fluids that are kept at certain conditions inside glass vessels. The goal is to re-
duce the labour time needed to perform such experiments. This can be done by
automatically classifying the correct change to a sample; this reduces the time
needed to create logs. Another way to do this is by learning to predict the future
appearance of a sample by looking at earlier appearances. This would reduce the
amount of time experiments need to last.
In order to solve these problems, a dataset of fluid samples inside transparent glass
vessels was used. The dataset was provided by P&G and contains approximately
1000 sequences of images, showing samples of fluids during their experiments. In
chapter 2 this dataset will be covered in more detail. Here we will discuss what
kind of data representation best captures the change in a sequence.
Chapter 3 gives an overview of traditional deep learning approaches and discusses
how these can be used to identify changes in fluids. For example auto encoders
which we will be using in chapter 4 to quantify the amount of change that oc-
curred. For this, a neural network is trained to encode stable samples as similar
as possible and the ones showing a change as different as possible.
In chapter 5 an overview will be given of generative approaches for solving the
classification problem. These methods can be trained in a semi-supervised way,
requiring far less labelled data. The results of this approach will be presented in
chapter 6 where we use a Generative Adversarial network (GAN) to classify the
changes in sequences of fluids. This chapter will merely focus on the classification
accuracy and will not cover the quality of the generated data.
In chapter 7 we present a GAN architecture specialised in producing high quality
1
generated images. This network will be trained to predict the final appearance of
a sample by looking at some earlier images.
2
1.2.1 Quantifying the changes
In chapter 4 an attempt was made to solve the problem by learning an ANN to
quantify the amount of change between 2 images of the same sample. Siamese
Network is an architecture that was presented for this goal[2]. This network trains
an encoder to encode similar samples as similar as possible while not similar sam-
ples should be encoded very different. In our case we call 2 images of the same
sample similar when its state remained stable.
3
Chapter 2
Dataset
The dataset used to study changes in fluids has been provided by P&G and has
been constructed by placing samples inside a chamber at a certain temperature.
Every hour a picture is taken of the samples. This creates a sequence of images that
show the changes that occurred in the sample. The dataset contains 1036 usable
sequences. Every sequence contains somewhere between 200 and 1500 images
depending on how long the experiment was conducted. A sample can be classified
as either stable or one of 4 failure states: sedimentation, phase splitting, creaming
and color change. These will be discussed in more detail in section 2.3.
The initial dataset only provided 20 labeled sequences which would certainly not
be sufficient for any fully supervised machine learning method. Arguably, even
unsupervised methods would still require more labels in order to create a validation
set that represents the dataset well. This is why the first stage of this thesis
revolved around ways to represent this data and also creating new labels to be
able to compare the different approaches.
4
For this reason, we have chosen to represent a sequence by their first and last
image.
The figure above shows an example of every failure type: on the left the two input
images x1 and x2 are shown. On the right, we see the result after performing
5
Diff(x1,x2). The result of Diff(x1,x2) gives us some insights into the classification
problem. We can see that stable, phase splitting and color change are easily iden-
tifiable, while sedimentation and creaming only show a very slight difference.
Above you can see the representation of every failure after relabelling the entire
dataset. We can see that while stable samples and color changes are very well
represented in the dataset, sedimentation, splitting and creaming are all very un-
derrepresented. Such a dataset is called an unbalanced dataset. These can cause a
network to optimize only the most occurring labels and ignore those that don’t oc-
cur often. The next section will cover approaches in order to deal with unbalanced
datasets.
6
While we cannot make these failures easier to detect, we can partly solve the prob-
lem of underrepresentation. One way of solving this problem is by oversampling
the underrepresented samples or undersampling those that are overrepresented. In
our case, over-sampling would be much more preferred since we would lose much
training data by undersampling.
Another way to deal with unbalanced datasets is by giving each class a weight that
will say how much a neural network should care about the sample. In order to
compute the weight for each class, we first compute the average weight Aw. This
is the number of samples Ns divided by the number of classifications k.
Aw = N s/k
Then for every classification, we compute the weight by dividing Aw by the number
of samples in that classification. Giving us the following weights:
classification weight
color change 0.77
stable 0.32
sedimentation 5.19
splitting 3.92
creaming 7.41
7
Chapter 3
In the previous chapter, we evaluated the dataset and chose a representation for
the data. We also covered the difficulties of this dataset and discussed how it can
be used in machine learning applications.
In this chapter, we will give the necessary overview of neural networks used for
image processing and how these can be used to solve the classification problem as
discussed in 1.1.
Deep neural networks have recently shown far better performance for image pro-
cessing than handcrafted approaches, this because they don’t rely on a predefined
feature representation. Instead, they work by passing an input through a network
of weights to produce an output. The weights are then updated for every sample
so that it fits the dataset best. This allows the network to learn more complex
feature representations of the data.
In 1998 Lecun [4] showed the first implementation of a multi-layer neural net-
work performing such gradient-based learning on text recognition. Many of the
techniques used here, later became key technologies for modeling convolutional
neural networks. In 2014 Alex Krizhevsky et al. presented convolutional neural
networks(CNN) used for image processing[5]. This work showed the first example
of a convolutional neural network, outperforming every traditional neural network
for image classification problems. Section 3.1 will describe convolutional neural
networks in more detail.
Going into the mathematical details or implementations of all these networks would
go beyond the scope of this thesis as these are implemented by specialized machine
learning libraries (PyTorch, Tensorflow, Caffe...). Instead, this chapter will try to
give a general explanation about the building blocks for a convolutional neural
network and models proposed to solve different kinds of learning problems.
8
3.1 Convolutional neural network
In traditional neural networks, image processing was usually achieved by having a
set of neurons as input layer, each taking one pixel of the image as input[6]. Af-
terwards, this input gets processed by fully connected hidden layers and combined
for a result. The limitations of such a network are that it doesn’t scale well to
bigger size images and that neurons on the same layer work independent from one
another without sharing connections. Convolutional neural networks work around
this problem by using 3-dimensional layers (width, height, and depth) [7].
[7]
Using this architecture, neurons can be implemented as filters so-called convo-
lutions. Multiple of those convolutions together create a convolutional layer that
produces a 3-dimensional result where the dept is defined as the number of filters
that were applied on that layer. The layer produces a 3-dimensional output of
which the size depends on the size of the convolution filters used, the stride (over-
lap) and amount of filters used in the layer. More on this in the next section.
In most cases, the goal of a convolution layer is to expand our input into a higher
depth dimension output to extract more advanced features, though reducing the
dimensions can also be done. A convolution layer is most commonly followed by
either a pooling layer or an activation layer. Pooling layers are generally used to
reduce the dimensions of the result that was generated by the convolutional layer.
One of the most commonly used pooling techniques is max pooling. More on this
topic in section 4.1.2.
The purpose of the activation layer is to add non-linearity by using an activa-
tion function on the weight multiplied result for each filter in the previous layer.
The activation function should be a non-linear function (Sigmoid, Tanh, Relu,...).
These will be discussed in section 4.1.3.
9
process in convolutional neural networks, where we are looking for the weights
that represent a concept.
In neural networks, convolutions are defined by their filter size and a stride. The
filter is used to slide both horizontally and vertically over the image to generate an
activation map stating how much the filter is triggered at each point. The stride
of the filter defines the steps it should skip when moving horizontally or vertically
over the image. The size of the outputted activation map is defined as follow:
D = ((W − F + 2P )/S) + 1
Where D is the dimension of the outputted activation map, W the input di-
mension (width or height), F the filter size and S the stride. The dimension of
the output can be manipulated by adding zero padding around the borders of
the input. P states the amount of padding to be added around the border. It
is common practise to make each filter the same size in a convolutional layer. A
convolutional layer returns the results of each of these filters as a 3-dimensional
matrix, where the depth is defined by the number of filters used on that layer and
the initial depth of the input.
W 2 = ((W 1 − F )/S) + 1
H2 = ((H1 − F )/S) + 1
D2 = D1
Where F is the filter size and S the stride of the pooling layer, with a filter size
of 2 and stride of 2 it’s easy to see that the original width and height will be cut in
half. The pooling layer works on the full depth of the input, so the output depth
stays the same as the input depth.
10
Max-Pooling Max-pooling is one of the most commonly used techniques to
implement pooling and makes use of a 2x2 filter with stride 2. It works by keeping
the maximum value out of the 4 values the filter covers. These are returned in a
new matrix with half the height and width of its input, keeping only the highest
values.
11
increase the number of parameters and thus memory usage within the network.
For this reason, these layers have fallen out of interest in more recent works [9][10].
12
Chapter 4
Implementation traditional
approach
13
4.1 Network architecture
The architecture of the Siamese Network used in this chapter consists of 2 parts:
the first part being an encoder network out of an autoencoder network. The
encoder consists of 3 convolutions and 3 max-pooling layers. These layers are used
to first increase the size of the input in order to downscale it then to a small
encoding of that image. The second part of the network uses 3 fully connected
layers to flatten the output and reduce the encoding to just a vector of size 5.
This allows us to easily compute the distance between these 2 encodings using
Euclidean distance.
layer size stride padding output size
conv1 3 3 1 16x10x10
max pool1 2 2 0 32*5*5
conv2 3 2 1 48*3*3
max pool2 2 1 0 48*2*2
conv3 3 1 1 48*2*2
max pool3 2 2 0 48*1*1
The network has been kept small intentionally to reduce the chance of overfitting
and because deeper networks require more data to be trained correctly. Dropout
and bath normalisation were also applied in between each convolutional layer to
reduce overfitting. The network is not trained to produce an output classification
but rather to differentiate between 2 clasifications. For this reason a special loss
function was introduced contrastive loss. [11]
Above the contrastive loss formula is given, where Dw resembles the euclidean
distance and m a margin from where the score should contribute to the loss. In
our network the chosen margin was 1.
4.2 Training
The Siamese Network is trained by passing the first and the last image of a se-
quence separately through the network. Then the euclidean distance between the
14
2 output encodings gets computed. The network is then trained to produce simi-
lar encodings if the sequence remained stable and otherwise produce very different
encodings
Note that no comparisons were yet made between different failure sequences in
order to keep the initial classification problem simple.
The network was trained using the triple dataset of 6500 samples out of chapter
2. This dataset was split into a training set of 6100 and a testing set of 400.
4.3 Results
Since a Siamese Network does not output a classification but rather a distance
between 2 encodings, no accuracy score can be given without choosing a threshold
from where we view 2 samples as distinct enough. Here we chose to display the
validation set of 400 in a scatter plot to illustrate the problems with this approach.
15
encode those differently. This makes the learning problem incredibly difficult to
optimize.
That’s why the next chapter will cover another approach that uses generative
methods to identify changes.
16
Chapter 5
Generative approach
In chapter 2 we have discussed the difficulties of the dataset. One of the biggest is
the underrepresentation of certain failures. Chapter 3 discussed the key concepts
of deep learning and building blocks of convolutional neural networks. In the last
chapter, we attempted to use these basic building blocks to quantify the amount
of change and discovered that our approach was not well suited for this purpose.
In this chapter, we will discuss a different kind of CNN architecture that uses a
generative approach. This architecture can be trained in a semi-supervised way,
allowing us to not only use the small set of labeled data but the entire collection
of 1.2 million images for unsupervised training.
In such generative approaches, the network is asked to produce new samples to
better learn the representation of the dataset. This chapter will limit itself by
only discussing generative adversarial networks (GAN) while chapter 6 will cover
the actual implementation and results. Other popular models such as Restricted
Boltzmann Machines (RBM’s) and Variational Auto-Encoders (VAE) exist, but
have been illustrated to capture far less detailed representations than GANs[12].
GANs were introduced in 2014 by Ian Goodfellow et. al [1] as an implicit proba-
bilistic model to learn a representation of the data. The way it achieves this is by
training 2 different networks: the discriminator and generator. The discriminator
is trained to identify whether its input is a real sample or one artificially produced.
The generator takes a vector of noise as input and is trained to produce samples
that cause the discriminator to classify those as real. These 2 networks play out a
min-max game, as their objective is to make the other network fail. When the gen-
erator starts making the discriminator fail, the discriminator will start improving
and the other way around. The advantage of this method is that the loss doesn’t
converge too quickly.
In 2016 Alec Radford et al. [13] presented an improved version of the original
GAN architecture called Deep Convolutional Generative Adversarial Networks
(DCGan). Here new guidelines were given to better design and train GANs. These
17
will be further discussed in section 5.2.1.
Although GANs can learn very detailed feature representations of its data, they
can be very difficult to train. One of the biggest difficulties lies in the stability
of power between the generator and discriminator. Imbalance in power can cause
the network to never converge to a minimum. Vanishing gradient can also occur
when the discriminator becomes too good. This happens when the loss of the
discriminator falls to 0 and no more optimization can happen.
One of the most mentioned problems training GANs is called mode collapse and
is a direct consequence of the way the adversarial loss is defined. It happens when
many input noises are mapped to the same output and the generator is no longer
improving.
Researchers are currently striving towards building a fundamental model of GANS
that is not vulnerable to these problems [12]. In the meanwhile many hacks have
been proposed to stabilize the training of GANs while better methods are still
being researched[14, 15, 16]. Many of these hacks revolve around adding noise to
both networks either on the input or via dropout layers.
Ever since the framework for generative adversarial training was presented, many
new architectures have been presented for different purposes. One of the most
obvious extensions presented in 2014 is called conditional GANs[17]. In this archi-
tecture, they added a label to the input of both the generator and discriminator
in order to specify the kind of samples that need to be generated. This also allows
the discriminator to be trained conditionally, judging if samples look real or fake
conditioned by some label.
In 2017 Jun-Yan Zhu et al. [18] presented a new architecture called Cycle GAN.
Their architecture makes major modifications to the original architecture in order
to learn a one to one mapping between 2 sets. Whereas regular GANs are used to
learn a mapping from input noise to data, Cycle GAN is used to learn a one-to-one
mapping from input to output. This is done by introducing a new loss to enforce
this mapping. More detailed description of these and other architectures can be
found in section 5.2.2.
18
correct label giving a result near zero on fake examples and a probability near 1
for real samples. When given real data the discriminator is trained to optimise
the following loss.
LossD(x) = log(D(x))
Meanwhile the discriminator is also expected to minimise the output probability
when given a fake example.
The generator, on the other hand, is then trained to maximize the probability D
produces when given a fake sample.
LossG(z) = (1 − D(G(z)))
In theory, these 2 networks play out a min-max game until they settle on a Nash
equilibrium. However, in reality, each model updates its weights independently
to each other, this doesn’t allow any cooperation. Salimans et al. discussed this
problem in 2016 showing that updating gradients in a concurrent way cannot
guarantee a convergence to a minimum[3]. Section 4.2 will cover potential solutions
proposed to solve these convergence problems.
A trained GAN can be used for 2 different purposes: if the goal is just to produce
new samples than it would be sufficient to only keep the trained generator, while if
the goal is to create a classifier, than it would be sufficient to just keep the trained
discriminator.
In chapter 6 the goal is to create a good discriminator used to classify the changes
in sequences.
5.2 Improvements
While the architecture discussed in the previous section showed the huge potential
in GANs, the adaptation went slowly as there was little known about how to design
such networks. This section will cover different improvements and guidelines that
have been proposed to better design GANS.
5.2.1 DCGAN
In 2016 Alec Radford et al. [13] presented the first GAN that only used convolu-
tions and was specialized for image processing. Here also multiple design choices
were explained that later became guidelines for designing deeper convolutional
GANs. Their network eliminated all fully connected layers and pooling layers and
replaced those by convolutional layers.
19
They also showed best stability was achieved when using LeakyRelu across all
layers of the discriminator. They claimed that using their guidelines would result
in a stable architecture.
20
approach, 2 different datasets are used for training one being supervised while the
other contains no labels.
The unlabeled dataset is used to train the generator to produce a wide variety of
new samples. The biggest change lies in the discriminator that is given an extra
output via the use of fully connected layers. This extra output assigns one of k+1
possible classifications to the sample, where k is the number of classifications in
the problem. An extra classification is added to be labeled as a generated sample.
This allows the network to improve the classification problem by using the sam-
ples that were generated. Both a supervised and unsupervised loss can then be
computed over these 2 outputs and added together. Using their technique, they
were able to achieve state of the art results in semi-supervised learning.
They also present new guidelines that proved to reduce the instability problems
when training GANs.
21
Chapter 6
Implementation generative
approach
In the last chapter, we gave a general description of GANs and discussed different
approaches in order to solve the initial instability problems GANs suffer from. We
also discussed how GANs can be used in a semi-supervised way to solve classifica-
tion problems.
In this chapter, we discuss the implementation of a semi-supervised approach used
to solve the original classification problem, where the goal is to classify sequences
of images containing fluids. In this chapter we won’t be evaluating the quality of
the generated results, as these will be discussed in chapter 7. Instead, this chapter
will present the classification results after performing cross-validation.
The labeled dataset used in this chapter is the dataset of 1036 labeled samples as
discussed in chapter 2. The unsupervised set used in this chapter is constructed
by taking 4 comparisons per sequence for 1000 sequences. This makes the unsu-
pervised set a total of 4000 samples.
6.1 implementation
By using the original implementation of GANs, the generator learns a mapping
from a noise vector z to some data. Since samples in our dataset consist of 2
images and a label, we can choose 2 different representations for the generator to
produce.
In the first representation, the generator produces both images as output given
a noise vector z. This allows us to better judge the quality of the output of the
generator, but forces us to use far more parameters, which makes it more difficult
to train. We found in our experiments that the generator would take significantly
22
longer to learn the relationship between the two outputs. This caused the discrim-
inator to learn way faster and cause a vanishing gradient to occur.
Another option would be to have the generator produce the pre-processed differ-
ence between the two images. By doing this, we make the network focus only on
the important features, but it also makes the network very vulnerable to mode
collapse, which also happened in our experiments.
In order to find a middle way in these 2 approaches, some concepts found in Cycle
GAN were adopted. In Cycle GAN the goal is no longer to learn a mapping from
noise to data, but rather create a mapping from images of one set to images of
another set. We adopt this by using the first image of a sequence as set A and the
last image as set B. Then we have our generator learn a mapping from an input
image out of set A to a corresponding image out of set B. The generator produces
an output fake1 that is then given to the discriminator along with the original
input image. The discriminator then judges if the absolute difference between the
input and generated images looks like a real change.
23
The generator in this network was replaced by the one used in Cycle GAN[21].
The generator starts by downsampling the input via convolutional layers. The
result is then passed through 9 residual units and finally this result is upsampled
again. The discriminator is a simple convolutional neural network consisting of 5
convolutional layers, each layer halves the width and height of its input.
The discriminator was also inspired by Cycle GAN but has been modified to out-
put a classification label. Also the network size has been reduced since this showed
about the same results. The network consists of 5 convolutional layers, followed by
2 fully connected layers to produce the output probability and classification label.
6.2 Training
The discriminator is then trained on both a supervised and an unsupervised dataset
to correctly identify real samples from generated ones and also to correctly predict
the corresponding label. For supervised dataset, we used the new labeled dataset
discussed in chapter 2. This dataset provided 1036 labeled sequences. The gener-
ator is only trained unsupervised, using the loss computed with the output of the
discriminator.
The images are first cropped from the center to 64*64 in order to reduce the size
of the network and also to only include changes that occurred inside the fluid.
This cropped image is given to the generator to generate the last image for that
sequence. Then the absolute difference is taken between the real input and gen-
erated output. The resulting image shows the relationship between the 2 images.
This image is then given to the discriminator to distinguish between real relation-
ships in the dataset and generated ones. Doing this indirectly forces the generated
image to remain similar enough to the input image while only showing realistic
changes. Both networks were trained on a constant learning rate of 0.002 and
24
optimized using Adam optimizer as suggested by Goodfellow et al. [3].
6.3 Results
The accuracy of this network was computed using cross validation by splitting the
dataset into 11 subsets: 10 subsets of size 100 and 1 subset of size 36. These were
trained separately starting from random initialised weights for 20 epoches each.
In 2.5 we presented weights that can be used for dealing with unbalanced datasets.
These weights are used in the supervised loss to balance the different classifications.
The results can be seen below in the confusion matrix, here is displayed what
category shows the most errors.
confusion matrix color change stable sedimentation splitting creaming
color change 89.6% 7.8% 1.4% 0% 1,1%
stable 7.3% 90.3% 0.7% 1.7% 0%
sedimentation 20% 30% 50% 0% 0%
splitting 9.5% 7.5% 0% 83% 0%
creaming 0% 34.5% 0% 0% 65.5%
We can see that stable and color change are very accurate and the most mistakes
fall in those classifications. This can be partially blamed to the manual labelling
as color changes are not always spotted.
Sedimentation and creaming on the other hand show a low accuracy. This can be
mostly blamed to the fact that these changes are the hardest to spot of all these
failures. But also because there were very few samples in the training set that
showed these failures.
This can be seen in the confusion matrix that shows samples labeled as creaming
are only missclassified as stable because they appear very similar. The same goes
for sedimentation that is either missclassified as being stable or color change.
25
classification accuracy (first-last) accuracy (first-middle)
color change 89.6% 62.86%
stable 90.3% 83%
sedimentation 50% 27.5%
splitting 83 81.13%%
creaming 65.5% 17.24%
We see that accuracy is lower for every classification when the network is trained
to predict changes earlier. Splitting and stable still remain quite accurate since
these are also visible halfway the experiment. But every other classification that
is less visible in earlier stages has a significantly lower accuracy.
In the next chapter we will discuss an alternative way to make earlier predictions
by generating the final image of an experiment.
26
Chapter 7
Improving predictions
In the last chapter, we tried to predict the final state of a sample by only looking
at the first and last picture of the sequence. We discussed the accuracy of this
approach and tried to make earlier predictions by using the first and middle image
of every sequence to train and test. This method still relies on the assumption
that all samples are labeled correctly. Though this cannot be guaranteed since the
labels only show the change between the first and the last image. This does not
necessarily have to mean the change is already visible in earlier images. They were
also manually labelled without much knowledge in the field of chemistry.
In this chapter we will discuss another approach to make earlier predictions that
does not rely on labeled data. In this approach, we will not try to classify the
change early but rather predict how the sample will look in a later time. In sec-
tion 5.2.2 we evaluated Cycle GAN. This architecture is used to create a mapping
between 2 sets of images. A prediction can also be seen as making a mapping from
one or more previous images, to one in the future.
The detailed results of Cycle GAN triggered a lot of research in new architectures
incorporating this idea. Pix2Pix is one of those architectures presented to improve
the results of the original Cycle GAN[22]. Here they proposed using conditional
GANs and making the generator not only produce realistic looking samples, but
also bad looking samples to better train the discriminator. They also don’t let
the result only be judged by the discriminator but also include an L1 loss with
the true output. Doing this caused their network to produce more detailed results
than the original Cycle GAN did. Section 6.1 will cover the results when using the
Pix2Pix architecture to predict the final image of a sequence.
27
7.1 Predicting future appearance
Predicting the changes in fluids may be a challenging task because fluids that look
similar do not necessarily react in the same way. It is clear that if we want to
predict the future appearance of a sample, we need to give more information than
just the first image.
In this section, we will compare the predictions when training the Pix2Pix ar-
chitecture to predict the final appearance of a sample. First, we will show the
predictions whenever the generator is only given the first image of a sequence.
This result we will merely use to compare against other approaches.
28
In the next section, we will discuss the result of this approach and determine if it
indeed produced better predictions.
7.2 Results
This section will compare the prediction results of every approach we discussed in
this chapter. These results were computed by training an unmodified version of
the Pix2Pix architecture for 25 epochs for all approaches[23]. All network configu-
rations remained the same in all experiments; only the input size of the generator
was increased when 2 images were used instead of 1.
In the first experiment, we trained the generator to predict the last image of a
sequence when it is only given the first image of a sequence.
In the second experiment, we gave the generator both the first image and the im-
age found in the middle of the sequence.
Finally in the last experiment, we used the first image and the difference between
the first an the middle image. The figure below compares the prediction when
using these approaches on 3 different inputs out of the test set.
On the left we see the true data taken from the dataset. Imgf is the first image
of the sequence and Imgm the one found in the middle of the sequence. The last
image of the sequence is taken as ground truth.
29
On the right, we see the predictions of these 3 approaches on the given sequence.
These predictions were made by using samples out of the test set. We first evaluate
the method for which we only used the first image to predict. Like we expected,
this approach was not able to predict a lot of change.
The second approach uses Imgf and Imgm to make its predictions. We can see
that the predictions stay relatively close to Imgm . This shows that the network is
not actually learning the relationship between the first image and the last image.
Instead, it only uses Imgm to make its predictions.
In section 7.1.2 we discussed another approach that no longer used the imgm itself.
Instead we proposed to use the difference between imgf and Imgm . The results of
this approach are displayed on the most right. We can see that these predictions
come far closer to the true outcome.
This chapter does not contain a distance metric measured from the ground truth
and is merely to give insight into the problem of predicting the future appearance.
We showed that if not impossible, it is at least very difficult to predict the future
appearance only using a single image.
We also showed that far more accurate results can be achieved by using the abso-
lute difference as a second input. This because it captures the relationship between
the first and the last image far better than giving an entire image.
30
Chapter 8
8.1 Conclusion
In this thesis, we tried to reduce the time needed for experiments conducted by
chemical and manufacturing companies. This could be achieved both by automat-
ically identifying changes as they occur, but also by learning to predict changes
earlier.
In chapter 2 we have discussed the different classifications and went into more
detail about the dataset used to study the changes in fluids. We saw that the se-
quences classified as sedimentation or creaming were very underrepresented in the
dataset. These failures are also very difficult to identify since they only show very
little visible changes. This can be solved by obtaining more and higher detailed
images of the samples.
First we tried to create a system that could quantify the amount of change that
can be measured between 2 images of the same sample. For this, we used a Siamese
Network trained to encode the first and last image of a sequence. In theory, the
network should encode samples that remained stable as very similar while any
change should also cause the encodings to be very different.
This proved to be a very difficult problem since changes in the camera angle can
also cause a sample to be encoded differently. It also relied on creaming and sedi-
mentation being encoded completely different than stable samples. These changes
are not very visual and only a few samples were available showing those. This
caused the network to ignore these changes.
For this reason, we presented a different approach to identify these failures by
using a generative approach. These generative approaches are able to learn more
detailed feature representations of the data by producing new realistic samples.
We presented GANs as architecture for this generative approach.
In chapter 6 we trained such a GAN in a semi-supervised way to classify the correct
31
change for a given sequence. This approach offered better results on all classifi-
cations, though accuracy when identifying creaming and sedimentation remained
quite low. This accuracy can be improved by using more detailed images or by
increasing the number of samples to learn from. We also saw that when we tried to
predict changes earlier, the classification accuracy dropped a lot. This is because
these predictions still rely a lot on labeling.
In chapter 7 we tried another approach to make earlier predictions. Here we
used a GAN to correctly predict the final image of a sequence. We compared 3
different approaches and found that the best results can be found by using the
pre-processing that we discussed in chapter 7. This because it captures the rela-
tionship between the first and the last image the best.
32
Bibliography
[2] G. Koch, R. Zemel, and R. Salakhutdinov, “Siamese neural networks for one-
shot image recognition,” 2015.
[6] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. MIT Press, 2016.
http://www.deeplearningbook.org.
[8] M. Hubel and T. N. Wiesel, Brain and Visual Perception. Oxford Univeristy
Press, 2005.
[9] e. Christian Szegedy, Wei Liu, “Going deeper with convolutions,” CoRR,
vol. abs/1409.4842, 2014.
[10] K. He, X. Zhang, and etc, “Deep residual learning for image recognition,”
CoRR, vol. abs/1512.03385, 2015.
33
[11] “One shot learning with siamese networks in pytorch.” https:
//hackernoon.com/one-shot-learning-with-siamese-networks-in-
pytorch-8ddaab10340e. Accessed: 2018-3-11.
[12] M. Arjovsky and L. Bottou, “Towards principled methods for training gener-
ative adversarial networks,” arXiv preprint arXiv:1701.04862, 2017.
[14] M. Arjovsky and L. Bottou, “Towards principled methods for training gener-
ative adversarial networks,” 2017.
[22] Q. Chen and V. Koltun, “Photographic image synthesis with cascaded re-
finement networks,” in IEEE International Conference on Computer Vision
(ICCV), 2017.
34