You are on page 1of 5

Generative Adversarial Nets:Optimizations and

Functioning

Shiv Santosh J Harshavardhan Biradar


Computer Science & Engineering Computer Science & Engineering
PES University PES University
Bengaluru, India Bengaluru, India
<shivsj9755@gmail.com> <harshabiradar84@gmail.com>

Abstract​—Deep learning methods have had striking success tasks. Doing so, reduces the data demand for learning the
over the past few years with applications involved with data same thing if in case of a single GAN model.
that map high-dimensional, rich sensory input to labels. But
deep ​generative models have had less impact due to several [3] If in case of applications that involve generating of
obstacles. Generative Adversarial Networks is a deep learning values for a particular data sample but not from a data
architecture that overcomes most obstacles faced by generative distribution generally, some implementation has to be
models invented until now. In this paper we will discuss how present that preserves details or computes on details that is
GANs turn out to be good generative models and how
specific to the data sample. For example, in [3] which works
modifications within and combinations with other models
bring about exceptional generative models. on face aging using GAN, their application demanded
feature preserving of each person in the aged output of the
Keywords— GAN, VAE, Deep learning, Generative Models, model. To carry out this task, they implemented
Similarity measures “Identity-Preserving” optimization of GAN’s latent vector.
I. INTRODUCTION [4] employs the same task by replacing the generator
GANs are estimating generative model whose goal is to part of a vanilla GAN model with a variational autoencoder.
capture a data distribution. GANs have two distinct entities VAE better captures similarities in a data space. They also
work simultaneously but entail conflicting goals. The two replaced element wise errors with feature wise errors. Along
entities can be thought of as G(generator) and with exceptional visual fidelity in the generated version,
D(discriminator). The target of G is to capture a data using negligible arithmetic modifications, high-level abstract
distribution and D yields a probability that a sample case is features were achieved.
from the training data rather than a generation of G. [5] employed a text to image generation GAN model that
In case of the models G and D being multilayer perceptrons, generates an image that is described by the text that is
the entire training can be done using back propagation. passed to it.
Motivation: ​Deep learning models have shown tremendous Abbreviations and Acronyms
success in the applications that are in requirement of GAN: Generative Adversarial Networks
discriminative models. These applications usually comprise cGAN: conditional Generative Adversarial Networks
of high-dimensional, rich sensory input to a class label. The VAE: Variational AutoEncoders
primary reason behind the striking success are MLE: Maximum Likelihood Estimation
backpropagation and dropout algorithms, using piecewise DCGAN: Deep convolutional Generative adversarial
linear units which have a particularly well-behaved gradient. nets
But when it comes to generative models, deep learning has G,D in equations: Generator network and discriminator
had a relatively lesser impact. The obstacles faced are network respectively
difficulties of approximating many intacable probabilistic
computations that arise in maximum likelihood estimation
and related strategies, and due to difficulty of leveraging the
II. ADVERSARIAL NETWORKS
benefits of piecewise linear units in the generative context.
A new model arose proposing adversarial nets framework.- Implementation of GAN model is the most unambiguous
and clear-cut when both the discriminator and generator are
[2] This architecture demands availability of a large multilayer perceptrons.
amount of data for training. Data is a very essential resource min max V (D, G) = E x~pdata (x) [logD(x)] + E z~pz (z) [log(1 − D(G)(z))]
and one does not always have access to the amount of data G D

demanded by some generative models. To tackle such The objective is to learn the generator’s distribution pg over
scenario, one solution is implementing divide-and-conquer data x . A prior is defined on input noise variables pz (z) ,
approach on the problem. In [2], they implemented two then a mapping to the data space as G(z; θg ) . θg are the
seperate GAN models which accomplish different individual parameters to G. D(x; θd ) outputs a single scalar, probability
that some input x is from the data but not pg . Hence
explaining the equation. D is trained to maximize the [3] Prototyping and Modelling are the two approaches
probability of assigning correct label and D to minimize available today implemented by face aging technologies.
[log(1 − D(G)(z))] . Prototyping approach is by estimating average age in a
Optimizing D to completion in the inner loop is not predefined age group. These capture aging patterns which
computationally feasible. To overcome this task the are in turn applied to input faces. This method is simple and
algorithm implemented involves a pre defined set of steps ​k,​ fast but the drawback is that it does not retain input specific
to optimize D for every single step optimization of G. details for the output it produces. In modelling approach,
employ parametric models which simulate aging
III. APPROACH METHODS
mechanisms of muscles, skin and skull of a particular
[2] Two GAN models work together to overcome the individual. But to achieve this, the model needs to train on
difficulty of less data availability. The goal of the model as a images of people across a wide range of ages. This is a very
whole is to emulate the results of a single GAN model with costly task.
large data availability with two GAN models with less data.
In the model incorporated in [2], GAN1 generates an image The approach incorporated in [3] has two separate
containing only object contours , and GAN2 paints the entities. (1)Age-cGAN to generate quality synthetic images
black-and-white image in order to produce the final image. within age categories and (2)novel latent vector optimization
Accordingly it is understood that the output of GAN1 is the that enables reconstruction of images with preserving
input of GAN2. The approach is based on two factors. The person’s identity from input face.
first factor is that shape, which in their opinion was the most The model is in high-level is a two step process. (1)
important element to convey the identity of an object among Given an input image x and age y 0 , find an optimal vector
the three visual features in content-based image processing z * which allows us to generate a reconstructed face
in contrary to color, texture or size. x̃ = G(z * , y 0 ) as close as possible to the initial one.
Considering x as the original image, and z as the output (2) Given the target age y target , generate the resulting face
from GAN1 and y as the output of GAN2. GAN2 aims to image xtarget = G(z * , y target ) by simply switching the age at
map random noise z and x (generated by GAN1) to y . the input of the generator.
(x, y ) pairs are used to train the discriminator. Expecting the
discriminator to assign a positive label to true pairs and When there is a demand for generation with certain
negative label to generated pairs, (G2 (z i , xi ), xi ) , we get the attributes (or conditions), cGANs(conditional GAN) are
discriminator loss function as follows: used.
n n
As a function of parameters of G and D, θG and θD , we
1
LD =− 2n
( ∑ log(D2 (y i , xi ) + ∑ log(1 − D2 (G2 (z i , xi ), xi ))) can express training of cGAN as an optimization of the
i=1 i=1
function V (θG , θD ) .
LD is the discriminator loss function averaged over n
min max V (θG , θD ) = E x,y~pdata [logD (x, y )]
samples.
The loss function G2 is to maximize the probability The problem with straight forward pixelwise Euclidean
assigned by the discriminator to samples which are distance between original image x ​and the reconstructed
generated. image x̃ are two-fold. Firstly, it increases the blurriness of
n reconstructions and secondly, focus is on unnecessary
1
LG =− n
∑ logD2 (G2(xi , z i ), xi ) details such as background, facial hair, etc which are not
i=1
relevant for identity preserving.
As a whole the objective function of the GAN model So the idea is minimizing difference between F R(x) and
incorporated can be expressed as F R(x̃) , where F R is assumed to be an ideal face recognition
min max V CGAN (D2 , G2 ) = E x,y~p (x,y) [logD2 (y, x)] G2 D2 +
network that recognizes a person’s identity.
E x~p (x),z~p(z) [log(1 − D2 (G2 (x, z ), x))]
[4] For image similarity, element wise comparison is not input text.
an ideal method. That is because it is different from how a For contextual loss, implementation is as following ,
human would interpret images. If an image has a small
translation, it would result in a huge pixel wise error ιcont =− 1
m
∑ log(D(G(z (i) |h(i) ))) , where
i
whereas it would be nearly negligible for the human eye.
This observation with human perspective concluded that h ​is for text encoding, z is for noise input and superscript
element-wise error calculation was not a good similarity (i) denotes (i)th example.
measure. So in order to overcome this obstacle, [4] For perceptual loss, three perceptual loss functions are
incorporates a method where no is hand-engineered method imposed, each aiming to enforce perceptual similarity
is used as similarity measure, but instead a function is learnt. between real and generated images.
This is done by combining a VAE and a GAN model. The (1)Pixel reconstruction loss. Making image adjustment to
generator of GAN and decoder of VAE are combined to one minimize pixel-wise losses is a simple approach to
entity. encourage visual similarity between images. The pixel
The advantage of GAN’s discriminator is that it implicitly reconstruction loss calculates the mean squared error
learns a rich similarity metric for images, so as to between a real image and a corresponding synthetic image.
discriminate them from generated ones. This characteristic The pixel reconstruction loss function encourages the pixels
of the GAN model was exploited so as to transfer the of the two images to match. Unlike image super-resolution,
properties of images learned by the discriminator into a text-to-image synthesis involves one-to-many mapping
more abstract reconstruction error for the VAE. between two different data kinds. Thus high-level image
The combined model is trained with a triple criterion, features are found more appropriate.
Dis (2)Activation reconstruction loss. Instead of promoting
ℒ = ℒprior + ℒllikel + ℒGAN , where,
pixel-wise match between synthetic and real images,
ℒprior = DKL (q(z|x)||p(z)) , high-level feature representations of the images could be
encouraged to be similar.
Dis
ℒllikel =− E q(z|x) [logp(Disl (x)|z)] , Activation outputs derived from high levels capture image
content and overall structures such as object shapes that may
ℒGAN = log(Dis(x)) + log(1 − Dis(Gen(z))) be useful for classifying objects. By minimizing the
+ log(1 − Dis(Dec(Enc(x)))) , where, differences in the activation outputs, the generated image is
z ~ E nc(x) = q (z|x) & x̃ ~ Dec(z) = p(x|z) encouraged to be classified similarly as the real image,
thereby containing objects of the same class as those in the
Specifically, since element-wise reconstruction errors are real image.
not adequate for images, the VAE reconstruction error term (3) Texture reconstruction loss. Although image content and
was replaced to a reconstruction error expressed in the GAN overall structures are well captured in the activation outputs,
discriminator. To achieve this, let Disl (x) denote the hidden style-related features such as texture and recurring patterns
representation of the lth layer of the discriminator. may not. In order to capture whether the generated image
and the real image use combinations of nearly identical set
of supporting filters, we compare Gram matrices of the
activation outputs.

IV. RESULTS
[2]

[5] With a text-to-image conversion model, one has to


overcome both contextual and perceptual losses. To ensure
good cross-modal translation, they adopted a contextual loss
term in the generator following the conditional GAN
framework. To generate realistic images, in addition
introduce perceptual loss terms for the generator,
corresponding to pixel, feature activation, and texture
reconstruction losses. Thus their approach is to regularize
the original minimax optimization for GAN with both
contextual and perceptual loss terms.
At a high level, a DCGAN is built and trained with
contextual and perceptual loss terms by conditioning on the
Starting from the first row, these are the results of Two unsupervised scheme for learning and applying such a
Stage GAN, straight forward GAN architecture, and real distance measure.
images. This was the first attempt at unsupervised learning of
Based on the idea that shape information plays a more encoder-decoder models as well as a similarity measure.
important role than color and texture for general object Their results show that the visual fidelity of the method
identification, presented here is an adversarial approach in employed is competitive with GAN, which in that regard is
order to strengthen the contribution of shape information. It considered state-of-the art.
has thus been demonstrated that the use of this proposed
approach can generate better bird images than a typical
GANs model using and without using shape information on
a small-sized dataset, and the results also show that the
quality of generate images are comparable to real ones.
[5]
[3]

This shows effectiveness of synthetic aging of human


faces based on Age Conditional using Generative
Adversarial Network (Age-cGAN). The method is Generated bird images by with three different perceptual
composed of two steps: (1) input face reconstruction loss functions: pixel (GAN-INT-CLS-Pixel), activation
requiring the solution of an optimization problem in order to (GAN-INT-CLS-VGG), and texture (GAN-INT-CLS-Gram)
find an optimal latent approximation z ∗ , (2) and face aging reconstruction losses.
itself performed by a simple change of condition y at the In [5], GAN-based text-to-image synthesis methods that use
input of the generator. The cornerstone of this method is the both contextual and perceptual losses have been described.
novel “Identity-Preserving” latent vector optimization The contextual loss in existing GAN literature focuses on
approach allowing to preserve the original person’s identity semantic relatedness between text and image, whereas the
in the reconstruction. This approach is universal meaning proposed perceptual loss focuses on the object-specific
that it can be used to preserve identity not only for face structure.
aging but also for other face alterations (e.g. adding a beard,
sunglasses etc.) V. CONCLUSION

[4] GANs comes with advantages and disadvantages relative


to previous modeling frameworks. The disadvantages are
primarily that there is no explicit representation of data
distribution of generator, and that D must be synchronized
well with G during training (in particular, G must not be
trained too much without updating D, much as the negative
chains of a Boltzmann machine must be kept up to date
between learning steps. The advantages are that Markov
chains are never needed, only backprop is used to obtain
gradients, no inference is needed during learning, and a wide
variety of functions can be incorporated into the model.
The problems with element-wise distance metrics are The aforementioned advantages are primarily
well known in the literature and many attempts have been computational. Adversarial models may also gain some
made at going beyond pixels – typically using statistical advantage from the generator network not being
hand-engineered measures. Much in the spirit of deep updated directly with data examples, but only with gradients
learning, we argue that the similarity measure is yet another flowing through the discriminator. This means that
component which can be replaced by a learned model components of the input are not copied directly into the
capable of capturing high-level structure relevant to the data generator’s parameters. Another advantage of adversarial
distribution. In this work, the main contribution was an networks is that they can represent very sharp, even
degenerate distributions, while methods based on Markov
chains require that the distribution be somewhat blurry in
order for the chains to be able to mix between modes.

REFERENCES

[1] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley,


S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,”
in Proceedings of NIPS, 2014, pp. 803–806.

[2] SYNTHESIS OF IMAGES BY TWO-STAGE GENERATIVE


ADVERSARIAL NETWORKS. Huang, Qiang, Jackson, Philip,
Plumbley, Mark D. and Wang, Wenwu (2018) In: 2018 IEEE
International Conference on Acoustics, Speech and Signal Processing,
15–20 Apr 2018, Calgary, Alberta, Canada.

[3] FACE AGING WITH CONDITIONAL GENERATIVE


ADVERSARIAL NETWORKS. Grigory Antipov, Moez Baccouche,
Jean-Luc Dugelay. arXiv:1702.01983v2 [cs.CV] 30 May 2017

[4] AUTOENCODING BEYOND PIXELS USING A LEARNED


SIMILARITY METRIC. Anders Boesen Lindbo Larsen, Søren Kaae
Sønderby, Hugo Larochelle, Ole Winther.arXiv:1512.09300v2
[cs.LG] 10 Feb 2016

[5] ADVERSARIAL NETS WITH PERCEPTUAL LOSSES FOR


TEXT-TO-IMAGE SYNTHESIS. Miriam Cha, Youngjune Gwon, H.
T. Kung. arXiv:1708.09321v1 [cs.CV] 30 Aug 2017. ACCEPTED
TO 2017 IEEE INTERNATIONAL WORKSHOP ON MACHINE
LEARNING FOR SIGNAL PROCESSING

You might also like