Professional Documents
Culture Documents
Functioning
Abstract—Deep learning methods have had striking success tasks. Doing so, reduces the data demand for learning the
over the past few years with applications involved with data same thing if in case of a single GAN model.
that map high-dimensional, rich sensory input to labels. But
deep generative models have had less impact due to several [3] If in case of applications that involve generating of
obstacles. Generative Adversarial Networks is a deep learning values for a particular data sample but not from a data
architecture that overcomes most obstacles faced by generative distribution generally, some implementation has to be
models invented until now. In this paper we will discuss how present that preserves details or computes on details that is
GANs turn out to be good generative models and how
specific to the data sample. For example, in [3] which works
modifications within and combinations with other models
bring about exceptional generative models. on face aging using GAN, their application demanded
feature preserving of each person in the aged output of the
Keywords— GAN, VAE, Deep learning, Generative Models, model. To carry out this task, they implemented
Similarity measures “Identity-Preserving” optimization of GAN’s latent vector.
I. INTRODUCTION [4] employs the same task by replacing the generator
GANs are estimating generative model whose goal is to part of a vanilla GAN model with a variational autoencoder.
capture a data distribution. GANs have two distinct entities VAE better captures similarities in a data space. They also
work simultaneously but entail conflicting goals. The two replaced element wise errors with feature wise errors. Along
entities can be thought of as G(generator) and with exceptional visual fidelity in the generated version,
D(discriminator). The target of G is to capture a data using negligible arithmetic modifications, high-level abstract
distribution and D yields a probability that a sample case is features were achieved.
from the training data rather than a generation of G. [5] employed a text to image generation GAN model that
In case of the models G and D being multilayer perceptrons, generates an image that is described by the text that is
the entire training can be done using back propagation. passed to it.
Motivation: Deep learning models have shown tremendous Abbreviations and Acronyms
success in the applications that are in requirement of GAN: Generative Adversarial Networks
discriminative models. These applications usually comprise cGAN: conditional Generative Adversarial Networks
of high-dimensional, rich sensory input to a class label. The VAE: Variational AutoEncoders
primary reason behind the striking success are MLE: Maximum Likelihood Estimation
backpropagation and dropout algorithms, using piecewise DCGAN: Deep convolutional Generative adversarial
linear units which have a particularly well-behaved gradient. nets
But when it comes to generative models, deep learning has G,D in equations: Generator network and discriminator
had a relatively lesser impact. The obstacles faced are network respectively
difficulties of approximating many intacable probabilistic
computations that arise in maximum likelihood estimation
and related strategies, and due to difficulty of leveraging the
II. ADVERSARIAL NETWORKS
benefits of piecewise linear units in the generative context.
A new model arose proposing adversarial nets framework.- Implementation of GAN model is the most unambiguous
and clear-cut when both the discriminator and generator are
[2] This architecture demands availability of a large multilayer perceptrons.
amount of data for training. Data is a very essential resource min max V (D, G) = E x~pdata (x) [logD(x)] + E z~pz (z) [log(1 − D(G)(z))]
and one does not always have access to the amount of data G D
demanded by some generative models. To tackle such The objective is to learn the generator’s distribution pg over
scenario, one solution is implementing divide-and-conquer data x . A prior is defined on input noise variables pz (z) ,
approach on the problem. In [2], they implemented two then a mapping to the data space as G(z; θg ) . θg are the
seperate GAN models which accomplish different individual parameters to G. D(x; θd ) outputs a single scalar, probability
that some input x is from the data but not pg . Hence
explaining the equation. D is trained to maximize the [3] Prototyping and Modelling are the two approaches
probability of assigning correct label and D to minimize available today implemented by face aging technologies.
[log(1 − D(G)(z))] . Prototyping approach is by estimating average age in a
Optimizing D to completion in the inner loop is not predefined age group. These capture aging patterns which
computationally feasible. To overcome this task the are in turn applied to input faces. This method is simple and
algorithm implemented involves a pre defined set of steps k, fast but the drawback is that it does not retain input specific
to optimize D for every single step optimization of G. details for the output it produces. In modelling approach,
employ parametric models which simulate aging
III. APPROACH METHODS
mechanisms of muscles, skin and skull of a particular
[2] Two GAN models work together to overcome the individual. But to achieve this, the model needs to train on
difficulty of less data availability. The goal of the model as a images of people across a wide range of ages. This is a very
whole is to emulate the results of a single GAN model with costly task.
large data availability with two GAN models with less data.
In the model incorporated in [2], GAN1 generates an image The approach incorporated in [3] has two separate
containing only object contours , and GAN2 paints the entities. (1)Age-cGAN to generate quality synthetic images
black-and-white image in order to produce the final image. within age categories and (2)novel latent vector optimization
Accordingly it is understood that the output of GAN1 is the that enables reconstruction of images with preserving
input of GAN2. The approach is based on two factors. The person’s identity from input face.
first factor is that shape, which in their opinion was the most The model is in high-level is a two step process. (1)
important element to convey the identity of an object among Given an input image x and age y 0 , find an optimal vector
the three visual features in content-based image processing z * which allows us to generate a reconstructed face
in contrary to color, texture or size. x̃ = G(z * , y 0 ) as close as possible to the initial one.
Considering x as the original image, and z as the output (2) Given the target age y target , generate the resulting face
from GAN1 and y as the output of GAN2. GAN2 aims to image xtarget = G(z * , y target ) by simply switching the age at
map random noise z and x (generated by GAN1) to y . the input of the generator.
(x, y ) pairs are used to train the discriminator. Expecting the
discriminator to assign a positive label to true pairs and When there is a demand for generation with certain
negative label to generated pairs, (G2 (z i , xi ), xi ) , we get the attributes (or conditions), cGANs(conditional GAN) are
discriminator loss function as follows: used.
n n
As a function of parameters of G and D, θG and θD , we
1
LD =− 2n
( ∑ log(D2 (y i , xi ) + ∑ log(1 − D2 (G2 (z i , xi ), xi ))) can express training of cGAN as an optimization of the
i=1 i=1
function V (θG , θD ) .
LD is the discriminator loss function averaged over n
min max V (θG , θD ) = E x,y~pdata [logD (x, y )]
samples.
The loss function G2 is to maximize the probability The problem with straight forward pixelwise Euclidean
assigned by the discriminator to samples which are distance between original image x and the reconstructed
generated. image x̃ are two-fold. Firstly, it increases the blurriness of
n reconstructions and secondly, focus is on unnecessary
1
LG =− n
∑ logD2 (G2(xi , z i ), xi ) details such as background, facial hair, etc which are not
i=1
relevant for identity preserving.
As a whole the objective function of the GAN model So the idea is minimizing difference between F R(x) and
incorporated can be expressed as F R(x̃) , where F R is assumed to be an ideal face recognition
min max V CGAN (D2 , G2 ) = E x,y~p (x,y) [logD2 (y, x)] G2 D2 +
network that recognizes a person’s identity.
E x~p (x),z~p(z) [log(1 − D2 (G2 (x, z ), x))]
[4] For image similarity, element wise comparison is not input text.
an ideal method. That is because it is different from how a For contextual loss, implementation is as following ,
human would interpret images. If an image has a small
translation, it would result in a huge pixel wise error ιcont =− 1
m
∑ log(D(G(z (i) |h(i) ))) , where
i
whereas it would be nearly negligible for the human eye.
This observation with human perspective concluded that h is for text encoding, z is for noise input and superscript
element-wise error calculation was not a good similarity (i) denotes (i)th example.
measure. So in order to overcome this obstacle, [4] For perceptual loss, three perceptual loss functions are
incorporates a method where no is hand-engineered method imposed, each aiming to enforce perceptual similarity
is used as similarity measure, but instead a function is learnt. between real and generated images.
This is done by combining a VAE and a GAN model. The (1)Pixel reconstruction loss. Making image adjustment to
generator of GAN and decoder of VAE are combined to one minimize pixel-wise losses is a simple approach to
entity. encourage visual similarity between images. The pixel
The advantage of GAN’s discriminator is that it implicitly reconstruction loss calculates the mean squared error
learns a rich similarity metric for images, so as to between a real image and a corresponding synthetic image.
discriminate them from generated ones. This characteristic The pixel reconstruction loss function encourages the pixels
of the GAN model was exploited so as to transfer the of the two images to match. Unlike image super-resolution,
properties of images learned by the discriminator into a text-to-image synthesis involves one-to-many mapping
more abstract reconstruction error for the VAE. between two different data kinds. Thus high-level image
The combined model is trained with a triple criterion, features are found more appropriate.
Dis (2)Activation reconstruction loss. Instead of promoting
ℒ = ℒprior + ℒllikel + ℒGAN , where,
pixel-wise match between synthetic and real images,
ℒprior = DKL (q(z|x)||p(z)) , high-level feature representations of the images could be
encouraged to be similar.
Dis
ℒllikel =− E q(z|x) [logp(Disl (x)|z)] , Activation outputs derived from high levels capture image
content and overall structures such as object shapes that may
ℒGAN = log(Dis(x)) + log(1 − Dis(Gen(z))) be useful for classifying objects. By minimizing the
+ log(1 − Dis(Dec(Enc(x)))) , where, differences in the activation outputs, the generated image is
z ~ E nc(x) = q (z|x) & x̃ ~ Dec(z) = p(x|z) encouraged to be classified similarly as the real image,
thereby containing objects of the same class as those in the
Specifically, since element-wise reconstruction errors are real image.
not adequate for images, the VAE reconstruction error term (3) Texture reconstruction loss. Although image content and
was replaced to a reconstruction error expressed in the GAN overall structures are well captured in the activation outputs,
discriminator. To achieve this, let Disl (x) denote the hidden style-related features such as texture and recurring patterns
representation of the lth layer of the discriminator. may not. In order to capture whether the generated image
and the real image use combinations of nearly identical set
of supporting filters, we compare Gram matrices of the
activation outputs.
IV. RESULTS
[2]
REFERENCES