You are on page 1of 18

Introductoin

In fineart , especially painting, humans havemastered the skill to create unique visual
experiences through composing a complex interplay between the content and style of an
image. Thus far the algorithmic basis of this process is unknown and there exists no artificial
system with similar capabilities. However in other key areas of visual perception such as
object and face recognition near-human performance was recently demonstrated by a class of
biologically inspired vision models called Deep Neural Networks artificial system based on a
Deep Neural Network that creates artistic images of high perceptual quality. The system uses
neural representations to separate and recombine content and style of arbitrary images,
providing a neural algorithm for the creation of artistic images.

The class of Deep Neural Networks that are most powerful in image processing tasks are
called Convolutional Neural Networks. Convolutional Neural Networks consist of layers of
small computational unit that process visual information hierarchically inafeed-forward
manner Each layer of units can be understood as a collection of image filters , each of which
extracts a certain feature from the input image. Thus, the output of a given layer consists of
so-called feature maps: differently filtered versions of the input image.

When Convolutional Neural Networks are trained on object recognition, they develop a
representation of the image that makes object information increasingly explicit along the processing
hierarchy.8 Therefore, along the processing hierarchy of the network, the input image is transformed
into representations that increasingly care about the actual content of the image compared to its
detailed pixel values. We can directly visualise the information each layer contains about the input
image by reconstructing the image only from the feature maps in that layer9 . Higher layers in the
network capture the high-level content in terms of objects and their arrangement in the input image
but do not constrain the exact pixel values of the reconstruction. In contrast, reconstructions from the
lower layers simply reproduce the exact pixel values of the original image . We therefore refer to the
feature responses in higher layers of the network as the content representation

To obtain a representation of the style of an input image, we use a feature space originally designed to
capture texture information. This features pace is built on top of the filter responses in each layer of
the network. It consists of the correlations between the different filter responses over the spatial extent
of the feature maps .By including the feature correlations of multiple layers, we obtain a stationary,
multi-scale representation of the input image, which captures its texture information but not the global
arrangement.

Amit Jain
16BCS1167
Fig -1(a) examples from gatys paper

Again, we can visualise the information captured by these style feature spaces built on different layers
of the network by constructing an image that matches the style representation of a given input image

Amit Jain
16BCS1167
Indeed reconstructions from the style features produce texturised versions of the input image that
capture its general appearance in terms of colour and localised structures. Moreover, the size and
complexity of local image structures from the input image increases along the hierarchy, a result that
can be explained by the increasing receptive field sizes and feature complexity. We refer to this multi-
scale representation as style representation. The key finding of this paper is that the representations of
content and style in the Convolutional Neural Network are separable. That is, we can manipulate both
representations independently to produce new, perceptually meaningful images. To demonstrate this
finding, we generate images that mix the contentand style representation from two different source
images The images are synthesised by finding an image that simultaneously matches the content
representation of the photograph and the style representation of the respective piece of art (see
Methods for details). While the global arrangement of the original photograph is preserved, the
colours and local structures that compose the global scenery are provided by the artwork. Effectively,
this renders the photograph in the style of the artwork, such that the appearance of the synthesised
image resembles the work of art, even though it shows the same content as the photograph.

When matching the style representations up to higher layers in the network, local images structures
are matched on an increasingly large scale, leading to a smoother and more continuous visual
experience. Thus, the visually most appealing images are usually created by matching the style
representation up to the highest layers in the network Of course, image content and style cannot be
completely disentangled. When synthesising an image that combines the content of one image with
the style of another, there usually does not exist an image that perfectly matches both constraints at
the same time. However, the loss function we minimise during image synthesis contains two terms for
content and style respectively, that are well separated (see Methods). We can therefore smoothly
regulate the emphasis on either reconstructing the content or the style . A strong emphasis on style
will result in images that match the appearance of the artwork, effectively giving a texturised version
of it, but hardly show any of the photograph’s content . When placing strong emphasis on content,
one can clearly identify the photograph, but the style of the painting is not as well-matched (Fig 3,
last column). For a specific pair of source images one can adjust the trade-off between content and
style to create visually appealing images. Here we present an artificial neural system that achieve
separation of image content from style, thus allowing to recast the content of one image in the style of
any other image. We demonstrate this by creating new, artistic images that combine the style of
several well-known paintings with the content of an arbitrarily chosen photograph. In particular, we
derive the neural representations for the content and style of an image from the feature responses of
high performing Deep Neural Networks trained on object recognition. To our knowledge this is the
first demonstration of image features separating content from style in whole natural images.

Amit Jain
16BCS1167
Software Requirement Specification

Amit Jain
16BCS1167
Architecture Diagram

Fig -3(a) Architecture of Algorithm

Amit Jain
16BCS1167
Fig -3(b) Architecture of VGG-19

Fig -3(b)-In this we have used a pre trained convolutional neural network VGG-19 which has
been trained on millions of images to classify almost 20000 classes. In this project we has no
used the last 3 layers of the network since those layers are fully connected layers and used for
classification purposes which is not required in this project .

Amit Jain
16BCS1167
Fig -3(c) DFD of neural style transfer

Amit Jain
16BCS1167
Methodlogy

In this method, we do not use a neural network in a true sense. That is, we aren’t training a
network to do anything. We are simply taking advantage of backpropagation to minimize two
defined loss values. The tensor which we backpropagate into is the stylized image we wish to
achieve — which we call the pastiche from here on out. We also have as inputs the artwork
whose style we want to transfer, known as the style image, and the picture that we want to
transfer the style onto, known as the content image.

The pastiche is initialized to be random noise. It, along with the content and style images, are
then passed through several layers of a network that is pretrained on image classification. We
use the outputs of various intermediate layers to compute two types of losses: style loss and
content loss — that is, how close is the pastiche to the style image in style, and how close is
the pastiche to the content image in content. Those losses are then minimized by directly
changing our pastiche image. By the end of a few iterations, the pastiche image now has the
style of the style image and the content of the content image — or, said differently, it is a
stylized version of the original content image.

Losses
Before we dive into the math and intuition behind the losses, let’s address a concern you may
have. You may be wondering why we use the outputs of intermediate layers of a pretrained
image classification network to compute our style and content losses. This is because, for a
network to be able to do image classification, it has to understand the image. So, between
taking the image as input and outputting its guess at what it is, it’s doing transformations to
turn the image pixels into an internal understanding of the content of the image.

We can interpret these internal understandings as intermediate semantic representations of


the initial image and use those representations to “compare” the content of two images. As an
example: if we pass two images of cats through an image classification network, even if the
initial images look very different, after being passed through many internal layers, their
representations will be very close in raw value. This is the content loss — pass both the
pastiche image and the content image through some layers of an image classification network
and find the Euclidean distance between the intermediate representations of those images.
Here’s the equation for content loss:

Amit Jain
16BCS1167
Content Loss Equation

The summation notation makes the concept look harder than it really is. Basically,
we make a list of layers at which we want to compute content loss. We pass the
content and pastiches images through the network until a particular layer in the list,
take the output of that layer, square the difference between each corresponding value
in the output, and sum them all up. We do this for every layer in the list, and sum
those up. One thing to note, though: we multiply each of the representations by some
value alpha (called the content weight) before finding their differences and
squaring it, whereas the original equation calls for the value to be multiplied after
squaring it. I found, in practice, the former to work much better than the latter, as it
produces appealing stylizations much more quickly.

The style loss is very similar, except instead of comparing the raw outputs of the style
and pastiche images at various layers, we compare the Gram matrices of the
outputs. A Gram matrix results from multiplying a matrix with the transpose of
itself:

Gram Matrix Equation

Amit Jain
16BCS1167
Because every column is multiplied with every row in the matrix, we can think of the
spatial information that was contained in the original representations to have been
“distributed”. The Gram matrix instead contains non-localized information about the
image, such as texture, shapes, and weights — style!

Now that we have defined the Gram matrix as having information about style, we can
find the Euclidean distance between the Gram matrices of the intermediate
representations of the pastiche and style image to find how similar they are in style:

Style Loss Equation

Similar to our style loss computation, we find the Euclidean distances between each
corresponding pair of values in the Gram matrices computed at each layer in a
predefined list of layers, multiplied by some value beta (known as the style weight).

We have the content loss — which contains information on how close the pastiche is
in content to the content image — and the style loss — which contains information on
how close the pastiche is in style to the style image. We can now add them together
to get the total loss. We then backpropagate through the network to reduce this loss
by getting a gradient on the pastiche image and iteratively changing it to make it look
more and more like a stylized content image. This is all described in more rigorous
detail in the original paper on the topic by Gatys et al.

Amit Jain
16BCS1167
Total Loss

Total loss is the sum of content loss and style loss that we have derived above . In total loss
we can also define the ratio of content and style in the generated Image by setting the hyper
paramerters alpha and beta . In order to get best result our aim is to minimize the total loss

Such that the values of pixels of generated image is changed .

Amit Jain
16BCS1167
Screenshots

Amit Jain
16BCS1167
Conclusion and Future Scope

Amit Jain
16BCS1167
Amit Jain
16BCS1167
Amit Jain
16BCS1167
Amit Jain
16BCS1167
Amit Jain
16BCS1167
Amit Jain
16BCS1167

You might also like