Professional Documents
Culture Documents
Systems
Ashish Sharma
14MCB1025
Guide : Dr.Geetha.S
CONTENTS:
Problem Statement.
Architecture Design.
Literature Survey.
Proposed Algorithm.
Implementation Techniques.
Modules and its current status.
Snapshot
Expected Results.
Milestone.
References.
ARCHITECTURE DESIGN:
PROPOSED ALGORITHM
MULTI MODAL LEARNING
We present a model that generates natural language descriptions
of images and their regions.
Our approach leverages datasets of images and their sentence
descriptions to learn about the inter-modal correspondences
between language and visual data. Our alignment model is
based on a novel combination of Convolutional Neural Networks
over image regions,bidirectional Recurrent Neural Networks
over sentences, and a structured objective that aligns the two
modalities through a multimodal embedding. We then describe a
Multimodal Recurrent Neural Network architecture that uses the
inferred alignments to learn to generate novel descriptions of
image regions. We demonstrate that our alignment model
produces state of the art results in retrieval experiments on
Flickr30K and MSCOCO datasets.
PROPOSED ALGORITHM
Convolutional Networks express a single differentiable function
from raw image pixel values to class probabilities.
PROPOSED ALGORITHM
MULTI MODAL LEARNING
We present a model that generates natural language descriptions
of images and their regions.
Our approach leverages datasets of images and their sentence
descriptions to learn about the inter-modal correspondences
between language and visual data. Our alignment model is
based on a novel combination of Convolutional Neural Networks
over image regions,bidirectional Recurrent Neural Networks
over sentences, and a structured objective that aligns the two
modalities through a multimodal embedding. We then describe a
Multimodal Recurrent Neural Network architecture that uses the
inferred alignments to learn to generate novel descriptions of
image regions. We demonstrate that our alignment model
produces state of the art results in retrieval experiments on
Flickr30K and MSCOCO datasets.
PROPOSED ALGORITHM
MULTI MODAL LEARNING
We present a model that generates natural language descriptions
of images and their regions.
Our approach leverages datasets of images and their sentence
descriptions to learn about the inter-modal correspondences
between language and visual data. Our alignment model is
based on a novel combination of Convolutional Neural Networks
over image regions,bidirectional Recurrent Neural Networks
over sentences, and a structured objective that aligns the two
modalities through a multimodal embedding. We then describe a
Multimodal Recurrent Neural Network architecture that uses the
inferred alignments to learn to generate novel descriptions of
image regions. We demonstrate that our alignment model
produces state of the art results in retrieval experiments on
Flickr30K and MSCOCO datasets.
IMPLEMENTATION TOOL:
MODULES :
MODULES DESCRIPTION
CURRENT STATUS
Completed
2. Collection of dataset.
Completed
Completed
Completed
5. Creation of visualizations.
Pending
EXPECTED RESULTS:
Decompose images and sentences into fragments
and infer their inter-modal alignment using
ranking objective. Our model aligns contiguous
segments of sentences which are more meaningful,
interpretable, and not fixed in length.
Convolutional Neural Networks (CNNs) models for
image classification and object detection On the
sentence side, our work takes advantage of
pretrained word vectors to obtain low-dimensional
representations of words. Finally, Recurrent
Neural Networks have
been previously used in language modeling.
MILESTONE:
WORK TASK
PLANNED START
PLANNED
COMPLETE
SEPTEMBER
SEPTEMBER
2. Collection of reviews.
OCTOBER
OCTOBER
OCTOBER
NOVEMBER
4. Design and
Implementation.
DECEMBER
JANUARY
5. Creation of
visualizations.
MARCH
MARCH
REFERENCES:
A. Krizhevsky. Learning multiple layers of features from tiny images. Masters
thesis, Department of Computer Science, University of Toronto, 2012.
A. Berg, J. Deng, and L. Fei-Fei. Large scale visual recognition challenge 2010.
www.imagenet.org/challenges. 2013.
A. Krizhevsky and G.E. Hinton. Using very deep autoencoders for content-based
image retrieval. InESANN, 2011.
Y. LeCun, K. Kavukcuoglu, and C. Farabet. Convolutional networks and
applications in vision. InCircuits and Systems (ISCAS), Proceedings of 2013 IEEE
International Symposium on, pages 253256.IEEE, 2013.
H. Lee, R. Grosse, R. Ranganath, and A.Y. Ng. Convolutional deep belief networks
for scalable unsupervised learning of hierarchical representations. In Proceedings
of the 26th Annual International Conference on Machine Learning, pages 609
616. ACM, 2014.
THANK YOU!!