You are on page 1of 1

ABSTRACT

The goal of Lip-reading is to decode and analyze the lip movements of a speaker for a
said word or phrase. Variation in speaking speed, intensity and same lip sequence of
distinct characters have been the challenging aspects of lip reading. In this paper we
present a lip reading model for an audio-less video data of variable-length sequence
frames. First, we extract the lip region from each face image in the video sequence and
concatenate them to form a single image. Next, we design a twelve-layer Convolutional
Neural Network with two layer of batch normalization for training the model and to
extract the visual features end-to-end. Batch normalization helps to reduce the internal
and external variances in various attributes like speaker’s accent, lighting and quality of
image frames, pace of the speaker and posture of speaking etc.We validate the
performance of ourmodel on a standard audio-less video MIRACLE-VC1 dataset and
compare with an existing model whichuses 16 layers CNN or more. A training accuracy
of 96% and a validation accuracy of 52.9% have been attained on the proposed lip
reading model.

You might also like