You are on page 1of 30

Text Extraction for Educational Video

Student Name
Matric No
Supervisor Name
Semester

: Wong Lih Fong


: PC103285
: Dr. Mohd Yazid Idris
:3

Educational Video

Objective to educate and train viewer.


Three main categories: demonstration
video, narrative video and lecture video.
Exponential growth of educational video
causing trouble to access the video and
search for the video (Stephan and
Christoph, 2006).
Solution, categorize video by:
UGC tagging
Speech in video
Text in video

Text Extraction for Educational Video

Text in Educational Video


Darin and Diane (2008), Huang
(2011), Li et al. (2010), Zhang et al,
(2010) showed that text features could
produce accurate result compare to
others.
Two types of text:

Caption Text
Scene Text

Text Extraction for Educational Video

Basic Architecture

Text Extraction for Educational Video

Video Frame Extraction


Most researches are based on static
image.
Most researchers did not mention the
method they used to extract video
frame. (Shivakumara et al. , 2011;
Sharma et al. , 2012; Wei and Lin,
2012)
Frame redundancy affect the
performance. (especially in video
domain)

Text Extraction for Educational Video

Video Frame Extraction


Reference

Method

Weakness

(Huang,
2012)

Check motion vectors on 30


consecutive frames.

Not working if text is not static


within 30 frames.

(Bai et al.,
2012)

Extract one frame per every


three frames.

Consist of frame redundancy.

(Liu and
Wang, 2012)

Extract three video frames per


second.

Consist of frame redundancy.

(Yang et al.,
2011)

Extract one frame per second.


Check for different of text line.

Applied on slide based lecture


video only.

Text Extraction for Educational Video

Text Detection
Used to come together with text
localization.
Usually ignored by researchers.
Current methods are too complicated
(create issue of performance due to
video domain).

Text Extraction for Educational Video

Text Detection
Reference

Method

Weakness

(Jung et al.,
2009)

Use SVM to classified text pixels


and background pixel.

Complicated algorithm and time


consuming

(Li et al.,
2011; Pan et
al., 2011)

DWT and image pyramid scale


information

Complicated algorithm and time


consuming

Text Extraction for Educational Video

Text Localization

Divided into three types:


Connected Component
Texture Based
Edge Based

Localization fail on the text which is


having similar color with background.

Text Extraction for Educational Video

Text Localization
Reference

Method

Weakness

(Yi and Tian,


2011)

CC based on gradient features


and color features.

Did not work on complex


images.

(Carnicer et
al., 2011)

Automated hysteresis threshold


by averaging edge results.

Time consuming, not suitable for


video domain.

(Liu and
Wang, 2012)

Contour based edge detector


Misclassified some background
and localize using spatial domain pixels with very similar color as
and inner-frame information.
text pixels.

Text Extraction for Educational Video

Text Extraction
Aims to produce a binary text image.
To distinguish between text pixels and
background pixels.
Noise and Missing Stroke cause the
imperfectness of binary text image.

Text Extraction for Educational Video

Text Extraction
Reference

Method

Weakness

(Kim and Kim,


2009)

Inverse text color for dark text.

Performance issues. Fail when


text and backgrounds color are
about the same.

(Haneda and
Blockwise segmentation and
Charles, 2011) global segmentation.
(Liu and
Wang, 2012)

Fail on non-uniform colors


background

Finding the largest frequency of


Fail if the gaps between
color in text region. Filter with
character is too small or too big.
temporal homogeneity of color of
frames.

Text Extraction for Educational Video

Problem Statement

Main Research Question:

How to produce an approach that is able to


extract the text from educational video and
convert it into binary text image fast and
accurate?

Sub. Questions:

How to localize the position of texts which are


having the similar color with the background?
How to convert the scene of video into binary
image that contain only texts?
How to reduce frames redundancy and
improve text detection rate in order to extract
text from video in real time?
Video Watermarking Technology for Semantic Search

Research Objectives

The main objective of the research is:


To produce an approach that is able to extract the text from
educational video and convert it into binary text image fast
and accurate.

The improvement could be achieved through these


objectives:
To propose a new method for locating the text with similar
background color in the images by applying image gridding
and multi-threshold approach.
To hybrid morphological dilation approach with stroke width
transform approach to separate the text pixel from the
image background and then convert it into binary text
image.
To propose a new and fast algorithm for text detection in
video based on its gradient graph.
To propose a new method to remove redundant frames by
analyzing the content differences of I-frame in compressed
Video Watermarking Technology for Semantic Search
video.

Scopes
Targeted text with stroke bigger than 7
pixels.
Limited to video encoded with
H.264/MPEG-4 codec.
Text recognition is performed by open
source OCR software.

Video Watermarking Technology for Semantic Search

Research
Methodology

Frame Extraction
Educational Video

Text Images
Text Localization

Convert to gray scale


images
Convert to gradient
images by 3x3 Sobel
operator
Divide images into
smallest possible regions

Separate text region by


determine OTSU
threshold

Convert to edge image


by Canny edge detector

I-frames extraction

Identify scene change parameter:


Absolute change
Normal change
Side change
Image differences
Filtering scheme

Filtered Frames
Text Detection
Identify sharp change of intensity
exceed height and width threshold

Images containing text

Text Extraction
Morphological Dilation expansion
on edge images

Text only edge images

Binary text images

Complete broken edges


and incomplete stroke
by Stroke Width
Transform

Video Watermarking Technology for Semantic Search

Proposed Solution for Frame


Extraction

In compress video domain, similar


frames are grouped in GOP.

Video Sequence
Layer

Group of Picture
Layer

Decoder Sequence

Data header: Layer-id etc.

GOP-1 GOP-2 GOPN

Data header: Video time


etc.

Picture-1

Picture-2

Text Extraction for Educational Video

PictureN

Proposed Solution for Frame


Extraction

Compute four parameters for detection:


Absolute change (Magnitude change),

Normal change (Vector change),


Side change, (Different of changes)
Image differences, (Total changes)

Accept frames with:


> , > , >
< , < , < , >

Text Extraction for Educational Video

Proposed Solution for Text


Detection
To proposed a simple (low
computational cost) algorithm with
promising accuracy.
Utilize the value of change
of intensity.
25 23 19 30 15 12

25
4

24
1

19
5

32

17

10

25
5

24
2

19
3

28

14

11

25
3

24
2

19
5

31

15

12

25
4

24
2

19
2

32

17

10

30

16

10

25for Educational
24 19
Text Extraction
Video

Proposed Solution for Text


Detection
100
Intensity

80
60
40
20
0

100

200

300
400
Image Width

500

600

70

100

200

300
400
Image Width

500

600

70

Line 1
Line 2
300

Intensity

250
200
150
100
50
0

Text Extraction for Educational Video

Proposed Solution for Text


Localization
Edge based solution: Canny edge
detector
Threshold determination

Too low thresholds value

Too high thresholds value

Text Extraction for Educational Video

Proposed Solution for Text


Localization

Do not stick to one threshold for each image multi-threshold


method.
77
4
0
31
33
20
0
8
67
38
29
31
21
7
4
8
18
21
19
19

33
20
18
8
28
34
16
63
49
37
56
32
17
21
8
4
7
29
18
19

0
19
14
4
34
20
21
70
22
52
75
33
18
22
25
8
22
22
20
19

0 32 32 26 24 34
4 30
4 22 39 33
22 22
8 21 28 31
21 20 22 23 29 30
29
8 42 51 35 30
20 19 49 57 63 53
24 28 73 64 79 42
38 62 75 63 73 72
63 84 63 102 84 93
65 115 101 104 111 79
69 102 104 104 100 96
56 50 37 79 68 87
33 36 22 34 84 91
38 19 36 49 70 85
12 21 23 40 74 69
8
8 28 94 120 118
0 21 27 80 63 69
4
7 29 36 21 26
36 18 28 29 33 39
8 31 18 25 32 19

38 39 42 50 46 16
36 39 52 62 72 51
34 48 75 76 68 61
42 45 60 90 112 98
36 48 62 97 104 53
45 52 63 81 80 77
40 51 48 68 66 55
77 93 65 78 60 71
107 106 87 101 59 78
102 91 109 97 88 106
117 108 93 86 86 104
110 83 78 62 84 60
100 98 80 59 62 70
104 80 96 87 53 58
84 79 89 80 72 70
111 103 115 122 108 89
95 78 69 78 75 49
28 38 22 20 14 20
22 21 22
4
4 20
25 22
4 19 34
4

Text Extraction for Educational Video

12 28
36 29
37 24
38 25
31 31
38 21
39 27
16 28
57 84
89 97
84 103
76 46
73 50
66 54
61 50
58 45
43 19
22 19
8 17
19 13

11
24
8
12
12
12
13
12
26
22
22
8
27
10
8
7
6
21
4
4

8
16
39
18
19
19
15
23
10
14
16
8
8
8
17
17
11
20
2
16

22
23
23
12
8
8
8
9
14
12
7
19
17
18
8
4
19
18
35
36

Proposed Solution for Text


Localization

Smallest possible regions division


(8x8) could produce precise position.

Normal Localization Output

Proposed Localization Output

Text Extraction for Educational Video

Proposed Solution for Text


Extraction

Hybrid two algorithms:


Morphological Dilation (fast and low
accuracy)
SWT (slow and high accuracy) (Epshtein et al.,
2010)

Text of
Extraction
Fill in the strokes
text

Before

After

Text Extraction for Educational Video

Proposed Solution for Text


Extraction
Morphological Dilation in horizontal
and vertical until edges from
different direction is meet

Text only Edge


Image

Apply SWT
on the outer
and inner
pixels

Apply Morphological Dilation again


on the generated Stroke (yellow
part)
Binary Text
Image

Text Extraction for Educational Video

Remove the
original
dilation

Proposed Solution for Text


Extraction

Broken edge
Morphological Dilation in horizontal
and vertical until edges from
different direction is meet

Broken Edge
Image

Apply SWT
on the outer
and inner
pixels

Apply Morphological Dilation again


on the generated Stroke (yellow
part)
Binary Text
Image

Text Extraction for Educational Video

Remove the
original
dilation

Significance of Research

There are four contributions in the research:


Enhancement of localizing algorithm, reduce the miss
detection on the text pixels and false detection on the nontext pixels. At the same time, improve the performance of
the algorithm and the precision of the location of text in the
images.
Enhancement of edge detecting algorithm, improve
detection rate on weak edges in the edge image. Eliminate
the parameters of edge detector by optimizing the
threshold based on input images.
Enhancement of text extraction algorithm, improve the
performance of separating text pixels and non-text pixels in
the images and convert it into binary text images. At the
same time, increase the accuracy of recognition rate from
OCR software.
Enhancement of frame filtering and text detection
algorithm, improve the rate of detecting the present of text
throughout the whole video and further to reduce the false
Video Watermarking Technology for Semantic Search

Conclusion
The research target to extract text
information in educational video.
Targeted enhanced algorithms are to
improve the performance and the
accuracy for text extraction in video.

Video Watermarking Technology for Semantic Search

Thank You

Video Watermarking Technology for Semantic Search

References

Bai, B., Yin, F. and Liu, C.L. (2012), A Fast Stroke-Based Method for Text Detection in
Video, 10th IAPR International Workshop on Document Analysis Systems (DAS), pp.69-73.
Epshtein, B., Ofek, E. and Wexler, Y. (2010) , Detecting text in natural scenes with stroke width
transform, IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp.29632970.
Haneda, E. and Bouman, C.A. (2011), Text Segmentation for MRC Document
Compression, IEEE Transactions on Image Processing, vol.20, no.6, pp.1611-1626.
Huang, X.D. (2011), A novel video text extraction approach based on Log-Gabor filters, 4th
International Congress on Image and Signal Processing (CISP), vol.1, pp.474-478.
Huang, X.D. (2012), Automatic Video Text Detection and Localization Based on Coarseness
Texture, Fifth International Conference on Intelligent Computation Technology and Automation
(ICICTA), pp.398-401.
Li, M.H., Bai, M., Wang, C.H. and Xiao, B.H. (2010), Conditional random field for text
segmentation from images with complex background, Pattern Recognition Letters, Vol. 31,
Issue 14, pp. 2295-2308.
Liu, X.Q. and Wang, W.Q (2012), Robustly Extracting Captions in Videos Based on Stroke-Like
Edges and Spatio-Temporal Analysis, IEEE Transactions on Multimedia, vol.14, no.2, pp.482489.
Pan, Y.F., Hou, X.W. and Liu, C.L. (2011), A Hybrid Approach to Detect and Localize Texts in
Natural Scene Images, IEEE Transactions on Image Processing, vol.20, no.3, pp.800-813.
Sharma, N., Shivakumara, P., Pal, U., Blumenstein, M. and Tan, C.L. (2012), A New Method for
Arbitrarily-Oriented Text Detection in Video, 10th IAPR International Workshop on Document
Analysis Systems (DAS), pp.74-78
Wei, Y.C. and Lin, C.H. (2012), A robust video text detection approach using SVM, Expert
Systems with Applications, Vol. 39, Issue 12, pp. 10832-10840.
Yang, H.J., Siebert, M., Luhne, P., Sack, H. and Meinel, C. (2011) , Automatic Lecture Video
Indexing Using Video OCR Technology,
IEEE
International
Symposium
on Multimedia
(ISM),
Video
Watermarking
Technology
for Semantic
Search