Video Watermarking Technology For Semantic Search

Text Extraction for Educational Video
Student Name
Matric No
Supervisor Name
Semester
: Wong Lih Fong

: PC103285
: Dr. Mohd Yazid Idris
:3
Educational Video
Objective to educate and train viewer.

Three main categories: demonstration
video, narrative video and lecture video.
Exponential growth of educational video
causing trouble to access the video and
search for the video (Stephan and
Christoph, 2006).
Solution, categorize video by:
UGC tagging
Speech in video
Text in video
Text in Educational Video

Darin and Diane (2008), Huang
(2011), Li et al. (2010), Zhang et al,
(2010) showed that text features could
produce accurate result compare to
others.
Two types of text:
Caption Text
Scene Text
Basic Architecture
Video Frame Extraction

Most researches are based on static
image.
Most researchers did not mention the
method they used to extract video
frame. (Shivakumara et al. , 2011;
Sharma et al. , 2012; Wei and Lin,
2012)
Frame redundancy affect the
performance. (especially in video
domain)
Video Frame Extraction

Reference
Method
Weakness
(Huang,
2012)
Check motion vectors on 30

consecutive frames.
Not working if text is not static

within 30 frames.
(Bai et al.,
2012)
Extract one frame per every

three frames.
Consist of frame redundancy.
(Liu and
Wang, 2012)
Extract three video frames per

second.
Consist of frame redundancy.
(Yang et al.,
2011)
Extract one frame per second.

Check for different of text line.
Applied on slide based lecture

video only.
Text Detection
Used to come together with text
localization.
Usually ignored by researchers.
Current methods are too complicated
(create issue of performance due to
video domain).
Text Detection
Reference
Method
Weakness
(Jung et al.,
2009)
Use SVM to classified text pixels

and background pixel.
Complicated algorithm and time

consuming
(Li et al.,
2011; Pan et
al., 2011)
DWT and image pyramid scale

information
Complicated algorithm and time

consuming
Text Localization
Divided into three types:

Connected Component
Texture Based
Edge Based
Localization fail on the text which is

having similar color with background.
Text Localization
Reference
Method
Weakness
(Yi and Tian,

2011)
CC based on gradient features

and color features.
Did not work on complex

images.
(Carnicer et
al., 2011)
Automated hysteresis threshold

by averaging edge results.
Time consuming, not suitable for

video domain.
(Liu and
Wang, 2012)
Contour based edge detector

Misclassified some background
and localize using spatial domain pixels with very similar color as
and inner-frame information.
text pixels.
Text Extraction
Aims to produce a binary text image.
To distinguish between text pixels and
background pixels.
Noise and Missing Stroke cause the
imperfectness of binary text image.
Text Extraction
Reference
Method
Weakness
(Kim and Kim,

2009)
Inverse text color for dark text.
Performance issues. Fail when

text and backgrounds color are
about the same.
(Haneda and
Blockwise segmentation and
Charles, 2011) global segmentation.
(Liu and
Wang, 2012)
Fail on non-uniform colors

background
Finding the largest frequency of

Fail if the gaps between
color in text region. Filter with
character is too small or too big.
temporal homogeneity of color of
frames.
Problem Statement
Main Research Question:
How to produce an approach that is able to

extract the text from educational video and
convert it into binary text image fast and
accurate?
Sub. Questions:
How to localize the position of texts which are

having the similar color with the background?
How to convert the scene of video into binary
image that contain only texts?
How to reduce frames redundancy and
improve text detection rate in order to extract
text from video in real time?
Video Watermarking Technology for Semantic Search
Research Objectives
The main objective of the research is:

To produce an approach that is able to extract the text from
educational video and convert it into binary text image fast
and accurate.
The improvement could be achieved through these

objectives:
To propose a new method for locating the text with similar
background color in the images by applying image gridding
and multi-threshold approach.
To hybrid morphological dilation approach with stroke width
transform approach to separate the text pixel from the
image background and then convert it into binary text
image.
To propose a new and fast algorithm for text detection in
video based on its gradient graph.
To propose a new method to remove redundant frames by
analyzing the content differences of I-frame in compressed
video.
Scopes
Targeted text with stroke bigger than 7
pixels.
Limited to video encoded with
H.264/MPEG-4 codec.
Text recognition is performed by open
source OCR software.
Research
Methodology
Frame Extraction
Educational Video
Text Images
Text Localization
Convert to gray scale

images
Convert to gradient
images by 3x3 Sobel
operator
Divide images into
smallest possible regions
Separate text region by

determine OTSU
threshold
Convert to edge image

by Canny edge detector
I-frames extraction
Identify scene change parameter:

Absolute change
Normal change
Side change
Image differences
Filtering scheme
Filtered Frames
Text Detection
Identify sharp change of intensity
exceed height and width threshold
Images containing text
Text Extraction
Morphological Dilation expansion
on edge images
Text only edge images
Binary text images
Complete broken edges

and incomplete stroke
by Stroke Width
Transform
Proposed Solution for Frame

Extraction
In compress video domain, similar

frames are grouped in GOP.
Video Sequence
Layer
Group of Picture
Layer
Decoder Sequence
Data header: Layer-id etc.
GOP-1 GOP-2 GOPN
Data header: Video time

etc.
Picture-1
Picture-2
PictureN
Proposed Solution for Frame

Extraction
Compute four parameters for detection:

Absolute change (Magnitude change),
Normal change (Vector change),

Side change, (Different of changes)
Image differences, (Total changes)
Accept frames with:

> , > , >
< , < , < , >
Proposed Solution for Text

Detection
To proposed a simple (low
computational cost) algorithm with
promising accuracy.
Utilize the value of change
of intensity.
25 23 19 30 15 12
25
4
24
1
19
5
32
17
10
25
5
24
2
19
3
28
14
11
25
3
24
2
19
5
31
15
12
25
4
24
2
19
2
32
17
10
30
16
10
25for Educational
24 19
Text Extraction
Video

Detection
100
Intensity
80
60
40
20
0
100
200
300
400
Image Width
500
600
70
100
200
300
400
Image Width
500
600
70
Line 1
Line 2
300
Intensity
250
200
150
100
50
0

Localization
Edge based solution: Canny edge
detector
Threshold determination
Too low thresholds value
Too high thresholds value

Localization
Do not stick to one threshold for each image multi-threshold

method.
77
4
0
31
33
20
0
8
67
38
29
31
21
7
4
8
18
21
19
19
33
20
18
8
28
34
16
63
49
37
56
32
17
21
8
4
7
29
18
19
0
19
14
4
34
20
21
70
22
52
75
33
18
22
25
8
22
22
20
19
0 32 32 26 24 34
4 30
4 22 39 33
22 22
8 21 28 31
21 20 22 23 29 30
29
8 42 51 35 30
20 19 49 57 63 53
24 28 73 64 79 42
38 62 75 63 73 72
63 84 63 102 84 93
65 115 101 104 111 79
69 102 104 104 100 96
56 50 37 79 68 87
33 36 22 34 84 91
38 19 36 49 70 85
12 21 23 40 74 69
8
8 28 94 120 118
0 21 27 80 63 69
4
7 29 36 21 26
36 18 28 29 33 39
8 31 18 25 32 19
38 39 42 50 46 16
36 39 52 62 72 51
34 48 75 76 68 61
42 45 60 90 112 98
36 48 62 97 104 53
45 52 63 81 80 77
40 51 48 68 66 55
77 93 65 78 60 71
107 106 87 101 59 78
102 91 109 97 88 106
117 108 93 86 86 104
110 83 78 62 84 60
100 98 80 59 62 70
104 80 96 87 53 58
84 79 89 80 72 70
111 103 115 122 108 89
95 78 69 78 75 49
28 38 22 20 14 20
22 21 22
4
4 20
25 22
4 19 34
4
12 28
36 29
37 24
38 25
31 31
38 21
39 27
16 28
57 84
89 97
84 103
76 46
73 50
66 54
61 50
58 45
43 19
22 19
8 17
19 13
11
24
8
12
12
12
13
12
26
22
22
8
27
10
8
7
6
21
4
4
8
16
39
18
19
19
15
23
10
14
16
8
8
8
17
17
11
20
2
16
22
23
23
12
8
8
8
9
14
12
7
19
17
18
8
4
19
18
35
36

Localization
Smallest possible regions division

(8x8) could produce precise position.
Normal Localization Output
Proposed Localization Output

Extraction
Hybrid two algorithms:

Morphological Dilation (fast and low
accuracy)
SWT (slow and high accuracy) (Epshtein et al.,
2010)
Text of
Extraction
Fill in the strokes
text
Before
After

Extraction
Morphological Dilation in horizontal
and vertical until edges from
different direction is meet
Text only Edge

Image
Apply SWT
on the outer
and inner
pixels
Apply Morphological Dilation again

on the generated Stroke (yellow
part)
Binary Text
Image
Remove the
original
dilation

Extraction
Broken edge
Morphological Dilation in horizontal
and vertical until edges from
different direction is meet
Broken Edge
Image
Apply SWT
on the outer
and inner
pixels
Apply Morphological Dilation again

on the generated Stroke (yellow
part)
Binary Text
Image
Remove the
original
dilation
Significance of Research
There are four contributions in the research:

Enhancement of localizing algorithm, reduce the miss
detection on the text pixels and false detection on the nontext pixels. At the same time, improve the performance of
the algorithm and the precision of the location of text in the
images.
Enhancement of edge detecting algorithm, improve
detection rate on weak edges in the edge image. Eliminate
the parameters of edge detector by optimizing the
threshold based on input images.
Enhancement of text extraction algorithm, improve the
performance of separating text pixels and non-text pixels in
the images and convert it into binary text images. At the
same time, increase the accuracy of recognition rate from
OCR software.
Enhancement of frame filtering and text detection
algorithm, improve the rate of detecting the present of text
throughout the whole video and further to reduce the false
Conclusion
The research target to extract text
information in educational video.
Targeted enhanced algorithms are to
improve the performance and the
accuracy for text extraction in video.
Thank You
References
Bai, B., Yin, F. and Liu, C.L. (2012), A Fast Stroke-Based Method for Text Detection in
Video, 10th IAPR International Workshop on Document Analysis Systems (DAS), pp.69-73.
Epshtein, B., Ofek, E. and Wexler, Y. (2010) , Detecting text in natural scenes with stroke width
transform, IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp.29632970.
Haneda, E. and Bouman, C.A. (2011), Text Segmentation for MRC Document
Compression, IEEE Transactions on Image Processing, vol.20, no.6, pp.1611-1626.
Huang, X.D. (2011), A novel video text extraction approach based on Log-Gabor filters, 4th
International Congress on Image and Signal Processing (CISP), vol.1, pp.474-478.
Huang, X.D. (2012), Automatic Video Text Detection and Localization Based on Coarseness
Texture, Fifth International Conference on Intelligent Computation Technology and Automation
(ICICTA), pp.398-401.
Li, M.H., Bai, M., Wang, C.H. and Xiao, B.H. (2010), Conditional random field for text
segmentation from images with complex background, Pattern Recognition Letters, Vol. 31,
Issue 14, pp. 2295-2308.
Liu, X.Q. and Wang, W.Q (2012), Robustly Extracting Captions in Videos Based on Stroke-Like
Edges and Spatio-Temporal Analysis, IEEE Transactions on Multimedia, vol.14, no.2, pp.482489.
Pan, Y.F., Hou, X.W. and Liu, C.L. (2011), A Hybrid Approach to Detect and Localize Texts in
Natural Scene Images, IEEE Transactions on Image Processing, vol.20, no.3, pp.800-813.
Sharma, N., Shivakumara, P., Pal, U., Blumenstein, M. and Tan, C.L. (2012), A New Method for
Arbitrarily-Oriented Text Detection in Video, 10th IAPR International Workshop on Document
Analysis Systems (DAS), pp.74-78
Wei, Y.C. and Lin, C.H. (2012), A robust video text detection approach using SVM, Expert
Systems with Applications, Vol. 39, Issue 12, pp. 10832-10840.
Yang, H.J., Siebert, M., Luhne, P., Sack, H. and Meinel, C. (2011) , Automatic Lecture Video
Indexing Using Video OCR Technology,
IEEE
International
Symposium
on Multimedia
(ISM),
Video
Watermarking
Technology
for Semantic
Search

Video Watermarking Technology For Semantic Search

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Video Watermarking Technology For Semantic Search

Uploaded by

Copyright:

Available Formats

Text Extraction for Educational Video

: Wong Lih Fong

Objective to educate and train viewer.

Text Extraction for Educational Video

Text in Educational Video

Text Extraction for Educational Video

Text Extraction for Educational Video

Video Frame Extraction

Text Extraction for Educational Video

Video Frame Extraction

Check motion vectors on 30

Not working if text is not static

Extract one frame per every

Consist of frame redundancy.

Extract three video frames per

Consist of frame redundancy.

Extract one frame per second.

Applied on slide based lecture

Text Extraction for Educational Video

Text Extraction for Educational Video

Use SVM to classified text pixels

Complicated algorithm and time

DWT and image pyramid scale

Complicated algorithm and time

Text Extraction for Educational Video

Divided into three types:

Localization fail on the text which is

Text Extraction for Educational Video

(Yi and Tian,

CC based on gradient features

Did not work on complex

Automated hysteresis threshold

Time consuming, not suitable for

Contour based edge detector

Text Extraction for Educational Video

Text Extraction for Educational Video

(Kim and Kim,

Inverse text color for dark text.

Performance issues. Fail when

Fail on non-uniform colors

Finding the largest frequency of

Text Extraction for Educational Video

Main Research Question:

How to produce an approach that is able to

How to localize the position of texts which are

The main objective of the research is:

The improvement could be achieved through these

Video Watermarking Technology for Semantic Search

Convert to gray scale

Separate text region by

Convert to edge image

Identify scene change parameter:

Images containing text

Text only edge images

Binary text images

Complete broken edges

Video Watermarking Technology for Semantic Search

Proposed Solution for Frame

In compress video domain, similar

Data header: Layer-id etc.

GOP-1 GOP-2 GOPN

Data header: Video time

Text Extraction for Educational Video