You are on page 1of 5

JOURNAL OF COMPUTING, VOLUME 4, ISSUE 9, SEPTEMBER 2012, ISSN (Online) 2151-9617 https://sites.google.com/site/journalofcomputing WWW.JOURNALOFCOMPUTING.

ORG

98

Text Localization Challenges and Solutions


First S. Tehsin, Second A. Masood, Third S. Kausar, Forth Y. Javed and Fifth F. Arif
AbstractMultimedia data has increased rapidly in past years. Textual information present in multimedia contains important information about the image/video content. This information can be exploited for image and video retrieval systems. There are few attributes associated with text, which are commonly used for text extraction. But state of the art research in the field is evident that these text extraction methods face many problems that hinder their robustness and reliability. This paper focuses on identification of such problem along with their possible solutions. This paper can be served as a guide for novice researchers of the field about text extraction related problems and their solutions. Index Terms Content-based Image Retrieval, caption text extraction, Scene Text, Multimedia Processing

1 INTRODUCTION
N recent years there is a rapid increase in multimedia libraries, which raise the need of efficiently retrieving, indexing and browsing multimedia information. Several approaches have been introduced in the literature to retrieve image and video data. These techniques are based on color, texture, shape and relation between objects etc. For text based queries, text embedded in images and videos can be a very good option for retrieval. Visual texts appear in multimedia data often instruct knowledge about news headings, title of movie, place locations, brands of products, scores of a match, date and time when an event took place. All this information is vital for understanding and retrieving images and videos. A lot of work has been done in the field of text extraction from multimedia data. But most of the work is application specific and there is still need for more work in designing, domain independent systems. This is because there are so many challenges when extracting text with variation in fonts, size, color, alignment, orientation, illumination and background. Problem of text extraction get very difficult because of these deviations. In literature text embedded in images and videos is classified in two groups, caption text and scene text. Caption text is laid over the image/video at a later stage e.g. score of match and name of the speaker. It is also known as artificial text or superimposed text. Scene text is actual part of the scene e.g. name of the product during commercial break, street signs, name plate and text appearing on t-shirts. Scene text is also referred as graphics text. Scene text is very difficult to extract because of blurriness, noise, illumination change, variation in background and camera parameters in scene images and videos.

Text extraction and recognition process comprises of five steps namely text detection, text localization, text tracking, segmentation/binarization, and character recognition. Aim of text detection and text localization is to create a bounding box around the text appear in image or frame of video. Architecture of text extraction process is presented in figure. 1.

Video/Image

Text tion

Detec-

Text zation

Locali-

Text Tracking

Text Segmentation

Text Recognition

Text Object

Figure 1: Architecture of text extraction and recognition process

First. Author is with the Department of Computer Software Engineering, MCS, NUST, Pakistan. Second. Author is with the Department of Information Security, MCS, NUST, Pakistan. Third. Author is with the CEME, NUST, Pakistan. Forth. Author is with the CEME, NUST, Pakistan. Fifth. Author is with the Department of Computer Software Engineering, MCS, NUST, Pakistan.

1.1 State-of-the-art in text location


A variety of approaches of text extraction have been proposed during the past years. According to the text features utilized, these methods can be mainly categorized into two types: region based and texture based [[1]]. Texture-based methods use the fact that text in images have distinctive textural properties that distinguish it from the background. The techniques mostly use Gabor

2012 Journal of Computing Press, NY, USA, ISSN 2151-9617

JOURNAL OF COMPUTING, VOLUME 4, ISSUE 9, SEPTEMBER 2012, ISSN (Online) 2151-9617 https://sites.google.com/site/journalofcomputing WWW.JOURNALOFCOMPUTING.ORG

99

filters, Wavelet, FFT, spatial variance, etc. These methods further use machine learning techniques such as SVM, MLP and adaBoost [[2],[3],[4],[5],[6]]. Region based approach exploits different region properties to extract text objects. This approach makes use of the fact that there is sufficient difference between the text color and its immediate background. Color features, edge features, and connected component methods are often used in this approach [[7],[8],[9],[10],[11]]. A variety of work has been done on caption text extraction, but most of the work is application specific and tested on non standard datasets. Shuicai Shi et al [[12]] proposed an approach for text detection, localization and extraction in video frames using block change rate and element image division in Lab color space. This approach extracts multilingual text from videos. Jiangbo Xu et al [[13]] proposed text extraction in DCT compressed domain. Candidate text blocks are detected in terms of DCT texture energy. An adaptive temporal constraint method is proposed to exploit the temporal occurrence of text in a sequence of frames. Results are verified on MPEG video sequences. The merger of binarized and intensity images edge maps is used in Shi Jianyong [[14]] method to detect text in video frames. This method is time saving specially for monochromatic images. Xin Zhang [[15]] proposed text extraction method for multilingual text carrying images/video frames. In this method color and edge features are combined to extract the text. A two stage video-text location method is proposed by Wang Zhiming [[16]]. In the first stage, an unsupervised paradigm based on wavelet is proposed to obtain candidate text region. Text boundaries are marked in the second stage by traversing line with its aptitude spectrum. Some application specific text detection approaches are also reported in the literature. T. S. Mahmood [[17]] proposed method for Cardiac Echo Videos for Decision Support systems. Cheolkon Jung and Joongkyu Kim [[18]] and Sunitha Abburu [[19]] reported text detection for golf and cricket matches respectively. Li Meng [[20]] proposed an algorithm for TV Commercial Detection. Region based methods make use of low-level features, so these methods show high speed and perform well under simple background, but are sensitive to noises. Texture based methods give better results in complex backgrounds but exhibit high computation complexity in texture classification, which results in larger processing time. The rest of the paper is organized as follows. Section 2 highlights the common potential problems in text extraction process, and also the difficulties specific to caption and scene text. Section 3 presents the solutions for text extraction problems. Section 4 provides concluding remarks.

lar in images, but their extraction problems are distinctly different. Caption text is usually of low-resolution and having low contrast images. On the other hand, scene text images are high-resolution camera captured ones and posses high illumination variability. Therefore, their text extraction methodologies should focus on different issues and come up with different solutions. Some of the problems are common in scene and caption text extraction, but some are specific to the category. We will discuss the common text extraction problems first, then individual ones.

2.1 Common text extraction problems

2.1.1

Color

Text appearing in the images can have a variety of colors. Text can appear in lighter colors with darker background and vice versa. Color of the text is the most vital information about the text that can be used for its extraction from the image but the inter-image and intra-image variation in text color leads to significant problems in text extraction. Because this problem is a hurdle in devising a robust mechanism for text extraction on the basis of color information.

2.1.2

Typeface

Font of the text may vary a lot in multimedia data. Variation of font, changes the stroke width and aspect ratio of characters. Stroke width and aspect ratio is part of the many feature sets for differentiating text and non text objects [[22], [23]]. Stroke width is the thickness of the stroke of a letter. Some typefaces are monoline, which means that all strokes and stems are of equal thickness. Other typefaces consist of varying strokes, which creates visual contrast. Here, the term contrast indicates the variance of stroke width within a character. In most Gothic typefaces, no contrast can be found, while other Gothics have a light contrast. Roman typefaces, on the other hand, often have distinctively contrasting stroke widths. A width to height ratio of 3:5 (0.6) is recommended for most applications. But Soar [[21]] concluded that a ratio of 7.5 : 10 (0.75) may result in even better visualization. Many typeface families have several fonts which vary in width only. Univers, for instance, has several fonts which differ in width only. So different fonts appear in the images, have different aspect ratio and stroke width. This diversity of font parameters leads to a lot of difficulty in text extraction.

2.1.3

Size

2 CHALLENGES IN TEXT EXTRACTIONS


Text extraction from images/videos is very challenging task. Because text appearing in the images may have different colors, font size and type, complex back ground and orientations. Scene text and caption text appear simi-

Size of the text varies a lot in the text images. Many design issues are dependent upon the assumption of text size. Small variation in font size is catered for in most of the systems, but most of the text extraction systems fail when range of the size increases largely. In many text extraction methods, morphological operators are used for segmentation, edge detection and combination of charac-

2012 Journal of Computing Press, NY, USA, ISSN 2151-9617

JOURNAL OF COMPUTING, VOLUME 4, ISSUE 9, SEPTEMBER 2012, ISSN (Online) 2151-9617 https://sites.google.com/site/journalofcomputing WWW.JOURNALOFCOMPUTING.ORG

100

ters to form word. Sizes of structuring element of Morphological operators are usually adjusted according to the target text size. By this sort of hard coding, the text extraction system limits its application to the varying size text images.

2.1.4

Complex Background
Figure 3. Examples of multi oriented scene text

This is the most difficult problem in text extraction process. Due to the complex background, segmentation of text object becomes difficult which results in false classification of objects. It is main cause for low detection rate in text extraction process.

2.3 Caption text extraction problems


Caption text is graphically or artificially overlaid on the image. It has distinct features and problem issues. Following are the problems which are quite dominant in caption text extraction.

2.2 Scene text extraction problems


Some of the problems are specifically related to the scene text extraction process.

2.3.1

Low contrast

2.2.1

Illumination Change

Illumination change is the biggest challenge in scene text extraction process. As most of thresholding and segmentation techniques fail and give poor results. It is also concluded in the ICDAR 2005 Text Locating Competition Results that Variations in illumination, such as reections from light sources cause signicant problems. In real images, uneven lighting, shadowing, reflections onto objects, inter-reflections between objects may decrease analysis performance. Examples of illumination change are presented in figure 2.

Caption text images are usually low contrast images, made for efficient online transmission and suffer from the compression artifacts. Low contrast attribute of caption text can play significant role in misclassification.

2.3.2

Animations

Animation is a very common feature of caption text in videos. Few techniques are presented in the literature dealing with the tracking animated text in videos [[25], [26]]. Current text tracking methods are mainly applied in tracking texts with rigid simple motion such as stationary captions or regular scrolling credits while are not appropriate to handle video content with moving objects such as complex scrolling, rotation and scale change. Animated text in videos can move in arbitrary directions and prediction of its motion is difficult. Complex back ground make tracking even more difficult.

3
Figure. 2. Examples of scene text with illumination change

Proposed Solutions

Proposed solutions of above mentioned problems are summarized in Table 1. These solutions can be very helpful for the novice researchers of the field.

2.2.2

Multi oriented text

Due to change in camera position, text in scene images may have varying orientation, as shown in Figure 3. Change in orientation can affect aspect ratio measures. It can also change the edge orientations. For example, if text is horizontally aligned in the given image, only vertical edges are sufficient for further processing. But if its orientation changes, multi oriented edges are required to process.

2012 Journal of Computing Press, NY, USA, ISSN 2151-9617

JOURNAL OF COMPUTING, VOLUME 4, ISSUE 9, SEPTEMBER 2012, ISSN (Online) 2151-9617 https://sites.google.com/site/journalofcomputing WWW.JOURNALOFCOMPUTING.ORG

101

Table 1. Solutions of text extraction problem

Category

Problem Color

Scene and Caption Text

Typeface

Size

Complex Background Illumination change Scene Text Multi oriented text

Low Resolution Caption Text Animations

Solution Contrast of the text with its background can be exploited for extraction process. This feature can be used in most of the region based techniques. Basic text features like edge density can be used for text extraction. Different fonts can be explicitly trained for classification. Multi resolution techniques [[24]] can be used to detect large variety of text sizes, but it increase the computational time. Ensemble segmentation techniques can be used for better and robust segmentation. Illumination change can be catered by dividing image into smaller windows and apply thresholding and segmentation locally to that window and finally combine all the results. Take rotation invariant features for text extraction like SIFT descriptors. Another solution to this problem can be processing the same image multiple times by applying rotations on the image. Contrast enhancement technique can be applied on image prior to text extraction process. Motion prediction and object tracking techniques can be used to track the animated text in videos.

Conclusion

Recent studies in the field of computer vision and pattern recognition show a great amount of interest in content retrieval from images and videos. The semantic information provided by an image can be useful for content based image retrieval, as well as for indexing and classification purposes. Since the text data can be embedded in an image or video in different font styles, sizes, orientations, colors, and against a complex background, the problem of extracting the candidate text region becomes a challenging one. The main challenge is to design a system as versatile as possible to handle all variability in daily life. It includes variable targets with complex layout, varying illumintion,

several character fonts and sizes and variability in imaging conditions with complex animations. Deviations in Font style, size, Orientation, alignment & complexity of background makes the text segmentation as a challenging task in text extraction. This paper is an effort to provide one point access to potential problems in the text extraction process, along with their potential solutions.

REFERENCES
[1] Lienhart, R., 2003. OCR, Video: A Survey and Practitioners Guide. Intel Corporation, Microprocessor Research Labs, Santa Clara, California, 155-184. Chuang, L., Ding, X., Wu, Y, 2006. An algorithm for text loca-

[2]

2012 Journal of Computing Press, NY, USA, ISSN 2151-9617

JOURNAL OF COMPUTING, VOLUME 4, ISSUE 9, SEPTEMBER 2012, ISSN (Online) 2151-9617 https://sites.google.com/site/journalofcomputing WWW.JOURNALOFCOMPUTING.ORG

102

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

tion in images based on histogram features and AdaBoost. J. Image Graph. 11(3), 325331. Kim, K.I., Jung, K., Kim, J.H, 2003. Texture-based approach for text detection in images using support vector machines and continuously adaptive mean shift algorithm. Pattern Anal. Mach. Learn. 26, 16311639. R. Lienhart and A. Wernicke, 2002. Localizing and Segmenting Text in Images and Videos, IEEE Transactions on Circuits and Systems for Video Technology 12( 4), 256-268. J. Gllavata, E. Qeli and B. Freisleben, 2006. Detecting Text in Videos Using Fuzzy Clustering Ensembles, Proceedings of the Eighth IEEE Interanational Symposium on Multimedia, 283290. D. Chen, J.M. Odobez and H. Bourlard, 2004. Text detection and recognition in images and video frames, Pattern Recognition, 595-608. K.I. Kim, K. Jung and J.H. Kim, 2003. Texture-based approach for text detection in image using support vector machine and continuously adaptive mean shift algorithm, IEEE Trans. Pattern Analysis and Machine Intelligence 25(12), 1631-1638. M.R. Lyu, J Song, M. Cai, 2005.A Comprehensive method for multilingual video text detection, localization, and extraction, IEEE transactions on circuits and systems for video technology 15, 243-255. Hua, X.-S., Chen, X.-R., Wenyin, L., Zhang, H.-J., 2001. Automatic Location of Text in Video Frames, Proceedings of the 2001 ACM Workshops on Multimedia: Multimedia Information Retrieval, 2427. Liu, C.,Wang, C., Dai, R., 2005. Text detection in images based on unsupervised classification of edge-based features. Proceedings of the 8th International Conference on Document Analysis and Recognition. 2, 607612. Sun, H., Zhao, N., Xu, X., 2006. Extraction of text under complex background using wavelet transform and support vector machine. Proceedings of 2006 International Conference on Mechatronics and Automation. 2, 14931497. Shuicai Shi, Tao Cheng, Shibin Xiao, Xueqiang Lv, 2009. A Smart Approach for Text Detection, Localization and Extraction in Video Frames. International Conference on Information Technology and Computer Science, 158 161. Jiangbo Xu, Xiuhua Jiang, Yuxia Wang, 2009. Caption Text Extraction Using DCT Feature in MPEG Compressed Video. World Congress on Computer Science and Information Engineering Shi Jianyong, Luo Xiling, Zhang Jun, An Edge-based Approach for Video Text Extraction, 2009 International Conference on Computer Technology and Development, 431 434. Xin Zhang, Fuchun Sun, Lei Gu, 2010. A Combined Algorithm for Video Text Extraction. Seventh International Conference on Fuzzy Systems and Knowledge Discovery, 2294 2298. Wang Zhiming, Xiao Yu, 2010. An approach for video-text extraction based on text traversing line and stroke connectivity. International Conference on Biomedical Engineering and Computer Science, 1 3. Tanveer Syeda-Mahmood, David Beymer, Arnon Amir, 2009. Disease-Specific Extraction of Text from Cardiac Echo Videos for Decision Support. 10th International Conference on Document Analysis and Recognition, 1290 1294. Cheolkon Jung and Joongkyu Kim, 2009. Player Information Extraction for Semantic Annotation in Golf Videos. IEEE

TRANSACTIONS ON BROADCASTING 55(1), 79 83. [19] Dr. Sunitha Abburu, 2010. Multi level semantic extraction for cricket video by text processing. International Journal of Engineering Science and Technology 2(10), 5377-5384. [20] Li Meng, Yong Cai, Min Wang, and Yuanxing Li, 2009. TV Commercial Detection Based on Shot Change and Text Extraction. 2nd International Congress on Image and Signal Processing, 1 5. [21] Soar R.S. (1955): Height-width proportion and stroke width in numeral visibility, Journal of Applied Psychology 39, p.43-46 [22] Epshtein, B., Ofek, E., Wexler, Y.: Detecting Text in Natural Scenes with Stroke Width Transform. In: CVPR (2010) [23] Michael R. Lyu, Jiqiang Song, and Min Cai , A Comprehensive Method for Multilingual Video Text Detection, Localization, and Extraction, IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 15, NO. 2, FEBRUARY 2005 [24] SnooperText: A Multiresolution System for Text Detection in Complex Visual Scenes. Rodrigo Minetto, Nicolas Thome, Matthieu Cord, Jonathan Fabrizio, Beatriz Marcotegui. International Conference on Image Processing (ICIP). Sep 2010 [25] H. P. li, D. Doermann and O. Kia, Automatic Text Detection and tracking in digital Video, IEEE Transactions on IP, vol. 9, no. 1, pp. 147-156, 2000 [26] Weihua Huang, Palaiahnakote Shivakumara and Chew Lim Tan Detecting Moving Text in Video Using Temporal Information IEEE 2008

2012 Journal of Computing Press, NY, USA, ISSN 2151-9617

You might also like