0% found this document useful (0 votes)
283 views20 pages

Scene Text Detection and Recognition USING DL PDF

1) Scene text detection and recognition has been influenced by deep learning, entering a new era where nearly all recent methods are built upon deep learning models. 2) Deep learning substantially simplifies the overall pipeline and provides significant improvements over previous hand-crafted feature-based methods. 3) Researchers are now turning to more target-oriented algorithms and datasets that tackle specific challenges like oriented, blurred, or long text.

Uploaded by

Bindu Madhuri
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
283 views20 pages

Scene Text Detection and Recognition USING DL PDF

1) Scene text detection and recognition has been influenced by deep learning, entering a new era where nearly all recent methods are built upon deep learning models. 2) Deep learning substantially simplifies the overall pipeline and provides significant improvements over previous hand-crafted feature-based methods. 3) Researchers are now turning to more target-oriented algorithms and datasets that tackle specific challenges like oriented, blurred, or long text.

Uploaded by

Bindu Madhuri
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

1

Scene Text Detection and Recognition:


The Deep Learning Era
Shangbang Long, Xin He, Cong Yao

Abstract—With the rise and development of deep learning, computer vision has been tremendously transformed and reshaped. As an
important research area in computer vision, scene text detection and recognition has been inevitably influenced by this wave of
revolution, consequentially entering the era of deep learning. In recent years, the community has witnessed substantial advancements
in mindset, methodology and performance. This survey is aimed at summarizing and analyzing the major changes and significant
progresses of scene text detection and recognition in the deep learning era. Through this article, we devote to: (1) introduce new
insights and ideas; (2) highlight recent techniques and benchmarks; (3) look ahead into future trends. Specifically, we will emphasize
arXiv:1811.04256v4 [cs.CV] 5 Sep 2019

the dramatic differences brought by deep learning and the grand challenges still remained. We expect that this review paper would
serve as a reference book for researchers in this field. Related resources are also collected and compiled in our Github repository:
https://github.com/Jyouhou/SceneTextPapers.

Index Terms—Scene Text, Detection, Recognition, Deep Learning, Survey

1 I NTRODUCTION

U NDOUBTEDLY , text is among the most brilliant and


influential creations of humankind. As the written
form of human languages, text makes it feasible to reliably
and effectively spread or acquire information across time
and space. In this sense, text constitutes the cornerstone of
human civilization.
On the one hand, text, as a vital tool for communication
and collaboration, has been playing a more important role
than ever in modern society; on the other hand, the rich,
precise high level semantics embodied in text could be Fig. 1: Schematic diagram of scene text detection and
beneficial for understanding the world around us. For ex- recognition. The image sample is from Total-Text [17].
ample, text information can be used in a wide range of real-
world applications, such as image search [127], [144], instant these variations pose challenges for detection and recogni-
translation [25], [113], robots navigation [23], [90], [91], [128], tion algorithms designed for text in natural scenes.
and industrial automation [18], [42], [51]. Therefore, automatic • Complexity and Interference of Backgrounds Back-
text reading from natural environments (schematic diagram grounds of natural scenes are virtually unpredictable. There
is depicted in Fig. 1), a.k.a. scene text detection and recog- might be patterns extremely similar with text (e.g., tree
nition [184] or PhotoOCR [10], has become an increasing leaves, traffic signs, bricks, windows, and stockades), or
popular and important research topic in computer vision. occlusions caused by foreign objects, which may potentially
However, despite years of research, a series of grand lead to confusions and mistakes.
challenges may still be encountered when detecting and • Imperfect Imaging Conditions In uncontrolled circum-
recognizing text in the wild. The difficulties mainly stem stances, the quality of text images and videos could not
from three aspects [184]: be guaranteed. That is, in poor imaging conditions, text
• Diversity and Variability of Text in Natural Scenes instances may be with low resolution and severe distortion
Distinctive from scripts in documents, text in natural scene due to inappropriate shooting distance or angle, or blurred
exhibit much higher diversity and variability. For example, because of out of focus or shaking, or noised on account of
instances of scene text can be in different languages, colors, low light level, or corrupted by highlights or shadows.
fonts, sizes, orientations and shapes. Moreover, the aspect These difficulties run through the years before deep
ratios and layouts of scene text may vary significantly. All learning showed its potential in computer vision as well as
in other fields. As deep learning came to prominence after
AlexNet [72] won the ILSVRC2012 [126] contest, researchers
• S. Long is with Carnegie Mellon University and worked as an turn to deep neural networks for automatic feature learning
intern at MEGVII (Face++) Inc., Beijing, China. E-mail: shang- and start with more in-depth studies. The community are
[email protected]
• X. He ([email protected]) and C. Yao ([email protected]) are now working on ever more challenging targets. The pro-
with MEGVII (Face++) Inc., Beijing, China. gresses made in recent years can be summarized as follows:
• Corresponding author: Cong Yao. • Incorporation of Deep Learning Nearly all recent meth-
Manuscript received Sep. 2nd, 2019; revised -, -. ods are built upon deep learning models. Most importantly,
2

deep learning frees researchers from the exhausting work


of repeatedly designing and testing hand-crafted features,
which gives rise to a blossom of works that push the
envelope further. To be specific, the use of deep learning
substantially simplifies the overall pipeline. Besides, these
algorithms provide significant improvements over previous
ones on standard benchmarks. Gradient-based training rou-
tines also facilitate to end-to-end trainable methods.
• Target-Oriented Algorithms and Datasets Researchers
are now turning to more specific aspects and targets.
Against difficulties in real-world scenarios, newly published
datasets are collected with unique and representative char-
acteristics. For example, there are datasets that feature long
text, blurred text, and curved text respectively. Driven by
these datasets, almost all algorithms published in recent
years are designed to tackle specific challenges. For instance,
some are proposed to detect oriented text, while others aim
at blurred and unfocused scene images. These ideas are also
combined to make more general-purposed methods.
• Advances in Auxiliary Technologies Apart from new
datasets and models devoted to the main task, auxiliary Fig. 2: Illustration of traditional methods with hand-
technologies that do not solve the task directly also find their crafted features: (1) Maximally Stable Extremal Regions
places in this field, such as synthetic data and bootstrapping. (MSER) [110], assuming chromatic consistency within each
In this survey, we present an overview of recent develop- character; (2) Stroke Width Transform (SWT) [26], assuming
ment in deep-learning based text detection and recognition consistent stroke width within each character.
from still scene images. We review methods from different
perspectives, and list the up-to-date datasets. We also ana- filter out non-text components using manually designed
lyze the status quo and future research trends. rules or classifiers automatically trained on hand-crafted
There have been already several excellent review pa- features (see Fig.2). In sliding window classification meth-
pers [146], [166], [172], [184], which also organize and an- ods, windows of varying sizes slide over the input image,
alyze works related text detection and recognition. How- where each window is classified as text segments/regions or
ever, these papers are published before deep learning came not. Those classified as positive are further grouped into text
to prominence in this field. Therefore, they mainly focus regions with morphological operations [74], Conditional
on more traditional and feature-based methods. We refer Random Field (CRF) [152] and other alternative graph
readers to these paper as well for a more comprehensive based methods [19], [154].
view and knowledge of the history. This article will mainly For text recognition, one branch adopted the feature-
concentrate on text information extraction from still images, based methods. Shi et al. [136] and Yao et al. [165] pro-
rather than videos. For scene text detection and recognition pose character segments based recognition algorithms. Ro-
in videos, please also refer to [65], [172]. driguez et al. [121], [122] and Gordo et al. [38] and Al-
The remaining parts of this paper are arranged as fol- mazan et al. [4] utilize label embedding to directly perform
lows: In Section 2, we briefly review the methods before matching between strings and images. Strokes [12] and
the deep learning era. In Section 3, we list and summarize character key-points [116] are also detected as features for
algorithms based on deep learning in a hierarchical order. classification. Another discomposed the recognition process
In Section 4, we take a look at the datasets and evaluation into a series of sub-problems. Various methods have been
protocols. Finally, we present potential applications and our proposed to tackle these sub-problems, which includes
own opinions on the current status and future trends. text binarization [75], [105], [149], [179], text line segmenta-
tion [167], character segmentation [112], [125], [137], single
character recognition [14], [130] and word correction [67],
2 M ETHODS BEFORE THE D EEP L EARNING E RA
[106], [148], [156], [176].
2.1 Overview There have been efforts devoted to integrated (i.e. end-
In this section, we take a brief glance retrospectively at algo- to-end as we call it today) systems as well [109], [152]. In
rithms before the deep learning era. More detailed and com- Wang et al. [152], characters are considered as a special case
prehensive coverage of these works can be found in [146], in object detection and detected by a nearest neighbor clas-
[166], [172], [184]. For text detection and recognition, the sifier trained on HOG features [21] and then grouped into
attention has been the design of features. words through a Pictorial Structure (PS) based model [28].
In this period of time, most text detection methods either Neumann and Matas [109] proposed a decision delay ap-
adopt Connected Components Analysis (CCA) [26], [57], proach by keeping multiple segmentations of each character
[63], [110], [145], [168], [171] or Sliding Window (SW) based until the last stage when the context of each character is
classification [19], [74], [152], [154]. CCA based methods first known. They detect character segmentations using extremal
extract candidate components through a variety of ways regions and decode recognition results through a dynamic
(e.g., color clustering or extreme region extraction), and then programming algorithm.
3

Recent Trends

Detection Recognition End-to-End Auxiliary Technologies

Different End-to-End
Simplified Specific Concatenated Synthetic
Prediction CTC Attention Trainable Bootsrapping Others
Pipelines Targets Multi-Step Data
Unit Two Step
(1) Anchor-based: (1) Text-instance: (1) Long text [122] (1) CNN+RNN [123] (1) Attention-based decoding Separately trained; Jointly trained; passing (1) Synthesize cropped (1) Bootstrapping for (1) deblurring: [50] [67]
EAST [169] TextBoxes [74] (2) Multi-orientation [170] (2) Conv Sequence [70] passing cropped images: [7] cropped feature maps; images: [55] character boxes: [51] (2) Leveraging Context: [171]
(2) Region-proposal: (2) Sub-text: (3) Irregular shapes [86] Learning [28] (2) Supervised Attention [13] gradients propagate end-to- (2) Rendering with [133] (3) Adversarial Attack: [163]
R2-CNN [60] pixel [20] (4) Speed-up [169] (3) Sliding Window [158] (3) Edit Probability-Based end: [11] [45] [73] [82] [89] natural images: [38] (2) Bootstrapping for
components [122] (5) Easy instance segmentation [30] Training [6] (3) Selective semantic word boxes: [112]
(6) Retrieving designated text [113] (4) Rectification [124] [125] segmentation: [165]
(7) Against complex background [44]

Fig. 3: Overview of recent progresses and dominant trends.


Pipeline research by far. This is the most significant change compared
Complexity
(a) (b)
to the former epoch.
In a nutshell, recent years have witnessed a blossoming
expansion of research into subdivisible trends. We summa-
(c) rize these changes and trends in Fig.3, and we would follow
(d)
this diagram in our survey.
Time
Input
Images
In this section, we would classify existing methods into
Extract Word
Box Proposal
Text-Region
Extraction
Direct Text Box
Regreesion
Direct Text Box
Regreesion
a hierarchical taxonomy, and introduce in a top-down style.
Proposal
Filtering
Text Line
Extraction
First, we divide them into four kinds of systems: (1) text
Bounding Box Character
Thresholding
& NMS
Thresholding
& NMS
detection that detects and localizes the existence of text
Regression Score Map
Cropped
Images
Cropped
Feature Maps
in natural image; (2) recognition system that transcribes
Rule-Based
Recognition
Filtering CTC-based methods (e) and converts the content of the detected text region into
Recognition Branch

Thresholding
& Merging
Word Partition
Attention-based methods (f) linguistic symbols; (3) end-to-end system that performs both
Text: Location and Transcription
text detection and recognition in one single pipeline; (4)
auxiliary methods that aim to support the main task of text
Fig. 4: Typical pipelines of scene text detection and recog- detection and recognition, e.g. synthetic data generation,
nition. (a) [60] and (b) [164] are representative multi-step and deblurring of image. Under each category, we review
methods. (c) and (d) are simplified pipeline. (c) [180] only recent methods from different perspectives.
contains detection branch, and therefore is used together
with a separate recognition model. (d) [49], [92] jointly train
3.1 Detection
a detection model and recognition model.
We acknowledge that scene text detection can be taxonom-
In summary, text detection and recognition methods ically subsumed under general object detection, which is
before the deep learning era mainly extract low-level or mid- dichotomized as one-staged methods and two-staged ones.
level hand crafted image features, which entails demanding However, the detection of scene text has a different set of
and repetitive pre-processing and post-processing steps. characteristics and challenges that require unique method-
Constrained by the limited representation ability of hand ologies and solutions. Thus, it would be more suitable
crafted features and the complexity of pipelines, those meth- to classify these algorithms based on their characteristics
ods can hardly handle intricate circumstances, e.g. blurred instead of the aforementioned dichotomy in general object
images in the ICDAR2015 dataset [68]. detection. Nevertheless, we encourage readers to refer to
recent surveys on object detection methods [43], [85].
There are three main trends in the field of text detection,
3 M ETHODOLOGY IN THE D EEP L EARNING E RA and we would introduce them in the following sub-sections
one by one. They are: (1) pipeline simplification; (2) changes
As implied by the title of this section, we would like to in prediction units; (3) specified targets.
address recent advances as changes in methodology instead
of merely new methods. Our conclusion is grounded in the 3.1.1 Pipeline Simplification
observations as explained in the following paragraph. One of the most important trends is the simplification of the
Methods in the recent years are characterized by the pipeline, as shown in Fig.4. Most methods before the era
following two distinctions: (1) Most methods utilizes deep- of deep-learning, and some early methods that use deep-
learning based models; (2) Most researchers are approach- learning, have multi-step pipelines. More recent methods
ing the problem from a diversity of perspectives. Methods have largely simplified and much shorter pipelines, which is
driven by deep-learning enjoy the advantage that automatic a key to reduce error propagation and simplify the training
feature learning can save us from designing and testing the process. In the last few years, separately trained two-staged
large amount potential hand-crafted features. At the same methods are surpassed by jointly trained ones. The main
time, researchers from different viewpoints are enriching components of these methods are end-to-end differentiable
and promoting the community into more in-depth work, modules, which is an outstanding property.
aiming at different targets, e.g. faster and simpler pipeline Multi-step methods: Early deep-learning based methods
[180], text of varying aspect ratios [131], and synthetic [164], [178]1 , [45] cast the task of text detection into a multi-
data [41]. As we can also see further in this section, the step process. In [164], a convolutional neural network is
incorporation of deep learning has totally changed the way
researchers approach the task, and has enlarged the scope of 1. Code: https://github.com/stupidZZ/FCN Text
4

used to predict whether each pixel in the input image (1)


(a)
belongs to characters, (2) is inside the text region, and (3)
the text orientation around the pixel. Connected positive
responses are considered as a detection of character or text Feature
Maps (b)
region. For characters belonging to the same text region, Input Images
FCN

Delaunay triangulation [66] is applied, after which a graph


partition algorithm groups characters into text lines based (c)

on the predicted orientation attribute .


Similarly, [178] first predicts a segmentation map indi- ROI Pooling Bounding Box
Regression
(d)
cating text line regions. For each text line region, MSER Word/Non-Word
Classfication

[111] is applied to extract character candidates. Character RPN

candidates reveal information of the scale and orientation


of the underlying text line. Finally, minimum bounding box Fig. 5: High level illustration of existing anchor/ROI-
is extracted as the final text line candidate. pooling based methods: (a) Similar to YOLO [117], pre-
dicting at each anchor positions. Representative methods
In [45], the detection process also consists of several
include rotating default boxes [93]. (b) Variants of SSD [86],
steps. First, text blocks are extracted. Then the model crops
including Textboxes [80], predicting at feature maps of dif-
and only focuses on the extracted text blocks to extract text
ferent sizes. (c) Direct regression of bounding boxes [180],
center line(TCL), which is defined as a shrunk version of
also predicting at each anchor position. (d) Region Proposal
the original text line. Each text line represents the existence
based methods, including rotating RoIs [102] and RoIs of
of one text instance. The extracted TCL map is then split
varying aspect ratios [64].
into several TCLs. Each split TCL is then concatenated to
the original image. A semantic segmentation model then most famous for its speed, we would re-introduce EAST in
classifies each pixel into ones that belong to the same text later parts, with emphasis on its efficiency.
instance as the given TCL, and ones that do not. Other methods adapt two-staged object detection frame-
Simplified pipeline: More recent methods follow a 2- work of R-CNN [33], [34], [119], where the second stage
step pipeline, consisting of an end-to-end trainable neural corrects the localization results based on features obtained
network model and a post-processing step that is usu- by Region of Interest (ROI) pooling. Rotation Region Pro-
ally much simpler than previous ones [48]2 , [64], [79], posal Networks [102] generates rotating region proposals,
[80]3 , [93], [131]4 , [94]5 , [102]6 , [78], [83]7 , [20], [93], [177]. in order to fit into text of arbitrary orientations, instead of
These methods mainly draw inspiration from techniques in axis-aligned rectangles. Similarly, R2CNN [64] uses region
general object detection [29], [33], [34], [46], [86], [119], and proposals of different sizes. In FEN [177], weighted sum
modify the region proposal and bounding box regression of region proposals with different sizes are used. The final
modules to localize text instances directly. We briefly intro- prediction is made by leveraging the textness score for
duce some representative works here. poolings of 4 different sizes.
As shown in Fig.5 (b), TextBoxes [80] adapts SSD to fit The aforementioned methods simplify the overall
the varying orientations and aspect-ratios of text by defining pipeline and improves efficiency greatly. However, the per-
default boxes as quadrilaterals with different specs. formance is still limited when faced with irregularly shaped
A variant of the standard anchor-based default box pre- text and long text. Therefore, deep-learning based multi-
diction method is EAST [180]8 . In the standard SSD network, staged methods are re-introduced. These methods, as dis-
there are several feature maps of different sizes, on which cussed below, use neural network to predict local attributes,
default boxes of different receptive fields are detected. In and a post-processing step to re-construct text instances.
EAST, all feature maps are integrated together by gradual Compared with early multi-staged methods, they rely more
upsampling, or U-Net [124] structure to be specific. The size on neural networks and have shorter pipelines.
of the final feature map is 41 of the original input image,
with c-channels. Under the assumption that each pixel only 3.1.2 Decomposing into Sub-Text
belongs to one text line, each pixel on the final feature A main distinction between text detection and general object
map, i.e. the 1 × 1 × c feature tensor, is used to regress the detection is that, text are homogeneous as a whole and
rectangular or quadrilateral bounding box of the underlying show locality, while general object detection are not. By
text line. Specifically, the existence of text, i.e. text/non-text, homogeneity and locality, we refer to the property that any
and geometries, e.g. orientation and size for rectangles, and part of a text instance is still text. Human do not have to see
vertexes coordinates for quadrilaterals, are predicted. EAST the whole text instance to know it belongs to some text.
makes a difference to the field of text detection with its
Such a property lays a cornerstone for a new branch of
highly simplified pipeline and the efficiency. Since EAST is
text detection methods that only predict sub-text compo-
nents and then assemble them into a text instance. In this
2. Code: https://github.com/BestSonny/SSTD part, we take the perspective of the granularity of text de-
3. Code: https://github.com/MhLiao/TextBoxes tection. There are two main level of prediction granularity,
4. Code: https://github.com/bgshih/seglink
text instance level and sub-text level.
5. Code: https://github.com/Yuliang-Liu/Curve-Text-Detector
6. Code: https://github.com/mjq11302010044/RRPN Text instance level methods, as mentioned in the last
7. Code: https://github.com/MhLiao/RRD section, follows the standard routine of general object detec-
8. Code: https://github.com/zxytim/EAST tion. A region-proposal network produces initial guess for
5

the localization of possible text instance. Optionally, some map represents features in the region specified by the cor-
methods then use a refinement part to filter false positive responding anchor. Assuming that text appear horizontally,
and also correct the localization. each row of features are fed into an RNN and labeled as
Contrarily, sub-text level detection meth- text/non-text. Geometries are also predicted.
ods [101], [22]9 , [45], [159], [164], [48]10 , [44], [131], Generally speaking, sub-text level methods are more
[178], [142]11 , [150], [183] only predicts parts that are robust to the size, aspect ratio, and shape of different text
combined to make a text instance. Such sub-text mainly instances. However, the efficiency of the postprocessing step
includes pixel-level and components-level. may depend on the actual implementation, and is slow in
In pixel-level methods [22], [45], [48], [159], [164], [178], some cases. The lack of refinement step may also harm the
an end-to-end fully convolutional neural network learns to performance.
generate a dense prediction map indicting whether each
pixel in the original image belongs to any text instances 3.1.3 Specific Targets
or not. Post-processing methods then groups pixels together Another characteristic of current text detection system is
depending on which pixels belong to the same text instance. that, most of them are designed for special purposes, at-
Since text can appear in clusters which makes predicted tempting to tackle specific difficulties in detecting scene text.
pixels connected to each other, the core of pixel-level meth- We broadly classify them into the following aspects.
ods is to separate text instances from each other. PixelLink 3.1.3.1 Long Text: Unlike general object detection,
[22] learns to predict whether two adjacent pixels belong text usually come in varying aspect ratios. They have much
to the same text instance by adding link prediction to each larger width-height ratio, and thus general object detection
pixel. Border learning method [159] casts each pixels into framework would fail. Several methods have been pro-
three categories: text, border, and background, assuming posed [64], [101], [131] to detect long text.
that border can well separate text instances. In Holistic R2 CN N [64] gives an intuitive solution, where ROI
[164], pixel-prediction maps include both text-block level pooling with different sizes are used. Following the frame-
and character center levels. Since the centers of characters work of Faster R-CNN [119], three ROI-poolings with vary-
do not overlap, the separation is done easily. ing pooling sizes, 7 × 7, 3 × 11, and 11 × 3, are performed
In this part we only intend to introduce the concept of for each box generated by region-proposal network, and the
prediction units. We would go back to details regarding the pooled features are concatenated for textness score.
separation of text instances in the section of Specific Targets. Another branch learns to detect local sub-text compo-
Components-level methods [44], [101], [131], [142], [150], nents which are independent from the whole text [22], [101],
[183] usually predict at a medium granularity. Component [131]. SegLink [131] proposes to detect components, i.e.
refer to a local region of text instance, sometimes containing square areas that are text, and how these components are
one or more characters. linked to each other. PixelLink [22] predicts which pixels
As shown in Fig.6 (a), SegLink [131] modifies SSD belong to any text and whether adjacent pixels belong to
[86]. Instead of default boxes that represent whole objects, the same text instances. Corner localization [101] detects text
SegLink defines default boxes as having only one aspect ra- corners. All these methods learn to detect local components
tio. Each default box represents a text segment. Besides, links and then group them together to make final detections.
between default boxes are predicted to indicate whether the Zhang et al. [175] propose to perform ROI and localiza-
linked segments belong to the same text instance. tion branch recursively, to revise the predicted position of
Corner localization method [101] proposes to detect the the text instance. It is a good way to include features at
four corners of each text instance. Since each text instance the boundaries of bounding boxes, which localizes the text
only has 4 corners, the prediction results and their relative better than RPN network.
position can indicate which corners should be grouped into While methods on text instance level may fail due to
the same text instance. limited receptive field, sub-text methods may suffer from
SegLink [131] and Corner localization [101] are proposed lack of end-to-end optimization. Therefore, the challenge of
specially for long and multi-oriented text. We only introduce long text still remains unsolved.
the idea here and discuss more details in the section of 3.1.3.2 Multi-Oriented Text: Another distinction
Specific Targets, regarding how they are realized. from general text detection is that text detection is rotation-
In [150], pixels are clustered according to their color con- sensitive and skewed text are common in real-world, while
sistency and edge information. The fused image segments using traditional axis-aligned prediction boxes would incor-
are called superpixel. These superpixels are further used to porate noisy background that would affect the performance
extract characters and predict text instance. of the following text recognition module. Several methods
Another branch of component-level method is Connec- have been proposed to adapt to it [64], [80], [83], [93], [102],
tionist Text Proposal Network (CTPN) [142], [158], [183]. [131], [180], [151]12 .
CTPN models inherit the idea of anchoring and recurrent Extending from general anchor-based methods, rotat-
neural network for sequence labeling. They usually consist ing default boxes [80], [93] are used, with predicted ro-
of a CNN-based image classification network, e.g. VGG, and tation offset. Similarly, rotating region proposals [102] are
stack an RNN on top of it. Each position in the final feature generated with 6 different orientations. Regression-based
methods [64], [131], [180] predict the rotation and positions
9. Code: https://github.com/ZJULearning/pixel link of vertexes, which are insensitive to orientation. Further,
10. Code: https://github.com/BestSonny/SSTD
11. Code: https://github.com/tianzhi0549/CTPN 12. Code: https://github.com/zlmzju/itn
6

results. Notably, they propose their method as an end-to-end


system. We would refer to it again in the following part.
Viewing the problem from a different perspective, Long
et al. [97] argue that text can be represented as a series
of sliding round disks along the text center line (TCL),
which is in accord with the running direction of the text
instance, as shown in Fig.7. With the novel representation,
they present a new model, TextSnake, as shown in Fig.6 (d),
that learns to predict local attributes, including TCL/non-
TCL, text-region/non-text-region, radius, and orientation.
The intersection of TCL pixels and text region pixels gives
the final prediction of pixel-level TCL. Local geometries are
then used to extract the TCL in the form of ordered point list,
Fig. 6: Illustration of representative bottom-up methods: (a) as demonstrated in Fig.6 (d). With TCL and radius, the text
SegLink [131]: with SSD as base network, predict word line is reconstructed. It achieves state-of-the-art performance
segments at each anchor position, and connections be- on several curved text dataset as well as more widely used
tween adjacent anchors. (b) PixelLink [22]: for each pixel, ones, e.g. ICDAR2015 [68] and MSRA-TD500 [145]. Notably,
predict text/non-text classification and whether it belongs Long et al. propose a cross-validation test across different
to the same text as adjacent pixels or not. (c) Corner datasets, where models are only fine-tuned on datasets with
Localization [101]: predict the four corners of each text straight text instances, and tested on the curved datasets.
and group those belonging to the same text instances. (d) In all existing curved datasets, TextSnake achieves improve-
TextSnake [97]: predict text/non-text and local geometries, ments by up to 20% over other baselines in F1-Score.
which are used to reconstruct text instance. A simple substitute for bounding box regression is poly-
gon regression. Wang et al. [155] propose to use an RNN
attached to the features encoded by RPN-based two-staged
object decoders, to predict the bounding polygon with
(a) (b) (c) (d) variable length. The method requires no post processing
Fig. 7: (a)-(c): Representing text as horizontal rectangles, or complex intermediate steps, and achieves a much faster
oriented rectangles, and quadrilaterals. (d): The sliding-disk speed of 10.0 FPS on Total-Text.
reprensentation proposed in TextSnake [97]. Similar to SegLink, Baek et al. [6] propose to learn the
character centers and links between them. Both components
in Liao et al. [83], rotating filters [181] are incorporated and links are predicted in the form of heat map. However,
to model orientation-invariance explicitly. The peripheral this method requires iterative weak supervision as real-
weights of 3 × 3 filters rotate around the center weight, to world datasets are rarely equipped with character-level
capture features that are sensitive to rotation. labels. It’s likely that such iterative methods are Knowledge
While the aforementioned methods may entail addi- Distillation [30], [52] (KD) themselves, and therefore it’s
tional post-processing, Wang et al. [151] propose to use unsure whether the improvement comes from the method
a parametrized Instance Transformation Network (ITN) that itself or the iterative KD process.
learns to predict appropriate affine transformation to per- Despite the improvements achieved by these methods,
form on the last feature layer extracted by the base network, few methods except TextSnake have considered the problem
to rectify oriented text instances. Their method, with ITN, of generalization ability. As curved text mainly appear on
can be trained end-to-end. street billboards, curved text dataset in other domain may
3.1.3.3 Text of Irregular Shapes: Apart from vary- be less available. Therefore, the generalization ability will be
ing aspect ratios, another distinction is that text can have important under this circumstance.
a diversity of shapes, e.g. curved text. Curved text poses a 3.1.3.4 Speedup: Current text detection methods
new challenge, since the regular rectangular bounding box place more emphasis on speed and efficiency, which is
would incorporate a large proportion of background and necessary for application in mobile devices.
even other text instances, making it difficult for recognition. The first work to gain significant speedup is EAST [180],
Extending from quadrilateral bounding box, it’s natural which makes several modifications to previous framework.
to use bounding ’boxes’ with more that 4 vertexes. Bound- Instead of VGG [139], EAST uses PVANet [71] as its base-
ing polygons [94] with as many as 14 vertexes are proposed, network, which strikes a good balance between efficiency
followed by a Bi-LSTM [53] layer to refine the coordinates and accuracy in the ImageNet competition. Besides, it sim-
of the predicted vertexes. In their framework, however, plifies the whole pipeline into a prediction network and a
axis-aligned rectangles are extracted as intermediate results non-maximum suppression step. The prediction network is
in the first step, and the location bounding polygons are a U-shaped [124] fully convolutional network that maps an
predicted upon them. input image I ∈ RH,W,C to a feature map F ∈ RH/4,W/4,K ,
Similarly, Lyu et al. [100] modify the Mask R-CNN [46] where each position f = Fi,j,: ∈ R1,1,K is the feature
framework, so that for each region of interest—in the form vector that describes the predicted text instance. That is, the
of axis-aligned rectangles—character masks are predicted location of the vertexes or edges, the orientation, and the
solely for each type of alphabets. These predicted characters offsets of the center, for the text instance corresponding to
are then aligned together to form a polygon as the detection that feature position (i, j). Feature vectors that correspond
7

to the same text instance are merged with the non-maximum the query and the features of scene text image to rank the
suppression. It achieves state-of-the-art speed with FPS of candidate text regions generated by DTLN. As much as
16.8 as well as leading performance on most datasets. we are concerned, this is the first work that retrieves text
3.1.3.5 Instance Segmentation: Recent years have according to a query.
witnessed methods with dense predictions, i.e. pixel level 3.1.3.7 Against Complex Background: Attention
predictions [22], [45], [114], [159]. These methods generate mechanism is introduced to silence the complex back-
a prediction map classifying each pixel as text or non- ground [48]. The stem network is similar to that of the
text. However, as text may come near each other, pixels of standard SSD framework predicting word boxes, except that
different text instances may be adjacent in the prediction it applies inception blocks on its cascading feature maps,
map. Therefore, separating pixels become important. obtaining what’s called Aggregated Inception Feature (AIF).
Pixel-level text center line is proposed [45], since the An additional text attention module is added, which is again
center lines are far from each other. These text lines can based on inception blocks. The attention is applied on all
be easily separated as they are not adjacent. To produce AIF, suppressing the influence of background noises.
prediction for text instance, a binary map of text center line
of a text instance is attached to the original input image
and fed into a classification network. A saliency mask is 3.2 Recognition
generated to indicate detected text. However, this method In this section, we introduce methods that tackle text recog-
consists of non-differentiable steps. The text-line generation nition. Input of these methods are cropped text instance
step and the final prediction step can not be trained end-to- images which contain one word or one text line.
end, and error propagates. In traditional text recognition methods [10], [137], the
Another way to separate different text instances is to use task is divided into 3 steps, i.e. image pre-processing,
the concept of border learning [114], [159], [161], where each character segmentation and character recognition. Character
pixel is classified into one of the three classes: text, non-text, segmentation is considered the most challenging part due to
and text border. The text border then separates text pixels the complex background and irregular arrangement of scene
that belong to different instances. Similarly, in the work text, and largely constrained the performance of the whole
of Xue et al. [161], text are considered to be enclosed by 4 recognition system. Two major techniques are adopted to
segments, i.e. a pair of long-side borders (abdomen and back) avoid segmentation of characters, namely Connectionist
and a pair of short-side borders (head and tail). The method Temporal Classification [39] and Attention mechanism. We
of Xue et al. is also the first to use DenseNet [56] as their introduce recognition methods in the literature based on the
basenet, which provides a consistant 2 − 4% performance main technique they employ. Mainstream frameworks are
boost in F1-score over that with ResNet [47] on all datasets illustrated in Fig.8.
that it’s evaluated on.
Following SegLink, PixelLink [22] learns to link pixels
3.2.1 CTC-based Methods
belonging to the same text instance. Text pixels are separated
into groups for different instances efficiently via disjoint CTC computes the conditional probability P (L|Y ), where
set algorithm. Likewise, Liu et al. [96] propose to predict Y = y1 , ..., yT represent the per-frame prediction of RNN
the composition of adjacent pixels with Markov Cluster- and L is the label sequence, so that the network can be
ing [147], instead of neural networks. The Markov Cluster- trained using only sequence level label as supervision. The
ing algorithm is applied to the saliency map of the input first application of CTC in the OCR domain can be traced
image generated by neural networks and indicates whether to the handwriting recognition system of Graves et al. [40].
each pixel belongs to any text instances or not. Then, the Now this technique is widely adopted in scene text recogni-
clustering results give the segmented text instances. tion [140], [88], [132]13 , [31], [169].
Tian et al. [143] propose to add a loss term that max- Shi et al. [132] propose a model that stacks CNN with
imizes the Euclidean distances between pixel embedding RNN to recognize scene text images. As illustrated in Fig.8
vectors that belong to different text instances, and mini- (a), CRNN consists of three parts: (1) convolutional layers,
mizes those belonging to the same instance, to better sep- which extract a feature sequence from the input image; (2)
arate adjacent texts. recurrent layers, which predict a label distribution for each
Segmentation based methods have proven successful. frame; (3) transcription layer (CTC layer), which translates
However, the lack of end-to-end training may limit their the per-frame predictions into the final label sequence.
performance. It remains a challenge how to implement end- Instead of RNN, Gao et al. [31] adopt the stacked
to-end optimization to these methods. convolutional layers to effectively capture the contextual
3.1.3.6 Retrieving Designated Text: Different from dependencies of the input sequence, which is character-
the classical setting of scene text detection, sometimes we ized by lower computational complexity and easier parallel
want to retrieve a certain text instance given the description. computation. Overall difference with other frameworks are
Rong et al. [123] present a multi-encoder framework to illustrated in Fig.8 (b).
retrieve text as designated. Specifically, text is retrieved as Yin et al. [169] simultaneously detect and recognize char-
required by a natural language query. The multi-encoder acters by sliding the text line image with character models,
framework includes a Dense Text Localization Network which are learned end-to-end on text line images labeled
(DTLN) and a Context Reasoning Text Retrieval (CRTR). with text transcripts.
DTLN uses an LSTM to decode the features in a FCN
network into a sequence of text instance. CRTR encodes 13. Code: https://github.com/bgshih/crnn
8

Basic Idea (a) (b) (c) (d) (e)


Transcription Transcription Transcription Transcription Transcription Transcription
Sequence
Reordering

Sequence C1 C2 Ct Sequence Cw
Sequence C1 C2 Ct Loss
C1 C2
Loss Loss
C1 C2 C3 C4 C5 C6 C1 C2 C3 C4 C5 C6
…… …… ……

C1 C2 C3 C4 C5 C6
…… …… ……
Convolutional Layers Supervision by
Attention Pooling Attention Pooling
Attention Pooling Character Level
Convolutional Layers Bounding Boxes
Convolutional Layers
Cropped Convolutional Layers
a1 a2 a3 …… aw a1 a2 a3 …… aw a1 a2 a3 …… aw
Images
Convolutional Layers Convolutional Layers Convolutional Layers
Resize

h
w

Fig. 8: Frameworks of text recognition models. The basic methodology is to first resize the cropped image to a fixed height,
then extract features and feed them to an RNN that produce a character prediction for each column. As the number of
columns of the features is not necessarily equal to the length of the word, the CTC technique [39] is proposed as a post-
processing stage. (a) RNN stacked with CNN [132]; (b) Sequence prediction with FCN [31]; (c) Attention-based models [16],
[32], [73], [95], [133], [162], allowing decoding text of varying lengths; (d) Cheng et al. [15] proposed to apply supervision
to the attention module; (e) To improve the misalignment problem in previous methods with fixed-length decoding with
attention, Edit Probability [8] is proposed to reorder the predicted sequential distribution.
3.2.2 Attention-based methods In [16], Cheng et al. argue that encoding a text image as
The attention mechanism was first presented in [7] to im- a 1-D sequence of features as implemented in most methods
prove the performance of neural machine translation sys- is not sufficient. They encode an input image to four feature
tems, and flourished in many machine learning application sequences of four directions:horizontal, reversed horizontal,
domains including scene text recognition. vertical and reversed vertical. A weighting mechanism is
Lee et al. [73] present a recursive recurrent neural net- applied to combine the four feature sequences.
works with attention modeling for lexicon-free scene text Liu et al. [87] present a hierarchical attention mechanism
recognition. the model first passes input images through (HAM) which consists of a recurrent RoI-Warp layer and
recursive convolutional layers to extract encoded image a character-level attention layer. They adopt a local trans-
features, and then decodes them to output characters by formation to model the distortion of individual characters,
recurrent neural networks with implicitly learned character- resulting in an improved efficiency, and can handle different
level language statistics. Attention-based mechanism per- types of distortion that are hard to be modeled by a single
forms soft feature selection for better image feature usage. global transformation.
Cheng et al. [15] observe the attention drift problem in
existing attention-based methods and proposes to impose
localization supervision for attention score to attenuate it,
3.2.3 Other Efforts
as shown in Fig.8 (d).
In [8], Bai et al. propose an edit probability (EP) met-
Jaderberg et al. [58], [59] perform word recognition by
ric to handle the misalignment between the ground truth
classifying the image into a pre-defined set of vocabulary,
string and the attention’s output sequence of probability
under the framework of image classification. The model
distribution, as shown in Fig.8 (e). Unlike aforementioned
is trained by synthetic images, and achieve state-of-the-
attention-based methods, which usually employ a frame-
art performance on some benchmarks containing English
wise maximal likelihood loss, EP tries to estimate the prob-
words only. But application of this method is quite limited
ability of generating a string from the output sequence of
as it cannot be applied to recognize long sequences such as
probability distribution conditioned on the input image,
phone numbers.
while considering the possible occurrences of missing or
superfluous characters. Liao et al. [82] cast the task of recognition into semantic
In [95], Liu et al. propose an efficient attention-based segmentation, and treat each character type as one class.
encoder-decoder model, in which the encoder part is trained The method is insensitive to shapes and is thus effective
under binary constraints to reduce computation on irregular text, but the lack of end-to-end training and
Among those attention-based methods, some work sequence learning makes it prone to single-character errors,
made efforts to accurately recognize irregular (perspectively especially when the image quality is low. They are also the
distorted or curved) text. Shi et al. [133], [134] propose a first to evaluate the robustness of their recognition method
text recognition system which combined a Spatial Trans- by padding and transforming test images. Also note that,
former Network (STN) [61] and an attention-based Se- 2-dimensional attention [160] can also be a solution to such
quence Recognition Network. The STN predict a Thin-Plate- curved text, which has been verified in [77]
Spline transformations which rectify the input irregular text Despite the progress we have seen so far, the evaluation
image into a more canonical form. of recognition methods falls behind the time. As most de-
Yang et al. [162] introduce an auxiliary dense character tection methods can detect oriented and irregular text and
detection task to encourage the learning of visual rep- some even rectify them, the recognition of such text may
resentations that are favorable to the text patterns. And seem redundant. On the other hand, the robustness of recog-
they adopt an alignment loss to regularize the estimated nition when cropped with slightly different bounding box is
attention at each time-step. Further, they use a coordinate seldom verified. Such robustness may be more important in
map as a second input to enforce spatial-awareness. real-world scenarios.
9

3.3 End-to-End System Basic Idea


Cropped image or cropped
rectangle feature maps

In the past, text detection and recognition are usually cast as


two independent sub-problems that are combined together Input Images
Detection Recognition
Predicted Text

to perform text reading from images. Recently, many end-


to-end text detection and recognition systems (also known …

as text spotting systems) have been proposed, profiting a Detection Loss Joint Training Recognition Loss
(a)
lot from the idea of designing differentiable computation
Grid
graphs. Efforts to build such systems have gained consider- Detection
Generator
Cropped
Cropped
Cropped
Recognition
Input Images Predicted Text
Image
able momentum as a new trend. Images
Images

While earlier work [152], [154] first detect single char- …

acters in the input image, recent systems usually detect Detection Loss Joint Training Recognition Loss

and recognize text in word level or line level. Some of (b)


Cropped Features Maps
these systems first generate text proposals using a text
detection model and then recognize them with another text Input Images
Detection Recognition
Predicted Text

recognition model [41], [60], [80]. Jaderberg et al. [60] use


Gradients
a combination of Edge Box proposals [185] and a trained …

aggregate channel features detector [24] to generate can- Detection Loss Joint Training Recognition Loss
(c)
didate word bounding boxes. Proposal boxes are filtered Cropped Features Maps
FCN
and rectified before being sent into their recognition model Detection
text/non-text

Predicted
Input Images
proposed in [59]. In [80], Liao et al. combine an SSD [86] Text

character
based text detector and CRNN [132] to spot text in images. Gradients category


Lyu et al. [100] propose a modification of Mask R-CNN that
Detection Loss Joint Training Recognition Loss
is adapted to produce shape-free recognition of scene text,
as shown in Fig.9 (c). For each region of interest, character Fig. 9: Illustration of mainstream end-to-end framework.
maps are produced, indicating the existence and location The basic idea is to concatenate the two branch. (a): In
of a single character. A post-processing that links these SEE [9], the detection results are represented as grid ma-
character together gives the final results. trices. Image regions are cropped and transformed before
One major drawback of the two-step methods is that the being fed into the recognition branch. (b): In contrast to (a),
propagation of error between the detection and recognition some methods crop from the feature maps and feed them
models will lead to less satisfactory performance. Recently, to the recognition branch [13], [49], [76], [92]. (c): While (a)
more end-to-end trainable networks are proposed to tackle and (b) utilize CTC-based and attention-based recognition
the this problem [9]14 , [13]15 , [49], [76], [92]. branch, it’s also possible to retrieve each character as generic
Bartz et al. [9] present an solution which utilizes a objects and compose the text [100].
STN [61] to circularly attend to each word in the input
an important role. In this part, we briefly introduce these
image, and then recognize them separately. The united
promising aspects: synthetic data, bootstrapping, text de-
network is trained in a weakly-supervised manner that no
blurring, and context information incorporation.
word bounding box labels are used. Li et al. [76] substitute
the object classification module in Faster-RCNN [119] with
3.4.1 Synthetic Data
an encoder-decoder based text recognition model and make
up their text spotting system. Lui et al. [92], Busta et al. [13] Most deep learning models are data-thirsty. Their perfor-
and He et al. [49] develope a unified text detection and mance is guaranteed only when enough data are available.
recognition systems with a very similar overall architec- Therefore, artificial data generation has been a popular
ture which consist of a detection branch and a recognition research topic, e.g. Generative Adversarial Nets (GAN) [37].
branch. Liu et al. [92] and Busta et al. [13] adopt EAST [180] In the field of text detection and recognition, this problem is
and YOLOv2 [118] as their detection branch respectively, more urgent since most human-labeled datasets are small,
and have a similar text recognition branch in which text usually containing around merely 1K − 2K data instances.
proposals are mapped into fixed height tensor by bilinear Fortunately, there have been work [41], [59], [81], [174]
sampling and then transcribe in to strings by a CTC-based that generate data of relatively high quality, and have been
recognition module. He et al. [49] also adopt EAST [180] widely used for pre-training models for better performance.
to generate text proposals, and they introduced character Jaderberg et at. [59] propose to generate synthetic
spatial information as explicit supervision in the attention- data for text recognition. Their method blends text with
based recognition branch. randomly cropped natural images from human-labeled
datasets after rending of font, border/shadow, color, and
distortion. The results show that training merely on these
3.4 Auxiliary Techniques synthetic data can achieve state-of-the-art performance and
Recent advances are not limited to detection and recognition that synthetic data can act as augmentative data sources for
models that aim to solve the tasks directly. We should all datasets.
also give credit to auxiliary techniques that have played SynthText [41]16 first propose to embed text in natural
scene images for training of text detection, while most
14. Code: https://github.com/Bartzi/see
15. Code: https://github.com/MichalBusta/DeepTextSpotter 16. Code: https://github.com/ankush-me/SynthText
10

previous work only print text on a cropped region and these


synthetic data are only for text recognition. Printing text on
the whole natural images poses new challenges, as it needs
to maintain semantic coherence. To produce more realistic
data, SynthText makes use of depth prediction [84] and
semantic segmentation [5]. Semantic segmentation groups
pixels together as semantic clusters, and each text instance is
printed on one semantic surface, not overlapping multiple
ones. Dense depth map is further used to determine the
orientation and distortion of the text instance. Model trained
only on SynthText achieves state-of-the-art on many text
detection datasets. It’s later used in other works [131], [180]
as well for initial pre-training.
Fig. 10: Overview of semi-supervised and weakly-
Further, Zhan et al. [174] equip text synthesis with other supervised methods. Existing methods differ in the way
deep learning techniques to produce more realistic samples. with regard to how filtering is done. (a): WeText [141],
They introduce selective semantic segmentation so that mainly by thresholding the confidence level and filtering by
word instances would only appear on sensible objects, e.g. word-level annotation. (b) and (c): Scoring-based methods,
a desk or wall in stead of someone’s face. Text rendering in including WordSup [55] which assumes that text are straight
their work is adapted to the image so that they fit into the lines, and uses a eigenvalue-based metric to measure its
artistic styles and do not stand out awkwardly. straightness; Rong et al. [78] evaluate each predicted text
Liao et al. [81] propose to use the famous open-source region with MSER features combined with SLC algotirhm.
game engine, Unreal Engine 4 (UE4), and UnrealCV [115]
WordSup [55] first initializes the character detector by
to synthesize images. Text are rendered with the scene
training 5K warm-up iterations on synthetic dataset, as
together, and thus can achieve different lighting conditions,
shown in Fig.10 (b). For each image, WordSup generates
weather, and natural occlusions.
character candidates, which are then filtered with word-
While the training of recognizers has largely shifted to boxes. For characters in each word box, the following score
synthetic data, it remains a challenge how to synthesize is computed to select the most possible character list:
images that help in training strong detectors.
area(Bchars ) λ2
s=w· + (1 − w) · (1 − ) (1)
area(Bword ) λ1

where Bchars is the union of the selected character boxes;


3.4.2 Bootstrapping Bword is the enclosing word bounding box; λ1 and λ2 are
the first and second largest eigenvalues of a covariance
matrix C , computed by the coordinates of the centers of the
Bootstrapping, or Weakly and semi supervision, is also an selected character boxes; w is a weight scalar. Intuitively, the
important track [55], [78], [141]. It’s mainly used in word [78] first term measures how complete the selected characters
or character [55], [141] level annotations. can cover the word boxes, while the second term measures
whether the selected characters are located on a straight line,
Bootstrapping for word-box Rong et al. [78] propose to
which is a main characteristic for word instances in most
combine an FCN-based text detection network with Max-
datasets.
imally Stable Extremal Region (MSER) features to generate
new training instances annotated on box-level. First, they WeText [141] start with a small datasets annotated on
train an FCN, which predicts the probability of each pixel character level. It follows two paradigms of bootstrapping:
belonging to text. Then, MSER features are extracted from semi-supervised learning and weakly-supervised learning.
regions where the text confidence is high. Using single In the semi-supervised setting, detected character candi-
linkage criterion (SLC) based algorithms [35], [138], final dates are filtered with a high thresholding value. In the
prediction is made. weakly-supervised setting, ground-truth word boxes are
used to mask out false positives outside. New instances
Bootstrapping for character-box Character level annota- detected in either way are added to the initial small datasets
tions are more accurate and better. However, most existing and re-train the model.
datasets do not provide character-level annotating. Since
character is smaller and close to each other, character-level
3.4.3 Text Deblurring
annotation is more costly and inconvenient. There have been
some work on semi-supervised character detection [55], By nature, text detection and recognition are more sensitive
[141]. The basic idea is to initialize a character-detector, to blurring than general object detection. Some methods
and applies rules or threshold to pick the most reliable [54]17 , [70] have been proposed for text deblurring.
predicted candidates. These reliable candidates are then Hradis et al. [54] propose an FCN-based deblurring
used as additional supervision source to refine the character- method. The core FCN maps the blurred input image and
detector. Both of them aim to augment existing datasets with
character level annotations. They only differ in details. 17. Code: http://www.fit.vutbr.cz/∼ihradis/CNN-Deblur/
11

Fig. 11: Selected samples from Chars74K, SVT-P, IIIT5K, MSRA-TD500, ICDAR2013, ICDAR2015, ICDAR2017 MLT,
ICDAR2017 RCTW, and Total-Text.

generates a deblurred image. They collect a dataset of well- 4.1 Benchmark Datasets
taken images of documents, and process them with kernels
designed to mimic hand-shake and de-focus. We collect existing datasets and summarize their statistics in
Tab.1. Then we discuss their characteristics in the following
Khare et al. [70] propose a quite different framework.
parts. We also select some representative image samples
Given a blurred image, g , it aims to alternatively optimize
from some of the datasets, which are demonstrated in Fig.11.
the original image f and kernel k by minimizing the follow-
Links to these datasets are also collected in our Github
ing energy value:
repository mentioned in abstract, for readers’ convenience.
Z
E= (k(x, y) ∗ f (x, y) − g(x, y))2 dxdy
4.1.1 Datasets with both detection and recognition tasks
Z (2)
+ λ wR(k(x, y))dxdy The ICDAR datasets: The ICDAR Robust Reading Compe-
tition [68], [69], [98], [99], [129], [135] was started in 2003,
where λ is the regularization weight, with operator R held every two years with different topics. They brought
as the Gaussian weighted (w) L1 norm. The optimization is about a series of scene text datasets that have shaped the
done by alternatively optimizing over the kernel k and the research community. Among the horizontal text sections,
original image f . ICDAR 2013 was modified from and replaced ICDAR
2003/2005/2011 as evaluation in later works. ICDAR 2013
is characterized by large and horizontal text. State-of-the-art
3.4.4 Context Information results of ICDAR 2013 are shown in Tab.2 for detection and
Tab.8 for recognition.
Another way to make more accurate predictions is to take
into account the context information. Intuitively, we know The ICDAR 2015 incidental text channel introduced a
that text only appear on a certain surfaces, e.g. billboards, new challenge. The images are taken by Google Glasses
books, and etc.. Text are less likely to appear on the face of without taking care of the image quality. A large proportion
a human or an animal. Following this idea, Zhu et al. [182] of text in the images are very small, blurred, occluded, and
propose to incorporate the semantic segmentation result as multi-oriented. State-of-the-art results are shown in Tab.3
part of the input. The additional feature filters out false for detection and Tab.8 for recognition.
positives where the patterns look like text. The ICDAR 2017 Competition on Reading Chinese Text
in the Wild proposed a Chinese text dataset. It is comprised
of 12, 263 images, making it the largest dataset at that time
and the first large Chinese dataset.
4 B ENCHMARK DATASETS AND E VALUATION P RO - The Chinese Text in the Wild (CTW) dataset [173] contains
TOCOLS 32, 285 high resolution street view images, annotated at
the character level, including its underlying character type,
As cutting edge algorithms achieved better performance bouding box, and detailed attributes such as whether it
on existing datasets, researchers are able to tackle more uses wordart. The dataset is the largest one to date, and
challenging aspects of the problems. New datasetes aimed the only one that contains detailed annotations. However,
at different real-world challenges have been and are being it only provides annotations for Chinese text and ignores
crafted, benefiting the development of detection and recog- other scripts, e.g. English.
nition methods further. Total-Text has a large proportion of curved text, while
In this section, we list and briefly introduce the exist- previous datasets contain only few. These images are mainly
ing datasets and the corresponding evaluation protocols. taken from street billboards, and annotated as polygons
We also identify current state-of-the-art approaches on the with variable number of vertices. State-of-the-art results for
widely used datasets when applicable. Total-Text are shown in Tab.4 for detection and recognition.
12

TABLE 1: Existing datasets: * indicates datasets that are the most widely used across recent publications. Newly published
ones representing real-world challenges are marked in bold. EN stands for English and CN stands for Chinese.
Dataset Image Num Text Num Detection Recognition
Orientation Language Characteristics
(Year) (train/test) (train/test) Task Task
ICDAR03(2003) 258/251 1110/1156 Horizontal EN - X X
*ICDAR13
229/233 848/1095 Horizontal EN Character stroke annotations X X
Scene Text(2013)

*ICDAR15 Blur
1000/500 -/- Multi-Oriented EN X X
Incidental Text(2015) Small
ICDAR RCTW(2017) 8034/4229 -/- Multi-Oriented CN - X X
Total-Text (2017) 1255/300 -/- Curved EN, CN Polygon label X X
SVT(2010) 100/250 257/647 Horizontal EN - X X
*CUTE(2014) -/80 -/- Curved EN - X X
CTW (2017) 25K/6K 812K/205K Multi-Oriented CN Fine-grained annotation X X
CASIA-10K (2018) 7K/3K -/- Multi-Oriented CN - X X
*MSRA-TD500 (2012) 300/200 1068/651 Multi-Oriented EN, CN Long text X -
HUST-TR400 (2014) 400/- -/- Multi-Oriented EN, CN Long text X -
ICDAR17MLT(2017) 9000/9000 -/- Multi-Oriented 9 langanges - X -
CTW1500 (2017) 1000/500 -/- Curved EN X -
*IIIT 5K-Word(2012) 2000/3000 2000/3000 Horizontal - - - X
SVTP(2013) -/639 -/639 Multi-Oriented EN Perspective text - X
SVHN(2010) 73257/26032 73257/26032 Horizontal - House number digits - X

TABLE 2: Detection on ICDAR2013 based on DetEval.
means multi-scale.
TABLE 3: Detection on ICDAR2015. ∗ means multi-scale.
Method Precision Recall F-1 FPS Method Precision Recall F-1 FPS
Zhang et al. [178] 88 78 83 - Zhang et al. [178] 71 43.0 54 -
SynthText [41] 92.0 75.5 83.0 - CTPN [142] 74 52 61 7.1
Holistic [164] 88.88 80.22 84.33 - Holistic [164] 72.26 58.69 64.77 -
PixelLink [22] 86.4 83.6 84.5 - He et al.∗ [45] 76 54 63 -
CTPN [142] 93 83 88 7.1 SegLink [131] 73.1 76.8 75.0 -
He et al.∗ [45] 93 79 85 - SSTD [48] 80 73 77 -
SegLink [131] 87.7 83.0 85.3 20.6 EAST [180] 83.57 73.47 78.20 13.2
He et al. ∗ [50] 92 80 86 1.1 He et al. ∗ [50] 82 80 81 -
TextBox++ [80] 89 83 86 1.37 R2CNN [64] 85.62 79.68 82.54 0.44
EAST [180] 92.64 82.67 87.37 - Liu et al. [96] 72 80 76 -
SSTD [48] 89 86 88 7.69 WordSup ∗ [55] 79.33 77.03 78.16 -
Lyu et al. [101] 93.3 79.4 85.8 10.4 Wang et al. [151] 85.7 74.1 79.5 -
Liu et al. [96] 88.2 87.2 87.7 - Lyu et al. [101] 94.1 70.7 80.7 3.6
He et al.∗ [49] 88 87 88 - TextSnake [97] 84.9 80.4 82.6 1.1
Xue et al. [161] 91.5 87.1 89.2 - He et al.∗ [49] 84 83 83 -
WordSup ∗ [55] 93.34 87.53 90.34 - Lyu et al.∗ [100] 85.8 81.2 83.4 4.8
Lyu et al.∗ [100] 94.1 88.1 91.0 4.6 PixelLink [22] 85.5 82.0 83.7 3.0
FEN [177] 93.7 90.0 92.3 1.11 Baek et al. [6] 89.8 84.3 86.9 8.6
Baek et al. [6] 97.4 93.1 95.2 - Zhang et al.∗ [175] 87.8 87.6 87.7 -
Wang et al.∗ [155] 89.2 86.0 87.6 -
The Street View Text (SVT) dataset [152], [153] is a col- Tian et al.∗ [143] 85.1 84.5 84.8 -
lection of street view images, and is now mainly used in
TABLE 4: Detection on Total-Text.
evaluating recognition algorithm.
Detection Word Spotting
CUTE [120] focuses on curved text. It only contains 80 Method
P R F None Full
images and is currently only used in recognition. Lyu et al.∗ [100] 69.0 55.0 61.3 52.9 71.8
TextSnake [97] 82.7 74.5 78.4 - -
4.1.2 Datasets with only detection task
Baek et al. [6] 87.6 79.9 83.6 - -
MSRA-TD500 [145] represents long and multi-oriented text Zhang et al. [175] 87.6 79.3 83.3 - -
that have much larger aspect ratios than other datasets. Wang et al. [155] 80.9 76.2 78.5 - -
Later, HUST-TR400 [163] are collected in the same way as
MSRA-TD500 to serve as additional training data. TABLE 5: Detection on CTW1500.
ICDAR2017-MLT [107] contains 18K images with scripts
Method Precision Recall F-1 FPS
of 9 languages, 2K for each. It features the largest number
CTD+TLOC [94] 77.4 69.8 73.4 -
of languages up till now. However, researchers pay little
TextSnake [97] 67.9 85.3 75.6 -
attention to multi-language detection and recognition.
CASIA-10K is a newly published Chinese scene text dataset. Baek et al. [6] 86.0 81.1 83.5 8.6
As Chinese characters are not segmented by spaces, line- Zhang et al. [175] 85.7 76.5 80.8 -
Wang et al. [155] 80.1 80.2 80.1 -
level annotations are provided.
Tian et al. [143] 82.7 77.8 80.1 -
13

TABLE 6: Detection on MSRA-TD500. TABLE 7: Characteristics of the three vocabulary lists used
Method Precision Recall F-1 FPS in ICDAR 2013/2015. S stands for Strongly Contextualised, W
Kang et al. [66] 71 62 66 - for Weakly Contextualised, and G for Generic
Zhang et al. [178] 83 67 74 - Vocab List Description
Holistic [164] 76.51 75.31 75.91 -
He et al. [50] 77 70 74 - a per-image list of 100 words
S
EAST [180] 87.28 67.43 76.08 13.2 all words in the image + seletected distractors
Wu et al. [159] 77 78 77 - W all words in the entire test set
SegLink [131] 86 70 77 8.9 G a 90k-word generic vocabulary
PixelLink [22] 83.0 73.2 77.8
TextSnake [97] 83.2 73.9 78.3 1.1 • DetEval: DetEval imposes constraints on both precision,
Xue et al. [161] 83.0 77.4 80.1 - i.e. SSPI and recall, i.e. SSGT
I
. Only when both are larger than
Wang et al. [151] 90.3 72.3 80.3 - their respective thresholds, are they matched together.
Lyu et al. [101] 87.6 76.2 81.5 5.7
• PASCAL [27]: The basic idea is that, if the intersection-
Liu et al. [96] 88 79 83 -
Baek et al. [6] 88.2 78.2 82.9 -
over-union value, i.e. SSUI , is larger than a designated thresh-
Wang et al. [155] 85.2 82.1 83.6 - old, the predicted and ground truth box are matched to-
Tian et al. [143] 84.2 81.7 82.9 - gether.
Most datasets follow either of the two evaluation proto-
SCUT-CTW1500 (CTW1500) is another dataset which fea- cols, but with small modifications. We only discusses those
tures curved text. Annotations in CTW1500 are polygons that are different from the two protocols mentioned above.
with 14 evenly placed vertices. Performances on CTW1500 4.2.1.1 ICDAR2003/2005: The match score m is cal-
are shown in Tab.5 for detection. culated in a way similar to IOU. It is defined as the ratio of
the area of intersection over that of the minimum bounding
4.1.3 Datasets with only recognition task
rectangular bounding box containing both.
IIIT 5K-Word [106] is the largest recognition dataset, con- 4.2.1.2 ICDAR2011/2013: One major drawback of
taining both digital and natural scene images. Its variance the evaluation protocol of ICDAR2003/2005 is that it
in font, color, size and other noises makes it the most only considers one-to-one match. It does not consider
challenging one to date. one-to-many, many-to-many, and many-to-one matchings,
SVT-Perspective (SVTP) is proposed in [116] for evaluating which underestimates the actual performance. Therefore,
the performance of recognizing perspective text. Images in ICDAR2011/2013 follows the method proposed by Wolf et
SVTP are picked from the side-view images in Google Street al. [157], where one-to-one matching is assigned a score of
View. Many of them are heavily distorted by the non-frontal 1 and the other two types are punished to a constant score
view angle. less than 1, usually set as 0.8.
The Street View House Numbers (SVHN) dataset [108] 4.2.1.3 MSRA-TD500: Yao et al. [145] propose a new
contains cropped images of house numbers in natural evaluation protocol for rotated bounding box, where both
scenes. The images are collected from Google View images. the predicted and ground truth bounding box are revolved
This dataset is usually used in digit recognition. horizontal around its center. They are matched only when
4.2 Evaluation Protocols the standard IOU score is higher than the threshold and the
In this part, we briefly summarize the evaluation protocols rotation of the original bounding boxe is less a pre-defined
for text detection and recognition. value (in practice pi 4 ).
As metrics for performance comparison of different al-
gorithms, we usually refer to their precision, recall and F1- 4.2.2 Text Recognition and End-to-End System
score. To compute these performance indicators, the list of Text recognition is another task where a cropped image
predicted text instances should be matched to the ground is given which contains exactly one text instance, and we
truth labels in the first place. Precision, denoted as P , is need to extract the text content from the image in a form
calculated as the proportion of predicted text instances that that a computer program can understand directly, e.g. string
can be matched to ground truth labels. Recall, denoted as R, type in C++ or str type in Python. There is not need for
is the proportion of ground truth labels that have correspon- matching in this task. The predicted text string is compared
dents in the predicted list. F1-score is a then computed by to the ground truth directly. The performance evaluation
∗R
F1 = 2∗PP +R , taking both precision and recall into account. is in either character-level recognition rate (i.e. how many
Note that the matching between the predicted instances and characters are recognized) or word level (whether the pre-
ground truth ones comes first. dicted word is 100% correct). ICDAR also introduces an
edit-distance based performance evaluation. Note that in
4.2.1 Text Detection end-to-end evaluation, matching is first performed in a
There are mainly two different protocols for text detection, similar way to that of text detection. State-of-the-art recog-
the IOU based PASCAL Eval and overlap based DetEval. nition performance on the most widely used datasets are
They differ in the criterion of matching predicted text in- summarized in Tab.8
stances and ground truth ones. In the following part, we The evaluation for end-to-end system is a combination
use these notations: SGT is the area of the ground truth of both detection and recognition. Given output to be eval-
bounding box, SP is the area of the predicted bounding uated, i.e. text location and recognized content, predicted
box, SI is the area of the intersection of the predicted and text instances are first matched with ground truth instances,
ground truth bounding box, SU is the area of the union. followed by comparison of the text content.
14

TABLE 8: State-of-the-art recognition performance across a number of datasets. “50”, “1k”, “Full” are lexicons. “0” means no
lexicon. “90k” and “ST” are the Synth90k and the SynthText datasets, respectively. “ST+ ” means including character-level
annotations. “Private” means private training data.
IIIT5k SVT IC03 IC13 IC15 SVTP CUTE
Methods ConvNet, Data
50 1k 0 50 0 50 Full 0 0 0 0 0
Yao et al. [165] - 80.2 69.3 - 75.9 - 88.5 80.3 - - - - -
Jaderberg et al. [62] - - - - 86.1 - 96.2 91.5 - - - - -
Su et al. [140] - - - - 83.0 - 92.0 82.0 - - - - -
Gordo [38] - 93.3 86.6 - 91.8 - - - - - - - -
Jaderberg et al. [60] VGG, 90k 97.1 92.7 - 95.4 80.7 98.7 98.6 93.1 90.8 - - -
Shi et al. [132] VGG, 90k 97.8 95.0 81.2 97.5 82.7 98.7 98.0 91.9 89.6 - - -
Shi et al. [133] VGG, 90k 96.2 93.8 81.9 95.5 81.9 98.3 96.2 90.1 88.6 - 71.8 59.2
Lee et al. [73] VGG, 90k 96.8 94.4 78.4 96.3 80.7 97.9 97.0 88.7 90.0 - - -
Yang et al. [162] VGG, Private 97.8 96.1 - 95.2 - 97.7 - - - - 75.8 69.3
Cheng et al. [15] ResNet, 90k+ST+ 99.3 97.5 87.4 97.1 85.9 99.2 97.3 94.2 93.3 70.6 - -
Shi et al. [134] ResNet, 90k+ST 99.6 98.8 93.4 99.2 93.6 98.8 98.0 94.5 91.8 76.1 78.5 79.5
Liao et al. [82] ResNet, ST+ +Private 99.8 98.8 91.9 98.8 86.4 - - - 91.5 - - 79.9
Li et al. [77] ResNet, 90k+ST+Private - - 91.5 - 84.5 - - - 91.0 69.2 76.4 83.3

The most widely used datasets for end-to-end systems Identity Authentication Automatic identity authentica-
are ICDAR2013 [69] and ICDAR2015 [68]. The evaluation tion is yet another field where OCR can give a full
over these two datasets are carried out under two different play to. In fields such as Internet finance and Customs,
settings [2], the Word Spotting setting and the End-to-End users/passengers are required to provide identification (ID)
setting. Under Word Spotting, the performance evaluation information, such as identity card and passport. Automatic
only focuses on the text instances from the scene image recognition and analysis of the provided documents would
that appear in a predesignated vocabulary, while other text require OCR that reads and extracts the textual content,
instances are ignored. On the contrary, all text instances and can automate and greatly accelerate such processes.
that appear in the scene image are included under End- There are companies that have already started working
to-End. Three different vocabulary lists are provided for on identification based on face and ID card, e.g. MEGVII
candidate transcriptions. They include Strongly Contextu- (Face++)20 .
alised, Weakly Contextualised, and Generic. The three kinds of Augmented Computer Vision As text is an essential ele-
lists are summarized in Tab.7. Note that under End-to-End, ment for the understanding of scene, OCR can assist com-
these vocabulary can still serve as reference. State-of-the-art puter vision in many ways. In the scenario of autonomous
performances are summarized in Tab.9. vehicle, text-embedded panels carry important information,
e.g. geo-location, current traffic condition, navigation, and
5 A PPLICATION etc.. There have been several works on text detection and
The detection and recognition of text—the visual and phys- recognition for autonomous vehicle [103], [104]. The largest
ical carrier of human civilization—allow the connection dataset so far, CTW [173], also places extra emphasis on
between vision and the understanding of its content further. traffic signs. Another example is instant translation, where
Apart from the applications we have mentioned at the OCR is combined with a translation model. This is extremely
beginning of this paper, there have been numerous specific helpful and time-saving as people travel or read documents
application scenarios across various industries and in our written in foreign languages. Google’s Translate applica-
daily lives. In this part, we list and analyze the most out- tion21 can perform such instant translation. A similar ap-
standing ones that have, or are to have, significant impact, plication is instant text-to-speech softwares equipped with
improving our productivity and life quality. OCR, which can help those with visual disability and those
Automatic Data Entry Apart from an electronic archive of who are illiterate [3].
existing documents, OCR can also improve our productivity Intelligent Content Analysis OCR also allows the industry
in the form of automatic data entry. Some industries involve to perform more intelligent analysis, mainly for platforms
time-consuming data type-in, e.g. express orders written like video-sharing websites and e-commerce. Text can be
by customers in the delivery industry, and hand-written extracted from images and subtitles as well as real-time
information sheets in the financial and insurance industries. commentary subtitles (a kind of floating comments added
Applying OCR techniques can accelerate the data entry by users, e.g. those in Bilibili22 and Niconico23 ). On the one
process as well as protect customer privacy. Some compa- hand, such extracted text can be used in automatic content
nies have already been using these technologies, e.g. SF- tagging and recommendation system. They can also be used
Express18 . Another potential application is note taking, such to perform user sentiment analysis, e.g. which part of the
as NEBO19 , a note-taking software on tablets like iPad that
performs instant transcription as users write down notes. 20. https://www.faceplusplus.com/face-based-identification/
21. https://translate.google.com/
18. Official website: http://www.sf-express.com/cn/sc/ 22. https://www.bilibili.com
19. Official website: https://www.myscript.com/nebo/ 23. www.nicovideo.jp/
15

TABLE 9: Performance of End-to-End and Word Spotting on ICDAR2015 and ICDAR2013. means multi-scale.
Word Spotting End-to-End
Method FPS
S W G S W G
ICDAR2015
Deep2Text-MO [60], [170], [171] 17.58 17.58 17.58 16.77 16.77 16.77 -
TextProposals+DictNet [36], [59] 56.0 52.3 49.7 53.3 49.6 47.2 0.2
HUST MCLAB [131], [132] 70.6 - - 67.9 - - -
Deep Text Spotter [13] 58.0 53.0 51.0 54.0 51.0 47.0 9.0
FOTS∗ [92] 87.01 82.39 67.97 83.55 79.11 65.33 -
He et al. [49] 85 80 65 82 77 63 -
Mask TextSpotter [100] 79.3 74.5 64.2 79.3 73.0 62.4 2.6
ICDAR2013
Textboxes [80] 93.9 92.0 85.9 91.6 89.7 83.9
Deep text spotter [13] 92 89 81 89 86 77 9
Li et al. [76] 94.2 92.4 88.2 91.1 89.8 84.6 1.1
FOTS∗ [92] 95.94 93.90 87.76 91.99 90.11 84.77 11.2
He et al. [49] 93 92 87 91 89 86 -
Mask TextSpotter [100] 92.5 92.0 88.2 92.2 91.1 86.5 4.8

video attracts the users most. On the other hand, website and application prospect. A feasible solution might be to
administrator can impose supervision and filtration for in- explore compositional representations that can capture the
appropriate and illegal content, such as terrorist advocacy. common patterns of text instances of different languages,
and train the detection and recognition models with text
examples of different languages, which are generated by text
6 C ONCLUSION AND D ISCUSSION synthesizing engines.
6.1 Status Quo Robustness of Models: Although current text recognizers
Algorithms: The past several years have witnessed the have proven to be able to generalize well to different scene
significant development of algorithms for text detection text dataset even only using synthetic data, recent work [82]
and recognition, mainly due to the deep learning boom. shows that robustness against flawed detection is not a
Deep learning models have replaced the manual search neglectable problem. Actually, such instability in prediction
and design for patterns and features. With improved ca- have also been observed for text detection models. The
pability of models, research attention has been drawn to reason behind this kind of phenomenon is still unclear. One
challenges such as oriented and curved text detection, and conjecture is that the robustness of models are related to the
have achieved considerable progress. internal operating mechanism of deep neural networks.
Applications: Apart from efforts towards a general solution Generalization: Few detection algorithms except for
to all sorts of images, these algorithms can be trained and TextSnake [97] have considered the problem of generaliza-
adapted to more specific scenarios, e.g. bankcard, ID card, and tion ability across datasets, i.e. training on one dataset, and
driver’s license. Some companies have been providing such testing on another. Generalization ability is important as
scenario-specific APIs, including Baidu Inc., Tencent Inc. some application scenarios would require the adaptability to
and MEGVII Inc.. Recent development of fast and efficient varying environments. For example, instant translation and
methods [119], [180] has also allowed the deployment of OCR in autonomous vehicles should be able to perform sta-
large-scale systems [11]. Companies including Google Inc. bly under different situations: zoomed-in images with large
and Amazon Inc. are also providing text extraction APIs. text instances, far and small words, blurred words, different
languages and shapes. It remains unverified whether simply
pooling all existing datasets together is enough, especially
6.2 Challenges and Future Trends when the target domain is totally unknown.
We look at the present through a rear-view mirror. We march Synthetic Data: While training recognizers on synthetic
backwards into the future [89]. We list and discuss challenges, datasets has become a routine and results are excellent,
and analyse what would be the next valuable research detectors still rely heavily on real datasets. It remains a
directions in the field scene text detection and recognition. challenge to synthesize diverse and realistic images to train
Languages: There are more than 1000 languages in the detectors. Potential benefits of synthetic data are not yet
world [1]. However, most current algorithms and datasets fully explored, such as generalization ability. Synthesis us-
have primarily focused on text of English. While English ing 3D engines and models can simulate different conditions
has a rather small alphabet, other languages such as Chi- such as lighting and occlusion, and thus is a very promising
nese and Japanese have a much larger one, with tens of approach.
thousands of symbols. RNN-based recognizers may suffer Evaluation: Existing evaluation metrics for detection stem
from such enlarged symbol sets. Moreover, some languages from those for general object detection. Matching based on
have much more complex appearances, and they are there- IoU score or pixel-level precision and recall ignore the fact
fore more sensitive to conditions such as image quality. that missing parts and superfluous backgrounds may hurt the
Researchers should first verify how well current algorithms performance of the subsequent recognition procedure. For
can generalize to text of other languages and further to each text instance, pixel-level precision and recall are good
mixed text. Unified detection and recognition systems for metrics. However, their scores are assigned to 1.0 once they
multiple types of languages are of important academic value are matched to ground truth, and thus not reflected in the
16

final dataset-level score. An off-the-shelf alternative method [15] Zhanzhan Cheng, Fan Bai, Yunlu Xu, Gang Zheng, Shiliang
is to simply sum up the instance-level scores under DetEval Pu, and Shuigeng Zhou. Focusing attention: Towards accurate
text recognition in natural images. In 2017 IEEE International
instead of first assigning them to 1.0. Conference on Computer Vision (ICCV), pages 5086–5094. IEEE,
Efficiency: Another shortcoming of deep learning based 2017.
methods lies in their efficiency. Most of the current systems [16] Zhanzhan Cheng, Xuyang Liu, Fan Bai, Yi Niu, Shiliang Pu, and
can not run in real-time when deployed on computers with- Shuigeng Zhou. Arbitrarily-oriented text recognition. CVPR2018,
2017.
out GPUs or mobile devices. Apart from model compression [17] Chee Kheng Ch’ng and Chee Seng Chan. Total-text: A com-
and lightweight models that have proven effective in other prehensive dataset for scene text detection and recognition. In
tasks, it is also valuable to study how to make custom 2017 14th IAPR International Conference on Document Analysis and
Recognition (ICDAR), volume 1, pages 935–942. IEEE, 2017.
speedup mechanism for text related tasks.
[18] MM Aftab Chowdhury and Kaushik Deb. Extracting and seg-
Bigger and Better Data: The sizes of most existing datasets menting container name from container images. International
are much smaller than datasets of other tasks (∼ 1k vs. >> Journal of Computer Applications, 74(19), 2013.
10k ). It will be worthwhile to study whether the improve- [19] Adam Coates, Blake Carpenter, Carl Case, Sanjeev Satheesh,
Bipin Suresh, Tao Wang, David J Wu, and Andrew Y Ng. Text
ments gained from current algorithms can scale up or they detection and character recognition in scene images with un-
are just accidental results of better regularization. Besides, supervised feature learning. In 2011 International Conference on
most datasets are only labelled with bounding boxes and Document Analysis and Recognition (ICDAR), pages 440–445. IEEE,
texts. Detailed annotation of different attributes [173] such 2011.
[20] Yuchen Dai, Zheng Huang, Yuting Gao, and Kai Chen. Fused text
as wordart and occlusion may guide researchers with per- segmentation networks for multi-oriented scene text detection.
tinence. Finally, datasets characterized by real-world chal- arXiv preprint arXiv:1709.03272, 2017.
lenges are also important in advancing research progress, [21] Navneet Dalal and Bill Triggs. Histograms of oriented gradients
such as densely located text on products. for human detection. In IEEE Computer Society Conference on
Computer Vision and Pattern Recognition (CVPR), volume 1, pages
886–893. IEEE, 2005.
[22] Deng Dan, Liu Haifeng, Li Xuelong, and Cai Deng. Pixellink:
R EFERENCES Detecting scene text via instance segmentation. In Proceedings of
AAAI, 2018, 2018.
[1] How many languages are there in the world? https://www.
[23] Guilherme N DeSouza and Avinash C Kak. Vision for mobile
ethnologue.com/guides/how-many-languages. Accessed: 2019-
robot navigation: A survey. IEEE transactions on pattern analysis
09-02.
and machine intelligence, 24(2):237–267, 2002.
[2] Icdar 2015 robust reading competition. http://rrc.cvc.uab.es/
[24] Piotr Dollár, Ron Appel, Serge Belongie, and Pietro Perona. Fast
files/Robust Reading 2015 v02.pdf. Accessed: 2018-07-30.
feature pyramids for object detection. IEEE Transactions on Pattern
[3] Screen reader. https://en.wikipedia.org/wiki/Screen reader#
Analysis and Machine Intelligence, 36(8):1532–1545, 2014.
cite note-Braille display-2. Accessed: 2018-08-09.
[4] Jon Almazán, Albert Gordo, Alicia Fornés, and Ernest Val- [25] Yuval Dvorin and Uzi Ezra Havosha. Method and device for
veny. Word spotting and recognition with embedded attributes. instant translation, June 4 2009. US Patent App. 11/998,931.
IEEE transactions on pattern analysis and machine intelligence, [26] Boris Epshtein, Eyal Ofek, and Yonatan Wexler. Detecting text
36(12):2552–2566, 2014. in natural scenes with stroke width transform. In 2010 IEEE
[5] Pablo Arbelaez, Michael Maire, Charless Fowlkes, and Jitendra Conference on Computer Vision and Pattern Recognition (CVPR),
Malik. Contour detection and hierarchical image segmenta- pages 2963–2970. IEEE, 2010.
tion. IEEE transactions on pattern analysis and machine intelligence, [27] Mark Everingham, SM Ali Eslami, Luc Van Gool, Christopher KI
33(5):898–916, 2011. Williams, John Winn, and Andrew Zisserman. The pascal visual
[6] Youngmin Baek, Bado Lee, Dongyoon Han, Sangdoo Yun, and object classes challenge: A retrospective. International journal of
Hwalsuk Lee. Character region awareness for text detection. In computer vision, 111(1):98–136, 2015.
Proceedings of the IEEE Conference on Computer Vision and Pattern [28] Pedro F Felzenszwalb and Daniel P Huttenlocher. Pictorial
Recognition (CVPR), pages 9365–9374, 2019. structures for object recognition. International journal of computer
[7] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neu- vision, 61(1):55–79, 2005.
ral machine translation by jointly learning to align and translate. [29] Cheng-Yang Fu, Wei Liu, Ananth Ranga, Ambrish Tyagi, and
ICLR 2015, 2014. Alexander C Berg. Dssd: Deconvolutional single shot detector.
[8] Fan Bai, Zhanzhan Cheng, Yi Niu, Shiliang Pu, and Shuigeng arXiv preprint arXiv:1701.06659, 2017.
Zhou. Edit probability for scene text recognition. In CVPR 2018, [30] Tommaso Furlanello, Zachary C Lipton, Michael Tschannen, Lau-
2018. rent Itti, and Anima Anandkumar. Born again neural networks.
[9] Christian Bartz, Haojin Yang, and Christoph Meinel. See: To- arXiv preprint arXiv:1805.04770, 2018.
wards semi-supervised end-to-end scene text recognition. arXiv [31] Yunze Gao, Yingying Chen, Jinqiao Wang, and Hanqing Lu.
preprint arXiv:1712.05404, 2017. Reading scene text with attention convolutional sequence mod-
[10] Alessandro Bissacco, Mark Cummins, Yuval Netzer, and Hartmut eling. arXiv preprint arXiv:1709.04303, 2017.
Neven. Photoocr: Reading text in uncontrolled conditions. In [32] Suman K Ghosh, Ernest Valveny, and Andrew D Bagdanov.
Proceedings of the IEEE International Conference on Computer Vision, Visual attention models for scene text recognition. In Document
pages 785–792, 2013. Analysis and Recognition (ICDAR), 2017 14th IAPR International
[11] Fedor Borisyuk, Albert Gordo, and Viswanath Sivakumar. Conference on. IEEE, 2017, volume 1, pages 943–948, 2017.
Rosetta: Large scale system for text detection and recognition [33] Ross Girshick. Fast r-cnn. In The IEEE International Conference on
in images. In Proceedings of the 24th ACM SIGKDD International Computer Vision (ICCV), 2015.
Conference on Knowledge Discovery & Data Mining, pages 71–79. [34] Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik.
ACM, 2018. Rich feature hierarchies for accurate object detection and seman-
[12] Michal Busta, Lukas Neumann, and Jiri Matas. Fastext: Efficient tic segmentation. In Proceedings of the IEEE conference on computer
unconstrained scene text detector. In Proceedings of the IEEE vision and pattern recognition (CVPR), pages 580–587, 2014.
International Conference on Computer Vision (ICCV), pages 1206– [35] Lluis Gomez and Dimosthenis Karatzas. Object proposals for text
1214, 2015. extraction in the wild. In 13th International Conference on Document
[13] Michal Busta, Lukas Neumann, and Jiri Matas. Deep textspotter: Analysis and Recognition (ICDAR). IEEE, 2015.
An end-to-end trainable scene text localization and recognition [36] Lluı́s Gómez and Dimosthenis Karatzas. Textproposals: a text-
framework. In Proc. ICCV, 2017. specific selective search algorithm for word spotting in the wild.
[14] Xilin Chen, Jie Yang, Jing Zhang, and Alex Waibel. Automatic Pattern Recognition, 70:60–74, 2017.
detection and recognition of signs from natural scenes. IEEE [37] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu,
Transactions on image processing, 13(1):87–99, 2004. David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua
17

Bengio. Generative adversarial nets. In Advances in neural [58] Max Jaderberg, Karen Simonyan, Andrea Vedaldi, and Andrew
information processing systems, pages 2672–2680, 2014. Zisserman. Deep structured output learning for unconstrained
[38] Albert Gordo. Supervised mid-level features for word image text recognition. ICLR2015, 2014.
representation. In Proceedings of the IEEE conference on computer [59] Max Jaderberg, Karen Simonyan, Andrea Vedaldi, and Andrew
vision and pattern recognition (CVPR), pages 2956–2964, 2015. Zisserman. Synthetic data and artificial neural networks for
[39] Alex Graves, Santiago Fernández, Faustino Gomez, and Jürgen natural scene text recognition. arXiv preprint arXiv:1406.2227,
Schmidhuber. Connectionist temporal classification: labelling 2014.
unsegmented sequence data with recurrent neural networks. In [60] Max Jaderberg, Karen Simonyan, Andrea Vedaldi, and Andrew
Proceedings of the 23rd international conference on Machine learning, Zisserman. Reading text in the wild with convolutional neural
pages 369–376. ACM, 2006. networks. International Journal of Computer Vision, 116(1):1–20,
[40] Alex Graves, Marcus Liwicki, Horst Bunke, Jürgen Schmidhuber, 2016.
and Santiago Fernández. Unconstrained on-line handwriting [61] Max Jaderberg, Karen Simonyan, Andrew Zisserman, et al. Spa-
recognition with recurrent neural networks. In Advances in neural tial transformer networks. In Advances in neural information
information processing systems, pages 577–584, 2008. processing systems, pages 2017–2025, 2015.
[41] Ankush Gupta, Andrea Vedaldi, and Andrew Zisserman. Syn- [62] Max Jaderberg, Andrea Vedaldi, and Andrew Zisserman. Deep
thetic data for text localisation in natural images. In Proceedings features for text spotting. In In Proceedings of European Conference
of the IEEE Conference on Computer Vision and Pattern Recognition on Computer Vision (ECCV), pages 512–528. Springer, 2014.
(CVPR), pages 2315–2324, 2016. [63] Anil K Jain and Bin Yu. Automatic text location in images and
video frames. Pattern recognition, 31(12):2055–2076, 1998.
[42] Young Kug Ham, Min Seok Kang, Hong Kyu Chung, Rae-
Hong Park, and Gwi Tae Park. Recognition of raised characters [64] Yingying Jiang, Xiangyu Zhu, Xiaobing Wang, Shuli Yang, Wei
for automatic classification of rubber tires. Optical Engineering, Li, Hua Wang, Pei Fu, and Zhenbo Luo. R2cnn: rotational region
34(1):102–110, 1995. cnn for orientation robust scene text detection. arXiv preprint
arXiv:1706.09579, 2017.
[43] Junwei Han, Dingwen Zhang, Gong Cheng, Nian Liu, and Dong
[65] Keechul Jung, Kwang In Kim, and Anil K Jain. Text information
Xu. Advanced deep-learning techniques for salient and category-
extraction in images and video: a survey. Pattern recognition,
specific object detection: a survey. IEEE Signal Processing Maga-
37(5):977–997, 2004.
zine, 35(1):84–100, 2018.
[66] Le Kang, Yi Li, and David Doermann. Orientation robust text line
[44] Dafang He, Xiao Yang, Wenyi Huang, Zihan Zhou, Daniel Kifer, detection in natural images. In Proceedings of the IEEE Conference
and C Lee Giles. Aggregating local context for accurate scene text on Computer Vision and Pattern Recognition (CVPR), pages 4034–
detection. In Asian Conference on Computer Vision, pages 280–296. 4041, 2014.
Springer, 2016. [67] Dimosthenis Karatzas and Apostolos Antonacopoulos. Text ex-
[45] Dafang He, Xiao Yang, Chen Liang, Zihan Zhou, Alexander G traction from web images based on a split-and-merge segmenta-
Ororbia, Daniel Kifer, and C Lee Giles. Multi-scale fcn with tion method using colour perception. In Proceedings of the 17th
cascaded instance aware segmentation for arbitrary oriented International Conference on Pattern Recognition, 2004. ICPR 2004.,
word spotting in the wild. In 2017 IEEE Conference on Computer volume 2, pages 634–637. IEEE, 2004.
Vision and Pattern Recognition (CVPR), pages 474–483. IEEE, 2017. [68] Dimosthenis Karatzas, Lluis Gomez-Bigorda, Anguelos Nico-
[46] Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. laou, Suman Ghosh, Andrew Bagdanov, Masakazu Iwamura,
Mask r-cnn. In 2017 IEEE International Conference on Computer Jiri Matas, Lukas Neumann, Vijay Ramaseshan Chandrasekhar,
Vision (ICCV), pages 2980–2988. IEEE, 2017. Shijian Lu, et al. Icdar 2015 competition on robust reading.
[47] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep In 2015 13th International Conference on Document Analysis and
residual learning for image recognition. In In Proceedings of the Recognition (ICDAR), pages 1156–1160. IEEE, 2015.
IEEE conference on computer vision and pattern recognition (CVPR), [69] Dimosthenis Karatzas, Faisal Shafait, Seiichi Uchida, Masakazu
2016. Iwamura, Lluis Gomez i Bigorda, Sergi Robles Mestre, Joan Mas,
[48] Pan He, Weilin Huang, Tong He, Qile Zhu, Yu Qiao, and Xiaolin David Fernandez Mota, Jon Almazan Almazan, and Lluis Pere
Li. Single shot text detector with regional attention. In The IEEE de las Heras. Icdar 2013 robust reading competition. In 2013
International Conference on Computer Vision (ICCV), 2017. 12th International Conference on Document Analysis and Recognition
[49] Tong He, Zhi Tian, Weilin Huang, Chunhua Shen, Yu Qiao, (ICDAR), pages 1484–1493. IEEE, 2013.
and Changming Sun. An end-to-end textspotter with explicit [70] Vijeta Khare, Palaiahnakote Shivakumara, Paramesran Raveen-
alignment and attention. In Proceedings of the IEEE Conference on dran, and Michael Blumenstein. A blind deconvolution model for
Computer Vision and Pattern Recognition (CVPR), pages 5020–5029, scene text detection and recognition in video. Pattern Recognition,
2018. 54:128–148, 2016.
[50] Wenhao He, Xu-Yao Zhang, Fei Yin, and Cheng-Lin Liu. Deep [71] Kye-Hyeon Kim, Sanghoon Hong, Byungseok Roh, Yeongjae
direct regression for multi-oriented scene text detection. In The Cheon, and Minje Park. PVANET: deep but lightweight neural
IEEE International Conference on Computer Vision (ICCV), 2017. networks for real-time object detection. arXiv:1608.08021, 2016.
[51] Zhiwei He, Jilin Liu, Hongqing Ma, and Peihong Li. A new [72] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Ima-
automatic extraction method of container identity codes. IEEE genet classification with deep convolutional neural networks. In
Transactions on intelligent transportation systems, 6(1):72–78, 2005. Advances in neural information processing systems, pages 1097–1105,
2012.
[52] Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the
[73] Chen-Yu Lee and Simon Osindero. Recursive recurrent nets with
knowledge in a neural network. arXiv preprint arXiv:1503.02531,
attention modeling for ocr in the wild. In Proceedings of the IEEE
2015.
Conference on Computer Vision and Pattern Recognition (CVPR),
[53] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term pages 2231–2239, 2016.
memory. Neural computation, 9(8):1735–1780, 1997.
[74] Jung-Jin Lee, Pyoung-Hean Lee, Seong-Whan Lee, Alan Yuille,
[54] Michal Hradiš, Jan Kotera, Pavel Zemcı́k, and Filip Šroubek. and Christof Koch. Adaboost for text detection in natural scene.
Convolutional neural networks for direct text deblurring. In In 2011 International Conference on Document Analysis and Recogni-
Proceedings of BMVC, volume 10, 2015. tion (ICDAR), pages 429–434. IEEE, 2011.
[55] Han Hu, Chengquan Zhang, Yuxuan Luo, Yuzhuo Wang, Junyu [75] Seonghun Lee and Jin Hyung Kim. Integrating multiple character
Han, and Errui Ding. Wordsup: Exploiting word annotations proposals for robust scene text extraction. Image and Vision
for character based text detection. In Proceedings of the IEEE Computing, 31(11):823–840, 2013.
International Conference on Computer Vision. 2017., 2017. [76] Hui Li, Peng Wang, and Chunhua Shen. Towards end-to-end
[56] Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q text spotting with convolutional recurrent neural networks. In
Weinberger. Densely connected convolutional networks. In The IEEE International Conference on Computer Vision (ICCV), 2017.
CVPR, volume 1, page 3, 2017. [77] Hui Li, Peng Wang, Chunhua Shen, and Guyu Zhang. Show,
[57] Weilin Huang, Zhe Lin, Jianchao Yang, and Jue Wang. Text attend and read: A simple and strong baseline for irregular text
localization in natural images using stroke feature transform and recognition. AAAI, 2019.
text covariance descriptors. In Proceedings of the IEEE International [78] Rong Li, MengYi En, JianQiang Li, and HaiBin Zhang. weakly
Conference on Computer Vision, pages 1241–1248, 2013. supervised text attention network for generating text proposals
18

in scene images. In 2017 14th IAPR International Conference on [101] Pengyuan Lyu, Cong Yao, Wenhao Wu, Shuicheng Yan, and
Document Analysis and Recognition (ICDAR), volume 1, pages 324– Xiang Bai. Multi-oriented scene text detection via corner lo-
330. IEEE, 2017. calization and region segmentation. In 2018 IEEE Conference on
[79] Minghui Liao, Baoguang Shi, and Xiang Bai. Textboxes++: A Computer Vision and Pattern Recognition (CVPR), 2018.
single-shot oriented scene text detector. IEEE Transactions on [102] Jianqi Ma, Weiyuan Shao, Hao Ye, Li Wang, Hong Wang, Yingbin
Image Processing, 27(8):3676–3690, 2018. Zheng, and Xiangyang Xue. Arbitrary-oriented scene text detec-
[80] Minghui Liao, Baoguang Shi, Xiang Bai, Xinggang Wang, and tion via rotation proposals. In IEEE Transactions on Multimedia,
Wenyu Liu. Textboxes: A fast text detector with a single deep 2018, 2017.
neural network. In AAAI, pages 4161–4167, 2017. [103] Abdelhamid Mammeri, Azzedine Boukerche, et al. Mser-based
[81] Minghui Liao, Boyu Song, Minghang He, Shangbang Long, Cong text detection and communication algorithm for autonomous ve-
Yao, and Xiang Bai. Synthtext3d: Synthesizing scene text images hicles. In 2016 IEEE Symposium on Computers and Communication
from 3d virtual worlds. arXiv preprint arXiv:1907.06007, 2019. (ISCC), pages 1218–1223. IEEE, 2016.
[82] Minghui Liao, Jian Zhang, Zhaoyi Wan, Fengming Xie, Jiajun [104] Abdelhamid Mammeri, El-Hebri Khiari, and Azzedine Bouk-
Liang, Pengyuan Lyu, Cong Yao, and Xiang Bai. Scene text erche. Road-sign text recognition architecture for intelligent
recognition from two-dimensional perspective. AAAI, 2019. transportation systems. In 2014 IEEE 80th Vehicular Technology
[83] Minghui Liao, Zhen Zhu, Baoguang Shi, Gui-song Xia, and Conference (VTC Fall), pages 1–5. IEEE, 2014.
Xiang Bai. Rotation-sensitive regression for oriented scene text [105] Anand Mishra, Karteek Alahari, and CV Jawahar. An mrf model
detection. In Proceedings of the IEEE Conference on Computer Vision for binarization of natural scene text. In ICDAR-International
and Pattern Recognition (CVPR), pages 5909–5918, 2018. Conference on Document Analysis and Recognition. IEEE, 2011.
[84] Fayao Liu, Chunhua Shen, and Guosheng Lin. Deep convolu- [106] Anand Mishra, Karteek Alahari, and CV Jawahar. Scene text
tional neural fields for depth estimation from a single image. In recognition using higher order language priors. In BMVC-British
Proceedings of the IEEE Conference on Computer Vision and Pattern Machine Vision Conference. BMVA, 2012.
Recognition (CVPR), pages 5162–5170, 2015. [107] Nibal Nayef, Fei Yin, Imen Bizid, Hyunsoo Choi, Yuan Feng,
[85] Li Liu, Wanli Ouyang, Xiaogang Wang, Paul Fieguth, Jie Chen, Dimosthenis Karatzas, Zhenbo Luo, Umapada Pal, Christophe
Xinwang Liu, and Matti Pietikäinen. Deep learning for generic Rigaud, Joseph Chazalon, et al. Icdar2017 robust reading
object detection: A survey. arXiv preprint arXiv:1809.02165, 2018. challenge on multi-lingual scene text detection and script
[86] Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, identification-rrc-mlt. In 2017 14th IAPR International Conference
Scott Reed, Cheng-Yang Fu, and Alexander C Berg. SSD: Single on Document Analysis and Recognition (ICDAR), volume 1, pages
shot multibox detector. In In Proceedings of European Conference on 1454–1459. IEEE, 2017.
Computer Vision (ECCV), pages 21–37. Springer, 2016. [108] Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco,
[87] Wei Liu, Chaofeng Chen, and KKY Wong. Char-net: A character- Bo Wu, and Andrew Y Ng. Reading digits in natural images with
aware neural network for distorted scene text recognition. In unsupervised feature learning. In NIPS workshop on deep learning
AAAI Conference on Artificial Intelligence. New Orleans, Louisiana, and unsupervised feature learning, volume 2011, page 5, 2011.
USA, 2018. [109] Luka Neumann and Jiri Matas. On combining multiple seg-
[88] Wei Liu, Chaofeng Chen, Kwan-Yee K Wong, Zhizhong Su, and mentations in scene text recognition. In 2013 12th International
Junyu Han. Star-net: A spatial attention residue network for Conference on Document Analysis and Recognition (ICDAR), pages
scene text recognition. In BMVC, volume 2, page 7, 2016. 523–527. IEEE, 2013.
[89] Xiang Liu. Old book of tang. Beijing: Zhonghua Book Company, [110] Lukas Neumann and Jiri Matas. A method for text localization
1975. and recognition in real-world images. In Asian Conference on
[90] Xiaoqing Liu and Jagath Samarabandu. An edge-based text Computer Vision, pages 770–783. Springer, 2010.
region extraction algorithm for indoor mobile robot navigation.
[111] Lukáš Neumann and Jiřı́ Matas. Real-time scene text localization
In Mechatronics and Automation, 2005 IEEE International Conference,
and recognition. In 2012 IEEE Conference on Computer Vision and
volume 2, pages 701–706. IEEE, 2005.
Pattern Recognition (CVPR), pages 3538–3545. IEEE, 2012.
[91] Xiaoqing Liu and Jagath K Samarabandu. A simple and fast
[112] Shigueo Nomura, Keiji Yamanaka, Osamu Katai, Hiroshi
text localization algorithm for indoor mobile robot navigation. In
Kawakami, and Takayuki Shiose. A novel adaptive morphologi-
Image Processing: Algorithms and Systems IV, volume 5672, pages
cal approach for degraded character image segmentation. Pattern
139–151. International Society for Optics and Photonics, 2005.
Recognition, 38(11):1961–1975, 2005.
[92] Xuebo Liu, Ding Liang, Shi Yan, Dagui Chen, Yu Qiao, and Junjie
Yan. Fots: Fast oriented text spotting with a unified network. [113] Christopher Parkinson, Jeffrey J Jacobsen, David Bruce Ferguson,
CVPR2018, 2018. and Stephen A Pombo. Instant translation system, November 29
2016. US Patent 9,507,772.
[93] Yuliang Liu and Lianwen Jin. Deep matching prior network:
Toward tighter multi-oriented text detection. 2017. [114] Andrei Polzounov, Artsiom Ablavatski, Sergio Escalera, Shijian
[94] Yuliang Liu, Lianwen Jin, Shuaitao Zhang, and Sheng Zhang. Lu, and Jianfei Cai. Wordfence: Text detection in natural images
Detecting curve text in the wild: New dataset and new solution. with border awareness. ICIP/ICPR, 2017.
arXiv preprint arXiv:1712.02170, 2017. [115] Weichao Qiu, Fangwei Zhong, Yi Zhang, Siyuan Qiao, Zihao
[95] Zichuan Liu, Yixing Li, Fengbo Ren, Hao Yu, and Wangling Xiao, Tae Soo Kim, and Yizhou Wang. Unrealcv: Virtual worlds
Goh. Squeezedtext: A real-time scene text recognition by binary for computer vision. In Proceedings of the 25th ACM international
convolutional encoder-decoder network. AAAI, 2018. conference on Multimedia, pages 1221–1224. ACM, 2017.
[96] Zichuan Liu, Guosheng Lin, Sheng Yang, Jiashi Feng, Weisi Lin, [116] Trung Quy Phan, Palaiahnakote Shivakumara, Shangxuan Tian,
and Wang Ling Goh. Learning markov clustering networks for and Chew Lim Tan. Recognizing text with perspective distortion
scene text detection. In Proceedings of the IEEE Conference on in natural scenes. In Proceedings of the IEEE International Conference
Computer Vision and Pattern Recognition (CVPR), pages 6936–6944, on Computer Vision (ICCV), pages 569–576, 2013.
2018. [117] Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi.
[97] Shangbang Long, Jiaqiang Ruan, Wenjie Zhang, Xin He, Wenhao You only look once: Unified, real-time object detection. In
Wu, and Cong Yao. Textsnake: A flexible representation for Proceedings of the IEEE Conference on Computer Vision and Pattern
detecting text of arbitrary shapes. In In Proceedings of European Recognition (CVPR), pages 779–788, 2016.
Conference on Computer Vision (ECCV), 2018. [118] Joseph Redmon and Ali Farhadi. Yolo9000: better, faster, stronger.
[98] Simon M Lucas. Icdar 2005 text locating competition results. arXiv preprint, 2017.
In International Conference on Document Analysis and Recognition, [119] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster
2005., pages 80–84. IEEE, 2005. r-cnn: Towards real-time object detection with region proposal
[99] Simon M Lucas, Alex Panaretos, Luis Sosa, Anthony Tang, networks. In Advances in neural information processing systems,
Shirley Wong, and Robert Young. Icdar 2003 robust reading pages 91–99, 2015.
competitions. In null, page 682. IEEE, 2003. [120] Anhar Risnumawan, Palaiahankote Shivakumara, Chee Seng
[100] Pengyuan Lyu, Minghui Liao, Cong Yao, Wenhao Wu, and Xiang Chan, and Chew Lim Tan. A robust arbitrary text detection
Bai. Mask textspotter: An end-to-end trainable neural network system for natural scene images. Expert Systems with Applications,
for spotting text with arbitrary shapes. In In Proceedings of 41(18):8027–8048, 2014.
European Conference on Computer Vision (ECCV), 2018. [121] Jose A Rodriguez-Serrano, Albert Gordo, and Florent Perronnin.
19

Label embedding: A frugal baseline for text recognition. Interna- [142] Zhi Tian, Weilin Huang, Tong He, Pan He, and Yu Qiao. De-
tional Journal of Computer Vision, 113(3):193–207, 2015. tecting text in natural image with connectionist text proposal
[122] Jose A Rodriguez-Serrano, Florent Perronnin, and France Mey- network. In In Proceedings of European Conference on Computer
lan. Label embedding for text recognition. In Proceedings of the Vision (ECCV), pages 56–72. Springer, 2016.
British Machine Vision Conference. Citeseer, 2013. [143] Zhuotao Tian, Michelle Shu, Pengyuan Lyu, Ruiyu Li, Chao
[123] Xuejian Rong, Chucai Yi, and Yingli Tian. Unambiguous text Zhou, Xiaoyong Shen, and Jiaya Jia. Learning shape-aware
localization and retrieval for cluttered scenes. In 2017 IEEE embedding for scene text detection. In Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition (CVPR), Conference on Computer Vision and Pattern Recognition, pages 4234–
pages 3279–3287. IEEE, 2017. 4243, 2019.
[124] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-Net: [144] Sam S Tsai, Huizhong Chen, David Chen, Georg Schroth, Radek
Convolutional Networks for Biomedical Image Segmentation. Springer Grzeszczuk, and Bernd Girod. Mobile visual search on printed
International Publishing, 2015. documents using text and low bit-rate features. In 18th IEEE
[125] Partha Pratim Roy, Umapada Pal, Josep Llados, and Mathieu International Conference on Image Processing (ICIP), pages 2601–
Delalandre. Multi-oriented and multi-sized touching character 2604. IEEE, 2011.
segmentation using dynamic programming. In 10th International [145] Zhuowen Tu, Yi Ma, Wenyu Liu, Xiang Bai, and Cong Yao.
Conference onDocument Analysis and Recognition, 2009. IEEE, 2009. Detecting texts of arbitrary orientations in natural images. In
[126] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev 2012 IEEE Conference on Computer Vision and Pattern Recognition,
Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya pages 1083–1090. IEEE, 2012.
Khosla, Michael Bernstein, et al. Imagenet large scale visual [146] Seiichi Uchida. Text localization and recognition in images and
recognition challenge. International Journal of Computer Vision, video. In Handbook of Document Image Processing and Recognition,
115(3):211–252, 2015. pages 843–883. Springer, 2014.
[127] Georg Schroth, Sebastian Hilsenbeck, Robert Huitl, Florian [147] Stijn Marinus Van Dongen. Graph clustering by flow simulation.
Schweiger, and Eckehard Steinbach. Exploiting text-related fea- PhD thesis, 2000.
tures for content-based image retrieval. In 2011 IEEE International [148] Steffen Wachenfeld, H-U Klein, and Xiaoyi Jiang. Recognition of
Symposium on Multimedia, pages 77–84. IEEE, 2011. screen-rendered text. In 18th International Conference on Pattern
[128] Ruth Schulz, Ben Talbot, Obadiah Lam, Feras Dayoub, Peter Recognition, 2006. ICPR 2006., volume 2, pages 1086–1089. IEEE,
Corke, Ben Upcroft, and Gordon Wyeth. Robot navigation 2006.
using human cues: A robot navigation system for symbolic goal- [149] Toru Wakahara and Kohei Kita. Binarization of color character
directed exploration. In Proceedings of the 2015 IEEE International strings in scene images using k-means clustering and support
Conference on Robotics and Automation (ICRA 2015), pages 1100– vector machines. In 2011 International Conference on Document
1105. IEEE, 2015. Analysis and Recognition (ICDAR), pages 274–278. IEEE, 2011.
[129] Asif Shahab, Faisal Shafait, and Andreas Dengel. Icdar 2011 [150] Cong Wang, Fei Yin, and Cheng-Lin Liu. Scene text detection
robust reading competition challenge 2: Reading text in scene with novel superpixel based character candidate extraction. In
images. In 2011 International Conference on Document Analysis and 2017 14th IAPR International Conference on Document Analysis and
Recognition. IEEE, 2011. Recognition (ICDAR), volume 1, pages 929–934. IEEE, 2017.
[130] Karthik Sheshadri and Santosh Kumar Divvala. Exemplar driven [151] Fangfang Wang, Liming Zhao, Xi Li, Xinchao Wang, and Dacheng
character recognition in the wild. In BMVC, pages 1–10, 2012. Tao. Geometry-aware scene text detection with instance transfor-
[131] Baoguang Shi, Xiang Bai, and Serge Belongie. Detecting oriented mation network. In Proceedings of the IEEE Conference on Computer
text in natural images by linking segments. In The IEEE Conference Vision and Pattern Recognition (CVPR), pages 1381–1389, 2018.
on Computer Vision and Pattern Recognition (CVPR), 2017. [152] Kai Wang, Boris Babenko, and Serge Belongie. End-to-end scene
[132] Baoguang Shi, Xiang Bai, and Cong Yao. An end-to-end trainable text recognition. In 2011 IEEE International Conference on Computer
neural network for image-based sequence recognition and its Vision (ICCV),, pages 1457–1464. IEEE, 2011.
application to scene text recognition. IEEE transactions on pattern [153] Kai Wang and Serge Belongie. Word spotting in the wild. In
analysis and machine intelligence, 39(11):2298–2304, 2017. In Proceedings of European Conference on Computer Vision (ECCV),
[133] Baoguang Shi, Xinggang Wang, Pengyuan Lyu, Cong Yao, and pages 591–604. Springer, 2010.
Xiang Bai. Robust scene text recognition with automatic rectifica- [154] Tao Wang, David J Wu, Adam Coates, and Andrew Y Ng. End-to-
tion. In Proceedings of the IEEE Conference on Computer Vision and end text recognition with convolutional neural networks. In 2012
Pattern Recognition (CVPR), pages 4168–4176, 2016. 21st International Conference on Pattern Recognition (ICPR), pages
[134] Baoguang Shi, Mingkun Yang, XingGang Wang, Pengyuan Lyu, 3304–3308. IEEE, 2012.
Xiang Bai, and Cong Yao. Aster: An attentional scene text [155] Xiaobing Wang, Yingying Jiang, Zhenbo Luo, Cheng-Lin Liu,
recognizer with flexible rectification. IEEE transactions on pattern Hyunsoo Choi, and Sungjin Kim. Arbitrary shape scene text
analysis and machine intelligence, 31(11):855–868, 2018. detection with adaptive text region representation. In Proceedings
[135] Baoguang Shi, Cong Yao, Minghui Liao, Mingkun Yang, Pei Xu, of the IEEE Conference on Computer Vision and Pattern Recognition,
Linyan Cui, Serge Belongie, Shijian Lu, and Xiang Bai. Icdar2017 pages 6449–6458, 2019.
competition on reading chinese text in the wild (rctw-17). In [156] Jerod Weinman, Erik Learned-Miller, and Allen Hanson. Fast
2017 14th IAPR International Conference on Document Analysis and lexicon-based scene text recognition with sparse belief propaga-
Recognition (ICDAR), volume 1, pages 1429–1434. IEEE, 2017. tion. In icdar, pages 979–983. IEEE, 2007.
[136] Cunzhao Shi, Chunheng Wang, Baihua Xiao, Yang Zhang, Song [157] Christian Wolf and Jean-Michel Jolion. Object count/area graphs
Gao, and Zhong Zhang. Scene text recognition using part-based for the evaluation of object detection and segmentation algo-
tree-structured character detection. In 2013 IEEE Conference on rithms. International Journal of Document Analysis and Recognition
Computer Vision and Pattern Recognition (CVPR), pages 2961–2968. (IJDAR), 8(4):280–296, 2006.
IEEE, 2013. [158] Dao Wu, Rui Wang, Pengwen Dai, Yueying Zhang, and Xiaochun
[137] Palaiahnakote Shivakumara, Souvik Bhowmick, Bolan Su, Cao. Deep strip-based network with cascade learning for scene
Chew Lim Tan, and Umapada Pal. A new gradient based text localization. In 2017 14th IAPR International Conference on
character segmentation method for video text recognition. In Document Analysis and Recognition (ICDAR), volume 1. IEEE, 2017.
2011 International Conference on Document Analysis and Recognition [159] Yue Wu and Prem Natarajan. Self-organized text detection with
(ICDAR). IEEE, 2011. minimal post-processing via border learning. In Proceedings of the
[138] Robin Sibson. Slink: an optimally efficient algorithm for the IEEE Conference on CVPR, pages 5000–5009, 2017.
single-link cluster method. The computer journal, 16(1):30–34, 1973. [160] Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron
[139] Karen Simonyan and Andrew Zisserman. Very deep convolu- Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio.
tional networks for large-scale image recognition. arXiv preprint Show, attend and tell: Neural image caption generation with
arXiv:1409.1556, 2014. visual attention. In International Conference on Machine Learning,
[140] Bolan Su and Shijian Lu. Accurate scene text recognition based pages 2048–2057, 2015.
on recurrent neural network. In Asian Conference on Computer [161] Chuhui Xue, Shijian Lu, and Fangneng Zhan. Accurate scene
Vision, pages 35–48. Springer, 2014. text detection through border semantics awareness and boot-
[141] Shangxuan Tian, Shijian Lu, and Chongshou Li. Wetext: Scene strapping. In In Proceedings of European Conference on Computer
text detection under weak supervision. In Proc. ICCV, 2017. Vision (ECCV), 2018.
20

[162] Xiao Yang, Dafang He, Zihan Zhou, Daniel Kifer, and C Lee Shangbang Long is a master student at
Giles. Learning to read irregular text with attention mechanisms. Carnegie Mellon University, majoring in machine
In Proceedings of the Twenty-Sixth International Joint Conference on learning. He was also an intern at MEGVII
Artificial Intelligence, IJCAI-17, pages 3280–3286, 2017. (Face++) Inc., Beijing, China. His research
[163] Cong Yao, Xiang Bai, and Wenyu Liu. A unified framework for mainly focuses on computer vision and natural
multioriented text detection and recognition. IEEE Transactions language processing.
on Image Processing, 23(11):4737–4749, 2014.
[164] Cong Yao, Xiang Bai, Nong Sang, Xinyu Zhou, Shuchang Zhou,
and Zhimin Cao. Scene text detection via holistic, multi-channel
prediction. arXiv preprint arXiv:1606.09002, 2016.
[165] Cong Yao, Xiang Bai, Baoguang Shi, and Wenyu Liu. Strokelets:
A learned multi-scale representation for scene text recognition. In
Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition (CVPR), pages 4042–4049, 2014.
[166] Qixiang Ye and David Doermann. Text detection and recognition
in imagery: A survey. IEEE transactions on pattern analysis and
machine intelligence, 37(7):1480–1500, 2015.
[167] Qixiang Ye, Wen Gao, Weiqiang Wang, and Wei Zeng. A robust
text detection algorithm in images and video frames. IEEE ICICS-
PCM, pages 802–806, 2003.
[168] Chucai Yi and YingLi Tian. Text string detection from natural
scenes by structure-based partition and grouping. IEEE Transac-
tions on Image Processing, 20(9):2594–2605, 2011.
[169] Fei Yin, Yi-Chao Wu, Xu-Yao Zhang, and Cheng-Lin Liu. Scene
text recognition with sliding convolutional character models.
arXiv preprint arXiv:1709.01727, 2017.
[170] Xu-Cheng Yin, Wei-Yi Pei, Jun Zhang, and Hong-Wei Hao. Multi-
orientation scene text detection with adaptive clustering. IEEE Xin He received the BS degree in automa-
transactions on pattern analysis and machine intelligence, 37(9), 2015. tion from Wuhan University (WHU), China in
[171] Xu-Cheng Yin, Xuwang Yin, Kaizhu Huang, and Hong-Wei Hao. 2013 and the MS degree in pattern recognition
Robust text detection in natural scene images. IEEE transactions and intelligent system from Institute of Automa-
on pattern analysis and machine intelligence, 36(5), 2014. tion, Chinese Academy of Sciences, China, in
[172] Xu-Cheng Yin, Ze-Yu Zuo, Shu Tian, and Cheng-Lin Liu. Text 2016. His research interests include image un-
detection, tracking and recognition in video: A comprehensive derstanding and sequence modeling, especially
survey. IEEE Transactions on Image Processing, 25(6), 2016. the area of text detection and recognition.
[173] Tai-Ling Yuan, Zhe Zhu, Kun Xu, Cheng-Jun Li, and Shi-Min Hu.
Chinese text in the wild. arXiv preprint arXiv:1803.00085, 2018.
[174] Fangneng Zhan, Shijian Lu, and Chuhui Xue. Verisimilar image
synthesis for accurate detection and recognition of texts in scenes.
2018.
[175] Chengquan Zhang, Borong Liang, Zuming Huang, Mengyi En,
Junyu Han, Errui Ding, and Xinghao Ding. Look more than
once: An accurate detector for text of arbitrary shapes. Proceedings
of the IEEE Conference on Computer Vision and Pattern Recognition
(CVPR), 2019.
[176] DongQuin Zhang and Shih-Fu Chang. A bayesian framework
for fusing multiple word knowledge models in videotext recog-
nition. In Computer Vision and Pattern Recognition, 2003. IEEE.
[177] Sheng Zhang, Yuliang Liu, Lianwen Jin, and Canjie Luo. Feature
enhancement network: A refined scene text detector. In Proceed-
ings of AAAI, 2018, 2018.
[178] Zheng Zhang, Chengquan Zhang, Wei Shen, Cong Yao, Wenyu
Liu, and Xiang Bai. Multi-oriented text detection with fully
convolutional networks. In Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition (CVPR), 2016. Cong Yao is currently with MEGVII Inc., Beijing,
[179] Zhou Zhiwei, Li Linlin, and Tan Chew Lim. Edge based binariza- China. He received the B.S. and Ph.D. degrees
tion for video text images. In 2010 20th International Conference on in electronics and information engineering from
Pattern Recognition (ICPR), pages 133–136. IEEE, 2010. the Huazhong University of Science and Tech-
[180] Xinyu Zhou, Cong Yao, He Wen, Yuzhi Wang, Shuchang Zhou, nology, China, in 2008 and 2014, respectively.
Weiran He, and Jiajun Liang. EAST: An efficient and accurate He was a research intern at Microsoft Research
scene text detector. In The IEEE Conference on Computer Vision and Asia, Beijing, China, from 2011 to 2012. He was
Pattern Recognition (CVPR), 2017. a Visiting Research Scholar with Temple Uni-
[181] Yanzhao Zhou, Qixiang Ye, Qiang Qiu, and Jianbin Jiao. Oriented versity, Philadelphia, PA, USA, in 2013. His re-
response networks. In Proceedings of the IEEE Conference on search focuses on computer vision, in particular,
Computer Vision and Pattern Recognition (CVPR). IEEE, 2017. text detection and recognition in natural images.
[182] Anna Zhu, Renwu Gao, and Seiichi Uchida. Could scene context
be beneficial for scene text detection? Pattern Recognition, 58, 2016.
[183] Xiangyu Zhu, Yingying Jiang, Shuli Yang, Xiaobing Wang, Wei
Li, Pei Fu, Hua Wang, and Zhenbo Luo. Deep residual text
detection network for scene text. In 14th International Conference
on Document Analysis and Recognition (ICDAR), 2017.
[184] Yingying Zhu, Cong Yao, and Xiang Bai. Scene text detection
and recognition: Recent advances and future trends. Frontiers of
Computer Science, 10(1):19–36, 2016.
[185] C Lawrence Zitnick and Piotr Dollár. Edge boxes: Locating object
proposals from edges. In In Proceedings of European Conference on
Computer Vision (ECCV), pages 391–405. Springer, 2014.

You might also like