0% found this document useful (0 votes)
124 views10 pages

OrigamiNet: Advanced Text Recognition

OrigamiNet is a neural network module that can be added to existing convolutional neural network (CNN) text line recognizers to convert them into full page recognizers. It allows for weakly supervised full page text recognition from unsegmented images without requiring any localization information during training or testing. The proposed method achieves state-of-the-art performance on standard benchmarks for handwriting recognition, surpassing other methods that require segmented training data or localization annotations.

Uploaded by

guy player
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
124 views10 pages

OrigamiNet: Advanced Text Recognition

OrigamiNet is a neural network module that can be added to existing convolutional neural network (CNN) text line recognizers to convert them into full page recognizers. It allows for weakly supervised full page text recognition from unsegmented images without requiring any localization information during training or testing. The proposed method achieves state-of-the-art performance on standard benchmarks for handwriting recognition, surpassing other methods that require segmented training data or localization annotations.

Uploaded by

guy player
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

OrigamiNet: Weakly-Supervised, Segmentation-Free, One-Step, Full Page Text

Recognition by learning to unfold

Mohamed Yousef Tom E. Bishop


Intuition Machines, Inc. Intuition Machines, Inc.
[email protected] [email protected]
arXiv:2006.07491v1 [cs.CV] 12 Jun 2020

Abstract Requirement [4] [3] [30] [7, 33, 19] Ours


Full-page image 3 3 3 3 3
Text recognition is a major computer vision task with a Full-page text GT 3 3 3 3 3
big set of associated challenges. One of those traditional Seg. line images 7 7 7 3 7
Seg. transcription 7 7 7 3 7
challenges is the coupled nature of text recognition and seg-
Pre-train on seg. data 3 3 3 7 7
mentation. This problem has been progressively solved over Special curriculum 3 3 7 7 7
the past decades, going from segmentation based recogni- # Iterations / image 500 10 10 10 1
tion to segmentation free approaches, which proved more
accurate and much cheaper to annotate data for. We take a
step from segmentation-free single line recognition towards Table 1: Comparison of what data is required to train a full
segmentation-free multi-line / full page recognition. We pro- page recognizer between various prior works and our pro-
pose a novel and simple neural network module, termed posed method. We can see that our method is the only that
OrigamiNet, that can augment any CTC-trained, fully con- truly works at page level without requiring any segmented
volutional single line text recognizer, to convert it into a data at any stage. # Iterations / image is the average number
multi-line version by providing the model with enough spa- of iterations required to transcribe a full paragraph image
tial capacity to be able to properly collapse a 2D input sig- from the IAM dataset; we can note that while all other meth-
nal into 1D without losing information. Such modified net- ods require multiple iterations per image (to recognize each
works can be trained using exactly their same simple origi- segmented character or line), our method performs only one
nal procedure, and using only unsegmented image and text pass over the input full paragraph image.
pairs. We carry out a set of interpretability experiments
that show that our trained models learn an accurate im-
plicit line segmentation. We achieve state-of-the-art char- quence of observations (i.e. characters). This trend pro-
acter error rate on both IAM & ICDAR 2017 HTR bench- gressed from the first systems that tried to segment each
marks for handwriting recognition, surpassing all other character alone then classify the character’s image [6], to
methods in the literature. On IAM we even surpass sin- segmentation free approaches that tried to recognize all the
gle line methods that use accurate localization information characters in a word, without requiring / performing any ex-
during training. Our code is available online at https: plicit segmentation [21]. Today, state-of-the-art text recog-
//github.com/IntuitionMachines/OrigamiNet. nition systems work on a whole input line image without
requiring any prior explicit character / word segmentation
[35, 18]. This removes the requirement for providing char-
1. Introduction acter localization annotations as part of ground-truth tran-
scription. Also the recognition accuracy relies only on au-
The ubiquity of text has made the automation of the pro- tomatic line segmentation, a much easier process than auto-
cessing of its various visual forms, an ever-increasing ne- matic character segmentation.
cessity. Over the years, one of the main driving themes for However, line segmentation is still an error-prone pro-
error rate reduction in text recognition systems has been re- cess and can cause great deterioration in the performance
ducing explicit segmentation proposals in favor of increas- of today’s text recognition systems. This is especially true
ing full sequence recognition. In full sequence models, the for documents with hard to segment text-lines such as hand-
recognition system learns to both simultaneously segment written documents [10, 24], with warped lines, uneven in-
/ align and recognize / classify an image representing a se- terline spacing, touching lines, and torn pages.

1
The main previous works that tried to address the prob- mation during training or testing.
lem of weakly supervised multi-line recognition were [3, 4, To summarize, we address the problem of weakly super-
30]. Besides these methods, other methods that work on vised full-page text recognition. In particular, we make the
full page recognition require the localization ground-truth following contributions:
of text lines during training. A detailed comparison between • We conceptually propose a new approach for weakly-
the training data required by our proposed method vs. other supervised simultaneous object segmentation and
methods in literature is presented in Table 1. recognition, and apply it to text.
In this work, we present a simple and novel neural net- • We propose a simple and generic neural network sub-
work sub-module, termed OrigamiNet, that can be added to module that can be added to any CNN-based text line
any existing convolutional neural network (CNN) text-line recognizer to convert it into a multi-line recognizer that
recognizer to convert it to a full page recognizer. It can tran- utilizes the same simple training procedure.
scribe full text pages in a weakly supervised manner with- • We carry an extensive set of experiments on a num-
out being given any localization ground-truth (either visual ber of state-of-the-art text recognizers that demonstrate
in the images or textual in the transcriptions) during train- our claims. The resultant architectures demonstrate
ing, and without performing any explicit segmentation. In state-of-the art performance on ICDAR2017 HTR and
contrast to previous work, this is done very efficiently using the full paragraph IAM datasets.
feed-forward connections only (no recurrent connections),
essentially, in a single network forward pass. 2. Related Work
Our main intuition in this work is, instead of the tradi-
There is not much prior work in the literature regarding
tional two-step framework that first segments then recog-
full page recognition. Segmentation-free multi-line recog-
nizes extracted segments, to propose a novel integrated ap-
nition has been mainly considered in [3, 4]. The idea of
proach for learning to simultaneously implicitly segment
both is using selective attention to focus only on a specific
and recognize. This works by learning a representation
part of the input image, either characters in [4] or lines in
transformation that transforms the input into a representa-
[3]. These works have two major drawbacks. First, both
tion where both segmentation and recognition is trivial.
are difficult to train, and need to pre-train their encoder sub-
We implicitly unfold an input multi-line image into a sin-
network on single-line images before training on multi-line
gle line image (i.e. from a 2D arrangement of characters to
versions, which defeats the objective of the task. Second,
1D), where all lines in the original image are stitched to-
though [3] is much faster than [4], both are very slow com-
gether into one long line, so no text-line segmentation is ac-
pared to current methods that work on segmented text lines.
tually needed. Both segmentation and recognition are done
Besides these two segmentation-free methods, other
in the same single step (single network forward pass) in-
methods that work on full page recognition either require
stead of being carried out iteratively (on each line), and
the localization ground-truth of text lines for all [5, 7, 19]
thus all computations are shared between recognition and
or part [33] of the training data to train either a separate net-
implicit segmentation, and the whole process is a lot faster.
work or a sub-module (of a large, multi-task network) for
The main ingredients to achieving this are: Using the text-line localization. Also, all these methods require line
idea of a spatial bottleneck followed by up-sampling, used breaks to be annotated on all the provided textual ground-
widely in pixel-wise prediction tasks (e.g. [16, 23]); and truth transcriptions (i.e. text lines must be segmented both
using the CTC loss function [11] which strongly induces / visually in the image and textually in the transcription). [30]
encourages a linear 1D target. We construct a simple neu- presented the idea of adapting [33] in a weakly supervised
ral network sub-module that applies these novel ideas, and manner without requiring line breaks in the transcription by
demonstrate both its effectiveness and generality by attach- setting the alignment between the predicted line transcrip-
ing it to a number of state-of-the-art text recognition neu- tions and the ground truth as a combinatorial optimization
ral network architectures. We show that it can successfully problem, and greedily solving it. However [30] still requires
convert them from single line into multi-line text recogniz- the same pre-training as [33] and performs worse.
ers with exactly the same training procedure (i.e. without
resorting to complex and fragile training recipes, like a spe-
3. Methodology
cial training curriculum or special pre-training strategies).
On the challenging ICDAR 2017 HTR [24] full page Figure 1 presents the core idea of our proposed
benchmark we achieve state-of-the-art Character Error Rate OrigamiNet module, and how it can be attached to any fully
(CER) without any localization data. On full paragraphs convolutional text recognizer. Both before and after ver-
of the IAM [17] dataset, we were able to achieve state-of- sions are shown for easy comparison.
the-art CER surpassing models that work on carefully pre- The Connectionist Temporal Classification (CTC) loss
segmented text-lines, without using any localization infor- function allows the training of neural text recognizers on
unsegmented inputs by considering all possible alignments single line recognizers, and without any special pre-training
between two 1D sequences. The sequence of predictions or curriculum applied to any sub-module of the network
produced by the network is denoted P , and the sequence (both of which are used exclusively in the literature).
of labels associated with the input image L, where |L| < One natural question here is how to choose the final line
|P |. The strict requirement of having P as a 1D sequence, length L2 (see definition in Fig. 1b)? To gather space for
introduces a problem, given that the original input signal the whole paragraph / page, L2 must be at least as long as
(the image I) is a 2D signal. This problem has typically the largest number of characters in any transcription in the
been dealt with by unfolding the 2D signal into 1D, using a training set. Longer still is better, given that (i) CTC needs
simple reduction operation (e.g. summation) along one of to insert blanks to separate repeated labels; (ii) characters
the dimensions (usually the vertical one), giving: vary greatly in spatial extent, and mapping each to multi-
ple target frames in the final vector is an easier task than
H
X transforming to exactly one frame.
Pi = F (Ii,j ) (1)
j=1 4. Experiments
Where F is a learned 2D representation transformation. We carry out an extensive set of experiments to answer
This is the paradigm shown in Fig. 1a. As noted in [3, 4] the following set of questions:
this simple, blind collapse from 2D to 1D gives equal im- • Does the module actually work as expected?
portance / contribution (and therefore gradients) to all the • Is it tied to a specific CNN architecture?
rows of the 2D input feature-map F (I), and thus prevents • Is it tied to a specific model capacity?
the recognition of any 2D arrangement of characters in the • How does final spatial size affect model performance?
input image. If two characters cover the same columns, only
one can be possibly recognized after the collapse operation.
4.1. Implementation Details
To tackle this problem, i.e. satisfy the 1D input require- All experiments use an initial learning rate of 0.01, ex-
ment of CTC without sacrificing the ability of recogniz- ponentially decayed to 0.001 over 9 × 104 batches. We im-
ing 2D arrangements of characters, we propose the idea of plement in PyTorch [20], with the Adam [15] optimizer.
learning the proper 2D→1D unfolding through a CNN, mo-
tivated by the success of CNNs in pixel-wise prediction and
4.2. Datasets
image-to-image translation tasks. IAM [17] (modern English) is a famous offline handwrit-
The main idea of our work (presented in Fig. 1b) is aug- ing benchmark dataset. It is composed of 1539 scanned text
menting the traditional paradigm with a series of up-scaling pages handwritten by 657 different writers, corresponding
operations that transforms the input feature-map into the to English texts extracted from the LOB corpus [14]. IAM
shape of a single line, that is long enough to hold all the has 747 documents (6,482 lines) in the training set, 116 doc-
lines (2D character arrangements) from the input image. uments (976 lines) in the validation set and 336 documents
Up-scaling operations are followed by convolutional com- (2,915 lines) in the test set.
putational blocks as our learned resize operations (as done The ICDAR2017 full page HTR competition [24] con-
by many researchers, e.g. [8]). The changed direction of sists of two training sets. The first contains 50 fully anno-
up-scaling encourages each line of the input image to be tated images with line-level localization and transcription
mapped into a distinct part of the output vertical dimension. ground-truth. The second set contains 10,000 images with
After such changes, we proceed with the traditional only transcriptions (with annotated line breaks). Most of the
paradigm as-is, perform the simple sum reduction (Eq. 1) dataset was taken from the Alfred Escher Letter Collection
along the vertical dimension w of the resulting line (which (AEC) which is written in German but it also has pages in
is perpendicular to the original input multi-line image’s ver- French and Italian. In all our experiments on this dataset,
tical dimension). The model is trained with CTC. we don’t make any use of either the 50-page training set or
Moreover, we argue that the main bottleneck preventing the annotated line-breaks on the 10,000-page training set
all previous works from learning proper 2D→1D mappings
4.3. CNN Backbones
directly as we do, is spatial constraints (i.e. not overall ca-
pacity or architectural constraints). Providing enough spa- To emphasize the generality of our proposed module, we
tial capacity to the model allows it to easily learn such trans- evaluate it on a number of popular CNN architectures that
formations (even for simple limited capacity models, as we achieved strong performance in the text recognition litera-
will show in the experiments section). Given the spatial ture. Inspired by the benchmark work [2], we evaluate VGG
capacity and the strong linear prior induced by CTC, the and ResNet-26 (the specific variants explored in [2]), as
model is able to learn strong 2D→1D unfolding function well as deeper and much more expressive variants (ResNet-
with the same simple training procedure used for training 66 and ResNet-74). We also evaluate a newly proposed
1
1 1
H H 1 W 6
1
C

W
16 16 CTC

16

16
H
512 C

W
8

8
H
pool
512

W
4 conv6 x conv8

4
H

W
256 conv5 x

2
2
W
H conv4 x
128

64 conv3 x

conv2 x

(a) A generic four stage fully convolutional single line recognizer, input is a singe line image, training is done using the CTC loss function.
Backbone CNN can be any of the ones presented in Table 2. Input gets progressively down-sampled, then converted into 1D by average
pooling along the vertical dimension right before the loss calculation. (Figures created via PlotNeuralNet [13])

1
1 1
H CTC
8
H
8
W
16

H 512
4
H W
2 8 conv6 x L1
H W
512 L2 L2 L2
4

256 conv5 x
W
2

128 conv4 x
W

64 conv3 x
W 2
512 3
conv2 x
conv7 x C 1
512 W 4
6 C w
interpolate conv8 x pool

H
W

C
Tensor Pool Interpolate

(b) Here we convert the fully convolutional single-line recognizer into an OrigamiNet multi-line recognizer; comparing the two figures
shows that the main change introduced is up-scaling vertically in two stages, and at the same time, down-scaling horizontally. We obtain a
feature-map that is tall and narrow (the shape of one very long vertical line, length L2 ). After that we proceed exactly as above, average
pooling over the short dimension, w (of the new line not the original image) then using the CTC loss function to drive the training process.

Figure 1: Converting a fully-convolutional single line recognizer into a multi-line recognizer using our OrigamiNet module.

gated, fully convolutional architecture for text recognition relation between L1 and L2 affect the final CER?
[35], named Gated Text Recognizer (GTR). The detailed Table 3 presents some experiments on this. First, we can
structure of the CNN backbones we evaluate our proposed see that generally, even a very simple model like VGG can
model on is presented in Table 2. More details on the basic successfully learn to recognise multiple lines (at a relatively
building blocks of these architectures can be found in their bad CER = 30%) at various configurations, yet, the deeper
respective papers, VGG [25], ResNet [12], and GTR [35]. ResNet-26 achieves a much better performance on the task
reaching 7.2%. Second, it is evident that wider generally
4.4. Final Length, L2
gives better performance (but at diminishing returns), which
For IAM, the final length should be at least 625, since is evident for VGG more than ResNet-26. We see that for
the longest paragraph in the training set contains 624 char- reasonable values (>800) the network is fairly robust to the
acters. We have two questions here: what value can balance choice of L2 . We can also note that both L1 and L2 should
running time and recognition accuracy? And how does the be relatively close to each other.
part layer name output size ResNet-26 ResNet-66 ResNet-74 VGG GTR-8 GTR-12
Input H ×W
ln1 H ×W static layer normalization
conv1 H ×W 7×7, 64 13×13, 16
     
3×3, 64 3×3, 64 3×3, 64  
H W ×1 ×1 ×1 3×3, 64 ×1 [GateBlock(512)]×1 [GateBlock(512)]×1
conv2_x 2 × 2
3×3, 64 3×3, 64 3×3, 64

2×2 max pool, stride 2


Encoder

     
3×3, 128 3×3, 128 3×3, 128  
H W ×2 ×2 ×6 3×3, 128 ×1 [GateBlock(512)]×1 [GateBlock(512)]×1
conv3_x 4 × 4
3×3, 128 3×3, 128 3×3, 128

2×2 max pool, stride 2


       
3×3, 256 3×3, 256 3×3, 256 3×3, 256
H W ×5 ×25 ×25 ×1 [GateBlock(512)]×1 [GateBlock(512)]×2
conv4_x 8 × 8
3×3, 256 3×3, 256 3×3, 256 3×3, 256

2×2 max pool, stride 2


       
3×3, 512 3×3, 512 3×3, 512 3×3, 512
H W ×3 ×3 ×3 ×1 [GateBlock(1024)]×1 [GateBlock(1024)]×3
conv5_x 8 × 16
3×3, 512 3×3, 512 3×3, 512 3×3, 512

2×2 max pool, stride 1×2


       
H W 3×3, 512 3×3, 512 3×3, 512 3×3, 512
conv6_x 8 × 16
×1 ×1 ×1 ×1 [GateBlock(1024)]×3 [GateBlock(1024)]×4
3×3, 512 3×3, 512 3×3, 512 3×3, 512

interpolate bilinearly to L1 × W
32
W
       
L1 ×
Decoder

conv7_x 32 3×3, 512 3×3, 512 3×3, 512 3×3, 512


×3 ×3 ×3 ×1 [GateBlock(512)]×1 [GateBlock(512)]×1
3×3, 512 3×3, 512 3×3, 512 3×3, 512

L2 × W64 interpolate bilinearly to L2 × W


64
conv8 L2 × w 1×1, C
L2 average pool over short dimension w
ln2 L2 static layer normalization
1 CTC
# Parameters ×106 38.2 61.9 63.05 10.6 9.9 16.4

Table 2: Architectural details of our evaluated CNN backbones (Encoder part), and how our module (Decoder part) is
attached to them. The table tries to abstract the architectures to their most common details. Although there is subtle difference
in the components of the basic building block (in brackets []) of every architecture, the overall organization of the network,
and how our module fits, is the same.

4.5. Final Width As can be seen in Table 5, end-to-end layer normaliza-


tion can bring significant increases in accuracy for models
Does the final shape need to have the largest possible as-
that already worked well; more importantly, it makes it pos-
pect ratio? How would the final width, w (shorter output
sible to train very deep models that were constantly diverg-
dimension) affect the learning system? Table 4 presents ex-
ing before, leading to state-of-art performance on the task.
periments using VGG and ResNet-26 on this regard. It is
clear that a large value like 62 deteriorates training signif-
4.7. Hard-to-segment text-lines
icantly for ResNet-26, but small and medium values (<31)
are comparable in performance. On the other hand, a model Due to the way IAM was collected [17], its lines are gen-
with limited receptive field and complexity like VGG can erally easy to segment. To study how our model would
generally make a lot of use from the added width. handle harder cases, we carried out two separate experi-
ments, artificially modifying IAM to produce new variants
4.6. End-to-end Layer Normalization
with hard-to-segment lines. Firstly, interline spacing is mas-
The idea of using parameter-less layer normalization as sively reduced via seam carving [1], resizing to 50% height,
the first and last layer of a model was proposed in [35], and creating heavily touching text lines, Fig. 4(b). GTR-12
shown to increase performance and facilitate optimization. achieved 6.5% CER on this dataset. Secondly, each para-
The same idea was very effective for our module, as initially graph has random projective (rotating/resizing lines), and
some deep models that converged for single line recognition random elastic transforms (like [32] but at the page level)
completely diverged here. This is most probably due to the applied, creating wave-like non-straight lines, Fig. 4(c).
large number of time-steps CTC works on for our case. GTR-12 achieved 6.2% CER on this dataset.
4.8. Comparison to state-of-the-art Final length (L2 ) 700 800 950 1100 1500
For all the previous experiments, IAM paragraph images First stage length L1 = 450
were scaled down to 500 × 500 pixels before training, and VGG 43.14 34.32 34.55 34.55 30.34
although we were already achieving state-of-the-art results, ResNet-26 8.121 7.675 7.602 7.238 7.449
we wanted to explore whether we can break even with sin- First stage length L1 = 225
gle line recognizers. As shown in Table 6, by increasing VGG 37.5 39.6 37.5 36.46 34.75
image / model sizes, we were for the first time able to ex-
ceed the performance of state-of-the-art single line recog-
nizers using a segmentation free full page recognizer that Table 3: The IAM test set CER of VGG and ResNet-26 for
trains without any visual or textual localization ground- various values of L1 and L2 .
truth. Note that we don’t include in the comparison methods
that use additional data, either in the form of training images Final width 62 31 15 8 3
as in [34, 9] or language modeling as in [31].
For the ICDAR2017 HTR dataset we follow [30] and re- VGG 25.98 17.41 37.4 34.55 24.21
port CER on the validation set proposed in [33] (the last ResNet-26 19.9 9.128 8.64 7.238 8.34
1000 pages of the 10,000 image training set), as the evalu-
ation server doesn’t provide CER or other character based
Table 4: The IAM test set CER of VGG and ResNet-26 for
metrics. Results are in Table 7. Note that both [33, 30] re-
various final widths. Here L1 = 450 and L2 = 1100
port results using CER normalized by GT length (nCER in
the table). We used author released pre-trained models from
[33] to compute their results without a language model. It LN VGG ResNet-26 ResNet-66 ResNet-74 GTR-8
is very evident our method can get far superior performance w/o 51.37 10.03 8.925 76.9 72.4
using weaker training signals. w 34.55 7.238 6.373 6.128 5.639
4.9. Model Interpretability
Here we consider an important question: what does the Table 5: The IAM test set CER for various models, with
model actually learn? We can see that the model works well and without layer-normalization
in practice and we have a hypothesis of what it might be
doing, but it would very interesting if we can have a peek at
how our model is able to make its predictions. of suppressing some important parts of the signal. So we
To gain an understanding of what parts of the input bi- propose SmoothGrad-Abs, which simply averages the abso-
ases the model towards a specific prediction, we utilize the lute value of the attribution maps. SmoothGrad-Abs strikes
framework of Path-Integrated Gradients [29] ensembled us- a good balance between SmoothGrad and SmoothGrad-
ing SmoothGrad [26]. Note that unlike typical classification Squared. For our experiments, we used 5 noisy images.
tasks, we predict L2 labels per image. Of those we discard Fig. 2 shows the attribution maps of a single random
blanks and repeated consecutive labels (in CTC, represent- character from each line of the input image (computed from
ing continuation of the same state; we found their attribution the attribution of the corresponding output neuron in the 1D
maps to be global and uninformative for these purposes). prediction map fed to CTC). We see that the model does in-
For integrated gradients (IG), we change the baseline to deed implicitly learn good character-level localization from
use an empty white image to designate no-signal, rather the input 2D image to the output 1D prediction map.
than an empty black one (which would be an all-signal im- Fig. 3 provides a holistic view that gathers all the maps
age in our case) - as our data is black text over a white back- into one image. We took the one-character attribution map
ground. Using white baselines produced much sharper at- from the previous step, apply Otsu thresholding to it (to
tribution maps than black ones, showing how sensitive IG keep only the most important parts) then add a marker at
is to the choice of the baseline (studied more in [28]). We the position of the center of mass of the resulting binary im-
used 50 steps to approximate the integral in our tests. age. The marker is colored according to the transcription
Standard SmoothGrad produces attribution maps that are text line it belongs to. As can be seen, the result represents
very noisy (see [27]), but the SmoothGrad-Squared variant a very good implicit line segmentation of the original input.
often suppresses most of the signal (a direct consequence
4.10. Limitations
of squaring fractions). After analysing the results of both,
we suggest the root cause of SmoothGrad problems is aver- We also trained our network on a variant of IAM
aging positive and negative signals together. The squaring with horizontally flipped images and line-level flipped
in SmoothGrad-Squared solves this problem, but at the cost groundtruth transcription, where it managed to achieve
Method Input Scale Test CER(%) Remarks
Single-line methods
[22] 128 × W 5.8 CNN+BLSTM+CTC
[18] 64 × W 5.24 Seq2Seq (CNN+BLSTM encoder)
[35] 32 × W 4.9 CNN+CTC
Multi-line methods
[4] 150 dpi 16.2
Requires pre-training the encoder
[3] 150 dpi 10.1
(MDLSTM) on segmented text lines
[3] 300 dpi 7.9
[5] 150 dpi 15.6
Requires fully segmented training data
[7] 8.5
Requires full line-break annotation and
[33] 6.4
partial visual localization
ResNet-74 OrigamiNet 500 × 500 6.1
GTR-8 OrigamiNet 500 × 500 5.6
GTR-8 OrigamiNet 750 × 750 5.5
GTR-12 OrigamiNet 750 × 750 4.7

Table 6: Comparison with the state-of-the-art on the IAM paragraph images, best result is highlighted.

THE F ourth Gospel was almost certainly writt e n in Greek. A modern text of the G ospel represents the work of generations of scholars who hav e compared the many manu-

scripts of John an d worked out the version which is most likely to have been the origina l wording. I t is not possible to establish any one t e xt with absolute precision.

Figure 2: Results of the interpretability experiment. For each of these 8 images (from left-right, top-down) we show the
attribution heat-map for a single character output (for each line in the image) overlaid over a faint version of the original
input image. The randomly chosen character is highlighted in green in the transcription below the image.

nearly the same CER. This verifies that the proposed full pages of text, learning the flow of multiple columns is
method is robust and can learn the reading order from data. not addressed directly. However, given that region / para-
While the proposed method works well on paragraphs or graph segmentation is trivial compared to text line segmen-
⇒ ⇒

Figure 3: The first and third columns represent two input images. The second and fourth columns are the corresponding
color coded scatter plot, where, for each character, the position of the center of mass for the attribution map associated with
that character is marked. Character markers belonging to the same line are given the same color. We can see that the model
learns a very good implicit segmentation of the input image into lines without any localization signal.

(b) Compact lines.

(a) Original Image. (c) Rotated and warped.

Figure 4: Synthetic distortions applied to the IAM dataset to study the how our model handles hard to segment text-lines. (a)
original paragraph image. (b) touching text-lines. (c) rotated and wavy text-lines

Method CER nCER linebreaks Pre-train tion ground-truth provided to the model during training. We
SFR [30] 8.18 8.68 3 50 fully proposed a simple neural network sub-module, OrigamiNet,
SFR-align [33] - 11.05 7 annotated pgs that can be added to any existing fully convolutional single-
GTR-12 OrigamiNet 6.80 5.87 7 - line recognizer and convert it into a multi-line recognizer by
providing the model with enough spatial capacity to be able
to properly unfold 2D input signals into 1D without losing
Table 7: Comparison on ICDAR2017 HTR, best result is information.
highlighted. nCER is CER normalized by GT length. line-
breaks indicates their presence or removal from the GT. We conducted an extensive set of experiments on the
IAM handwriting dataset to show the applicability and gen-
erality of our proposed module. We achieve state-of-the-art
tation we think this is not a serious practical limitation. CER on the ICDAR2017 HTR and IAM datasets surpassing
models that explicitly made use of line segmentation infor-
5. Conclusion mation during training. We then concluded with a set of
interpretability experiments to investigate what the model
In this paper we tackled the problem of multi-line / full actually learns and demonstrated its implicit ability to lo-
page text recognition without any visual or textual localiza- calize characters on each line.
References [16] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional
networks for semantic segmentation. In Proceedings of the
[1] S. Avidan and A. Shamir. Seam carving for content-aware IEEE conference on computer vision and pattern recogni-
image resizing. In ACM SIGGRAPH 2007 papers, pages 10– tion, pages 3431–3440, 2015. 2
es. 2007. 5 [17] U.-V. Marti and H. Bunke. The iam-database: an english
[2] J. Baek, G. Kim, J. Lee, S. Park, D. Han, S. Yun, S. J. Oh, sentence database for offline handwriting recognition. In-
and H. Lee. What is wrong with scene text recognition model ternational Journal on Document Analysis and Recognition,
comparisons? dataset and model analysis. arXiv preprint 5(1):39–46, 2002. 2, 3, 5
arXiv:1904.01906, 2019. 3 [18] J. Michael, R. Labahn, T. Grüning, and J. Zöllner. Evaluating
[3] T. Bluche. Joint line segmentation and transcription for sequence-to-sequence models for handwritten text recogni-
end-to-end handwritten paragraph recognition. In Advances tion. arXiv preprint arXiv:1903.07377, 2019. 1, 7
in Neural Information Processing Systems, pages 838–846, [19] B. Moysset, C. Kermorvant, and C. Wolf. Learning to detect,
2016. 1, 2, 3, 7 localize and recognize many text objects in document im-
[4] T. Bluche, J. Louradour, and R. Messina. Scan, attend and ages from few examples. International Journal on Document
read: End-to-end handwritten paragraph recognition with Analysis and Recognition (IJDAR), 21(3):161–175, 2018. 1,
mdlstm attention. In 2017 14th IAPR International Confer- 2
ence on Document Analysis and Recognition (ICDAR), vol- [20] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury,
ume 1, pages 1050–1055. IEEE, 2017. 1, 2, 3, 7 G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga,
[5] M. Carbonell, J. mas romeu, M. Villegas, A. FornÃl’s, and A. Desmaison, A. Kopf, E. Yang, Z. DeVito, M. Raison,
J. LladÃşs. End-to-end handwritten text detection and tran- A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and
scription in full pages. In 2019 International Conference on S. Chintala. Pytorch: An imperative style, high-performance
Document Analysis and Recognition Workshops (ICDARW), deep learning library. In Advances in Neural Information
07 2019. 2, 7 Processing Systems 32, pages 8024–8035. Curran Asso-
[6] R. G. Casey and E. Lecolinet. A survey of methods and ciates, Inc., 2019. 3
strategies in character segmentation. IEEE transactions on [21] T. Plötz and G. A. Fink. Markov models for offline hand-
pattern analysis and machine intelligence, 18(7):690–706, writing recognition: a survey. International Journal on Doc-
1996. 1 ument Analysis and Recognition (IJDAR), 12(4):269, 2009.
[7] J. Chung and T. Delteil. A computationally efficient pipeline 1
approach to full page offline handwritten text recognition. [22] J. Puigcerver. Are multidimensional recurrent layers really
arXiv preprint arXiv:1910.00663, 2019. 1, 2, 7 necessary for handwritten text recognition? In 2017 14th
[8] C. Dong, C. C. Loy, K. He, and X. Tang. Image IAPR International Conference on Document Analysis and
super-resolution using deep convolutional networks. IEEE Recognition (ICDAR), volume 1, pages 67–72. IEEE, 2017.
transactions on pattern analysis and machine intelligence, 7
38(2):295–307, 2015. 3 [23] O. Ronneberger, P. Fischer, and T. Brox. U-net: Convo-
lutional networks for biomedical image segmentation. In
[9] K. Dutta, P. Krishnan, M. Mathew, and C. Jawahar. Improv-
International Conference on Medical image computing and
ing cnn-rnn hybrid networks for handwriting recognition. In
computer-assisted intervention, pages 234–241. Springer,
2018 16th International Conference on Frontiers in Hand-
2015. 2
writing Recognition (ICFHR), pages 80–85. IEEE, 2018. 6
[24] J. A. Sanchez, V. Romero, A. H. Toselli, M. Villegas, and
[10] B. Gatos, G. Louloudis, T. Causer, K. Grint, V. Romero, J. A.
E. Vidal. Icdar2017 competition on handwritten text recog-
Sánchez, A. H. Toselli, and E. Vidal. Ground-truth produc-
nition on the read dataset. In 2017 14th IAPR Interna-
tion in the transcriptorium project. In 2014 11th IAPR In-
tional Conference on Document Analysis and Recognition
ternational Workshop on Document Analysis Systems, pages
(ICDAR), volume 1, pages 1383–1388. IEEE, 2017. 1, 2, 3
237–241. IEEE, 2014. 1
[25] K. Simonyan and A. Zisserman. Very deep convolutional
[11] A. Graves, S. Fernández, F. Gomez, and J. Schmidhu- networks for large-scale image recognition. arXiv preprint
ber. Connectionist temporal classification: labelling unseg- arXiv:1409.1556, 2014. 4
mented sequence data with recurrent neural networks. In [26] D. Smilkov, N. Thorat, B. Kim, F. Viégas, and M. Watten-
Proceedings of the 23rd international conference on Ma- berg. Smoothgrad: removing noise by adding noise. arXiv
chine learning, pages 369–376. ACM, 2006. 2 preprint arXiv:1706.03825, 2017. 6
[12] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learn- [27] I. Sturm, S. Lapuschkin, W. Samek, and K.-R. Müller. Inter-
ing for image recognition. In Proceedings of the IEEE con- pretable deep neural networks for single-trial eeg classifica-
ference on computer vision and pattern recognition, pages tion. Journal of neuroscience methods, 274:141–145, 2016.
770–778, 2016. 4 6
[13] H. Iqbal. Harisiqbal88/plotneuralnet v1.0.0, Dec. 2018. 4 [28] P. Sturmfels, S. Lundberg, and S.-I. Lee. Visualizing
[14] S. Johansson. The lob corpus of british english texts: Pre- the impact of feature attribution baselines. Distill, 2020.
sentation and comments. 1980. 3 https://distill.pub/2020/attribution-baselines. 6
[15] D. P. Kingma and J. Ba. Adam: A Method for Stochastic [29] M. Sundararajan, A. Taly, and Q. Yan. Axiomatic attribution
Optimization. Dec. 2014. 3 for deep networks. In Proceedings of the 34th International
Conference on Machine Learning-Volume 70, pages 3319–
3328. JMLR. org, 2017. 6
[30] C. Tensmeyer and C. Wigington. Training full-page hand-
written text recognition models without annotated line
breaks. In 2019 International Conference on Document
Analysis and Recognition (ICDAR), pages 1–8. IEEE, 2019.
1, 2, 6, 8
[31] P. Voigtlaender, P. Doetsch, and H. Ney. Handwriting recog-
nition with large multidimensional long short-term memory
recurrent neural networks. In 2016 15th International Con-
ference on Frontiers in Handwriting Recognition (ICFHR),
pages 228–233. IEEE, 2016. 6
[32] C. Wigington, S. Stewart, B. Davis, B. Barrett, B. Price, and
S. Cohen. Data augmentation for recognition of handwritten
words and lines using a cnn-lstm network. In 2017 14th IAPR
International Conference on Document Analysis and Recog-
nition (ICDAR), volume 1, pages 639–645. IEEE, 2017. 5
[33] C. Wigington, C. Tensmeyer, B. Davis, W. Barrett, B. Price,
and S. Cohen. Start, follow, read: End-to-end full-page hand-
writing recognition. In Proceedings of the European Confer-
ence on Computer Vision (ECCV), pages 367–383, 2018. 1,
2, 6, 7, 8
[34] S. Xiao, L. Peng, R. Yan, and S. Wang. Deep network with
pixel-level rectification and robust training for handwriting
recognition. In 2019 International Conference on Document
Analysis and Recognition (ICDAR), pages 9–16. IEEE, 2019.
6
[35] M. Yousef, K. F. Hussain, and U. S. Mohammed. Accurate,
data-efficient, unconstrained text recognition with convolu-
tional neural networks. arXiv preprint arXiv:1812.11894,
2018. 1, 4, 5, 7

You might also like