OrigamiNet: Advanced Text Recognition
OrigamiNet: Advanced Text Recognition
1
The main previous works that tried to address the prob- mation during training or testing.
lem of weakly supervised multi-line recognition were [3, 4, To summarize, we address the problem of weakly super-
30]. Besides these methods, other methods that work on vised full-page text recognition. In particular, we make the
full page recognition require the localization ground-truth following contributions:
of text lines during training. A detailed comparison between • We conceptually propose a new approach for weakly-
the training data required by our proposed method vs. other supervised simultaneous object segmentation and
methods in literature is presented in Table 1. recognition, and apply it to text.
In this work, we present a simple and novel neural net- • We propose a simple and generic neural network sub-
work sub-module, termed OrigamiNet, that can be added to module that can be added to any CNN-based text line
any existing convolutional neural network (CNN) text-line recognizer to convert it into a multi-line recognizer that
recognizer to convert it to a full page recognizer. It can tran- utilizes the same simple training procedure.
scribe full text pages in a weakly supervised manner with- • We carry an extensive set of experiments on a num-
out being given any localization ground-truth (either visual ber of state-of-the-art text recognizers that demonstrate
in the images or textual in the transcriptions) during train- our claims. The resultant architectures demonstrate
ing, and without performing any explicit segmentation. In state-of-the art performance on ICDAR2017 HTR and
contrast to previous work, this is done very efficiently using the full paragraph IAM datasets.
feed-forward connections only (no recurrent connections),
essentially, in a single network forward pass. 2. Related Work
Our main intuition in this work is, instead of the tradi-
There is not much prior work in the literature regarding
tional two-step framework that first segments then recog-
full page recognition. Segmentation-free multi-line recog-
nizes extracted segments, to propose a novel integrated ap-
nition has been mainly considered in [3, 4]. The idea of
proach for learning to simultaneously implicitly segment
both is using selective attention to focus only on a specific
and recognize. This works by learning a representation
part of the input image, either characters in [4] or lines in
transformation that transforms the input into a representa-
[3]. These works have two major drawbacks. First, both
tion where both segmentation and recognition is trivial.
are difficult to train, and need to pre-train their encoder sub-
We implicitly unfold an input multi-line image into a sin-
network on single-line images before training on multi-line
gle line image (i.e. from a 2D arrangement of characters to
versions, which defeats the objective of the task. Second,
1D), where all lines in the original image are stitched to-
though [3] is much faster than [4], both are very slow com-
gether into one long line, so no text-line segmentation is ac-
pared to current methods that work on segmented text lines.
tually needed. Both segmentation and recognition are done
Besides these two segmentation-free methods, other
in the same single step (single network forward pass) in-
methods that work on full page recognition either require
stead of being carried out iteratively (on each line), and
the localization ground-truth of text lines for all [5, 7, 19]
thus all computations are shared between recognition and
or part [33] of the training data to train either a separate net-
implicit segmentation, and the whole process is a lot faster.
work or a sub-module (of a large, multi-task network) for
The main ingredients to achieving this are: Using the text-line localization. Also, all these methods require line
idea of a spatial bottleneck followed by up-sampling, used breaks to be annotated on all the provided textual ground-
widely in pixel-wise prediction tasks (e.g. [16, 23]); and truth transcriptions (i.e. text lines must be segmented both
using the CTC loss function [11] which strongly induces / visually in the image and textually in the transcription). [30]
encourages a linear 1D target. We construct a simple neu- presented the idea of adapting [33] in a weakly supervised
ral network sub-module that applies these novel ideas, and manner without requiring line breaks in the transcription by
demonstrate both its effectiveness and generality by attach- setting the alignment between the predicted line transcrip-
ing it to a number of state-of-the-art text recognition neu- tions and the ground truth as a combinatorial optimization
ral network architectures. We show that it can successfully problem, and greedily solving it. However [30] still requires
convert them from single line into multi-line text recogniz- the same pre-training as [33] and performs worse.
ers with exactly the same training procedure (i.e. without
resorting to complex and fragile training recipes, like a spe-
3. Methodology
cial training curriculum or special pre-training strategies).
On the challenging ICDAR 2017 HTR [24] full page Figure 1 presents the core idea of our proposed
benchmark we achieve state-of-the-art Character Error Rate OrigamiNet module, and how it can be attached to any fully
(CER) without any localization data. On full paragraphs convolutional text recognizer. Both before and after ver-
of the IAM [17] dataset, we were able to achieve state-of- sions are shown for easy comparison.
the-art CER surpassing models that work on carefully pre- The Connectionist Temporal Classification (CTC) loss
segmented text-lines, without using any localization infor- function allows the training of neural text recognizers on
unsegmented inputs by considering all possible alignments single line recognizers, and without any special pre-training
between two 1D sequences. The sequence of predictions or curriculum applied to any sub-module of the network
produced by the network is denoted P , and the sequence (both of which are used exclusively in the literature).
of labels associated with the input image L, where |L| < One natural question here is how to choose the final line
|P |. The strict requirement of having P as a 1D sequence, length L2 (see definition in Fig. 1b)? To gather space for
introduces a problem, given that the original input signal the whole paragraph / page, L2 must be at least as long as
(the image I) is a 2D signal. This problem has typically the largest number of characters in any transcription in the
been dealt with by unfolding the 2D signal into 1D, using a training set. Longer still is better, given that (i) CTC needs
simple reduction operation (e.g. summation) along one of to insert blanks to separate repeated labels; (ii) characters
the dimensions (usually the vertical one), giving: vary greatly in spatial extent, and mapping each to multi-
ple target frames in the final vector is an easier task than
H
X transforming to exactly one frame.
Pi = F (Ii,j ) (1)
j=1 4. Experiments
Where F is a learned 2D representation transformation. We carry out an extensive set of experiments to answer
This is the paradigm shown in Fig. 1a. As noted in [3, 4] the following set of questions:
this simple, blind collapse from 2D to 1D gives equal im- • Does the module actually work as expected?
portance / contribution (and therefore gradients) to all the • Is it tied to a specific CNN architecture?
rows of the 2D input feature-map F (I), and thus prevents • Is it tied to a specific model capacity?
the recognition of any 2D arrangement of characters in the • How does final spatial size affect model performance?
input image. If two characters cover the same columns, only
one can be possibly recognized after the collapse operation.
4.1. Implementation Details
To tackle this problem, i.e. satisfy the 1D input require- All experiments use an initial learning rate of 0.01, ex-
ment of CTC without sacrificing the ability of recogniz- ponentially decayed to 0.001 over 9 × 104 batches. We im-
ing 2D arrangements of characters, we propose the idea of plement in PyTorch [20], with the Adam [15] optimizer.
learning the proper 2D→1D unfolding through a CNN, mo-
tivated by the success of CNNs in pixel-wise prediction and
4.2. Datasets
image-to-image translation tasks. IAM [17] (modern English) is a famous offline handwrit-
The main idea of our work (presented in Fig. 1b) is aug- ing benchmark dataset. It is composed of 1539 scanned text
menting the traditional paradigm with a series of up-scaling pages handwritten by 657 different writers, corresponding
operations that transforms the input feature-map into the to English texts extracted from the LOB corpus [14]. IAM
shape of a single line, that is long enough to hold all the has 747 documents (6,482 lines) in the training set, 116 doc-
lines (2D character arrangements) from the input image. uments (976 lines) in the validation set and 336 documents
Up-scaling operations are followed by convolutional com- (2,915 lines) in the test set.
putational blocks as our learned resize operations (as done The ICDAR2017 full page HTR competition [24] con-
by many researchers, e.g. [8]). The changed direction of sists of two training sets. The first contains 50 fully anno-
up-scaling encourages each line of the input image to be tated images with line-level localization and transcription
mapped into a distinct part of the output vertical dimension. ground-truth. The second set contains 10,000 images with
After such changes, we proceed with the traditional only transcriptions (with annotated line breaks). Most of the
paradigm as-is, perform the simple sum reduction (Eq. 1) dataset was taken from the Alfred Escher Letter Collection
along the vertical dimension w of the resulting line (which (AEC) which is written in German but it also has pages in
is perpendicular to the original input multi-line image’s ver- French and Italian. In all our experiments on this dataset,
tical dimension). The model is trained with CTC. we don’t make any use of either the 50-page training set or
Moreover, we argue that the main bottleneck preventing the annotated line-breaks on the 10,000-page training set
all previous works from learning proper 2D→1D mappings
4.3. CNN Backbones
directly as we do, is spatial constraints (i.e. not overall ca-
pacity or architectural constraints). Providing enough spa- To emphasize the generality of our proposed module, we
tial capacity to the model allows it to easily learn such trans- evaluate it on a number of popular CNN architectures that
formations (even for simple limited capacity models, as we achieved strong performance in the text recognition litera-
will show in the experiments section). Given the spatial ture. Inspired by the benchmark work [2], we evaluate VGG
capacity and the strong linear prior induced by CTC, the and ResNet-26 (the specific variants explored in [2]), as
model is able to learn strong 2D→1D unfolding function well as deeper and much more expressive variants (ResNet-
with the same simple training procedure used for training 66 and ResNet-74). We also evaluate a newly proposed
1
1 1
H H 1 W 6
1
C
W
16 16 CTC
16
16
H
512 C
W
8
8
H
pool
512
W
4 conv6 x conv8
4
H
W
256 conv5 x
2
2
W
H conv4 x
128
64 conv3 x
conv2 x
(a) A generic four stage fully convolutional single line recognizer, input is a singe line image, training is done using the CTC loss function.
Backbone CNN can be any of the ones presented in Table 2. Input gets progressively down-sampled, then converted into 1D by average
pooling along the vertical dimension right before the loss calculation. (Figures created via PlotNeuralNet [13])
1
1 1
H CTC
8
H
8
W
16
H 512
4
H W
2 8 conv6 x L1
H W
512 L2 L2 L2
4
256 conv5 x
W
2
128 conv4 x
W
64 conv3 x
W 2
512 3
conv2 x
conv7 x C 1
512 W 4
6 C w
interpolate conv8 x pool
H
W
C
Tensor Pool Interpolate
(b) Here we convert the fully convolutional single-line recognizer into an OrigamiNet multi-line recognizer; comparing the two figures
shows that the main change introduced is up-scaling vertically in two stages, and at the same time, down-scaling horizontally. We obtain a
feature-map that is tall and narrow (the shape of one very long vertical line, length L2 ). After that we proceed exactly as above, average
pooling over the short dimension, w (of the new line not the original image) then using the CTC loss function to drive the training process.
Figure 1: Converting a fully-convolutional single line recognizer into a multi-line recognizer using our OrigamiNet module.
gated, fully convolutional architecture for text recognition relation between L1 and L2 affect the final CER?
[35], named Gated Text Recognizer (GTR). The detailed Table 3 presents some experiments on this. First, we can
structure of the CNN backbones we evaluate our proposed see that generally, even a very simple model like VGG can
model on is presented in Table 2. More details on the basic successfully learn to recognise multiple lines (at a relatively
building blocks of these architectures can be found in their bad CER = 30%) at various configurations, yet, the deeper
respective papers, VGG [25], ResNet [12], and GTR [35]. ResNet-26 achieves a much better performance on the task
reaching 7.2%. Second, it is evident that wider generally
4.4. Final Length, L2
gives better performance (but at diminishing returns), which
For IAM, the final length should be at least 625, since is evident for VGG more than ResNet-26. We see that for
the longest paragraph in the training set contains 624 char- reasonable values (>800) the network is fairly robust to the
acters. We have two questions here: what value can balance choice of L2 . We can also note that both L1 and L2 should
running time and recognition accuracy? And how does the be relatively close to each other.
part layer name output size ResNet-26 ResNet-66 ResNet-74 VGG GTR-8 GTR-12
Input H ×W
ln1 H ×W static layer normalization
conv1 H ×W 7×7, 64 13×13, 16
3×3, 64 3×3, 64 3×3, 64
H W ×1 ×1 ×1 3×3, 64 ×1 [GateBlock(512)]×1 [GateBlock(512)]×1
conv2_x 2 × 2
3×3, 64 3×3, 64 3×3, 64
3×3, 128 3×3, 128 3×3, 128
H W ×2 ×2 ×6 3×3, 128 ×1 [GateBlock(512)]×1 [GateBlock(512)]×1
conv3_x 4 × 4
3×3, 128 3×3, 128 3×3, 128
interpolate bilinearly to L1 × W
32
W
L1 ×
Decoder
Table 2: Architectural details of our evaluated CNN backbones (Encoder part), and how our module (Decoder part) is
attached to them. The table tries to abstract the architectures to their most common details. Although there is subtle difference
in the components of the basic building block (in brackets []) of every architecture, the overall organization of the network,
and how our module fits, is the same.
Table 6: Comparison with the state-of-the-art on the IAM paragraph images, best result is highlighted.
THE F ourth Gospel was almost certainly writt e n in Greek. A modern text of the G ospel represents the work of generations of scholars who hav e compared the many manu-
scripts of John an d worked out the version which is most likely to have been the origina l wording. I t is not possible to establish any one t e xt with absolute precision.
Figure 2: Results of the interpretability experiment. For each of these 8 images (from left-right, top-down) we show the
attribution heat-map for a single character output (for each line in the image) overlaid over a faint version of the original
input image. The randomly chosen character is highlighted in green in the transcription below the image.
nearly the same CER. This verifies that the proposed full pages of text, learning the flow of multiple columns is
method is robust and can learn the reading order from data. not addressed directly. However, given that region / para-
While the proposed method works well on paragraphs or graph segmentation is trivial compared to text line segmen-
⇒ ⇒
Figure 3: The first and third columns represent two input images. The second and fourth columns are the corresponding
color coded scatter plot, where, for each character, the position of the center of mass for the attribution map associated with
that character is marked. Character markers belonging to the same line are given the same color. We can see that the model
learns a very good implicit segmentation of the input image into lines without any localization signal.
Figure 4: Synthetic distortions applied to the IAM dataset to study the how our model handles hard to segment text-lines. (a)
original paragraph image. (b) touching text-lines. (c) rotated and wavy text-lines
Method CER nCER linebreaks Pre-train tion ground-truth provided to the model during training. We
SFR [30] 8.18 8.68 3 50 fully proposed a simple neural network sub-module, OrigamiNet,
SFR-align [33] - 11.05 7 annotated pgs that can be added to any existing fully convolutional single-
GTR-12 OrigamiNet 6.80 5.87 7 - line recognizer and convert it into a multi-line recognizer by
providing the model with enough spatial capacity to be able
to properly unfold 2D input signals into 1D without losing
Table 7: Comparison on ICDAR2017 HTR, best result is information.
highlighted. nCER is CER normalized by GT length. line-
breaks indicates their presence or removal from the GT. We conducted an extensive set of experiments on the
IAM handwriting dataset to show the applicability and gen-
erality of our proposed module. We achieve state-of-the-art
tation we think this is not a serious practical limitation. CER on the ICDAR2017 HTR and IAM datasets surpassing
models that explicitly made use of line segmentation infor-
5. Conclusion mation during training. We then concluded with a set of
interpretability experiments to investigate what the model
In this paper we tackled the problem of multi-line / full actually learns and demonstrated its implicit ability to lo-
page text recognition without any visual or textual localiza- calize characters on each line.
References [16] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional
networks for semantic segmentation. In Proceedings of the
[1] S. Avidan and A. Shamir. Seam carving for content-aware IEEE conference on computer vision and pattern recogni-
image resizing. In ACM SIGGRAPH 2007 papers, pages 10– tion, pages 3431–3440, 2015. 2
es. 2007. 5 [17] U.-V. Marti and H. Bunke. The iam-database: an english
[2] J. Baek, G. Kim, J. Lee, S. Park, D. Han, S. Yun, S. J. Oh, sentence database for offline handwriting recognition. In-
and H. Lee. What is wrong with scene text recognition model ternational Journal on Document Analysis and Recognition,
comparisons? dataset and model analysis. arXiv preprint 5(1):39–46, 2002. 2, 3, 5
arXiv:1904.01906, 2019. 3 [18] J. Michael, R. Labahn, T. Grüning, and J. Zöllner. Evaluating
[3] T. Bluche. Joint line segmentation and transcription for sequence-to-sequence models for handwritten text recogni-
end-to-end handwritten paragraph recognition. In Advances tion. arXiv preprint arXiv:1903.07377, 2019. 1, 7
in Neural Information Processing Systems, pages 838–846, [19] B. Moysset, C. Kermorvant, and C. Wolf. Learning to detect,
2016. 1, 2, 3, 7 localize and recognize many text objects in document im-
[4] T. Bluche, J. Louradour, and R. Messina. Scan, attend and ages from few examples. International Journal on Document
read: End-to-end handwritten paragraph recognition with Analysis and Recognition (IJDAR), 21(3):161–175, 2018. 1,
mdlstm attention. In 2017 14th IAPR International Confer- 2
ence on Document Analysis and Recognition (ICDAR), vol- [20] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury,
ume 1, pages 1050–1055. IEEE, 2017. 1, 2, 3, 7 G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga,
[5] M. Carbonell, J. mas romeu, M. Villegas, A. FornÃl’s, and A. Desmaison, A. Kopf, E. Yang, Z. DeVito, M. Raison,
J. LladÃşs. End-to-end handwritten text detection and tran- A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and
scription in full pages. In 2019 International Conference on S. Chintala. Pytorch: An imperative style, high-performance
Document Analysis and Recognition Workshops (ICDARW), deep learning library. In Advances in Neural Information
07 2019. 2, 7 Processing Systems 32, pages 8024–8035. Curran Asso-
[6] R. G. Casey and E. Lecolinet. A survey of methods and ciates, Inc., 2019. 3
strategies in character segmentation. IEEE transactions on [21] T. Plötz and G. A. Fink. Markov models for offline hand-
pattern analysis and machine intelligence, 18(7):690–706, writing recognition: a survey. International Journal on Doc-
1996. 1 ument Analysis and Recognition (IJDAR), 12(4):269, 2009.
[7] J. Chung and T. Delteil. A computationally efficient pipeline 1
approach to full page offline handwritten text recognition. [22] J. Puigcerver. Are multidimensional recurrent layers really
arXiv preprint arXiv:1910.00663, 2019. 1, 2, 7 necessary for handwritten text recognition? In 2017 14th
[8] C. Dong, C. C. Loy, K. He, and X. Tang. Image IAPR International Conference on Document Analysis and
super-resolution using deep convolutional networks. IEEE Recognition (ICDAR), volume 1, pages 67–72. IEEE, 2017.
transactions on pattern analysis and machine intelligence, 7
38(2):295–307, 2015. 3 [23] O. Ronneberger, P. Fischer, and T. Brox. U-net: Convo-
lutional networks for biomedical image segmentation. In
[9] K. Dutta, P. Krishnan, M. Mathew, and C. Jawahar. Improv-
International Conference on Medical image computing and
ing cnn-rnn hybrid networks for handwriting recognition. In
computer-assisted intervention, pages 234–241. Springer,
2018 16th International Conference on Frontiers in Hand-
2015. 2
writing Recognition (ICFHR), pages 80–85. IEEE, 2018. 6
[24] J. A. Sanchez, V. Romero, A. H. Toselli, M. Villegas, and
[10] B. Gatos, G. Louloudis, T. Causer, K. Grint, V. Romero, J. A.
E. Vidal. Icdar2017 competition on handwritten text recog-
Sánchez, A. H. Toselli, and E. Vidal. Ground-truth produc-
nition on the read dataset. In 2017 14th IAPR Interna-
tion in the transcriptorium project. In 2014 11th IAPR In-
tional Conference on Document Analysis and Recognition
ternational Workshop on Document Analysis Systems, pages
(ICDAR), volume 1, pages 1383–1388. IEEE, 2017. 1, 2, 3
237–241. IEEE, 2014. 1
[25] K. Simonyan and A. Zisserman. Very deep convolutional
[11] A. Graves, S. Fernández, F. Gomez, and J. Schmidhu- networks for large-scale image recognition. arXiv preprint
ber. Connectionist temporal classification: labelling unseg- arXiv:1409.1556, 2014. 4
mented sequence data with recurrent neural networks. In [26] D. Smilkov, N. Thorat, B. Kim, F. Viégas, and M. Watten-
Proceedings of the 23rd international conference on Ma- berg. Smoothgrad: removing noise by adding noise. arXiv
chine learning, pages 369–376. ACM, 2006. 2 preprint arXiv:1706.03825, 2017. 6
[12] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learn- [27] I. Sturm, S. Lapuschkin, W. Samek, and K.-R. Müller. Inter-
ing for image recognition. In Proceedings of the IEEE con- pretable deep neural networks for single-trial eeg classifica-
ference on computer vision and pattern recognition, pages tion. Journal of neuroscience methods, 274:141–145, 2016.
770–778, 2016. 4 6
[13] H. Iqbal. Harisiqbal88/plotneuralnet v1.0.0, Dec. 2018. 4 [28] P. Sturmfels, S. Lundberg, and S.-I. Lee. Visualizing
[14] S. Johansson. The lob corpus of british english texts: Pre- the impact of feature attribution baselines. Distill, 2020.
sentation and comments. 1980. 3 https://distill.pub/2020/attribution-baselines. 6
[15] D. P. Kingma and J. Ba. Adam: A Method for Stochastic [29] M. Sundararajan, A. Taly, and Q. Yan. Axiomatic attribution
Optimization. Dec. 2014. 3 for deep networks. In Proceedings of the 34th International
Conference on Machine Learning-Volume 70, pages 3319–
3328. JMLR. org, 2017. 6
[30] C. Tensmeyer and C. Wigington. Training full-page hand-
written text recognition models without annotated line
breaks. In 2019 International Conference on Document
Analysis and Recognition (ICDAR), pages 1–8. IEEE, 2019.
1, 2, 6, 8
[31] P. Voigtlaender, P. Doetsch, and H. Ney. Handwriting recog-
nition with large multidimensional long short-term memory
recurrent neural networks. In 2016 15th International Con-
ference on Frontiers in Handwriting Recognition (ICFHR),
pages 228–233. IEEE, 2016. 6
[32] C. Wigington, S. Stewart, B. Davis, B. Barrett, B. Price, and
S. Cohen. Data augmentation for recognition of handwritten
words and lines using a cnn-lstm network. In 2017 14th IAPR
International Conference on Document Analysis and Recog-
nition (ICDAR), volume 1, pages 639–645. IEEE, 2017. 5
[33] C. Wigington, C. Tensmeyer, B. Davis, W. Barrett, B. Price,
and S. Cohen. Start, follow, read: End-to-end full-page hand-
writing recognition. In Proceedings of the European Confer-
ence on Computer Vision (ECCV), pages 367–383, 2018. 1,
2, 6, 7, 8
[34] S. Xiao, L. Peng, R. Yan, and S. Wang. Deep network with
pixel-level rectification and robust training for handwriting
recognition. In 2019 International Conference on Document
Analysis and Recognition (ICDAR), pages 9–16. IEEE, 2019.
6
[35] M. Yousef, K. F. Hussain, and U. S. Mohammed. Accurate,
data-efficient, unconstrained text recognition with convolu-
tional neural networks. arXiv preprint arXiv:1812.11894,
2018. 1, 4, 5, 7