0% found this document useful (0 votes)
20 views10 pages

FR-DeTR End-To-End Flowchart Recognition With Precision and Robustness

The document presents FR-DETR, an end-to-end multi-task network for flowchart recognition that integrates symbol detection and edge detection to improve precision and robustness. It addresses the challenges of traditional methods by utilizing a shared multi-scale Transformer structure and introduces a new dataset with over 1000 machine-generated flowchart images for training. Experimental results demonstrate that FR-DETR significantly outperforms existing methods in recognition accuracy.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views10 pages

FR-DeTR End-To-End Flowchart Recognition With Precision and Robustness

The document presents FR-DETR, an end-to-end multi-task network for flowchart recognition that integrates symbol detection and edge detection to improve precision and robustness. It addresses the challenges of traditional methods by utilizing a shared multi-scale Transformer structure and introduces a new dataset with over 1000 machine-generated flowchart images for training. Experimental results demonstrate that FR-DETR significantly outperforms existing methods in recognition accuracy.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Received May 4, 2022, accepted June 6, 2022, date of publication June 14, 2022, date of current version June

21, 2022.
Digital Object Identifier 10.1109/ACCESS.2022.3183068

FR-DETR: End-to-End Flowchart Recognition


With Precision and Robustness
LIANSHAN SUN , HANCHAO DU , AND TAO HOU
Shaanxi Joint Laboratory of Artificial Intelligence, School of Electronic Information and Artificial Intelligence, Shaanxi University of Science and Technology,
Xi’an 710021, China
Corresponding author: Lianshan Sun ([email protected])
This work was supported by the Scientific Research Program through the Education Department of Shaanxi Provincial Government under
Grant 17JK0087.

ABSTRACT Traditional flowchart recognition methods have difficulties in detecting newly added sym-
bols and distinguishing targets from complex backgrounds like line texture. Existing deep-learning-based
object detectors and line segment detectors are promising in recognizing and distinguishing targets from
texture backgrounds. However, using two separate detectors will inevitably cause unnecessary training and
inference costs. Moreover, the insufficient volume and diversity of currently available dataset limit the
effectiveness of model training. To address these issues, this paper proposes an end-to-end multi-task network
FR-DETR (Flowchart Recognition DEtection TRansformer) and a new dataset for precise and robust
flowchart recognition. FR-DETR comprises a CNN backbone and a shared multi-scale Transformer structure
to perform symbol detection and edge detection using shared feature maps and respective prediction heads
in a coarse-to-fine refinement process. The coarse stage analyzes features with low resolution and suggests
candidate regions that contain potential targets for the fine stage to produce accurate predictions using
features with high resolution. Meanwhile, a new dataset is constructed to provide more symbol types and
complex backgrounds for network training and evaluation. It contains more than 1000 machine-generated
flowchart images, 25K+ symbol instances with nine categories, and 20K+ line segments. The experiments
show that FR-DETR achieves an overall precision and recall of 94.0% and 93.1% on the proposed dataset,
and 98.7% and 98.1% on the CLEF-IP dataset, respectively, which all outperform the prior methods.

INDEX TERMS Flowchart recognition, multi-task learning, multi-scale Transformer, object detection, line
segment detection.

I. INTRODUCTION Existing methods [2], [3], [8]–[10], [23] for recogniz-


Flowchart recognition is an essential sub-task in research ing machine-generated flowchart mainly extract the entire
on document analysis and recognition [1]. The critical prob- structure by analyzing the connected components in images
lem of flowchart recognition is to recognize and refine the and then identify specific structures using manually chosen
structural semantics of flowcharts. There are two study areas features, which results in their failures of detecting broken
for flowchart recognition: handwritten flowchart recognition edges and dotted lines. Traditionally, these methods focus
and machine-generated flowchart recognition. In the past few on flowcharts in binary images or images with simple back-
years, many researchers have mainly focused on analyzing grounds. However, as information technologies advance and
handwritten flowcharts, while machine-generated ones have data volumes increase, flowchart images are becoming more
been rarely concerned. However, understanding the struc- diverse and complex to meet various aesthetic requirements.
tural semantics of machine-generated flowchart images is In detail, flowchart structures now have more varied and
crucial for many structural-semantic-based tasks, such as colorful symbols and more complex backgrounds, leading
patent retrieval, automatic code generation, and task-oriented to problems such as decrements in recognition accuracy as
dialogue systems [2]–[6]. well as incompatibility and inflexibility of manually cho-
sen features. Meanwhile, traditional detection methods nor-
The associate editor coordinating the review of this manuscript and mally aim to perform universal detection, which may produce
approving it for publication was Yonghong Peng . redundant results and cause models’ inability to separate tar-
Dataset available online: https://github.com/harolddu/frdetr_dataset gets from interference (e.g., line-shape texture backgrounds).

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License.


64292 For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/ VOLUME 10, 2022
L. Sun et al.: FR-DETR: End-to-End Flowchart Recognition With Precision and Robustness

Different from traditional methods, deep-learning-based respectively. In addition, to satisfy the requirements of data
computer vision technologies are capable of focusing on volume and complexity for model training, a new flowchart
desired targets, which can be summarized into symbols and dataset is also constructed. Compared with the only publicly
edges in flowcharts. Deep-learning-based object detectors are accessible machine-generated flowchart dataset CLEF-IP
currently widely used for detecting objects within images. (Cross-Language Evaluation Forum Intellectual Property),
Although these approaches perform well in symbol detec- which has around 60 images, the new dataset has more than
tion, their bounding box representation makes it difficult to 1000 machine-generated flowchart images, 25K+ symbol
recognize line segments that are short or nearly parallel to instances with nine categories, and 20K+ line segments.
the axes. Some deep-learning-based line segment detectors The main contributions of this paper are:
that treat edges as endpoint pairs can be used to achieve 1) It demonstrates that the machine-generated flowchart
good line segment detection performance. Consequently, the recognition can be accomplished using deep-learning-
flowchart recognition task can be divided into recognizing based object detection and line segment detection to
symbols using object detection and detecting edges using line handle the increasing symbol diversity and background
segment detection. As shown in Fig. 1, symbols and arrows complexity of flowchart images.
are indicated by bounding boxes and edges are indicated by 2) It proposes an end-to-end multi-task learning network
line segments. that combines object detection and line segment detec-
tion to reduce the costs caused by separate models. The
model jointly detects symbols and edges in flowcharts
and employs a multi-scale Transformer structure to
improve the recognition accuracy of both tasks.
3) The proposed method achieves a better recognition
accuracy that outperforms the prior machine-generated
flowchart recognition methods.
4) A new machine-generated flowchart dataset is estab-
lished to address data shortages for deep learning model
training.

II. RELATED WORK


FIGURE 1. Representation of proposed flowchart recognition method:
bounding boxes localize each symbol and arrow, line segments denote A. FLOWCHART RECOGNITION
connecting edges, regardless of texture backgrounds. Most prior studies of machine-generated flowchart recogni-
tion took place after the CLEF-IP in 2011, which was held by
Many methods manage these two tasks separately. For the Information Retrieval Facility (IFR) and released a public
example, some works [11]–[14] handle object detection, machine-generated flowchart dataset. Rusiñol et al. [7], [8]
while some others [15]–[19] deal with line segment detection. summarized flowcharts as structure layer and text layer, and
Whereas these methods can achieve promising performance, then performed recognition after layer separation. The struc-
the sequential processing of these tasks results in high train- ture extracted by connected component analysis was used
ing and inference costs. Some multi-task models [20]–[22], to distinguish symbols with geometric moments descriptor
[25], [26] achieve competitive performance by combining and blurred shape model. Mörzinger et al. [9] separated
different tasks into a model with one shared encoder and text and diagram before studying the visual features of a
separated decoders. In flowchart recognition, information structure at the pixel level. Thean et al. [10] suggested a
processed by different tasks is inherently correlated, such as symbol recognition method based on text-graphic segmen-
symbols are connected by edges and edges are connecting tation, shape-squeezing, and symmetric features. Zhang [23]
symbols. We believe a multi-task model is more suitable for proposed a corner-based structural model (CBSM) based on
flowchart recognition because (1) it enables information to be the analysis of different corners and symbol shapes. The
shared between two tasks of object detection and line segment CBSM recognizes symbols by defining corner classification
detection, which can simplify the network structure; and (2) and corner-based spatial constraints for each kind of graphic
it reduces the training and analysis process by managing two shapes. These methods mostly rely on the results of connected
tasks simultaneously. component analysis and different manually chosen features.
Based on these works, this paper proposes an end-to- Over the last decade, research on handwritten flow-
end multi-task network architecture named FR-DETR, the chart recognition has received increasing attention.
first deep learning system for machine-generated flowchart Carton et al. [27] fused structural and statistical information
recognition to the best of our knowledge. By fusing to compute grammatical descriptions for each type of sym-
DETR (DEtection TRansformer) [14] and LETR (LinE bol. Lemaitre et al. [28] analyzed flowchart structure based
segment TRansformers) [19], FR-DETR has a CNN back- on the description and modification of segment (DMOS)
bone and a multi-scale Transformer structure to per- and structural information. Bresler et al. [29] solved a
form fine-grained symbol detection and edge detection max-sum model optimization task to obtain the best symbol

VOLUME 10, 2022 64293


L. Sun et al.: FR-DETR: End-to-End Flowchart Recognition With Precision and Robustness

description. Schäfer et al. [30] proposed Arrow R-CNN to


detect symbols and connecting edges by adding a keypoint
head to Faster R-CNN. The keypoint head is designed to
detect the arrow and tail belonging to a connecting edge as
two keypoints. Arrow R-CNN was the first deep-learning-
based flowchart recognition approach.

B. TRANSFORMER ARCHITECTURE
In recent years, Transformer [32] has become the standard
backbone for many natural language processing (NLP) mod-
els and has achieved remarkable success. Its self-attention
and cross-attention mechanisms can build strong relations
among sequence-format inputs. Recently, Transformer has
attracted more researchers and steps in the fields of computer
vision [34]. Carion et al. [14] introduced a Transformer based
end-to-end object detection network DETR that removed
pre-designed anchor and non-maximum suppression (NMS).
It detects objects using interactions between a fixed num-
ber of queries and encoded image features. Following the
basic query concept of DETR, Xu et al. [19] proposed
a Transformer based line segment detector LETR. Unlike
the prior line segment detection approaches consisting of
heuristics-guided steps, the LETR detector directly regressed
the endpoints of a line segment and achieved state-of-the-art
performance on relative line segment detection datasets.

C. MULTI-TASK APPROACHES
Traditionally, task-oriented networks have been indepen-
dently designed and trained for each task. However,
as humans concurrently solve multiple tasks, many real-world
problems are also multi-modal. This motivates researchers to
study generalized methods that can accomplish all desired
tasks with a single model. Deep-learning-based MTL (multi-
task learning) models aim to improve network generalization
and the capability of jointly learning shared information.
FIGURE 2. Due to the complex structure that machine-generated
Compared with single-task models, multi-task networks have flowcharts have, errors occur when applying arrow R-CNN. bounding
advantages such as a reasonable reduction in model size and boxes indicate entire connecting edges, and each red point and blue
point represent the start and tail of belonged edge. (a) shows the
increment in inference speeds by sharing inherent parts of detection errors of box and keypoint when edges have no arrow.
network structure [24]. Vandenhende et al. [25] considered (b) shows the detection errors when edges have a nested layout.
the importance of task interactions while distilling informa-
tion during multi-task learning. Hu et al. [26] jointly learned
multiple tasks across different domains with a unified Trans- applying Arrow R-CNN. As shown in Fig. 2, although the
former. Wu et al. [21] introduced a multi-task network that representation of connecting edges is adjusted to an entire
can jointly handle object detection, drivable area segmenta- connection between two symbols, errors in detecting edges
tion, and lane detection in autonomous driving. and keypoints still occur. Based on the structural analysis
of connecting edges, using line segment representation can
III. PROPOSED APPROACH better handle the dilemma in which Arrow R-CNN is stuck.
A. MOTIVATION Line segments are not appropriately described with bound-
Although Arrow R-CNN [30] achieves remarkable results ing boxes because of their highly variable aspect ratio and
in handwritten flowcharts recognition, it is not fully appli- limited choices of anchors. Many works [16]–[18] follow
cable for recognizing machine-generated flowcharts. Every the procedure of first producing junction proposals and then
connecting edge detected by Arrow R-CNN must contain converting them into line segments, which causes the perfor-
an arrow as a keypoint. However, several connecting edges mance of these methods to rely heavily on junction detection.
within a machine-generated flowchart may share one arrow, However, in a flowchart, junctions appear on both connect-
which causes failures in detecting edges and keypoints when ing edges and symbols, which brings numerous redundant

64294 VOLUME 10, 2022


L. Sun et al.: FR-DETR: End-to-End Flowchart Recognition With Precision and Robustness

FIGURE 3. FR-DETR architecture. A convolution network is used as backbone to produce two feature maps with different resolutions from an input
flowchart. The coarse encoder-decoder structure first encodes the smaller feature map and then generates the mid-queries with the interactions between
encoded features and the init-queries. The fine encoder-decoder structure encodes the feature map with higher resolution and outputs final-queries
based on the interaction between the corresponding encoded feature and mid-queries. In the end, the final-queries are fed into feed-forward networks
to make the final predictions.

instances when using junction based approaches. In addition, The channel dimension of the feature map f0 is then reduced
due to convolution having a fixed respect field, CNN-based from C to a smaller dimension d to obtain a new feature map
models have limitations in building long-range relations for f ∈ Rd×H ×W by 1 × 1 convolution. To meet the encoder’s
targets like long lines. Recently, a Transformer-based line requirement, which expects input in the sequence format,
segment detector LETR [19] suggested a model that directly the feature map f is flattened to create another feature map
regresses the endpoints coordinates of each line segment. The z ∈ Rd×HW .
attention mechanism introduced by Transformer perfectly
meets the need to distinguish line segments between wanted 2) FEATURE ENCODING
and unwanted ones.
The Transformer encoder is stacked with multiple encoding
Inspired by the aforementioned methods, with the aim of
layers. Each layer consists of a multi-head self-attention mod-
improving the flowchart recognition accuracy and reducing
ule and a feed-forward network (FFN). The encoding layers
the costs of using isolated models, this paper modifies LETR
receive processed features from their predecessor layer and
into a multi-task model to jointly accomplish the two detec-
deliver the output features to the corresponding FFN after
tion tasks. The model selected to perform symbol detection
learning the pairwise relations between the input and output.
is DETR. The reasons can be concluded as follows: (1) DETR
In general, the flattened feature map z ∈ Rd×HW is encoded
has a similar Transformer-based structure to LETR, which
into a new feature map z0 ∈ Rd×HW . The positional encoding
maximizes structure sharing between the two models and the
of f is added to guarantee the flattened feature map z not to
overall cost reduction. (2) Other CNN-based object detection
lose the spatial relations.
models and LETR have few shareable parts, which unavoid-
ably results in insufficient structure sharing and difficulties in
jointly analyzing features. 3) FEATURE DECODING
Following the standard architecture of Transformer, the
B. THE FR-DETR MODEL decoder transforms each N embeddings of symbol and
The overall architecture of FR-DETR is designed based line segment using the multi-head self-attention and cross-
on a multi-scale Transformer encoder-decoder structure as attention module. Like the positional encoding of the encoder,
depicted in Fig. 3. The proposed flowchart recognition pro- the input embeddings are learnable positional information
cess can be divided into four sub-tasks: feature extraction, that is added to the input of each layer and named as target
feature encoding, feature decoding and target prediction. queries. Each decoding layer receives z0 ∈ Rd×HW from the
last encoding layer and two types of target queries b ∈ Rd×N
1) FEATURE EXTRACTION and l ∈ Rd×N , namely symbol queries and line queries, from
FR-DETR uses a CNN backbone to extract a feature its predecessor decoding layer. Both types of queries are first
map f0 ∈ RC×H ×W from an input image x ∈ R3×H0 ×W0 . processed by the self-attention model, and then, each entity

VOLUME 10, 2022 64295


L. Sun et al.: FR-DETR: End-to-End Flowchart Recognition With Precision and Robustness

in the queries is assigned to different regions of positional D. TARGET PREDICTION LOSS


encoding by the cross-attention module. The output of the In the encoding-decoding system, the multi-scale encoder-
decoder is then used to predict the final results using an FFN. decoder gradually rectifies each N initial queries of symbol
and line segment to achieve final queries in the same amount,
4) TARGET PREDICTION respectively. In the prediction process, each type of final
The final results of symbols and line segments are predicted queries is fed into its corresponding FFN that consists of a
by an FFN. Specifically, the coordinates of symbols and line classifier and a regressor to predict the category confidence p
segments are computed by a multi-layer perceptron (MLP) and the coordinates of every target. If a final query belongs to
with three layers, and the confidence of the predicted targets symbol detection, the coordinates prediction is in the format
is produced by a linear projection layer. of bounding box b = (cx, cy, w, h), which denotes the center
point, width and height of the box. Otherwise, a line segment
C. COARSE-TO-FINE STRATEGY prediction l is represented by two endpoints (p1 , p2 ), where
The detection of small objects, such as arrows in a flowchart, p1 = (x1 , y1 ), p2 = (x2 , y2 ).
empirically needs high resolution feature maps to achieve The set of predictions t̂ has N targets, and the set of ground
better results. However, directly processing high resolution truth t has M elements, normally N > M . To assign a
image features with Transformer incurs a high computation bipartite matching between the predictions and ground truth,
cost. In contrast to object detection, which mainly focuses t is assumed as a set that padded with no object (∅) to meet
on local and neighborhood regions, line segment detection the size of t̂. In this case, an optimization for the bipartite
needs to consider the fine-grained local features and the matching is used to find the permutation with the lowest cost:
global information. According to previous works [11], [12],
N
an efficient way to tackle the problem is a sequential two- X
stage structure, whose former component produces suggested Lmatch (t, t̂) = I{ci 6=∅} [λ1 Ltarget (ti , t̂σ̂i ) − λ2 pσ̂i ] (1)
i=1
regions for the other component to perform exact detection.
Following this idea, FR-DETR performs both the desired σ̂ = arg min Lmatch (t, t̂) (2)
σ
tasks in a refinement process using a coarse-to-fine strategy.
This strategy enables FR-DETR to learn from multi-scale where σ̂ is the indices of the matched predictions in set t̂.
image features and produce precise predictions. In general, Lmatch computes the total pair-wise cost of matching the
the model first analyzes the global information to locate pos- ground truth ti and prediction t̂σ̂i . I is an indicator function
sible targets coarsely and then uses the location information to used to drop ∅ when computing Lmatch , ci represents the class
examine local features and perform fine-grained recognition. of ti . Ltarget denotes the matching cost functions for symbol
In the coarse stage, FR-DETR studies a low resolution detection Lsymbol and line segment detection Lline .
feature map to identify potential regions containing symbols
and line segments. The low resolution feature map sent into Lsymbol (bi , b̂σ̂i ) = λL1 ||bi − b̂σ̂i ||1 + λIoU LIoU (3)
the coarse encoder is the output of the ResNet [33] C5 layer, Lline (li , l̂σ̂i ) = λL1 ||li − l̂σ̂i ||1 (4)
1
and its size is 32 of the original image resolution. After the
encoding process, the encoded features and init-target queries where Lsymbol consists of L1 loss and IoU loss. The L1 loss
are then passed into each decoding layer’s cross-attention computes the distance between the ground truth and symbol
module and self-attention module. The predictions produced prediction, and Liou uses the generalized IoU loss to mea-
by the coarse stage are considered as potential target regions sure the similarity between bounding boxes. Likewise, Lline
and received by fine decoding layers as mid-target queries. represents L1 thee distance of corresponding end point pair
The coarse stage is important for improving the accuracy of between two line segments of the ground truth and prediction.
the fine stage and can reduce the computation cast compared For the two detection tasks, each task loss must also eval-
with directly processing high resolution features. uate the results of classification. By adding a cross-entropy
In the fine stage, based on the suggested potential regions, loss term, each task loss can be represented as:
FR-DETR makes detailed predictions using a feature map N
1
with 16 of the original image resolution, which is the output L̂symbol (b, b̂) =
X
λcls Lcls (bi , b̂σ̂i ) + Lsymbol (bi , b̂σ̂i ) (5)
of the ResNet C4 layer and is twice the size of the feature
i=1
map used in the coarse stage. In general, fine decoding is N
similar to that in the coarse stage. The main difference is
X
L̂line (l, l̂) = λcls Lcls (li , l̂σ̂i ) + Lline (li , l̂σ̂i ) (6)
that it processes information with more details and focuses i=1
on the suggested regions to conclude predictions, making
the fine stage crucial for accomplishing precise fine-grained where L̂symbol and L̂line are each task’s loss term, Lcls denotes
detection. the cross-entropy loss term for classification. The total loss of
While performing prediction, each detection task has one FR-DETR is formulated as:
shared prediction head for the coarse and fine stage to make
predictions precisely and progressively. Ltotal = L̂symbol + L̂line (7)

64296 VOLUME 10, 2022


L. Sun et al.: FR-DETR: End-to-End Flowchart Recognition With Precision and Robustness

IV. EXPERIMENT AND RESULTS Algorithm 1 Task-by-Task Training Strategy


A. DATASET Input:
(1) CLEF-IP dataset. The original CLEF-IP dataset released Target network N with parameters:
f f
in 2011 contains machine-generated flowchart images and 2 = {θlinec , θc
symbol , θline , θsymbol , θbackbone };
other structural diagrams. After removing non-flowcharts, // c = coarse stage,f = fine stage
approximately 60 images remain, most of which are provided Training data set: S;
by the European Patent Office (EPO) for the patent retrieval Preset train epochs:
study. These flowcharts have a simple white background, and 8 = {φwarmup , φend , φc , φf }
their structures are drawn in black. The dataset has three Output: Well-trained network: N
main symbols: rectangle for processing action, diamond for 1: for i in {c, f } do
decision, and oval for terminator. 2: 2 ← {θlinei , θi
symbol }; // Initialization
3: if i = c then
TABLE 1. Statistic of the new dataset. 4: 2 ← 2∪{θbackbone }; // Train backbone in the coarse
stage
5: end if
6: Train(N , S) for φwarmup epochs;
7: 2 ← 2\{θsymboli }; // Freeze θsymbol
i

8: Train(N , S) for φi epochs;


9: 2 ← 2 ∪ {θsymboli i }; // Freeze θ i and activate
}\{θline line
θsymbol
i

10: Train(N , S) for φi epochs;


11: 2 ← 2 ∪ {θline i }; // Activate all parameters

12: Train(N , S) for φend epochs; // Stage i finished


13: end for
14: return Trained Network N ;

B. IMPLEMENTATION
1) NETWORK ARCHITECTURE
ResNet-50 and ResNet-101 are used as backbone to generate
feature maps from the input image x ∈ R3×H0 ×W0 . The coarse
encoder and fine encoder take the output from the C5 layer
(2) The proposed dataset. To enrich the symbol category and and C4 layer, respectively. The output from C5 layer is a low
structural complexity, public flowchart images are collected resolution feature map f0 ∈ RC×H ×W , where C = 2048, H =
H0 W0
through the Internet by using image search engines, such as 32 , W = 32 . The output from C4 layer is a feature map with
Google Image, Bing Image and Baidu Image. After filter- higher resolution and C = 1024, H = H160 , W = W 0
16 . Before
ing low quality images and removing duplicates, a dataset being fed into the Transformer encoder, the feature map is
containing more than 1,000 images is constructed. The new compressed from C channels to 256 by a 1 × 1 convolution.
dataset includes 25K+ symbol instances and 20K+ line seg- Then, with a fixed sine/cosine positional encoding, the new
ments. Statistical details are shown in Table 1. Moreover, feature map is sent into the Transformer encoder. To process
the backgrounds of the images are not all purely white. For multi-scale features, the network sets up two independent
example, some samples have colorful backgrounds with line encoder-decoder structures. Similar to the standard Trans-
textures. former structure, each encoder and decoder has six encoding
The FR-DETR model is trained on the new dataset, which layers and six decoding layers with eight attention heads.
is randomly divided into 800 training images and 200 test- The two tasks share one encoder-decoder structure at each
ing images. Data augmentation methods including random stage, and use respective prediction heads to produce the final
resize, random flip, and random crop, are taken through all prediction. The sharing structure can efficiently reduce the
experiments. Usually, applying augmentations [31], such as number of network parameters and the cost of learning and
copy-pasting, is an effective and efficient way to improve the inferring.
detection results of small objects. Due to the arrow class,
the only category of small objects in the dataset, already 2) TRAINING STRATEGY
has the largest amount, specific augmentation for such class A single GeForce RTX 3090 GPU is used to train and test
is unnecessary. The input images are resized to 600 pixels the model through all experiments. The backbone weights are
on the longer side during the training process and at least initialized using the pre-trained model provided by DETR.
600 pixels on the shorter side during the evaluation process. To accomplish coarse-to-fine detection, the network first

VOLUME 10, 2022 64297


L. Sun et al.: FR-DETR: End-to-End Flowchart Recognition With Precision and Robustness

TABLE 2. Detection results of symbol and line segment/edge. (Edge is for arrow R-CNN only.)

TABLE 3. Comparison of parameter volume and inference time among different models using ResNet50 as backbone.

FIGURE 4. Detection results of FR-DETR on the new dataset. (a) shows the results of recognizing flowcharts with complex structures, dense targets and
broken edges. (b) shows the recognition results of flowcharts with texture backgrounds.

trains only the coarse stage for 500 epochs. Subsequently, the model. The end-to-end training simply trains the entire
all parameters of the coarse encoder-decoder are frozen, and network at once, and the symbol detection and line segment
the fine stage is trained for 300 epochs after being initialized detection are therefore trained jointly. The task-by-task train-
by coarse weights. Different training paradigms, such as end- ing allows the model to focus on one task each time. In detail,
to-end strategy and task-by-task strategy, are applied to train the entire model is first trained for 50 warm-up epochs to

64298 VOLUME 10, 2022


L. Sun et al.: FR-DETR: End-to-End Flowchart Recognition With Precision and Robustness

FIGURE 5. Visualization of FR-DETR’s coarse-to-fine decoding process. The first column shows the input images. The (b) to (d) columns show the
decoding results for symbols, line segments and both at the end of the coarse stage. Likewise, the (e) to (g) columns show the decoding results at the
end of the fine stage. The last column shows detection results.

be roughly convergent. Then, while all the parameters of the TABLE 4. Comparison of different training strategies.
other branch are frozen, each task is trained for 300 epochs in
the coarse stage and 200 epochs in the fine stage. Finally, all
the parameters are unfrozen and trained together for another
100 epochs. The specific training steps are presented in Algo-
rithm 1. In the end-to-end training strategy, the learning rate is
initially set to 0.0001 and multiplied by a factor of 0.1 every
200 epochs in the coarse stage and 120 epochs in the fine
stage. In the task-by-task training strategy, learning rates are
shared between tasks only during the warm-up and jointly decoders is also tested. In this case, the network structure
training stage. AdamW is used as the optimizer and the still has a shared backbone and encoders, but the decoding
weight decay is set to 10−4 . results sent into each prediction head are produced by their
respective decoders. This structure allows each decoding
C. EVALUATION METRIC module to focus on its specific task. The results show that the
As pioneering flowchart recognition works, the evaluation multi-decoder does improve the model performance. Com-
metric selected in this paper is the F1-Score, which is a pared with DETR and LETR, there is a slight gap between
standard metric in flowchart recognition. The F1-Score is the the results of FR-DETR and the single-task models, while
weighted average of Precision and Recall, where Precision the multi-task model makes it difficult to find an optimal
measures the percentage of correct prediction and Recall global solution for all tasks. As shown in Table 3, using
shows the percentage of correctly detected targets. ResNet50 as the backbone, FR-DETR has an inference time
similar to that of DETR and LETR. However, FR-DETR
TP
Precision = (8) can productively reduce the total inference time with a slight
TP + FP increment in the number of parameters. Generally, FR-DETR
TP
Recall = (9) can significantly reduce the time consumption with a small
TP + FN decrease in accuracy.
Precision × Recall
F1 = 2 × (10) The comparison of different training strategies’ perfor-
Precision + Recall mance is shown in Table 4. It shows that the two strategies
where TP (true positive) denotes correct positive predic- achieve almost the same results, while the task-by-task train-
tions of positive targets, FP (false positive) denotes incorrect ing is slightly better than the end-to-end strategy.
positive predictions of negative targets, FN (false negative) FR-DETR is robust and can handle complex flowchart
denotes incorrect negative predictions of positive targets. images. Unlike the CLEF-IP dataset, the new dataset provides
flowcharts with more complex structures and backgrounds,
D. RESULTS AND COMPARISONS enabling FR-DETR to address challenging problems such
Regarding the evaluation metric, Table 2 reveals the detection as broken edges and misleading line textures. As shown in
results of symbols and line segments on the new dataset. Fig. 4, the images on the left have a dense layout of symbols
It is clear that the overall performance of FR-DETR is better and broken connecting edges, the images on the right are
than Arrow R-CNN. Table 2 also shows that all two-stage flowcharts with texture backgrounds. The detection results
models achieve better results than the single-stage model, show that FR-DETR can accurately localize each symbol and
which means that the two-stage structure can effectively line segment regardless of structure complexity, backgrounds
improve detection accuracy. The effectiveness of multiple and broken edges.

VOLUME 10, 2022 64299


L. Sun et al.: FR-DETR: End-to-End Flowchart Recognition With Precision and Robustness

TABLE 5. Comparison of FR-DETR and other methods on the CLEF-IP [4] F. Piroi, M. Lupu, A. Hanbury, A. P. Sexton, W. Magdy, and I. V. Filippov,
dataset. ‘‘CLEF-IP 2012: Retrieval experiments in the intellectual property
domain,’’ in Proc. CLEF Eval. Labs Workshop, Rome, Italy, Sep. 2012,
pp. 1–13. [Online]. Available: http://ceur-ws.org/Vol-1178/CLEF2012wn-
CLEFIP-PiroiEt2012.pdf
[5] C. Supaartagorn, ‘‘Web application for automatic code generator using
a structured flowchart,’’ in Proc. 8th IEEE Int. Conf. Softw. Eng.
Service Sci. (ICSESS), Beijing, China, Nov. 2017, pp. 114–117, doi:
10.1109/ICSESS.2017.8342876.
[6] J. Andreas et al., ‘‘Task-oriented dialogue as dataflow synthesis,’’
Trans. Assoc. Comput. Linguistics, vol. 8, pp. 556–571, Sep. 2020, doi:
10.1162/TACL_A_00333.
[7] M. Rusiñol, L.-P. de Las Heras, J. Mas, O. R. Terrades, D. Karatzas,
A. Dutta, G. Sánchez, and J. Lladós, ‘‘CVC-UAB’s participation in
the flowchart recognition task of CLEF-IP,’’ in Proc. CLEF Eval. Labs
The demonstration of FR-DETR’s coarse-to-fine decoding Workshop, Rome, Italy, Sep. 2012, pp. 1–11. [Online]. Available: http://
process is shown in Fig. 5. The results of the coarse stage ceur-ws.org/Vol-1178/CLEF2012wn-CLEFIP-RusinolEt2012.pdf
[8] M. Rusiñol, L.-P. de las Heras, and O. R. Terrades, ‘‘Flowchart recog-
are produced by the coarse decoder decoding features from nition for non-textual information retrieval in patent search,’’ Inf. Retr.,
ResNet’s C5 layer. Although the target information is will- vol. 17, nos. 5–6, pp. 545–562, Oct. 2014, doi: 10.1007/S10791-013-
captured, the features with a low resolution limit the model 9234-3.
from making precise predictions. The outputs of the fine stage [9] R. Mörzinger, R. Schuster, and A. Horti, ‘‘Visual structure analysis of flow
charts in patent images,’’ in Proc. CLEF Eval. Labs Workshop, Rome,
are generated by the fine decoder decoding high resolution Italy, Sep. 2012, pp. 1–8. [Online]. Available: http://ceur-ws.org/Vol-
features from ResNet’s C4 layer with target queries from the 1178/CLEF2012wn-CLEFIP-MorzingerEt2012.pdf
coarse stage. The overlay of attention heatmaps shows more [10] A. Thean, J.-M. Deltorn, P. Lopez, and L. Romary, ‘‘Textual summarisation
of flowcharts in patent drawings for CLEF-IP 2012,’’ CLEF 2012 Eval.
detailed relations in the image space, which is the key to the Labs Workshop, Rome, Italy, Sep. 2012, pp. 1–18. [Online]. Available:
detector performance. http://ceur-ws.org/Vol-1178/CLEF2012wn-CLEFIP-TheanEt2012.pdf
The overall recognition results on the CLEF-IP dataset, [11] R. Girshick, ‘‘Fast R-CNN,’’ in Proc. IEEE Int. Conf. Comput.
Vis. (ICCV), Santiago, Chile, Dec. 2015, pp. 1440–1448, doi:
as shown in Table 5, are improved to 98.7% precision and 10.1109/ICCV.2015.169.
98.1% recall by FR-DETR, which indicates that the proposed [12] S. Ren, K. He, R. Girshick, and J. Sun, ‘‘Faster R-CNN: Towards real-
method outperforms the prior approaches. time object detection with region proposal networks,’’ IEEE Trans. Pat-
tern Anal. Mach. Intell., vol. 39, no. 6, pp. 1137–1149, Jun. 2017, doi:
10.1109/TPAMI.2016.2577031.
V. CONCLUSION [13] G. Jocher et al., ‘‘Ultralytics/YOLOv5: V6.1—TensorRT,
This paper presents a new method for machine-generated TensorFlow edge TPU and OpenVINO export and inference,’’ Los
Angeles, CA, USA, Tech. Rep., Feb. 2022. [Online]. Available:
flowchart recognition, which accomplishes the task by detect- https://zenodo.org/record/6222936 and https://ultralytics.com, doi:
ing symbols and connecting edges using deep-learning- 10.5281/zenodo.6222936.
based object detectors and line segment detectors. An end- [14] N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and
to-end multi-task model named FR-DETR that contains a S. Zagoruyko, ‘‘End-to-end object detection with transformers,’’ in Proc.
16th Eur. Conf. Comput. Vis. (ECCV), Glasgow, U.K., Aug. 2020,
multi-scale Transformer structure is introduced to reduce pp. 213–229, doi: 10.1007/978-3-030-58452-8_13.
the high cost caused by using separate models. Its well- [15] G. Gu, B. Ko, S. Go, S.-H. Lee, J. Lee, and M. Shin, ‘‘Towards
performed joint detection of symbols and line segments sig- light-weight and real-time line segment detection,’’ 2021,
arXiv:2106.00186.
nificantly simplifies the flowchart recognition task. A new [16] Z. Zhang, Z. Li, N. Bi, J. Zheng, J. Wang, K. Huang, W. Luo,
machine-generated flowchart dataset is also constructed for Y. Xu, and S. Gao, ‘‘PPGNet: Learning point-pair graph for line seg-
practical model training and evaluation. ment detection,’’ in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recog-
nit. (CVPR), Long Beach, CA, USA, Jun. 2019, pp. 7098–7107, doi:
Although FR-DETR outperforms other flowchart recog- 10.1109/CVPR.2019.00727.
nition methods, it is not lightweight enough to meet the [17] Y. Zhou, H. Qi, and Y. Ma, ‘‘End-to-end wireframe parsing,’’ in Proc.
requirements of real-time processing and mobile application IEEE/CVF Int. Conf. Comput. Vis. (ICCV), Seoul, South Korea, Oct. 2019,
pp. 962–971, doi: 10.1109/ICCV.2019.00105.
development. In addition, recognizing flowcharts and under- [18] K. Huang, Y. Wang, Z. Zhou, T. Ding, S. Gao, and Y. Ma, ‘‘Learn-
standing the corresponding structural semantics with an end- ing to parse wireframes in images of man-made environments,’’ in
to-end model is still a challenge. Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., Salt Lake
City, UT, USA, Jun. 2018, pp. 625–626, doi: 10.1109/CVPR.2018.
00072.
REFERENCES [19] Y. Xu, W. Xu, D. Cheung, and Z. Tu, ‘‘Line segment detection
[1] J. Lladós, ‘‘Two decades of GREC workshop series. Conclusions of using transformers without edges,’’ in Proc. IEEE/CVF Conf. Com-
GREC2017,’’ in Graphics Recognition. Current Trends and Evolu- put. Vis. Pattern Recognit. (CVPR), Jun. 2021, pp. 4257–4266, doi:
tions (Lecture Notes in Computer Science), vol. 11009, A. Fornés and 10.1109/CVPR46437.2021.00424.
B. Lamiroy, Eds. Cham, Switzerland: Springer, Nov. 2018, pp. 163–168, [20] Y. Qian, J. M. Dolan, and M. Yang, ‘‘DLT-Net: Joint detection of drivable
doi: 10.1007/978-3-030-02284-6_14. areas, lane lines, and traffic objects,’’ IEEE Trans. Intell. Transp. Syst.,
[2] N. Bhatti and A. Hanbury, ‘‘Image search in patents: A review,’’ Int. J. vol. 21, no. 11, pp. 4670–4679, Nov. 2019.
Document Anal. Recognit., vol. 16, no. 4, pp. 309–329, Dec. 2013, doi: [21] D. Wu, M. Liao, W. Zhang, X. Wang, X. Bai, W. Cheng, and W. Liu,
10.1007/S10032-012-0197-5. ‘‘YOLOP: You only look once for panoptic driving perception,’’ 2021,
[3] S. Adams, ‘‘Electronic non-text material in patent applications—Some arXiv:2108.11250.
questions for patent offices, applicants and searchers,’’ World Patent Inf., [22] D. Vu, B. Ngo, and H. Phan, ‘‘HybridNets: End-to-end perception net-
vol. 27, no. 2, pp. 99–103, Jun. 2005, doi: 10.1016/J.WPI.2004.12.005. work,’’ 2022, arXiv:2203.09035.

64300 VOLUME 10, 2022


L. Sun et al.: FR-DETR: End-to-End Flowchart Recognition With Precision and Robustness

[23] S. Zhang, ‘‘Research on flowchart recognition by fusing structural LIANSHAN SUN received the Ph.D. degree
model and corner feature,’’ M.S. thesis, College Electr. Inf. Eng., in software engineering from Peking University,
Shaanxi Univ. Sci. Technol., Xi’an, China, 2018. [Online]. Avail- in 2009. He is an Associate Professor with the
able: https://oversea.cnki.net/KCMS/detail/detail.aspx?dbcode=CMFD School of Electronic Information and Artificial
&dbname=CMFD201802&filename=1018204634.nh Intelligence, Shaanxi University of Science and
[24] S. Vandenhende, S. Georgoulis, W. Van Gansbeke, M. Proesmans, D. Dai, Technology. His main research interests include
and L. Van Gool, ‘‘Multi-task learning for dense prediction tasks: A sur- blockchain, information security, and artificial
vey,’’ IEEE Trans. Pattern Anal. Mach. Intell., early access, Jan. 26, 2021,
intelligence.
doi: 10.1109/TPAMI.2021.3054719.
[25] S. Vandenhende, S. Georgoulis, and L. van Gool, ‘‘MTI-Net: Multi-scale
task interaction networks for multi-task learning,’’ in Proc. 16th Eur.
Conf. Comput. Vis. (ECCV), Glasgow, U.K., Aug. 2020, pp. 527–543, doi:
10.1007/978-3-030-58548-8_31.
[26] R. Hu and A. Singh, ‘‘UniT: Multimodal multitask learning with
a unified transformer,’’ in Proc. IEEE/CVF Int. Conf. Comput. Vis.
(ICCV), Montreal, QC, Canada, Oct. 2021, pp. 1419–1429, doi:
10.1109/ICCV48922.2021.00147.
[27] C. Carton, A. Lemaitre, and B. Coüasnon, ‘‘Fusion of statistical and struc-
tural information for flowchart recognition,’’ in Proc. 12th Int. Conf. Docu-
ment Anal. Recognit., Washington, DC, USA, Aug. 2013, pp. 1210–1214, HANCHAO DU received the graduate degree in
doi: 10.1109/ICDAR.2013.245. computer and its application from Xidian Univer-
[28] A. Lemaitre, H. Mouchère, J. Camillerapp, and B. Coüasnon, ‘‘Interest sity, in 2015. He is currently pursuing the mas-
of syntactic knowledge for on-line flowchart recognition,’’ in Graphics ter’s degree with the School of Electronic Informa-
Recognition. New Trends and Challenges (Lecture Notes in Computer tion and Artificial Intelligence, Shaanxi University
Science), vol. 7423, Y.-B. Kwon and J.-M. Ogier, Eds. Berlin, Germany: of Science and Technology. His main research
Springer, Feb. 2013, pp. 89–98, doi: 10.1007/978-3-642-36824-0_9. interests include image recognition and artificial
[29] M. Bresler, D. Prua, and V. Hlavac, ‘‘Modeling flowchart structure intelligence.
recognition as a max-sum problem,’’ in Proc. 12th Int. Conf. Document
Anal. Recognit., Washington, DC, USA, Aug. 2013, pp. 1215–1219, doi:
10.1109/ICDAR.2013.246.
[30] B. Schäfer and H. Stuckenschmidt, ‘‘Arrow R-CNN for flowchart
recognition,’’ in Proc. Int. Conf. Document Anal. Recognit. Work-
shops (ICDARW), Sydney, NSW, Australia, Sep. 2019, pp. 7–13, doi:
10.1109/ICDARW.2019.00007.
[31] M. Kisantal, Z. Wojna, J. Murawski, J. Naruniec, and K. Cho, ‘‘Augmen-
tation for small object detection,’’ 2019, arXiv:1902.07296.
[32] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez,
L. Kaiser, and I. Polosukhin, ‘‘Attention is all you need,’’ in Proc. 31th
Int. Conf. Neural Inf. Process. Syst., Long Beach, CA, USA, Dec. 2017, TAO HOU is a Lecturer with the School of
pp. 6000–6010, doi: 10.48550/arXiv.1706.03762. Electronic Information and Artificial Intelligence,
[33] K. He, X. Zhang, S. Ren, and J. Sun, ‘‘Deep residual learning for Shaanxi University of Science and Technology.
image recognition,’’ in Proc. IEEE Conf. Comput. Vis. Pattern Recog- His main research interests include software engi-
nit. (CVPR), Las Vegas, NV, USA, Jun. 2016, pp. 770–778, doi: neering and artificial intelligence.
10.1109/CVPR.2016.90.
[34] K. Han, Y. Wang, H. Chen, X. Chen, J. Guo, Z. Liu, Y. Tang, A. Xiao,
C. Xu, Y. Xu, Z. Yang, Y. Zhang, and D. Tao, ‘‘A survey on vision
transformer,’’ IEEE Trans. Pattern Anal. Mach. Intell., early access,
Feb. 18, 2022, doi: 10.1109/TPAMI.2022.3152247.

VOLUME 10, 2022 64301

You might also like