FR-DeTR End-To-End Flowchart Recognition With Precision and Robustness
FR-DeTR End-To-End Flowchart Recognition With Precision and Robustness
21, 2022.
Digital Object Identifier 10.1109/ACCESS.2022.3183068
ABSTRACT Traditional flowchart recognition methods have difficulties in detecting newly added sym-
bols and distinguishing targets from complex backgrounds like line texture. Existing deep-learning-based
object detectors and line segment detectors are promising in recognizing and distinguishing targets from
texture backgrounds. However, using two separate detectors will inevitably cause unnecessary training and
inference costs. Moreover, the insufficient volume and diversity of currently available dataset limit the
effectiveness of model training. To address these issues, this paper proposes an end-to-end multi-task network
FR-DETR (Flowchart Recognition DEtection TRansformer) and a new dataset for precise and robust
flowchart recognition. FR-DETR comprises a CNN backbone and a shared multi-scale Transformer structure
to perform symbol detection and edge detection using shared feature maps and respective prediction heads
in a coarse-to-fine refinement process. The coarse stage analyzes features with low resolution and suggests
candidate regions that contain potential targets for the fine stage to produce accurate predictions using
features with high resolution. Meanwhile, a new dataset is constructed to provide more symbol types and
complex backgrounds for network training and evaluation. It contains more than 1000 machine-generated
flowchart images, 25K+ symbol instances with nine categories, and 20K+ line segments. The experiments
show that FR-DETR achieves an overall precision and recall of 94.0% and 93.1% on the proposed dataset,
and 98.7% and 98.1% on the CLEF-IP dataset, respectively, which all outperform the prior methods.
INDEX TERMS Flowchart recognition, multi-task learning, multi-scale Transformer, object detection, line
segment detection.
Different from traditional methods, deep-learning-based respectively. In addition, to satisfy the requirements of data
computer vision technologies are capable of focusing on volume and complexity for model training, a new flowchart
desired targets, which can be summarized into symbols and dataset is also constructed. Compared with the only publicly
edges in flowcharts. Deep-learning-based object detectors are accessible machine-generated flowchart dataset CLEF-IP
currently widely used for detecting objects within images. (Cross-Language Evaluation Forum Intellectual Property),
Although these approaches perform well in symbol detec- which has around 60 images, the new dataset has more than
tion, their bounding box representation makes it difficult to 1000 machine-generated flowchart images, 25K+ symbol
recognize line segments that are short or nearly parallel to instances with nine categories, and 20K+ line segments.
the axes. Some deep-learning-based line segment detectors The main contributions of this paper are:
that treat edges as endpoint pairs can be used to achieve 1) It demonstrates that the machine-generated flowchart
good line segment detection performance. Consequently, the recognition can be accomplished using deep-learning-
flowchart recognition task can be divided into recognizing based object detection and line segment detection to
symbols using object detection and detecting edges using line handle the increasing symbol diversity and background
segment detection. As shown in Fig. 1, symbols and arrows complexity of flowchart images.
are indicated by bounding boxes and edges are indicated by 2) It proposes an end-to-end multi-task learning network
line segments. that combines object detection and line segment detec-
tion to reduce the costs caused by separate models. The
model jointly detects symbols and edges in flowcharts
and employs a multi-scale Transformer structure to
improve the recognition accuracy of both tasks.
3) The proposed method achieves a better recognition
accuracy that outperforms the prior machine-generated
flowchart recognition methods.
4) A new machine-generated flowchart dataset is estab-
lished to address data shortages for deep learning model
training.
B. TRANSFORMER ARCHITECTURE
In recent years, Transformer [32] has become the standard
backbone for many natural language processing (NLP) mod-
els and has achieved remarkable success. Its self-attention
and cross-attention mechanisms can build strong relations
among sequence-format inputs. Recently, Transformer has
attracted more researchers and steps in the fields of computer
vision [34]. Carion et al. [14] introduced a Transformer based
end-to-end object detection network DETR that removed
pre-designed anchor and non-maximum suppression (NMS).
It detects objects using interactions between a fixed num-
ber of queries and encoded image features. Following the
basic query concept of DETR, Xu et al. [19] proposed
a Transformer based line segment detector LETR. Unlike
the prior line segment detection approaches consisting of
heuristics-guided steps, the LETR detector directly regressed
the endpoints of a line segment and achieved state-of-the-art
performance on relative line segment detection datasets.
C. MULTI-TASK APPROACHES
Traditionally, task-oriented networks have been indepen-
dently designed and trained for each task. However,
as humans concurrently solve multiple tasks, many real-world
problems are also multi-modal. This motivates researchers to
study generalized methods that can accomplish all desired
tasks with a single model. Deep-learning-based MTL (multi-
task learning) models aim to improve network generalization
and the capability of jointly learning shared information.
FIGURE 2. Due to the complex structure that machine-generated
Compared with single-task models, multi-task networks have flowcharts have, errors occur when applying arrow R-CNN. bounding
advantages such as a reasonable reduction in model size and boxes indicate entire connecting edges, and each red point and blue
point represent the start and tail of belonged edge. (a) shows the
increment in inference speeds by sharing inherent parts of detection errors of box and keypoint when edges have no arrow.
network structure [24]. Vandenhende et al. [25] considered (b) shows the detection errors when edges have a nested layout.
the importance of task interactions while distilling informa-
tion during multi-task learning. Hu et al. [26] jointly learned
multiple tasks across different domains with a unified Trans- applying Arrow R-CNN. As shown in Fig. 2, although the
former. Wu et al. [21] introduced a multi-task network that representation of connecting edges is adjusted to an entire
can jointly handle object detection, drivable area segmenta- connection between two symbols, errors in detecting edges
tion, and lane detection in autonomous driving. and keypoints still occur. Based on the structural analysis
of connecting edges, using line segment representation can
III. PROPOSED APPROACH better handle the dilemma in which Arrow R-CNN is stuck.
A. MOTIVATION Line segments are not appropriately described with bound-
Although Arrow R-CNN [30] achieves remarkable results ing boxes because of their highly variable aspect ratio and
in handwritten flowcharts recognition, it is not fully appli- limited choices of anchors. Many works [16]–[18] follow
cable for recognizing machine-generated flowcharts. Every the procedure of first producing junction proposals and then
connecting edge detected by Arrow R-CNN must contain converting them into line segments, which causes the perfor-
an arrow as a keypoint. However, several connecting edges mance of these methods to rely heavily on junction detection.
within a machine-generated flowchart may share one arrow, However, in a flowchart, junctions appear on both connect-
which causes failures in detecting edges and keypoints when ing edges and symbols, which brings numerous redundant
FIGURE 3. FR-DETR architecture. A convolution network is used as backbone to produce two feature maps with different resolutions from an input
flowchart. The coarse encoder-decoder structure first encodes the smaller feature map and then generates the mid-queries with the interactions between
encoded features and the init-queries. The fine encoder-decoder structure encodes the feature map with higher resolution and outputs final-queries
based on the interaction between the corresponding encoded feature and mid-queries. In the end, the final-queries are fed into feed-forward networks
to make the final predictions.
instances when using junction based approaches. In addition, The channel dimension of the feature map f0 is then reduced
due to convolution having a fixed respect field, CNN-based from C to a smaller dimension d to obtain a new feature map
models have limitations in building long-range relations for f ∈ Rd×H ×W by 1 × 1 convolution. To meet the encoder’s
targets like long lines. Recently, a Transformer-based line requirement, which expects input in the sequence format,
segment detector LETR [19] suggested a model that directly the feature map f is flattened to create another feature map
regresses the endpoints coordinates of each line segment. The z ∈ Rd×HW .
attention mechanism introduced by Transformer perfectly
meets the need to distinguish line segments between wanted 2) FEATURE ENCODING
and unwanted ones.
The Transformer encoder is stacked with multiple encoding
Inspired by the aforementioned methods, with the aim of
layers. Each layer consists of a multi-head self-attention mod-
improving the flowchart recognition accuracy and reducing
ule and a feed-forward network (FFN). The encoding layers
the costs of using isolated models, this paper modifies LETR
receive processed features from their predecessor layer and
into a multi-task model to jointly accomplish the two detec-
deliver the output features to the corresponding FFN after
tion tasks. The model selected to perform symbol detection
learning the pairwise relations between the input and output.
is DETR. The reasons can be concluded as follows: (1) DETR
In general, the flattened feature map z ∈ Rd×HW is encoded
has a similar Transformer-based structure to LETR, which
into a new feature map z0 ∈ Rd×HW . The positional encoding
maximizes structure sharing between the two models and the
of f is added to guarantee the flattened feature map z not to
overall cost reduction. (2) Other CNN-based object detection
lose the spatial relations.
models and LETR have few shareable parts, which unavoid-
ably results in insufficient structure sharing and difficulties in
jointly analyzing features. 3) FEATURE DECODING
Following the standard architecture of Transformer, the
B. THE FR-DETR MODEL decoder transforms each N embeddings of symbol and
The overall architecture of FR-DETR is designed based line segment using the multi-head self-attention and cross-
on a multi-scale Transformer encoder-decoder structure as attention module. Like the positional encoding of the encoder,
depicted in Fig. 3. The proposed flowchart recognition pro- the input embeddings are learnable positional information
cess can be divided into four sub-tasks: feature extraction, that is added to the input of each layer and named as target
feature encoding, feature decoding and target prediction. queries. Each decoding layer receives z0 ∈ Rd×HW from the
last encoding layer and two types of target queries b ∈ Rd×N
1) FEATURE EXTRACTION and l ∈ Rd×N , namely symbol queries and line queries, from
FR-DETR uses a CNN backbone to extract a feature its predecessor decoding layer. Both types of queries are first
map f0 ∈ RC×H ×W from an input image x ∈ R3×H0 ×W0 . processed by the self-attention model, and then, each entity
B. IMPLEMENTATION
1) NETWORK ARCHITECTURE
ResNet-50 and ResNet-101 are used as backbone to generate
feature maps from the input image x ∈ R3×H0 ×W0 . The coarse
encoder and fine encoder take the output from the C5 layer
(2) The proposed dataset. To enrich the symbol category and and C4 layer, respectively. The output from C5 layer is a low
structural complexity, public flowchart images are collected resolution feature map f0 ∈ RC×H ×W , where C = 2048, H =
H0 W0
through the Internet by using image search engines, such as 32 , W = 32 . The output from C4 layer is a feature map with
Google Image, Bing Image and Baidu Image. After filter- higher resolution and C = 1024, H = H160 , W = W 0
16 . Before
ing low quality images and removing duplicates, a dataset being fed into the Transformer encoder, the feature map is
containing more than 1,000 images is constructed. The new compressed from C channels to 256 by a 1 × 1 convolution.
dataset includes 25K+ symbol instances and 20K+ line seg- Then, with a fixed sine/cosine positional encoding, the new
ments. Statistical details are shown in Table 1. Moreover, feature map is sent into the Transformer encoder. To process
the backgrounds of the images are not all purely white. For multi-scale features, the network sets up two independent
example, some samples have colorful backgrounds with line encoder-decoder structures. Similar to the standard Trans-
textures. former structure, each encoder and decoder has six encoding
The FR-DETR model is trained on the new dataset, which layers and six decoding layers with eight attention heads.
is randomly divided into 800 training images and 200 test- The two tasks share one encoder-decoder structure at each
ing images. Data augmentation methods including random stage, and use respective prediction heads to produce the final
resize, random flip, and random crop, are taken through all prediction. The sharing structure can efficiently reduce the
experiments. Usually, applying augmentations [31], such as number of network parameters and the cost of learning and
copy-pasting, is an effective and efficient way to improve the inferring.
detection results of small objects. Due to the arrow class,
the only category of small objects in the dataset, already 2) TRAINING STRATEGY
has the largest amount, specific augmentation for such class A single GeForce RTX 3090 GPU is used to train and test
is unnecessary. The input images are resized to 600 pixels the model through all experiments. The backbone weights are
on the longer side during the training process and at least initialized using the pre-trained model provided by DETR.
600 pixels on the shorter side during the evaluation process. To accomplish coarse-to-fine detection, the network first
TABLE 2. Detection results of symbol and line segment/edge. (Edge is for arrow R-CNN only.)
TABLE 3. Comparison of parameter volume and inference time among different models using ResNet50 as backbone.
FIGURE 4. Detection results of FR-DETR on the new dataset. (a) shows the results of recognizing flowcharts with complex structures, dense targets and
broken edges. (b) shows the recognition results of flowcharts with texture backgrounds.
trains only the coarse stage for 500 epochs. Subsequently, the model. The end-to-end training simply trains the entire
all parameters of the coarse encoder-decoder are frozen, and network at once, and the symbol detection and line segment
the fine stage is trained for 300 epochs after being initialized detection are therefore trained jointly. The task-by-task train-
by coarse weights. Different training paradigms, such as end- ing allows the model to focus on one task each time. In detail,
to-end strategy and task-by-task strategy, are applied to train the entire model is first trained for 50 warm-up epochs to
FIGURE 5. Visualization of FR-DETR’s coarse-to-fine decoding process. The first column shows the input images. The (b) to (d) columns show the
decoding results for symbols, line segments and both at the end of the coarse stage. Likewise, the (e) to (g) columns show the decoding results at the
end of the fine stage. The last column shows detection results.
be roughly convergent. Then, while all the parameters of the TABLE 4. Comparison of different training strategies.
other branch are frozen, each task is trained for 300 epochs in
the coarse stage and 200 epochs in the fine stage. Finally, all
the parameters are unfrozen and trained together for another
100 epochs. The specific training steps are presented in Algo-
rithm 1. In the end-to-end training strategy, the learning rate is
initially set to 0.0001 and multiplied by a factor of 0.1 every
200 epochs in the coarse stage and 120 epochs in the fine
stage. In the task-by-task training strategy, learning rates are
shared between tasks only during the warm-up and jointly decoders is also tested. In this case, the network structure
training stage. AdamW is used as the optimizer and the still has a shared backbone and encoders, but the decoding
weight decay is set to 10−4 . results sent into each prediction head are produced by their
respective decoders. This structure allows each decoding
C. EVALUATION METRIC module to focus on its specific task. The results show that the
As pioneering flowchart recognition works, the evaluation multi-decoder does improve the model performance. Com-
metric selected in this paper is the F1-Score, which is a pared with DETR and LETR, there is a slight gap between
standard metric in flowchart recognition. The F1-Score is the the results of FR-DETR and the single-task models, while
weighted average of Precision and Recall, where Precision the multi-task model makes it difficult to find an optimal
measures the percentage of correct prediction and Recall global solution for all tasks. As shown in Table 3, using
shows the percentage of correctly detected targets. ResNet50 as the backbone, FR-DETR has an inference time
similar to that of DETR and LETR. However, FR-DETR
TP
Precision = (8) can productively reduce the total inference time with a slight
TP + FP increment in the number of parameters. Generally, FR-DETR
TP
Recall = (9) can significantly reduce the time consumption with a small
TP + FN decrease in accuracy.
Precision × Recall
F1 = 2 × (10) The comparison of different training strategies’ perfor-
Precision + Recall mance is shown in Table 4. It shows that the two strategies
where TP (true positive) denotes correct positive predic- achieve almost the same results, while the task-by-task train-
tions of positive targets, FP (false positive) denotes incorrect ing is slightly better than the end-to-end strategy.
positive predictions of negative targets, FN (false negative) FR-DETR is robust and can handle complex flowchart
denotes incorrect negative predictions of positive targets. images. Unlike the CLEF-IP dataset, the new dataset provides
flowcharts with more complex structures and backgrounds,
D. RESULTS AND COMPARISONS enabling FR-DETR to address challenging problems such
Regarding the evaluation metric, Table 2 reveals the detection as broken edges and misleading line textures. As shown in
results of symbols and line segments on the new dataset. Fig. 4, the images on the left have a dense layout of symbols
It is clear that the overall performance of FR-DETR is better and broken connecting edges, the images on the right are
than Arrow R-CNN. Table 2 also shows that all two-stage flowcharts with texture backgrounds. The detection results
models achieve better results than the single-stage model, show that FR-DETR can accurately localize each symbol and
which means that the two-stage structure can effectively line segment regardless of structure complexity, backgrounds
improve detection accuracy. The effectiveness of multiple and broken edges.
TABLE 5. Comparison of FR-DETR and other methods on the CLEF-IP [4] F. Piroi, M. Lupu, A. Hanbury, A. P. Sexton, W. Magdy, and I. V. Filippov,
dataset. ‘‘CLEF-IP 2012: Retrieval experiments in the intellectual property
domain,’’ in Proc. CLEF Eval. Labs Workshop, Rome, Italy, Sep. 2012,
pp. 1–13. [Online]. Available: http://ceur-ws.org/Vol-1178/CLEF2012wn-
CLEFIP-PiroiEt2012.pdf
[5] C. Supaartagorn, ‘‘Web application for automatic code generator using
a structured flowchart,’’ in Proc. 8th IEEE Int. Conf. Softw. Eng.
Service Sci. (ICSESS), Beijing, China, Nov. 2017, pp. 114–117, doi:
10.1109/ICSESS.2017.8342876.
[6] J. Andreas et al., ‘‘Task-oriented dialogue as dataflow synthesis,’’
Trans. Assoc. Comput. Linguistics, vol. 8, pp. 556–571, Sep. 2020, doi:
10.1162/TACL_A_00333.
[7] M. Rusiñol, L.-P. de Las Heras, J. Mas, O. R. Terrades, D. Karatzas,
A. Dutta, G. Sánchez, and J. Lladós, ‘‘CVC-UAB’s participation in
the flowchart recognition task of CLEF-IP,’’ in Proc. CLEF Eval. Labs
The demonstration of FR-DETR’s coarse-to-fine decoding Workshop, Rome, Italy, Sep. 2012, pp. 1–11. [Online]. Available: http://
process is shown in Fig. 5. The results of the coarse stage ceur-ws.org/Vol-1178/CLEF2012wn-CLEFIP-RusinolEt2012.pdf
[8] M. Rusiñol, L.-P. de las Heras, and O. R. Terrades, ‘‘Flowchart recog-
are produced by the coarse decoder decoding features from nition for non-textual information retrieval in patent search,’’ Inf. Retr.,
ResNet’s C5 layer. Although the target information is will- vol. 17, nos. 5–6, pp. 545–562, Oct. 2014, doi: 10.1007/S10791-013-
captured, the features with a low resolution limit the model 9234-3.
from making precise predictions. The outputs of the fine stage [9] R. Mörzinger, R. Schuster, and A. Horti, ‘‘Visual structure analysis of flow
charts in patent images,’’ in Proc. CLEF Eval. Labs Workshop, Rome,
are generated by the fine decoder decoding high resolution Italy, Sep. 2012, pp. 1–8. [Online]. Available: http://ceur-ws.org/Vol-
features from ResNet’s C4 layer with target queries from the 1178/CLEF2012wn-CLEFIP-MorzingerEt2012.pdf
coarse stage. The overlay of attention heatmaps shows more [10] A. Thean, J.-M. Deltorn, P. Lopez, and L. Romary, ‘‘Textual summarisation
of flowcharts in patent drawings for CLEF-IP 2012,’’ CLEF 2012 Eval.
detailed relations in the image space, which is the key to the Labs Workshop, Rome, Italy, Sep. 2012, pp. 1–18. [Online]. Available:
detector performance. http://ceur-ws.org/Vol-1178/CLEF2012wn-CLEFIP-TheanEt2012.pdf
The overall recognition results on the CLEF-IP dataset, [11] R. Girshick, ‘‘Fast R-CNN,’’ in Proc. IEEE Int. Conf. Comput.
Vis. (ICCV), Santiago, Chile, Dec. 2015, pp. 1440–1448, doi:
as shown in Table 5, are improved to 98.7% precision and 10.1109/ICCV.2015.169.
98.1% recall by FR-DETR, which indicates that the proposed [12] S. Ren, K. He, R. Girshick, and J. Sun, ‘‘Faster R-CNN: Towards real-
method outperforms the prior approaches. time object detection with region proposal networks,’’ IEEE Trans. Pat-
tern Anal. Mach. Intell., vol. 39, no. 6, pp. 1137–1149, Jun. 2017, doi:
10.1109/TPAMI.2016.2577031.
V. CONCLUSION [13] G. Jocher et al., ‘‘Ultralytics/YOLOv5: V6.1—TensorRT,
This paper presents a new method for machine-generated TensorFlow edge TPU and OpenVINO export and inference,’’ Los
Angeles, CA, USA, Tech. Rep., Feb. 2022. [Online]. Available:
flowchart recognition, which accomplishes the task by detect- https://zenodo.org/record/6222936 and https://ultralytics.com, doi:
ing symbols and connecting edges using deep-learning- 10.5281/zenodo.6222936.
based object detectors and line segment detectors. An end- [14] N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and
to-end multi-task model named FR-DETR that contains a S. Zagoruyko, ‘‘End-to-end object detection with transformers,’’ in Proc.
16th Eur. Conf. Comput. Vis. (ECCV), Glasgow, U.K., Aug. 2020,
multi-scale Transformer structure is introduced to reduce pp. 213–229, doi: 10.1007/978-3-030-58452-8_13.
the high cost caused by using separate models. Its well- [15] G. Gu, B. Ko, S. Go, S.-H. Lee, J. Lee, and M. Shin, ‘‘Towards
performed joint detection of symbols and line segments sig- light-weight and real-time line segment detection,’’ 2021,
arXiv:2106.00186.
nificantly simplifies the flowchart recognition task. A new [16] Z. Zhang, Z. Li, N. Bi, J. Zheng, J. Wang, K. Huang, W. Luo,
machine-generated flowchart dataset is also constructed for Y. Xu, and S. Gao, ‘‘PPGNet: Learning point-pair graph for line seg-
practical model training and evaluation. ment detection,’’ in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recog-
nit. (CVPR), Long Beach, CA, USA, Jun. 2019, pp. 7098–7107, doi:
Although FR-DETR outperforms other flowchart recog- 10.1109/CVPR.2019.00727.
nition methods, it is not lightweight enough to meet the [17] Y. Zhou, H. Qi, and Y. Ma, ‘‘End-to-end wireframe parsing,’’ in Proc.
requirements of real-time processing and mobile application IEEE/CVF Int. Conf. Comput. Vis. (ICCV), Seoul, South Korea, Oct. 2019,
pp. 962–971, doi: 10.1109/ICCV.2019.00105.
development. In addition, recognizing flowcharts and under- [18] K. Huang, Y. Wang, Z. Zhou, T. Ding, S. Gao, and Y. Ma, ‘‘Learn-
standing the corresponding structural semantics with an end- ing to parse wireframes in images of man-made environments,’’ in
to-end model is still a challenge. Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., Salt Lake
City, UT, USA, Jun. 2018, pp. 625–626, doi: 10.1109/CVPR.2018.
00072.
REFERENCES [19] Y. Xu, W. Xu, D. Cheung, and Z. Tu, ‘‘Line segment detection
[1] J. Lladós, ‘‘Two decades of GREC workshop series. Conclusions of using transformers without edges,’’ in Proc. IEEE/CVF Conf. Com-
GREC2017,’’ in Graphics Recognition. Current Trends and Evolu- put. Vis. Pattern Recognit. (CVPR), Jun. 2021, pp. 4257–4266, doi:
tions (Lecture Notes in Computer Science), vol. 11009, A. Fornés and 10.1109/CVPR46437.2021.00424.
B. Lamiroy, Eds. Cham, Switzerland: Springer, Nov. 2018, pp. 163–168, [20] Y. Qian, J. M. Dolan, and M. Yang, ‘‘DLT-Net: Joint detection of drivable
doi: 10.1007/978-3-030-02284-6_14. areas, lane lines, and traffic objects,’’ IEEE Trans. Intell. Transp. Syst.,
[2] N. Bhatti and A. Hanbury, ‘‘Image search in patents: A review,’’ Int. J. vol. 21, no. 11, pp. 4670–4679, Nov. 2019.
Document Anal. Recognit., vol. 16, no. 4, pp. 309–329, Dec. 2013, doi: [21] D. Wu, M. Liao, W. Zhang, X. Wang, X. Bai, W. Cheng, and W. Liu,
10.1007/S10032-012-0197-5. ‘‘YOLOP: You only look once for panoptic driving perception,’’ 2021,
[3] S. Adams, ‘‘Electronic non-text material in patent applications—Some arXiv:2108.11250.
questions for patent offices, applicants and searchers,’’ World Patent Inf., [22] D. Vu, B. Ngo, and H. Phan, ‘‘HybridNets: End-to-end perception net-
vol. 27, no. 2, pp. 99–103, Jun. 2005, doi: 10.1016/J.WPI.2004.12.005. work,’’ 2022, arXiv:2203.09035.
[23] S. Zhang, ‘‘Research on flowchart recognition by fusing structural LIANSHAN SUN received the Ph.D. degree
model and corner feature,’’ M.S. thesis, College Electr. Inf. Eng., in software engineering from Peking University,
Shaanxi Univ. Sci. Technol., Xi’an, China, 2018. [Online]. Avail- in 2009. He is an Associate Professor with the
able: https://oversea.cnki.net/KCMS/detail/detail.aspx?dbcode=CMFD School of Electronic Information and Artificial
&dbname=CMFD201802&filename=1018204634.nh Intelligence, Shaanxi University of Science and
[24] S. Vandenhende, S. Georgoulis, W. Van Gansbeke, M. Proesmans, D. Dai, Technology. His main research interests include
and L. Van Gool, ‘‘Multi-task learning for dense prediction tasks: A sur- blockchain, information security, and artificial
vey,’’ IEEE Trans. Pattern Anal. Mach. Intell., early access, Jan. 26, 2021,
intelligence.
doi: 10.1109/TPAMI.2021.3054719.
[25] S. Vandenhende, S. Georgoulis, and L. van Gool, ‘‘MTI-Net: Multi-scale
task interaction networks for multi-task learning,’’ in Proc. 16th Eur.
Conf. Comput. Vis. (ECCV), Glasgow, U.K., Aug. 2020, pp. 527–543, doi:
10.1007/978-3-030-58548-8_31.
[26] R. Hu and A. Singh, ‘‘UniT: Multimodal multitask learning with
a unified transformer,’’ in Proc. IEEE/CVF Int. Conf. Comput. Vis.
(ICCV), Montreal, QC, Canada, Oct. 2021, pp. 1419–1429, doi:
10.1109/ICCV48922.2021.00147.
[27] C. Carton, A. Lemaitre, and B. Coüasnon, ‘‘Fusion of statistical and struc-
tural information for flowchart recognition,’’ in Proc. 12th Int. Conf. Docu-
ment Anal. Recognit., Washington, DC, USA, Aug. 2013, pp. 1210–1214, HANCHAO DU received the graduate degree in
doi: 10.1109/ICDAR.2013.245. computer and its application from Xidian Univer-
[28] A. Lemaitre, H. Mouchère, J. Camillerapp, and B. Coüasnon, ‘‘Interest sity, in 2015. He is currently pursuing the mas-
of syntactic knowledge for on-line flowchart recognition,’’ in Graphics ter’s degree with the School of Electronic Informa-
Recognition. New Trends and Challenges (Lecture Notes in Computer tion and Artificial Intelligence, Shaanxi University
Science), vol. 7423, Y.-B. Kwon and J.-M. Ogier, Eds. Berlin, Germany: of Science and Technology. His main research
Springer, Feb. 2013, pp. 89–98, doi: 10.1007/978-3-642-36824-0_9. interests include image recognition and artificial
[29] M. Bresler, D. Prua, and V. Hlavac, ‘‘Modeling flowchart structure intelligence.
recognition as a max-sum problem,’’ in Proc. 12th Int. Conf. Document
Anal. Recognit., Washington, DC, USA, Aug. 2013, pp. 1215–1219, doi:
10.1109/ICDAR.2013.246.
[30] B. Schäfer and H. Stuckenschmidt, ‘‘Arrow R-CNN for flowchart
recognition,’’ in Proc. Int. Conf. Document Anal. Recognit. Work-
shops (ICDARW), Sydney, NSW, Australia, Sep. 2019, pp. 7–13, doi:
10.1109/ICDARW.2019.00007.
[31] M. Kisantal, Z. Wojna, J. Murawski, J. Naruniec, and K. Cho, ‘‘Augmen-
tation for small object detection,’’ 2019, arXiv:1902.07296.
[32] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez,
L. Kaiser, and I. Polosukhin, ‘‘Attention is all you need,’’ in Proc. 31th
Int. Conf. Neural Inf. Process. Syst., Long Beach, CA, USA, Dec. 2017, TAO HOU is a Lecturer with the School of
pp. 6000–6010, doi: 10.48550/arXiv.1706.03762. Electronic Information and Artificial Intelligence,
[33] K. He, X. Zhang, S. Ren, and J. Sun, ‘‘Deep residual learning for Shaanxi University of Science and Technology.
image recognition,’’ in Proc. IEEE Conf. Comput. Vis. Pattern Recog- His main research interests include software engi-
nit. (CVPR), Las Vegas, NV, USA, Jun. 2016, pp. 770–778, doi: neering and artificial intelligence.
10.1109/CVPR.2016.90.
[34] K. Han, Y. Wang, H. Chen, X. Chen, J. Guo, Z. Liu, Y. Tang, A. Xiao,
C. Xu, Y. Xu, Z. Yang, Y. Zhang, and D. Tao, ‘‘A survey on vision
transformer,’’ IEEE Trans. Pattern Anal. Mach. Intell., early access,
Feb. 18, 2022, doi: 10.1109/TPAMI.2022.3152247.