0% found this document useful (0 votes)
81 views58 pages

Sahu 2020

This document reviews the use of artificial intelligence techniques in augmented reality applications for manufacturing. It discusses how current augmented reality systems rely primarily on traditional non-AI approaches that limit their effectiveness. The document outlines how artificial intelligence strategies like deep learning, ontology, and expert systems could improve key components of augmented reality systems, including camera calibration, object detection/tracking, pose estimation, virtual object creation/registration, and rendering. It argues that incorporating artificial intelligence could help augmented reality systems better adapt to a wider range of manufacturing environments and user preferences.

Uploaded by

Piotr
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
81 views58 pages

Sahu 2020

This document reviews the use of artificial intelligence techniques in augmented reality applications for manufacturing. It discusses how current augmented reality systems rely primarily on traditional non-AI approaches that limit their effectiveness. The document outlines how artificial intelligence strategies like deep learning, ontology, and expert systems could improve key components of augmented reality systems, including camera calibration, object detection/tracking, pose estimation, virtual object creation/registration, and rendering. It argues that incorporating artificial intelligence could help augmented reality systems better adapt to a wider range of manufacturing environments and user preferences.

Uploaded by

Piotr
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 58

International Journal of Production Research

ISSN: (Print) (Online) Journal homepage: https://www.tandfonline.com/loi/tprs20

Artificial intelligence (AI) in augmented reality (AR)-


assisted manufacturing applications: a review

Chandan K. Sahu , Crystal Young & Rahul Rai

To cite this article: Chandan K. Sahu , Crystal Young & Rahul Rai (2020): Artificial intelligence (AI)
in augmented reality (AR)-assisted manufacturing applications: a review, International Journal of
Production Research, DOI: 10.1080/00207543.2020.1859636

To link to this article: https://doi.org/10.1080/00207543.2020.1859636

Published online: 28 Dec 2020.

Submit your article to this journal

Article views: 422

View related articles

View Crossmark data

Citing articles: 1 View citing articles

Full Terms & Conditions of access and use can be found at


https://www.tandfonline.com/action/journalInformation?journalCode=tprs20
INTERNATIONAL JOURNAL OF PRODUCTION RESEARCH
https://doi.org/10.1080/00207543.2020.1859636

Artificial intelligence (AI) in augmented reality (AR)-assisted manufacturing


applications: a review

Chandan K. Sahu a∗ , Crystal Young b∗ and Rahul Rai a

a Geometric Reasoning and Artificial Intelligence Lab (GRAIL), Department of Automotive Engineering, Clemson University, Greenville, SC, USA;
b Manufacturing and Design Lab (MADLab), Department of Mechanical and Aerospace Engineering, University at Buffalo, Buffalo, NY, USA

ABSTRACT ARTICLE HISTORY


Augmented reality (AR) has proven to be an invaluable interactive medium to reduce cognitive Received 16 June 2020
load by bridging the gap between the task-at-hand and relevant information by displaying informa- Accepted 14 November 2020
tion without disturbing the user’s focus. AR is particularly useful in the manufacturing environment KEYWORDS
where a diverse set of tasks such as assembly and maintenance must be performed in the most cost- Augmented reality; artificial
effective and efficient manner possible. While AR systems have seen immense research innovation intelligence; deep learning;
in recent years, the current strategies utilised in AR for camera calibration, detection, tracking, cam- machine learning; tracking;
era position and orientation (pose) estimation, inverse rendering, procedure storage, virtual object manufacturing
creation, registration, and rendering are still mostly dominated by traditional non-AI approaches.
This restricts their practicability to controlled environments with limited variations in the scene. Clas-
sical AR methods can be greatly improved through the incorporation of various AI strategies like
deep learning, ontology, and expert systems for adapting to broader scene variations and user pref-
erences. This research work provides a review of current AR strategies, critical appraisal for these
strategies, and potential AI solutions for every component of the computational pipeline of AR
systems. Given the review of current work in both fields, future research work directions are also
outlined.

1. Introduction
et al. 2008; Makris et al. 2016), automotive (Regenbrecht,
Although reality can never be altered, its perception can Baratoff, and Wilke 2005; Alhaija et al. 2017), defense
be improved. Augmented reality (AR) is an interface (Livingston et al. 2011), healthcare (Fischer, Bartz, and
between reality and the perception of reality. AR is the Straßer 2004; Yamamoto et al. 2012), tourism (Wei, Ren,
medium that superimposes digital information on the and O’Neill 2014; tom Dieck and Jung 2018), gam-
physical world, thereby bridging the rift between the real ing (Piekarski and Thomas 2002), and product lifecycle
and virtual worlds. AR enables a novel transformational management (PLM) (Doil et al. 2003; Ong, Yuan, and
paradigm of information delivery and assimilation to a Nee 2008).
human operator, and if properly implemented, leads to AR has the opportunity to be integrated into every
cognitive load1 reduction. stage of PLM. For instance, AR can be utilised in the
AR can be defined as a human-computer interaction facility development process by visualising the floorplan
technology that interactively registers three-dimensional in an interactive environment for the developer (Wang
(3D) real and virtual objects in real time (Azuma et al. 2020). In product development, AR can simulate the
et al. 2001). Although the term ‘Augmented Reality’ was various stages of manufacturing for a product or show the
coined by Thomas and David in 1992, the first application various design iterations of a product (Ong, Yuan, and
of AR technology was a head-mounted display (HMD) Nee 2008). While the product is being manufactured, AR
which dates back to 1968 by Sutherland. AR is in the pro- can be utilised for assembly (Wang et al. 2013), quality
cess of revolutionising many industries including educa- assurance (Schwerdtfeger et al. 2008), and for collabora-
tion (Radu 2012; Wu et al. 2013), construction and build- tion and safety (Makris et al. 2016). Once the product is
ing management (Chi, Kang, and Wang 2013; Gheisari ready to be marketed to consumers, AR can be used by
et al. 2014), e-commerce (Dacko 2017), robotics (Green the consumer to visualise the product as an integrated

CONTACT Rahul Rai [email protected], rahulrai@buffalo.edu Geometric Reasoning and Artificial Intelligence Lab (GRAIL), Department of
Automotive Engineering, Clemson University, Greenville, SC 29607, USA
∗ This work has been contributed equally.

© 2020 Informa UK Limited, trading as Taylor & Francis Group


2 C. K. SAHU ET AL.

part of their daily life (Scholz and Smith 2016). When


the user buys the product, AR can be used as a diagnostic
tool in case of troubleshooting (Sanna et al. 2015). Finally,
at the end of the product’s lifecycle, AR can be utilised
again in the manufacturing environment in order to dis-
assemble the final product for material recovery (Osti
et al. 2017; Chang, Nee, and Ong 2020).
AR can be an effective way to decrease costs and
increase efficiency during the manufacturing stage of
PLM. AR is a technology that can reduce cognitive load
by supplementing human labour with the right informa-
tion at the right moment and in the right mode with-
out disturbing the user’s focus (Abraham and Annun-
ziata 2017), thereby enabling the industry to achieve
the aforementioned goals. The focus of this paper is to
present a review of the state-of-the-art literature at the
intersection of AR, manufacturing, and artificial intelli-
gence (AI).
Many manufacturing environments are noisy, and
thus it is impractical to include audio or voice ele-
ments into AR-assisted manufacturing applications with-
out consideration of alternative user interaction or rig-
orous noise reduction strategies (Syberfeldt, Danielsson,
and Gustavsson 2017). Additionally, mandatory personal
protective equipment limits the user from other modal-
Figure 1. Computational pipeline for a procedure-based AR
ities like haptics. For this reason, this review primarily
system.
focuses on the visual modality. To be specific, we focus
on Red Green Blue (RGB) images or a live video stream
(rather than RGB Depth (RGB-D), thermal, Light Detec- the functionalities of the virtual camera imitating the real
tion and Ranging (LiDAR), audio or other sensory data) camera are achieved, any information within procedure
as the input data due to the fact that even miniature storage is incorporated into the creation of virtual objects
ergonomic devices like mobiles and wearable devices can for AR in virtual object creation. These virtual objects
equip both a camera to acquire the data and a display to are registered in the AR environment, which means that
disseminate the augmented information. they are aligned with the detected and tracked objects in
The computational pipeline for an AR system is shown the operator’s real environment. Finally, virtual objects
in Figure 1. For a seamless and realistic integration of are rendered on the main visual interface of the AR sys-
the real and virtual worlds in the AR device, the real tem. The rendering process places the virtual objects in
and virtual environments must align with each other. an optimal way for the user and the current environment.
The virtual camera must be calibrated with the real The pipeline underlying AR-based manufacturing
camera’s intrinsic parameters to ensure that the real applications suffer from the following drawbacks: (1) the
world is accurately depicted throughout the AR com- computational strategies are computationally expensive
putational pipeline. Each frame from the real camera is and not robust in varying environments, (2) the current
then inputted into the detection, tracking, and camera strategies utilised in AR systems are laborious to imple-
position and orientation (pose) estimation modules in ment and restricted to known environments (i.e. camera
order to acquire the extrinsic parameters. The detection calibration), (3) specialised and diverse feature construc-
module detects the pertinent dynamic objects within the tion is required for object detection and tracking, (4)
use environment. The tracking module then tracks these there is a lack of robust pose estimation techniques,
objects in real-time. Finally, the camera pose estimation thereby affecting variable runtime and latency of the AR
module estimates the pose of the real camera in real- system, (5) storage methods are rigid and non-adaptable,
time. Additionally, each frame is rendered inversely for (6) virtual objects are hard-coded within the AR sys-
determining characteristics like illumination and depth tem, (7) primarily open-loop registration approaches are
to supplement the graphics processing unit to enable a utilised, and (8) current rendering methods implement
realistic rendition of the virtual objects in the scene. Once simplified rules to promote context-awareness in the AR
INTERNATIONAL JOURNAL OF PRODUCTION RESEARCH 3

system and utilise destructive methods to optimise the (6) Introduce various future research directions relating
visualisation of data. Clearly, improvements are needed to the AR pipeline.
in all aspects of procedure-based AR systems. (7) Provide conclusive remarks based on the review.
AI is a promising solution to increase the adaptability
and versatility of AR systems. The potential of AI tech- This review paper is structured as follows. Section 2
niques in the AR space is clear from the unprecedented contains an overview of the current research and chal-
success of AI in image processing and its allied tasks. As lenges at the intersection of AR, AI, and manufactur-
AR starts with camera images and terminates with ren- ing. Sections 3 through 11 deliberate each module of
dered images with augmentations projected onto the dis- the framework within Figure 1. Each section includes
play, AI has the potential to bring revolutionary changes an introduction; current approaches in AR systems; a dis-
in AR applications similar to the way AR has in manufac- cussion about the limitations of current approaches and
turing applications. Already, AI has brought tremendous the potential for AI; current approaches within AR that
changes to the manufacturing field (Y. Liu et al. 2020) and utilise AI; potential AI methods that can be utilised in AR;
there are many more applications which can be adopted and a conclusion in a sequential order. Section 12 out-
into manufacturing applications like inspection (Zhang lines the key and immediate future research directions
et al., “Convolutional Neural Network,” 2019) and mate- that are derived from and inspired by the insights gained
rial information management (Furini et al. 2016). Deep from the preceding sections. Finally, Section 13 provides
learning (DL), a subset of AI techniques, has surpassed an overview and analysis of the work described in the
decades of performance of classical object detection tech- previous sections. An overview of the topics reviewed
niques in just a decade. Similarly, AI techniques can and the conclusions drawn from all the modules of the
introduce a paradigm shift in the performance of all the procedure-based AR pipeline is provided in Table 1.
tasks related to AR. The AR space, which is mostly domi-
nated by fiducial markers, can potentially adapt to natural
2. AR & AI in manufacturing
features. AR, when powered by AI, can be adopted with-
out modification to the environment. DL can improve The manufacturing industry is currently undergoing an
the accuracy and robustness of calibration, detection, industrial revolution deemed Industry 4.0. The last three
tracking, camera pose estimation, and registration tasks. industrial revolutions included mechanisation in Indus-
Ontology, a subset of AI, can simplify storage and the dis- try 1.0, mass-production in Industry 2.0, and automation
semination of procedural information. DL can be used in Industry 3.0 (Ababsa 2020). Industry 3.0 automated all
in the development of adaptable augmentations within possible manufacturing activities and has seen a swarm
virtual object creation. Finally, DL can be utilised when of robots and sensors on the shop floor. Industry 4.0
determining what data to display and the optimal way or smart manufacturing is founded upon technologies
to display this data to the user in rendering. Recently, whose aspirations include (1) accelerating the remaining
the untapped potential of AI in AR applications is being segment of work that needs human expertise and skills;
explored. However, so far, most researchers have limited and (2) incorporating intelligence into the automated
the utility of DL techniques mostly to detection and to infrastructure to make robotic systems autonomous and
simpler tasks like rectangular bounding boxes to localise able to make decisions from data (Rüßmann et al. 2015).
(Abdi and Meddeb 2017; Subakti and Jiang 2018) or Aspiration (1) made the provisions for the incorporation
annotate (Rao et al. 2017; Park et al. 2020) the identified of AR into the scope of manufacturing. Aspiration (2)
objects. paved the way for integrating AI into the manufactur-
The main contributions of this research work are as ing industry. Significant progress in terms of theory and
follows: practice has been made in each of these directions. The
manufacturing industry has been a huge beneficiary of
(1) Review the current strategies utilised within AR sys- these technologies.
tems for each of the modules of the AR computa-
tional pipeline.
2.1. AR-assisted manufacturing
(2) Identify the shortcomings of the current approaches
utilised in AR systems for each module. As AR can facilitate uninterrupted information exchange
(3) Justify the use of AI-based methods for each module. without the need of altering the operator’s attention, it can
(4) Review current AI-based methods utilised in AR for accelerate manufacturing activities. Thus, it has become
each module. one of the nine key disruptive technologies that are driv-
(5) Describe relevant AI-based research for each mod- ing Industry 4.0 (Rüßmann et al. 2015). The majority of
ule as related to manufacturing. the applications of AR in the manufacturing industry are
4
C. K. SAHU ET AL.
Table 1. Overview of the concepts, limitations of classical approaches, and advantages of AI approaches for each module of the procedure-based AR system given in Figure 1.
Section Classical methods Limitations of classical methods AI methods Advantages of AI methods
(§3) Camera Calibration • Linear Techniques Classical methods: • Single Image Techniques AI Methods:
(Offline, Intrinsics Unknown) • Are computationally expensive or (Online) • Enable adaptive calibration without
• Non-Linear Techniques highly complex to implement • Non-Single Image Techniques user intervention;
(Offline, Intrinsics Unknown) (Donné et al. 2016); (Online) • Require minimal inputs;
• Calibration-Free • Are cumbersome, time-consuming, • Are applicable to general scenes; and
(Offline, Intrinsics Unknown) and prone to errors • Are not limited to use in predeter-
• Self-Calibration (Tuceryan, Genc, and Navab 2002; mined environments.
(Offline, Intrinsics Unknown) Bogdan et al. 2018);
• Camera Pose Estimation for • Require prior knowledge of the scene
Extrinsicsa or strong assumptions to be imposed
(Offline, Intrinsics Known) (for offline camera calibration)
(Seo and Hong 2000; Bogdan
et al. 2018); and
• Are not robust or adaptive in general
scenes
(Seo, Ahn, and Hong 1998; Pang
et al. 2006).
(§4) Detection • Fiducials Classical methods: • Image Classification AI Methods:
(Artificial Features) • Cannot engineer all of the features (Natural Features) • Are able to work directly with raw
• Simple Features that characterise an object; and • Object Detection images;
(Natural Features) • Cannot achieve generic object detec- (Natural Features) • Are capable of learning all essential
• Advanced Features tion capability. • Semantic Segmentation object features autonomously; and
(Natural Features) (Natural Features) • Are capable of classifying and detect-
• Instance Segmentation ing a wide variety of generic objects.
(Natural Features)
(§5) Tracking • Feature-based Classical methods: • Detection and Tracking AI Methods:
• Model-based • Are prone to failure in cluttered envi- • Tracking by Detection • Eliminate the need for hand-crafted
ronments; features to design robust trackers; and
• Have limitations with complex scenes • Do not explode exponentially as the
involving multiple dynamic objects, number of tracked objects increases.
variations in illumination, motion blur,
and occlusion; and
• Exponentially explode as the number
of tracked objects increases, making
them ineffective for real-time tracking
(for MOTb algorithms).
(continued).
Table 1. Continued.
Section Classical methods Limitations of classical methods AI methods Advantages of AI methods
(§6) Camera Pose Estimation • Pose Estimation from n points Classical methods: • Without Auxiliary Learning AI Methods:
(PnP) • Lack robustness with large changes • With Auxiliary Learning • Do not require engineered features;
in viewpoint, variation in illumination, • Can be trained to be robust against
textureless scenes, motion blur, and changes in illumination, perspective,
other situations; and others;
• Are not able to find accurate matching • Have relatively constant complexity;
points in all scenarios; • Have consistent performance at run-
• Grow in computational complexity time; and
and the proportion of outliers as the • Are able to directly infer the pose from
size of the model grows (for SfMc - RGBe images (thus skipping the Detec-
based approaches); and tion & Tracking steps) (Kendall, Grimes,
• Do not have a constant runtime at and Cipolla 2015).
inference (for RANSACd ).
(§7) Inverse Rendering - Illumination • With Auxiliary Information Classical methods: • Supervised AI Methods:
• Without Auxiliary Information • Have a complex capturing process (for • Unsupervised • Do not require definitive models of the
light probes and methods using auxil- geometric or photometric attributes;
iary information); and
• Are computationally expensive and • Exploit the attributes of the images
not scalable; within the FOVf rather than those
• Disturb real-time performance (with beyond the FOVf .
light probes); and
• Have low accuracy (for methods not
using auxiliary information).
(§7) Inverse Rendering - Depth Inference • Perceptual Depth Cues Classical methods: • Supervised AI Methods:
• Image Attributes • Are ineffective in recovering depth • Semi-supervised • Are capable of exploiting almost all

INTERNATIONAL JOURNAL OF PRODUCTION RESEARCH


information despite the use of percep- • Unsupervised cues at once, thus offering better
tual cues due to: depth inference; and
• The heavily under-constrained • Are able to estimate depth informa-
nature of the problem; tion from RGBe images simultaneously
• Camera characteristics like with other tasks (detection, tracking,
LDRg ; and and pose estimation).
• The use of few cues.
(§8) Procedure Storage • RDBMSh Classical methods: • Ontology AI Methods:
(Databases) • Are static in nature and do not have • Are able to dynamically manipulate
• PLMi the capability to derive further knowl- data over time; and
(Databases) edge and relationships. • Add flexibility into AR systems.
• PIMj
(Databases)
• CMMSk
(Databases)
• XMLl
(Databases)
(continued).

5
6
Table 1. Continued.
Section Classical methods Limitations of classical methods AI methods Advantages of AI methods

C. K. SAHU ET AL.
(§9) Virtual Object Creation • Manual Creation Classical methods: • Caption Generation AI Methods:
• Require extensive time and energy (1Dm ) • Enable augmentations to be created
from the AR system developer to make • Sketch or Contour Generation and adapted in-situ; and
augmentations; and (2Dn ) • Require minimal input for augmenta-
• Do not enable adaptive or in-situ aug- • 3Do Reconstruction tion creation.
mentations. (3Do )
(§10) Registration • Open-Loop Classical methods: • Incorporate NNsq in VVSp AI Methods:
• VVSp • Are not fully developed; • Can be used in unknown environ-
(Closed-Loop) • Are not in widespread use in AR sys- • MPCr for VVSp ments (Saxena et al. 2017); and
tems; and • Reinforcement Learning for VVSp • Are more accurate, efficient, stable,
• Require predefined features in the and robust than IBVSs strategies (for
scene (for VVSp methods). AI-based VVSp methods)
(Comport et al. 2006).
(§11) Rendering • Rule-Based Methods Classical methods: • Layout Generation AI Methods:
(Data) • Are rigid and based on hard-coded (Data Visualisation) • Are able to incorporate context-
• Spatial rules (for Data); • Data Clustering awareness, user customisation, expert
(Data Visualisation, Filtering) • Can be destructive and include data (Data Visualisation) knowledge, and flexibility into AR
• Knowledge-Based or Semantic loss (for Data Visualisation) (Tatzgern systems;
(Data Visualisation, Filtering) et al. 2016); and • Are more efficient and effective for
• Hybrid • Can fail when the amount of data rendering (with AI-based layout
(Data Visualisation, Filtering) grows too large (for Data Visualisation) creation); and
• Geometric-Based (Tatzgern et al. 2016). • Are able to transform data into
(Data Visualisation, clustering-friendly nonlinear rep-
Visualization Layouts) resentations (with DL-based data
• Image-Based clustering).
(Data Visualisation,
Visualization Layouts)
• Data Clustering
(Data Visualisation)
a Refer to ‘Camera Pose Estimation’ Section (§6).
b Multiple Object Tracking (MOT).
c Structure from Motion (SfM).
d RANdom SAmple Consensus (RANSAC).
e Red Green Blue (RGB).
f Field Of View (FOV).
g Low Dynamic Range (LDR).
h Relational DataBase Management System (RDBMS).
i Product Lifecycle Management (PLM).
j Product Information Management (PIM).
k Computerised Maintenance Management Systems (CMMS).
l Extensible Markup Language (XML).
m One-Dimensional (1D).
n Two-Dimensional (2D).
o Three-Dimensional (3D).
p Virtual Visual Servoing (VVS).
q Neural Network (NN).
r Model-Predictive Control (MPC).
s Image-Based Visual Servoing (IBVS).
INTERNATIONAL JOURNAL OF PRODUCTION RESEARCH 7

in three activities: assembly, maintenance, and training. improving overall productivity by allowing bi-directional
Within assembly, maintenance, and training tasks, tech- context-aware authoring (Zhu, Ong, and Nee 2013).
nicians must follow specific tasks and procedures in order AR has been recognised as an effective tool to dis-
to perform assembly sequences and procedures. Seam- seminate knowledge by improving training methods. An
lessly displaying this procedural information to the user HMD was utilised to train employees assembling engines
is essential to minimise their cognitive load. This directly at the vehicle manufacturer BMW (Werrlich, Nitsche,
translates into minimal operating costs, efficient pro- and Notni 2017). Users have been trained for assembling
cesses, and increased productivity. In AR-assisted assem- a milling machine (Peniche et al. 2012) as well as for
bly and maintenance activities, the operators can perform welding (Quandt et al. 2018). AR-assisted training for
their work without altering their focus. Operators who complex manufacturing activities has been evaluated to
use AR for assembly and maintenance can also handle be as effective as face-to-face training (Gonzalez-Franco
parts that they are working with for the first time. AR et al. 2016). Gavish et al. (2015) concluded from a usabil-
eliminates the need of memorising operational instruc- ity study that using AR for training in industrial assembly
tions. Hence, customisation can be added to products and maintenance tasks shall be encouraged, and AR is
effortlessly. AR-assisted assembly operations have seen a more effective than virtual reality.
reduction of error of up to 82% (Tang et al. 2003). While Additionally, AR has assisted in product design
information supply remains the primary motivation, sev- by annotating design variations (Bruno et al. 2019);
eral secondary advantages offered by AR are creditworthy in assembly and disassembly sequence planning (Wang
in improving productivity. Reduction of the distraction et al. 2013; Chang, Nee, and Ong 2020); in packaging
that was imminent in using computers or instruction by overlaying the virtual correction template on the die
manuals as the source of information is one of them. Oth- itself (Álvarez et al. 2019); in planning the orientation of
ers include collaboration capabilities with the AR system, robots (Fang, Ong, and Nee 2013); in casting to study
interconnectivity of the AR system with the manufactur- users’ hand movement (Motoyama et al. 2020); to high-
ing environment, and the capability for AR systems to light spot-weld locations (Doshi et al. 2017); and in shop
adapt to the manufacturing facility and to users’ current floor management (Wang et al. 2020). Furthermore, AR
status. can assist in simulating and developing manufacturing
Many projects have been established to incorporate processes and products before they are actually carried
AR in assembly operations. An AR-based interactive out. That will provide support in getting the activities
manual assembly design system was developed to pre- done correctly in their first attempt. Moreover, collabo-
dict the user’s assembly intents and to manipulate vir- rative AR applications can open up the opportunity to
tual objects (Wang, Ong, and Nee 2013). Chen, Hong, simultaneously leverage the expertise of multiple people
and Wang (2015) used AR to assist the assembly of a from geographically diverse locations.
gearbox. AR has also been used to remotely support
assembly operations to cater to customer’s customisation
2.2. AI-driven manufacturing
requests (Mourtzis, Zogopoulos, and Xanthi 2019). The
AR system offered the part list and assembly instruc- The advent of automation (Industry 3.0) in manufac-
tions at the workstation by retrieving its production turing brought about the widespread use of sensors and
schedule. robots. The use of sensors has thus resulted in a large
AR has been utilised for maintenance tasks. Mainte- amount of data available from the manufacturing pro-
nance tasks encompass a wide variety of activities such cess, or big data. Deriving decisions from big data and
as inspection, measurement, and service in addition to making robots autonomous to enable them to make their
assembly. A robotic arm was maintained using AR, where own decisions have been two of the key goals of Indus-
an animated ratchet was superimposed (Zubizarreta, try 4.0 (Rüßmann et al. 2015). Autonomous robots and
Aguinaga, and Amundarain 2019). The AR system esti- big data analytics have resulted in the widespread use
mated the 6-DOF of the rigid parts associated with the of AI. AI technologies provide a way to manage and
robotic arm and showed the motion of the ratchet. The utilise this big data (Aggour et al. 2019). AI strategies have
relevant product structure was displayed in the main- been utilised in all aspects of the manufacturing process
tenance of a ubiquitous car (Lee and Rhee 2008a). The and supply chain from design, operations management,
pertinent technical information was projected on the and production to maintenance and assembly (Baryan-
engine of a motorbike during its maintenance (Uva nis et al. 2019; Zhang et al., “A Reference Framework,”
et al. 2018). The electronic components of a wind turbine 2019). Real-time data from sensors can inform decision-
were inspected with the help of AR (Quandt et al. 2018). making in the manufacturing process (Deac et al. 2017).
AR has successfully assisted maintenance technicians in This data can allow manufacturing systems to respond in
8 C. K. SAHU ET AL.

real-time to changing demands and conditions through- inspections and predictive maintenance. Yacob, Semere,
out the PLM with respect to the industry, supply network, and Nordgren (2019) propose Skin Model Shapes, which
and customer needs (Zhang and Kwok 2018; Yao et al., are able to generate digital twins of manufactured parts.
“From Intelligent Manufacturing,” 2017). Smart manu- The Skin Model Shapes are used in conjunction with
facturing aims to be able to utilise big data collected from ML methods to detect anomalies and unfamiliar changes
the entire product lifecycle into the manufacturing sys- with parts to ensure part quality. Kang et al. (2018) com-
tem in order to improve every aspect of the process (Tao pared both linear regression and artificial neural net-
et al. 2018). The main goal behind using big data is to works (ANNs) for predictive models for part failure.
achieve fault-free or defect-free processes (Escobar and Part failure predictive models are used to predict prod-
Morales-Menendez 2018). uct quality indicators based on both product quality
AI strategies can be used to aid in the design of and manufacturing parameters. AI has also been widely
manufacturing systems. For instance, Stocker, Schmid, used in inspection (Yang et al. 2020), diagnosis (Matei
and Reinhart (2019) utilise reinforcement learning to et al. 2020), and prognostics and health monitoring of
automate the design process for vibratory bowl feeders machines (Yang and Rai 2019).
(VBFs). The reinforcement learning mimics the man- Finally, AI can be used for scheduling and produc-
ual trial-and-error approach currently used to find the tion in assembly lines. Huo, Zhang, and Chan (2020)
trap configuration with the highest efficiency. Sander- use a fuzzy control system to adjust the assembly line
son, Chaplin, and Ratchev (2019) developed an inno- depending on real-time machine health information. The
vative Function-Behaviour-Structure model for Evolv- control system is designed to smooth the workloads of
able Assembly Systems (EAS) consisting of an ontology workstations in the assembly line. Azadeh et al. (2012)
and design process. The Function-Behaviour-Structure describe a multi-layer feed-forward back-propagation
model is used as input to a modelling design process for ANN developed for a stochastic Two-Stage Assembly
assembly systems. In this methodology, the design pro- Flow-Shop Scheduling Problem (TSAFSP). The TSAFSP
cess is integrated into the manufacturing control system. consists of (1) manufacturing of a product’s parts by
AI methods can be utilised for handling operations various machines in parallel and (2) assembly of all of
management. Baryannis et al. (2019) highlight the usage the product’s parts. The TSAFSP addressed by Azadeh
of various AI methods for supply chain risk manage- et al. (2012) accounts for stochastic machine breakdown
ment, including stochastic programming, robust opti- and processing times.
misation, fuzzy programming, hybrid approaches with
mathematical programming, network-based approaches,
2.3. AI-driven AR-assisted manufacturing
agent-based approaches, reasoning, and machine learn-
ing (ML) with big data. Li et al., “Machine Learning” Both AR-assisted manufacturing and AI-driven manu-
(2020) propose a rescheduling framework that balances facturing have evolved independently of each other and
rescheduling frequency and the accumulation of delays. have accelerated manufacturing separately. Most past
The framework combines ML with optimisation in order work has used AR as a tool to supplement manufactur-
to solve a flexible job-shop scheduling problem (FJSP). ing activities. Those attempts, although numerous, suffer
Various AI strategies can be applied to produc- from a variety of shortcomings, as discussed in Section 1.
tion decision-making and support. For instance, Palmer These shortcomings can be eliminated by using AI as a
et al. (2018) developed a manufacturing reference ontol- tool in the computational framework of AR. There is a
ogy to support flexible manufacturing decision-making. domain of research that uses AI as a tool in AR appli-
This ontology is not only useful to elucidate understand- cations that has been overlooked so far. The succeed-
ing across domains but for information sharing and sup- ing sections (3 to 11) detail the mode of incorporation
port for interoperable manufacturing systems as well. of AI in AR-assisted manufacturing applications. While
Zhou et al. (2020) present a framework for knowledge- AI-driven AR-assisted manufacturing applications will
driven digital twin manufacturing cell (KDTMC) to be instrumental in the future, the adoption of AI into
support autonomous manufacturing. The framework AR-assisted manufacturing faces challenges from both
utilises digital twins, dynamic knowledge bases, and technical and managerial fronts.
knowledge-based intelligent skills. Possible applications There are many challenges that are presented from the
for the framework include process planning, produc- technical front while incorporating AI strategies within
tion scheduling, and production process analysis and AR-assisted manufacturing applications. First, manufac-
regulation. turing environments offer limited variation in objects
When considering maintenance, service, repair, and to support detection and tracking. Discerning between
inspection, AI techniques can be used for quality various objects such as nuts or bolts which differ by
INTERNATIONAL JOURNAL OF PRODUCTION RESEARCH 9

size, or objects that have undergone the same terminal dynamic risk assessment. Integrating these subsystems in
machining process, is very difficult. That makes creat- their current forms together in the manufacturing envi-
ing generic AI challenging. Training AI algorithms to ronment requires programming expertise, thus necessi-
recognise anomalies is an expensive affair because creat- tating organisational restructuring (Martinetti et al. 2019;
ing data that capture the anomalies is usually time and Masood and Egger 2020). Next, AI methods tend to be
resource-intensive. Further, collecting and sharing big black-box in nature; they are not explainable. That ques-
data from the manufacturing processes to the AR sys- tions their credibility. Lack of reasoning diminishes users’
tem is another challenge. Transmission of relevant data to trust in AI-driven AR-systems. Once AI is included in
AR systems is essential in order for the AR system to be AR-assisted manufacturing, there may be a prolonged
up-to-date and context-aware. Privacy and security issues adjustment period for employees because of a lack of
can arise when industrial data are stored in cloud plat- trust in AI (Qin and Chiang 2019; Arinez et al. 2020).
forms (Bajic et al. 2018). Finally, manufacturing processes Moreover, a lack of explainability can hinder the organ-
frequently include time-varying uncertainties (Qin and isation’s compliance with safety and legal requirements.
Chiang 2019). These uncertainties must be accounted On the organisational level, access to computing and
for in order for data-driven or AI methods to perform human resources in order to implement AI in manufac-
accurately within AR-assisted manufacturing systems. turing along with legal and regulatory issues are other
There are many challenges that emerge from the man- major challenges (Dagnaw 2020). Dagnaw (2020) con-
agerial side when incorporating AI technologies into cisely summarises this challenge as the need for the devel-
the manufacturing environment. As described by Arinez opment of AI applications that are ethical, explainable,
et al. (2020), implementing AI on a large scale such as and acceptable.
a shop-floor is a challenge in itself because most imple- While the integration of AI into AR-assisted manu-
mentations to date are designed and tested in a labora- facturing will face the above-mentioned technical and
tory. Integrating AI with current manufacturing systems managerial challenges, the advantages of AI-driven AR
is another issue; finding solutions to issues with par- far outweigh these shortcomings. With AI, AR systems
allelism in computing (Jordan and Mitchell 2015) and will become intelligent like autonomous robots and sen-
AI compatibility with first principles models and pro- sors. They will be able to detect and track generic objects
cesses (Qin and Chiang 2019) are some major challenges without mistake. They will be ubiquitous like sensors in
with implementing AI in manufacturing. When creating Industry 3.0. AI-driven AR systems will be able to work
training datasets for AI algorithms, it may be necessary autonomously in the manufacturing environment with
to collect data from actual machine processes. In cases minimal effort and input from humans. AI provides AR
of predictive maintenance, diagnostics, or prognostics, systems with the ability to work efficiently and effectively
data would need to be collected from machine failures as a unified whole with the manufacturing environment
or malfunctions. This data collection process may be and with human operators, thus enabling Industry 4.0.
dangerous for the developer and costly for the manu- While the ubiquitous use of AI methods within AR sys-
facturer. Supervised learning, which has been the most tems is perceived as a logical step towards the develop-
successful segment of AI, needs labels of the data. Label- ment of smart industry, only a few AR systems in the
ing is expensive in terms of time and can require skilled research literature utilise AI. As a way to bridge the gap
labour in manufacturing applications. Next, utilising the between the industry of today and Industry 4.0, next,
results from AI-driven manufacturing requires special we highlight the shortcomings of classical approaches in
interpretation (Wuest et al. 2016; Qin and Chiang 2019). each area of the AR pipeline and introduce better AI
Interpretation of results includes an analysis of the out- alternatives, starting with camera calibration.
put, characteristics of the chosen algorithm, parameter
settings, the expected output, and the input data along
3. Camera calibration
with any preprocessing done (Wuest et al. 2016). The
correct interpretation is a challenge that manufacturers The first step in the creation of an AR application is
must face when incorporating AI into their processes. camera calibration. Camera calibration is a process that
Next, manufacturing processes are often dangerous and captures the internal parameters of the real camera so that
require special safety considerations. Dagnaw (2020) the virtual camera can mimic the real camera. Camera
describes challenges with incorporating safety into AI- calibration readies the virtual environment for detection,
driven manufacturing processes, including a need to add tracking, and camera pose estimation so that the virtual
labels relating to safety in AI algorithms, incorporat- objects can be correctly aligned with real objects. A hier-
ing knowledge-based reasoning into AI algorithms, and archical description of the topics covered within camera
utilising AI algorithms in the manufacturing space for calibration can be found in Figure 2.
10 C. K. SAHU ET AL.

Figure 2. Visual depiction of the topics reviewed in camera calibration.

Many AR systems are developed using Software two main categories of strategies implemented in AR sys-
Development Kits (SDKs) in order to streamline the tasks tems to find the intrinsic camera parameters: (1) meth-
required to develop AR systems, including camera cal- ods that utilise linear techniques and (2) methods that
ibration. These SDKs are either available as a supple- utilise non-linear techniques (Malek et al. 2008). Addi-
ment to proprietary devices such as Microsoft HoloLens, tionally, some strategies find the intrinsic and extrinsic
Schenker META, Magic Leap, Android devices, Apple parameters separately or find the camera matrix as a
devices, and Lightform projectors, or as separate libraries whole. If the latter strategy is implemented, it is possi-
available for use in development. If the SDK is associ- ble to decompose the resultant camera matrix into its
ated with a proprietary device, a calibration profile is intrinsic and extrinsic matrix parts (Axholt 2011) using
available for use. For SDKs that are independently devel- the Kruppa equations (Kruppa 1913). Having both the
oped, manual or direct calibration techniques are used intrinsic and extrinsic parameters defined separately is
with calibration patterns as in the Wikitude (Wikitude required to generate shading, shadow, and illumination
GmbH 2018), Vuforia (PTC 2020), and Kudan (Kudan effects on virtual objects (Seo and Hong 2000; Pang
Inc. 2020) SDKs. et al. 2006).
Linear calibration techniques involve the use of the
linear least-squares method to obtain the projection
3.1. Current approaches for camera calibration
matrix (Malek et al. 2008). The most common method
in AR
utilised within linear calibration techniques is Direct
The camera calibration techniques currently in use can Linear Transform (DLT), as in Malek et al. (2008) and
be classified into offline and online methods. While cam- Teichrieb et al. (2007). Numerical or iterative techniques
era calibration is traditionally an offline process, many are utilised to perform linear optimisation, as in Malek
researchers have found ways to perform the necessary et al. (2008), where the Gauss-Jacobi numerical tech-
camera calibration steps online by utilising user inter- nique was used to solve the linear system of equations.
vention or by using computer vision (CV) techniques. Linear techniques are fast but do not take into account
Offline camera calibration methods can further be split any non-linear radial or tangential lens distortion (Malek
into two categories: (1) methods where the camera cali- et al. 2008). Omitting this non-linear information relat-
bration parameters do not need to be known in advance, ing to lens distortion leads to inaccurate calibration
and (2) methods where the intrinsic camera parameters results (Gomez, Simon, and Berger 2005).
are pre-calibrated (Pang et al. 2006). The majority of AR Non-linear optimisation calibration techniques esti-
systems that perform offline camera calibration belong to mate the camera parameters through iteration based on
the second category. In this case, the camera calibration a predefined function (Malek et al. 2008). Some common
process solely involves the determination of the extrinsic functions used in non-linear methods are the reprojec-
parameters. tion error (Dash et al. 2018) and the residual errors in
Within the first category of methods where the cal- calibration (Yuan, Ong, and Nee 2006). Numerical or
ibration parameters are not known a priori, there are iterative techniques such as the Levenberg-Marquardt
INTERNATIONAL JOURNAL OF PRODUCTION RESEARCH 11

(LM) algorithm (Dash et al. 2018) and Broyden-Fletcher- advantage of scene illumination effects because it does
Goldfarb-Shanno (BFGS) algorithm (Xie et al. 2018) not utilise Euclidean geometry (Seo and Hong 2000).
are utilised to perform non-linear optimisation in many
cases (Comport et al. 2006). Many of the non-linear
3.2. Limitations of current approaches for camera
optimisation strategies in AR systems derive the cam-
calibration in AR & potential for AI
era parameters based on the motion of the camera or
Structure from Motion (SfM) as in Seo and Hong (2000), The camera calibration methods currently utilised in AR
Teichrieb et al. (2007), Simões et al. (2013), Seo, Ahn, and systems are computationally expensive or highly complex
Hong (1998), Gordon and Lowe (2006), and Skrypnyk to implement (Donné et al. 2016). Further, any meth-
and Lowe (2004). Although non-linear techniques have ods that require the user to align elements in the space
good accuracy, they are slow and have a high computa- or to collect data for the system are cumbersome, time-
tional cost (Malek et al. 2008). Additionally, non-linear consuming, and prone to errors (Tuceryan, Genc, and
techniques tend to converge to local minima (Comport Navab 2002; Bogdan et al. 2018). Any camera calibration
et al. 2006). methods that are performed offline require some knowl-
For augmentation of video streams or in fully offline edge of the scene or strong assumptions to be imposed on
AR applications, methods such as the bundle adjustment the scene prior to the use of the system, and this is limit-
technique can be used for camera calibration, as in Gor- ing for applications where the use environment changes
don and Lowe (2006), Heyden and Astrom (1997), and frequently (Seo and Hong 2000; Bogdan et al. 2018). Even
Heyden and Åkström (1998). in cases of self-calibration methods, none of these strate-
A subset of camera calibration focuses upon self- gies have been found to be robust in general scenes (Pang
calibration (i.e. auto-calibration) and calibration-free et al. 2006).
methods. Self-calibration and calibration-free methods In order to overcome the limitations of existing clas-
are useful in cases where the intrinsic parameters change sical camera calibration methods, AI-based camera cal-
over time (e.g. zooming of the camera) and fidu- ibration strategies are recommended. AI-based camera
cial points are not available (Seo and Hong 2000) or calibration methods: (1) enable adaptive calibration with-
when attempting to calibrate general video images (Seo, out the need for user interaction, (2) are applicable to
Ahn, and Hong 1998). Self-calibration methods pro- general scenes, and (3) are not limited to use in prede-
vide the Euclidean motion of the real camera in addi- termined environments.
tion to the calibration parameters (Seo and Hong 2000).
The method by Faugeras, Luong, and Maybank (1992)
3.3. AI approaches for camera calibration in AR
utilises the epipolar transformation found between point
matches of image sequences and the Kruppa equations To the best of our knowledge, there have been no
to link the epipolar transformations to the absolute conic AI-based camera calibration methods utilised within
in order to find the intrinsic camera parameters. Polle- AR-assisted manufacturing applications.
feys, Koch, and Van Gool (1999) work off of the absence
of skew in the image plane in order to allow for met-
3.4. Potential AI approaches for camera calibration
ric self-calibration of zooming lens cameras. Taketomi
in AR
et al. (2014) dynamically calculated the intrinsic and
extrinsic parameters of a zooming lens camera using an The AI-based camera calibration methods for use in
energy minimisation formulation. Chendeb, Fawaz, and AR systems can be categorised into single image and
Guitteny (2013) apply self-calibration to AR by combin- non-single image methods. Single image methods are
ing interferometry and pattern recognition. The methods solutions that take as input a single image. Non-single
by Seo, Ahn, and Hong (1998) and Seo and Hong (2000) image methods are those which take as input multiple
use reference and video images to recover structure and consecutive frames or a video segment.
motion information for the object of interest. The param-
eterised cuboid structure (PCS) method (Chen, Yu, and 3.4.1. Single image methods
Hung 1999) is used to insert animated virtual objects into Within single image strategies, Bogdan et al. (2018)
an AR scene by using a cuboid structure as the reference present DeepCalib, a Convolutional Neural Network
object. Finally, in Kutulakos and Vallino (1998), an affine (CNN) with Inception-V3 architecture (Szegedy et al.
camera model with an orthographic camera is used in 2016) that estimates the intrinsic parameters of the cam-
order to produce virtual augmentations. Kutulakos and era. The CNN is trained using omnidirectional images
Vallino (1998)’s method is not generally applicable to from the Internet. Their method does not require motion
cameras with perspective projection and is unable to take estimation, a calibration target, consecutive frame inputs,
12 C. K. SAHU ET AL.

or information about the structure of the scene. The AI- parameters as no longer useful unless known two-
based self-calibration method by Zhuang et al. (2019) dimensional (2D)-3D correspondences are still present
finds camera intrinsics and radial distortion for Simulta- in the environment (Seo and Hong 2000). In order to
neous Localization and Mapping (SLAM) systems using make the camera calibration process more robust, AI-
two-view epipolar constraints. In this work, the CNN is based approaches are recommended. AI-based methods
trained on a set of images with varying intrinsics and are able to dynamically predict the calibration parameters
radial distortion. The solution by Lopez et al. (2019) finds for a camera while in use, given a single image or mul-
both intrinsic and extrinsic camera parameters from sin- tiple image frames. This then saves time for developers
gle images with radial distortion. They utilise a CNN and operators when utilising AR systems in the manufac-
that is trained to perform regression on varying extrin- turing environment. While AI-based NN methods have
sic and intrinsic parameters. Donné et al. (2016) present their own challenges, such as extensive effort required for
ML for adaptive calibration template detection or MATE. training, these challenges are outweighed by the benefits
A CNN is trained using standard checkerboard calibra- gained from being able to dynamically estimate cam-
tion patterns in order to detect these calibration patterns era intrinsics in unknown environments (Zhang et al.,
in single images. After the calibration pattern is detected, “Line-Based Geometric,” 2019).
other methods can be used to derive the camera param-
eters. He, Wang, and Hu (2018) estimate depth and focal
4. Detection
length from a single image. While this method cannot be
used to find all of the intrinsic parameters of the camera, In a real-time setting, object detection initialises the
it effectively finds the focal length of an image. Hold- process of determination of camera extrinsics. The real
Geoffroy et al. (2018) utilise a deep CNN (DCNN) to objects in the environments are detected and localised to
infer the focal length and camera orientation parameters instantly or recurrently (with tracking) estimate the pose
of a camera from a single image. The DCNN is trained of the camera. The techniques discussed in this section
using the SUN360 database (Xiao et al. 2012), a large- are summarised in Figure 3.
scale panorama dataset. A simple pinhole camera model
is utilised, so distortion parameters are omitted.
4.1. Current approaches for detection in AR
The classical methods of detection can be classified
3.4.2. Non-single image methods
into marker-based and markerless. Marker-based meth-
Within AI-based non-single image camera calibration
ods place fiducials, which have characteristic prop-
methods, Brunken and Gühmann (2020) utilise multiple
erties (shape, patterns, etc.) for easier identification
consecutive image frames as input to a Neural Network
in the scene. Circular 2D barcodes (Naimark and
(NN) that automatically determines intrinsic, extrin-
Foxlin 2002), color-coded 3D markers (Mohring, Lessig,
sic, and distortion parameters from a camera. In this
and Bimber 2004), 3D-cones (Steinbis, Hoff, and Vin-
method, two views are transformed to match a third ref-
cent 2008), and libraries like ARToolKit (Kato and
erence view through plane homography mappings. Gor-
Billinghurst 1999), ARTag (Fiala 2005), and ARToolK-
don et al. (2019) utilise frames from videos to estimate
itPlus (Wagner and Schmalstieg 2007) are a few fidu-
the camera calibration parameters. Depth, egomotion,
cial markers that are widely used for AR applications.
object motion, and camera intrinsics are learned from
Although they are easier to track, these fiducials need to
video frames in an unsupervised manner. Their method
be placed in the environment, which may not be possible
utilises two NNs: one for depth estimation and one used
(in industrial scenes) or desirable (aesthetically unpleas-
to predict camera motion and camera intrinsics from two
ant or a distraction).
consecutive video frames.
The markerless methods detect natural features avail-
able in the scene. They identify specific key points that
are unique in their neighbourhood and create a fea-
3.5. Conclusion
ture. Hence, they are local in nature, and this makes
The classical camera calibration techniques currently them invariant to scaling and rotation. Scale-invariant
used in AR systems are not adaptive to general scenes feature transform (SIFT) (Lowe 1999, 2004) and speeded-
(Seo, Ahn, and Hong 1998), which are commonly up robust features (SURF) (Bay, Tuytelaars, and Van
encountered in manufacturing. When offline calibration Gool 2006) are two of the most widely used markerless
parameter estimations are used, any changes that occur methods for matching images of an object from differ-
to the internal parameters of the camera while in use, ent viewpoints. Natural features like edge and texture
such as zooming, render the originally-estimated camera information (Vacchetti, Lepetit, and Fua 2004), edges and
INTERNATIONAL JOURNAL OF PRODUCTION RESEARCH 13

Figure 3. Visual depiction of the topics reviewed in detection.

corners (Park and Park 2010), lines with a 3D model global contextual information, whereas GSIFT lacked
(Wuest, Vial, and Strieker 2005; Ababsa et al. 2008), SIFT color attributes.
(Lee and Hollerer 2008; Wagner et al. 2008, 2009; Ha • Hand-crafted features have limited flexibility. They
et al. 2011; Teng and Wu 2012), Ferns (Ozuysal, Fua, and have to be engineered for each class of images. That
Lepetit 2007; Wagner et al. 2008, 2009), SURF (Takacs amplifies the effort in feature engineering as the num-
et al. 2008) etc. have already been used for object detec- ber of classes increases. Thus, it limits the design of a
tion in AR applications. generic object detection framework, thereby prevent-
Usually, such point-based natural features like SIFT ing their proliferation in AR applications.
and SURF are susceptible to failure in cases of vari-
ations in lighting, intrinsic parameters of the camera, ML-based approaches, although developed alongside
and motion blur. So, they were further developed as feature-based approaches, could not perform comparably
Principal Components Analysis (PCA)-SIFT (Ke and until the development of huge datasets and computa-
Sukthankar 2004) to make SIFT more robust to defor- tional capacities. The new wave of ML techniques, under
mations, Affine (A)SIFT (Morel and Yu 2009) to make the umbrella of DL, discarded the manual feature extrac-
SIFT fully affine invariant, Global (G)SIFT (Mortensen, tion step, and work directly with the raw images. The
Deng, and Shapiro 2005) for adding global characteris- CNNs, a class of DL, eliminated the need for feature engi-
tics, Colored (C)SIFT (Abdel-Hakim and Farag 2006) neering in object detection. CNNs are capable of learning
for incorporating color invariant attributes, and fully all the essential features of the objects by themselves.
affine invariant (Fair) SURF (Pang et al. 2012) to make Moreover, state-of-the-art DL techniques are capable of
SIFT both fully affine invariant and computationally classifying and detecting a wide variety of generic objects.
efficient.
4.3. AI approaches for detection in AR
4.2. Limitations of current approaches for detection
Mask Region-based CNN (R-CNN) (He et al. 2017) was
in AR & potential for AI
used for object detection and instance segmentation and
Even after several improvements on the point-based nat- subsequent 3D annotation (Park et al. 2020). Its prac-
ural features, their performance at image processing did tical utility was studied in the case of the maintenance
not portray significant improvement. They still failed in and inspection of a 3D printer. You Only Look Once
the presence of occlusion, noise, clutter, distortions, and (YOLO)9000 (Redmon and Farhadi 2017) was used for
variations in illumination characteristics. That can be detecting objects, and the exemplar-based Criminisi’s
attributed to the following factors: method (Criminisi, Perez, and Toyama 2003) was used
for filling the void created by removing the object in case
• Identification, modelling, and inclusion of all the fea- of an inpainting application (Kim et al. 2018). The use of
tures is a formidable task. Every additional improve- ML eliminated the selection of the area to be removed
ment created a new feature descriptor, which also and replaced it with label selection. A CNN powered
lacked some other characteristics, e.g. CSIFT lacked by AlexNet (Krizhevsky, Sutskever, and Hinton 2012)
14 C. K. SAHU ET AL.

was used in DeepAR (Akgul, Penekli, and Genc 2016) combat overfitting, softmax classifier, and parallel com-
for detecting 2D images of the target as part of the puting. It was succeeded by VGG-Net (Simonyan and
camera-based tracking mechanism. SqueezeNet (Iandola Zisserman 2014), which endorsed the use of deeper net-
et al. 2016), trained on the pattern analysis, statisti- works, stacked convolutional layers, and smaller filters
cal modelling, and computational learning (PASCAL) with stride and transfer learning. GoogLeNet (Szegedy
Visual Object Classes (VOC) (Everingham et al. 2010), et al. 2015) used inception modules consisting of parallel
was implemented on a mobile device for constructing convolutional layers with filters of different sizes to allow
a fast markerless indoor mobile AR system (Subakti multi-scale processing. ResNet (He et al. 2016) advocated
and Jiang 2018). Another fast markerless indoor mobile the use of extremely deeper networks to learn better fea-
AR system (Subakti and Jiang 2018) was powered by tures and the use of additional residual connections for
MobileNet (Howard et al. 2017; Sandler et al. 2018) for mitigating the exploded computational expense.
recognising different machines and retrieving machine Recurrent NNs (RNNs) have also been explored for
status in an indoor industrial setting. The MobileNet clas- image classification (Visin et al. 2015) as an alterna-
sified the image and detected the boundary for labeling. tive to CNNs. SqueezeNet (Iandola et al. 2016) based
The AR application offered the user broad characteris- on AlexNet, MobileNet (Howard et al. 2017) based
tics of the machine when the user was far and detailed on depth-wise separable convolutions, and ShuffleNet
machine information when the user was close. (Zhang et al. 2018) based on channel shuffling and point-
wise group convolutions are a few more CNNs developed
for edge devices that have performance comparable to the
4.4. Potential AI approaches for detection in AR
state-of-the-art architectures.
Most of the current work utilising DL techniques in AR
applications limits the utility of DL at detecting objects 4.4.2. Object detection
using rectangular bounding boxes. However, DL offers In addition to classifying the image (or video frame) for
precise pixel-level details. The complete understanding (an) object(s), the object detection task also has to localise
of a scene primarily consists of four tasks: image clas- it. Usually, the DL techniques detect approximate bound-
sification, object detection, semantic segmentation, and aries and bound the detected objects using rectangular
instance segmentation, in the order of increasing fine- bounding boxes. Most object detection algorithms can
grained inference (Garcia-Garcia et al. 2017; Wu, Sahoo, be broadly categorised into two classes: multi-stage and
and Hoi 2020). They are briefly discussed below. single-stage.

4.4.1. Image classification 4.4.2.1. Multi-stage approach. This is a traditional


Image classification attributes the whole scene into a sin- approach where a set of candidate bounding boxes
gle class or labels the scene using a single label. The NN (region proposals) are generated first. The features are
acts as an operator to attribute the whole scene to a sin- extracted from those boxes using a CNN. And then, the
gle class with a significantly higher probability than other proposed regions are classified into different categories
classes. Although image classification by itself does not for labeling them.
have significant importance in AR applications, it is a pre- R-CNN (Girshick et al. 2014) was the first success-
decessor to subsequent tasks of object detection and oth- ful attempt at object detection. It integrated selective
ers. The successful architectures in image classification search (Uijlings et al. 2013) for region proposal with
are adopted for all other tasks. AlexNet (Krizhevsky, Sutskever, and Hinton 2012) for
LeNet-5 (LeCun et al. 1998) was among the first feature extraction, and a Support Vector Machine (SVM)
CNNs to taste success on a comparatively simple for classification. Its multi-stage training pipeline needed
Modified National Institute of Standards and Technol- separate training for each stage. Thus, it was slow, and
ogy (MNIST) dataset consisting of handwritten digits. the overall optimisation was difficult. Hence, it was devel-
AlexNet’s (Krizhevsky, Sutskever, and Hinton 2012) suc- oped into Fast R-CNN (Girshick 2015) to jointly train
cess in classifying the relatively complex and diverse for classification and regression of the bounding boxes. It
ImageNet database (Deng et al. 2009) triggered the also supported parameter sharing. However, the external
widespread adoption of CNNs for image classification. It region proposal by the selective search was still a bottle-
revolutionised the task of image classification by includ- neck. It was replaced by a region proposal network (RPN)
ing the rectified linear unit (ReLU) (Nair and Hin- based on a Fully Convolutional Network (FCN). That
ton 2010) for improving training times, repeated blocks gave birth to Faster R-CNN (Ren et al. 2015), a completely
of convolutional layers and max-pooling layers, data aug- end-to-end trainable CNN-based framework with near
mentation, dropout regularisation (Hinton et al. 2012) to real-time performance.
INTERNATIONAL JOURNAL OF PRODUCTION RESEARCH 15

Spatial Pyramid Pooling (SPP)-Net (He et al. 2015), (Simonyan and Zisserman 2014), and GoogLeNet (Szeg
Feature Pyramid Network (FPN) (Lin et al. 2017), edy et al. 2015), replaced the final fully-connected
Region-based FCN (R-FCN) (Dai et al. 2016), Mask layer with fully-convolutional layers. Although it was
R-CNN (He et al. 2017) etc. are a few more architectures quite successful, it lacked global contextual informa-
for object detection that follow a multi-stage approach. tion, instance awareness, and real-time performance.
Although accurate, all these multi-stage approaches are Encoder-decoder variants referred to as semantic pixel-
expensive in space and time. That limits their efficacy in wise segmentation network (SegNet) (Badrinarayanan,
AR applications which call for real-time performance. Kendall, and Cipolla 2017) and U-Net (Ronneberger,
Fischer, and Brox 2015) were also developed, where a
4.4.2.2. Single-stage approach. In a single-stage appro set of upsampling and deconvolutional layers preceded
ach, both the location of the objects and the class prob- the final softmax classifier which is used for predicting
abilities are predicted from the full image using a single pixel-wise class probabilities. Dilated convolutions were
CNN. used to aggregate multi-scale contextual information (Yu
Overfeat (Sermanet et al. 2013), based on FCNs, was and Koltun 2015; Chen et al., “DeepLab,” 2017; Paszke
the first single-stage object detector that received signifi- et al. 2016; Wang et al., “Understanding Convolution,”
cant recognition. It had sub-par accuracy compared to R- 2018). As the convolutional filters operate at a partic-
CNN (Girshick et al. 2014). YOLO (Redmon et al. 2016) ular scale, and hence the extracted features are limited
divides the image into a G × G grid. Each cell has to to that scale, multi-scale networks were developed. The
predict the object in the cell. Each of those grid cells multi-scale networks use multiple networks targeting dif-
predicts b bounding box locations and c class probabil- ferent scales and fuse the individual outputs to produce
ities. YOLO directly predicted the bounding boxes and the final output (Eigen and Fergus 2015; Bian, Lim, and
associated probabilities from the pixels. YOLO show- Zhou 2016; Roy and Todorovic 2016; Zhao et al. 2017).
cased outstanding real-time performance of 45 Frames ParseNet (Liu, Rabinovich, and Berg 2015) concatenated
Per Second (FPS), and a miniature version referred to global features from a preceding layer to the local fea-
as ‘Fast YOLO’ (Redmon et al. 2016) offered 155 FPS. tures of a succeeding layer. RNNs have also been utilised
YOLO was further developed as YOLO9000 (Redmon to capture the short- and long-spatial relationships in
and Farhadi 2017) for detecting more than 9000 object ReSeg (Visin et al. 2016) and other works (Pinheiro and
categories, and as YOLOv3 (Redmon and Farhadi 2018) Collobert 2014; Byeon et al. 2015). Adversarial networks
for improving speed and accuracy. However, because of have also been explored for semantic segmentation (Luc
the coarse grid structure, and as each of the cells can only et al. 2016; Souly, Spampinato, and Shah 2017; Hung
be attributed to one class, YOLO was unable to prop- et al. 2018). The Efficient NN (ENet) (Paszke et al. 2016)
erly detect small objects which may be present in groups. developed using dilated convolutions for semantic seg-
To overcome these shortcomings, Single-Shot Detector mentation offers good real-time performance. It is a
(SSD) (Liu et al. 2016) was developed. SSD consolidated promising solution for AR.
concepts from YOLO, multi-scale convolutional features
(Hariharan et al. 2016) and RPN from R-CNN (Girshick 4.4.4. Instance segmentation
et al. 2014) for faster detection without compromising the Instance segmentation is the most advanced state of scene
accuracy of detection. understanding, which requires differentiating different
AttentionNet (Yoo et al. 2015), G-CNN (Najibi, Raste- instances of the same object in addition to semantic seg-
gari, and Davis 2016), Multibox (Erhan et al. 2014), mentation. Lack of knowledge of the number of instances
Deconvolutional SSD (DSSD) (Fu et al. 2017), Deeply of a particular object in the scene makes it more diffi-
Supervised Object Detector (DSOD) (Shen et al. 2017) cult than other tasks. It is a partially solved problem and
etc. are few other single-stage object detectors. requires significant further research. Successful instance
segmentation can offer information about occlusion that
4.4.3. Semantic segmentation can help in AR-assisted precise robotic pick-and-place
Unlike classification and detection, in which a group of applications among several other applications.
pixels is attributed to a single class, semantic segmentation Simultaneous Detection and Segmentation (SDS)
attributes each pixel to a particular class. The DL learns (Hariharan et al. 2014), built upon R-CNN (Girshick
a pixel-wise classifier. Semantic segmentation identifies et al. 2014), is one of the first significant attempts at
almost precise pixel-level boundaries of the objects. instance segmentation. Mask R-CNN (He et al. 2017)
FCN (Long, Shelhamer, and Darrell 2015), built upon added a third output to Faster R-CNN (Ren et al. 2015)
state-of-the-art image classification models like AlexNet to compute the binary mask for segmentation in addi-
(Krizhevsky, Sutskever, and Hinton 2012), VGG16 tion to two other outputs: bounding box offset and class
16 C. K. SAHU ET AL.

Figure 4. Visual depiction of the topics reviewed in tracking.

label. Mask Labeling (MaskLab) (Chen et al. 2018), based 5. Tracking


on Faster R-CNN (Ren et al. 2015), produces three out-
Detection can lead to an estimation of the pose of the
puts: box detection, direction prediction, and semantic
camera in a static scene. In reality, no interactive envi-
segmentation. The direction prediction estimates each
ronment is static. Even minute movements of the camera
pixel’s direction towards its center and hence assists in
alter the viewpoint and disturb the alignment between
separating instances of the same class. Path Aggrega-
the virtual and real content. Objects in the scene can
tion Network (PANet) (Liu et al. 2018), based on FPN
also be under motion. So, tracking is essential to esti-
(Lin et al. 2017) and path augmentation, was also devel-
mate the pose of the camera in real-time. An effective
oped. Mask R-CNN (He et al. 2017), boundary-aware
tracker needs to be accurate within fractions of a mil-
instance segmentation (Hayder, He, and Salzmann 2017),
limetre in distance and degrees in orientation. It has to be
MaskLab (Chen et al. 2018), and PANet (Liu et al. 2018)
quick enough to maintain temporal coherency. The com-
produce instances with bounding boxes, not pixel-wise
bined latency of the tracker and the renderer has to be low
segmentations.
enough to maintain sensible registration. Figure 4 depicts
A multi-task network cascade (Dai, He, and Sun 2016)
the organisation of the discussed tracking techniques.
consisting of three networks was developed for instance-
aware semantic segmentation. The three networks were
dedicated to differentiating instances, categorising obje
5.1. Current approaches for tracking in AR
cts, and estimating masks. In addition to multi-task net-
work cascade, TensorMask (Chen et al. 2019), DeepMask The classical approaches of tracking can be broadly be
(Pinheiro, Collobert, and Dollár 2015), and SharpMask split into feature-based and model-based. The feature-
(Pinheiro et al. 2016) predict pixel-wise instance segmen- based approaches use feature descriptors of the current
tation. images of the object to match it with the reference image.
Techniques like mean shift (Comaniciu, Ramesh, and
Meer 2000; Comaniciu and Meer 2002; Zhou, Yuan,
4.5. Conclusion
and Shi 2009), keypoint tracking by matching the point
Object detection and its allied tasks have seen unprece- neighbourhood (Lowe 2004), homography (Benhimane
dented success with the advent of DL. Their performance and Malis 2007), differential tracking like the Lucas-
has surpassed that of classical feature-based approaches. Kanade (LK) algorithm (Lucas and Kanade 1981) etc.
Currently, object detection is mostly considered to be are used to match the feature descriptors. However, such
a solved problem because of DL. However, the object feature-based methods are not effective under occlu-
detection task in AR-assisted manufacturing applications sions and changes in illumination. Sum of conditional
is still mostly dominated by fiducial markers. More- covariance (Rogério et al. 2011), M-estimation (Com-
over, most of the existing work using DL techniques in port, Marchand, and Chaumette 2003), local normalised
AR-assisted manufacturing limit DL for object detec- correlation (Irani and Anandan 1998), and maximisa-
tion for simpler augmentations like rectangular bound- tion of mutual information (Viola and Wells 1997; Panin
ing boxes to localise (Abdi and Meddeb 2017; Subakti and and Knoll 2008; Dame and Marchand 2010) are a few
Jiang 2018) or annotate (Rao et al. 2017; Park et al. 2020) techniques to increase the robustness of feature-based
the identified objects. Advanced pixel level and instance- methods.
aware techniques are yet to be adopted into AR-assisted Virtual visual servoing (VVS) with M-estimator was
manufacturing applications. used for real-time tracking of circles, lines, and spheres
INTERNATIONAL JOURNAL OF PRODUCTION RESEARCH 17

in a monocular video see-through (VST) system (Com- in wearable and handheld cameras (Castle et al. 2007).
port, Marchand, and Chaumette 2003). Visual servo- It was further developed as parallel tracking and map-
ing with Lie algebra of affine transformations (Drum- ping (PTAM) (Klein and Murray 2007) and Oriented
mond and Cipolla 1999) was used in a model-based FAST and Rotated Binary Robust Independent Elemen-
tracker that used edge features (Reitmayr and Drum- tary Features (BRIEF) (ORB)-SLAM (Mur-Artal, Mon-
mond 2006). Multi-stage optical flow estimation (Neu- tiel, and Tardos 2015) for various applications.
mann and You 1998) was used to track both fiducials (col-
ored circles and triangles (Neumann and Cho 1996)) and
5.2. Limitations of current approaches for tracking
natural features (e.g. textures and corners) (Park, You,
in AR & potential for AI
and Neumann 1999). Multiple tracking hypotheses based
on the moving edges algorithm (Bouthemy 1989) were The optimisation methods necessary for tracking are
retained for robust estimation in Vacchetti, Lepetit, and prone to failure in cluttered environments, which hinders
Fua (2004) and Wuest, Vial, and Strieker (2005), instead the matching process. In Bayesian approaches, ensuring
of one like those in Drummond and Cipolla (2002) a fair representation of all possible candidate poses is
and Marchand, Bouthemy, and Chaumette (2001). The a challenging task due to the huge number of possible
tracking algorithm discussed by Vacchetti, Lepetit, and poses.
Fua (2004) can track both textured and non-textured As a whole, the classical approaches of tracking have
objects as they combine both edges and feature points. the following shortcomings:
Point features extracted using Features from Accelerated
Segment Test (FAST) corner detector were tracked by • Complex scenes involving multiple dynamic objects,
tracking-by-synthesis approach (Simon 2011), where the variations in illumination, and motion blur are dif-
points were matched using normalised cross-correlation. ficult to model. Moreover, the tracked objects may
Real-time Attitude and Position Determination (RAPiD) undergo partial or full occlusion. Capturing the vari-
(Harris 1993) was a popular markerless 3D tracking ations and impacts of occlusion still remains a hurdle
system. in traditional approaches of tracking.
The model-based approaches use a 3D model of the • The combinatorial complexity of traditional Multi-
object (Lowe 1991; Drummond and Cipolla 2002; Com- ple Object Tracking (MOT) algorithms (e.g. mul-
port et al. 2006). From a bird’s eye view, the model- tiple hypothesis tracking (MHT) (Reid 1979) and
based approaches minimise the error between the fea- joint probabilistic data association filters (Bar-Shalom,
tures extracted from the current image and the projection Fortmann, and Cable 1990; Hamid Rezatofighi et al.
of the 3D model. Ordinarily, they minimise the distance 2015)) explode exponentially as the number of tracked
between the points of the image and the projection of the objects increases. So, their complexity renders them
3D lines (Uchiyama and Marchand 2012). Edges are used ineffective for real-time tracking of multiple objects.
widely due to their computational efficiency and robust-
ness against changes in lighting. Both non-linear optimi- The elimination of hand-crafted features to design
sation methods (e.g. VVS (Comport et al. 2006), iterative trackers that are robust against occlusion and variations
reweighted least squares (Drummond and Cipolla 2002)) in the object attributes makes DL-based techniques bet-
and Bayesian methods (e.g. particle filter (Klein and Mur- ter than the conventional ones. Moreover, the complex-
ray 2006; Pupilli and Calway 2006)) were used for solving ity of the DL-based trackers is not proportional to the
the minimisation problem. number of tracked objects in the scene. Together, these
Multiple objects were tracked using image retrieval advantages make DL-based trackers a wise choice for AR
and online SfM (Kim, Lepetit, and Woo 2010). SLAM applications.
(Bailey and Durrant-Whyte 2006; Durrant-Whyte and
Bailey 2006) has been the most significant breakthrough
5.3. AI approaches for tracking in AR
in tracking applications. It can track the location of the
agent (e.g. robot, camera) in unknown environments, in To the best of our knowledge, there is not a single research
addition to creating a map of the environment using nat- work that uses AI in the tracking component of AR-
ural features. Its widespread adoption can be attributed assisted manufacturing. The only instance of AR that
to the elimination of the requirement of prior informa- uses AI-based tracking is a DCNN. The DCNN was
tion like maps or Computer-Aided Design (CAD) mod- used for temporal tracking of the 6-Degrees of Free-
els. SLAM was used for annotating unknown environ- dom (DoF) pose of an object for superimposing the vir-
ments (Reitmayr, Eade, and Drummond 2007; Reitmayr tual model of the object on the real object using two
et al. 2010) in addition to tracking. It was also used frames of inputs (Lalonde 2018). It was more resistant
18 C. K. SAHU ET AL.

to occlusions. DL-based tracking techniques have not factorised convolution operators to reduce the number
received the deserved attention. This opens up the pos- of parameters, a conservative update strategy to increase
sibility for radical applications of DL-based trackers for robustness, and a compact generative model to reduce
adoption into AR-assisted manufacturing applications. complexity. Multi-Domain Network (MDNet) (Nam and
Han 2016) and structured output deep learning tracker
(SO-DLT) (Wang et al., “Transferring Rich Feature,”
5.4. Potential AI approaches for tracking in AR
2015) are a few early DL methods to utilise CNNs for
From a functional viewpoint, we split the tracking meth- tracking. They were limited by real-time performance.
ods into two parts: single object tracking (SOT) and Successive networks like DeepSRDCF (Spatially Reg-
MOT, depending on the number of objects being tracked. ularised Discriminative Correlation Filters) (Danelljan
Although tracking a single object is sufficient to estimate et al. 2015) and fully convolutional network based tracker
the camera pose, SOTs are prone to failure in case of (FCNT) (Wang et al., “Visual Tracking,” 2015) used
occlusion or insignificant relative motion between the shallow networks to process pre-trained CNNs as fea-
tracked object and the camera. In the case of occlu- tures to improve their performance. The first fully con-
sion, the camera pose can be estimated by tracking other volutional Siamese network for tracking, SiamFC, was
objects that are not occluded. That makes MOTs a bet- powered by AlexNet (Krizhevsky, Sutskever, and Hin-
ter choice. From a technical perspective, the tracking ton 2012) and had a cosine window for temporal con-
approaches can be categorised into detection and track- straints (Bertinetto et al. 2016). SiamFC had poor gen-
ing (Wagner et al. 2008; Pilet and Saito 2010) and tracking eralisation capability and its performance degraded if the
by detection (Lepetit and Fua 2006; Ozuysal et al. 2009). target experienced significant changes to its appearance.
In detection and tracking, the object in the current frame Semantic appearance Siamese network (SA-Siam) (He
is detected. Its instances in the future frames are tracked et al. 2018b), inspired from ensemble trackers, branched
using a tracking algorithm. Tracking by detection detects into an appearance branch and a semantic branch. Each of
the object in every frame and tracks it using associa- the branches was a Siamese network. The networks were
tion techniques. It does not utilise the location infor- trained separately and were fused after computing the
mation from the previous frames. As the detection and similarity scores. It offered comparatively better real-time
tracking methods utilise the information from the past performance.
frames, they are computationally inexpensive and are Most of the prior Siamese networks like SiamFC
quicker than tracking by detection-based methods. How- (Bertinetto et al. 2016), Siamese INstance search Tracker
ever, detection and tracking methods are prone to failure (SINT) (Tao, Gavves, and Smeulders 2016), SA-Siam (He
in the presence of multiple similar objects, or in cases et al. 2018b), and Siamese RPN (SiamRPN) (Li et al. 2018)
which affect the visibility of the tracked object (e.g. under used shallow networks like AlexNet (Krizhevsky, Sutske
occlusion or instances where the tracked object moves ver, and Hinton 2012). Despite replacing the shallower
beyond the Field Of View (FOV) of the camera). Due to AlexNet foundation with deeper networks like ResNet
the presence of numerous unknown objects, which calls (He et al. 2016) or GoogLeNet (Szegedy et al. 2015),
for extraction of unknown features, MOT tasks are heav- the performance of the network did not improve signifi-
ily dominated by tracking by detection approaches. Most cantly. The authors of SiamFC+, SiamRPN+ (Zhang and
of the SOT tasks use detection and tracking approaches Peng 2019), and SiamRPN++ (Li et al., “Siamrpn++,”
as the initial bounding box for the target object can be 2019) identified two reasons for the limited perfor-
prescribed or estimated. mance: (1) deeper networks restricted the translation
invariance; and (2) the network padding used in deeper
5.4.1. Single object tracking networks induced a positional bias in the learning.
Discriminative correlation filter (DCF) based approa It hindered the object relocalization. SiamFC+ and
ches, which exploit properties of circular correlations, SiamRPN+ (Zhang and Peng 2019) were equipped with
used to dominate most of the tracking benchmarks in a set of cropping-inside residual (CIR) units, and those
terms of accuracy and robustness. Further improve- CIR units were stacked to create deeper and wider net-
ments of DCF-based approaches for increasing the track- works. SiamRPN++ (Li et al., “Siamrpn++,” 2019)
ing speed reduced the accuracy significantly (Bolme was trained using a spatially aware sampling strategy to
et al. 2010). So, pre-trained CNNs were appended to facilitate spatial invariance. It incorporated a multi-layer
increase their performance. DCF-based methods needed aggregation module and a depth-wise correlation layer
large models and had a huge number of parameters. on top of SiamRPN (Li et al. 2018). SiamRPN++ (Li
Thus, they were prone to overfitting. Efficient Convolu- et al., “Siamrpn++,” 2019) could achieve 35 FPS with a
tion Operators (ECO) (Danelljan et al. 2017) introduced ResNet-50 (He et al. 2016) backbone and 70 FPS with a
INTERNATIONAL JOURNAL OF PRODUCTION RESEARCH 19

MobileNet (Howard et al. 2017) backbone. The Siamese 5.4.2. Multiple object tracking
networks were further developed as a Distractor-aware MOT algorithms have to detect and track multiple
Siamese RPN (DaSiamRPN) (Zhu et al. 2018) for improv- objects without prior information about the number of
ing robustness against occlusion and clutter, memory target objects and their appearance. The detected objects
networks (Yang and Chan 2018) using Long Short-Term in MOT have to be tagged for proper identification. In
Memory networks (LSTMs) to capture the variation of addition to occlusion and interaction among the objects,
the target’s appearance, the Siamese better match net- inter-class similarity and intra-class variation make the
work (Siam-BM) (He et al. 2018a) to handle rotations, MOT task quite challenging. Most DL-based MOT algo-
Dynamic Siamese network (DSiam) (Guo et al. 2017) to rithms use the tracking by detection approach because of
learn background suppression and variation in the target the number of unknown objects. In a tracking by detec-
appearance, and Residual Attentional Siamese Network tion framework, there are two tasks: detecting the moving
(RASNet) (Wang et al., “Learning Attentions,” 2018) to object in every frame and tracking them by associat-
enhance discriminative power and reduce overfitting. ing the instances of the same object in each frame. So,
The networks like SiamFC (Bertinetto et al. 2016), SA- holistically, tracking is a spatio-temporal task.
Siam (He et al. 2018b), and SiamRPN (Li et al. 2018) Before the advent of DL techniques in object detec-
track the target object using axis-aligned rectangular tion, the detection task was the bottleneck of tracking
bounding boxes. Although the rectangular bounding box applications. After the success of DL techniques in object
is efficient, it is far from an accurate representation of detection, many trackers used state-of-the-art DL-based
the object. Moreover, for AR applications where the detectors to detect multiple objects. So, simple online and
pose of the object is necessary, a rectangular bound- real-time tracking (SORT) (Bewley et al. 2016) included
ing box does not meet the necessity. SiamMask (Wang faster R-CNN (Ren et al. 2015) for detecting multi-
et al. 2019) produced both rotating bounding boxes and ple objects. It employed Kalman filters (Kalman 1960)
object segmentation masks at 55 FPS by including two for motion prediction and the Hungarian method
more tasks of bounding box regression using an RPN (Kuhn 1955) for data association. To capture the tem-
(Ren et al. 2015; Li et al. 2018) and class agnostic binary poral information of the videos, the spatial features
segmentation (Pinheiro, Collobert, and Dollár 2015) in extracted by SSD (Liu et al. 2016) were fed to LSTMs
addition to the usual similarity metric of the Siamese for regressing and classifying locations of the objects in
networks. More importantly, SiamMask can estimate the addition to associating the features of the objects across
binary segmentation masks from a boundary box initial- the frames (Lu, Lu, and Tang 2017). Kieritz, Hubner,
isation. SiamMask_E (Chen and Tsotsos 2019) improved and Arens (2018) jointly detected and tracked multi-
the bounding box fitting process by fitting an ellipse to ple objects using an integrated network of SSD and
estimate the size and angle of rotation of the bounding RNNs. An end-to-end trainable enhanced Siamese net-
box. work (Kim, Alletto, and Rigazio 2016) was used to extract
Pure DL-based algorithms still have a long way to appearance and temporal geometric information for sup-
go to meet the expected performance in tracking prob- porting tracking.
lems. They are not as successful as they are in object In addition to detection, data association is an impor-
detection. Although Siamese networks are quite promis- tant task for tracking applications. Occlusion, dras-
ing and are successful, they could not fully own the stage tic variation in appearance or motion, and interac-
of visual object tracking challenge (VOTC) 2018 (Kris- tions of the objects impede data association. Fail-
tan et al. 2018) and 2019 (Kristan et al. 2019). Siamese ure in data association leads to fragmentation of the
networks shared the stage of VOTC 2019 with algorithms tracks, misidentification of the objects, etc. Tradition-
based on DCFs for short-term,2 long-term3 as well as real- ally, clique graphs (Zamir, Dehghan, and Shah 2012),
time4 tracking. Most of the successful trackers in VOTC global optimisation (Zhang, Li, and Nevatia 2008;
2019 used CNNs for object localisation. Instead of clas- Berclaz et al. 2011), discrete-continuous optimisation
sical DCFs, CNN-based formulation of DCFs was widely (Andriyenko, Schindler, and Roth 2012), object re-
used. The successful Siamese trackers used Siamese cor- identification (Kuo and Nevatia 2011) etc. were used
relation filters. Many networks combined characteristics for data association. These traditional approaches are of
of both Siamese networks and DCFs. Unlike VOTC 2018, discrete combinatorial complexity. So, to improve the
the best VOTC 2019 trackers (except FuCoLot in long- data association in the case of MHT, MHT was com-
term tracking) did not use hand-crafted features. The top bined with bi-linear LSTMs (Kim, Li, and Rehg 2018) to
trackers relied on a deeper ResNet backbone. The future incorporate motion and appearance models and LSTM-
trackers seem to be deeper hybrid networks utilising CNNs (Chen, Peng, and Ren 2018) to handle long
concepts from both Siamese networks and DCFs. temporal variations in the object. Further, only RNNs
20 C. K. SAHU ET AL.

were utilised in three blocks for motion prediction, state capture the intra-class variations, which need not be the
update, and track management (initiation and termi- case in generic MOT. Generic MOT lays almost equal
nation) for online MOT (Milan et al. 2017). Although emphasis on both intra- and inter-class variations. The
it was quicker ( ≈ 165 FPS), its accuracy was sub-par. SOT datasets include tracking of generic objects (e.g.
Multiple LSTM models were used for learning position, VOTC 20206 , object tracking benchmark (Wu, Lim, and
velocity, and appearance features in MOT (Liang and Yang 2013)). The most important reason for the lack of
Zhou 2018). However, LSTMs are known to fail at fully attention in generic MOT is the unavailability of germane
capturing the temporal information across the frames, datasets; the first generic MOT dataset was developed in
as tracking is a spatio-temporal task. So, leveraging both 2020 (Dave et al. 2020) with labeled bounding boxes. We
CNNs and LSTMs is quite intuitive. Thus, an RNN-based expect generic MOT to gain significant momentum in the
appearance model built upon a CNN was used for MOT future. That will expedite the development of AR tracking
(Sadeghian, Alahi, and Savarese 2017). The appearance technologies in general and in manufacturing applica-
model was supplemented with motion and object inter- tions. Additionally, state-of-the-art SOTs like SiamMask
action models based on LSTMs. State-of-the-art Siamese (Wang et al. 2019) and SiamMask_E (Chen and Tsot-
networks were combined with LSTMs to capture the sos 2019) can track segmentation masks, whereas MOT
appearance, motion, and velocity cues for MOT (Wan, methods are still exclusively using axis-aligned rectangu-
Wang, and Zhou 2018). The tracks were initialised using lar bounding boxes to track objects.
traditional methods (Kalman filters (Kalman 1960) with
Hungarian methods (Kuhn 1955) and LK optical flow
6. Camera pose estimation
(Lucas and Kanade 1981) with Intersection-over-Union
(IoU) distance computation) for feeding into the LSTMs. The ultimate goal of both detection and tracking is to
The data association task can also be solved by esti- estimate the camera pose. Both detection and tracking
mating a similarity score among different frames by produce information about the pose of the object, from
using CNNs only. Siamese networks, which are highly which the pose of the camera can be estimated. The trans-
successful in SOT, have become the de facto choice to formation from the pose of the object to the camera is
compute the similarity score. A Siamese CNN and an termed pose estimation. In general, pose estimation is the
appearance-based tracklet affinity model were jointly process of estimation of the position and orientation of
learned for increasing the robustness against track frag- the camera from the scene. Pose estimation is different
mentation and identity switching (Wang et al. 2016). from visual place recognition in the sense that visual
Spatio-temporal descriptors learned from a Siamese place recognition offers the topological location without
CNN were combined with gradient boosting to asso- the exact position, whereas pose estimation offers both
ciate the image patches (Leal-Taixé, Canton-Ferrer, and global position and orientation of the camera. A precise
Schindler 2016). It was validated using linear program- 6-DoF pose of the camera is essential for a realistic blend
ming. Spatio-temporal motion features were added to of virtual content. The pose estimation techniques can be
a Feature Pyramid Siamese Network (FPSN) to supple- broadly categorised as shown in Figure 5.
ment the similarity in appearance information of the
Siamese networks with motion information (Lee and
6.1. Current approaches for camera pose
Kim 2018). Additionally, reinforcement learning (Ren
estimation in AR
et al. 2018; Rosello and Kochenderfer 2018) and autoen-
coders (AEs) (Wang et al. 2014) have been explored for As the camera pose consists of six independent param-
MOT applications. eters, ideally, three points (P3P - Pose estimation from
3 points) should suffice. However, the solution involves
a quartic polynomial. That calls for an additional point
5.5. Conclusion
to disambiguate the results. However, multiple point cor-
Most of the MOT tasks focus on tracking vehicles (in respondence (PnP - pose estimation from n points) is
autonomous driving applications e.g. KITTI (Geiger, preferred, as the accuracy of the estimated pose increases
Lenz, and Urtasun 2012; Geiger et al. 2013), traffic scenes with the number of point correspondences. The over-
(Wen et al. 2015)), or people5 (pedestrians in crowded constrained pose estimation problem gets transformed
places or on the street). AR applications, in general, and into an optimisation problem. Techniques like DLT
in manufacturing applications, focus on generic MOT. and pose from orthography and scaling with iterations
So, generic MOT still needs considerable focus as the (POSIT) (Dementhon and Davis 1995) can solve the
premise of the two problems are quite different. Track- optimisation problem. DLT can find the complete pro-
ers for tracking pedestrians or vehicles have to mostly jection matrix, which consists of both camera intrinsics
INTERNATIONAL JOURNAL OF PRODUCTION RESEARCH 21

Figure 5. Visual depiction of the topics reviewed in camera pose estimation.

and extrinsics, by solving the linear system of equations camera pose from square fiducial markers (Ababsa and
using Singular Value Decomposition (SVD). However, Mallem 2004). POSIT (Dementhon and Davis 1995) was
estimating only the extrinsics yields better results in 3D used to calculate the pose of a camera from binocular
tracking. If the internal parameters are known, the for- images and retro-reflective markers in a stereo-vision
mulation becomes over-parametrised; this reduces sta- system (Dorfmüller 1999). POSIT for coplanar points
bility. POSIT is an iterative process, which has become (Oberkampf, DeMenthon, and Davis 1996) was used
a standard method to solve the PnP problem. Although along with RANSAC for determining the pose of the
it does not need any initialisation, it is sensitive to noise camera in a markerless mobile AR system for geovisuali-
and is unsuitable for coplanar points. Homography (Ben- sation (Schall et al. 2008). POSIT was also used by Bleser,
himane and Malis 2007) can be used for estimating the Pastarmov, and Stricker (2005) with RANSAC, and with
pose from coplanar points. In the case of noisy data, the Tukey estimator by Lepetit et al. (2003) for AR applica-
camera pose is estimated by minimising the sum of repro- tions. A non-iterative PnP algorithm (Moreno-Noguer,
jection errors. That requires an iterative optimisation Lepetit, and Fua 2007) with RANSAC was utilised for
scheme, which again needs an initialisation. If the total approximating the camera pose from keyframes (Park,
error is linear, linear least squares or iterative re-weighted Lepetit, and Woo 2008). P3P with RANSAC was used
least squares can be used. If the total reprojection error to find the 3D camera pose from natural features in an
is non-linear, Gauss-Newton or LM algorithm can be HMD. P5P was used with RANSAC to estimate the cam-
employed. The pose increment in LM includes an addi- era pose relative to the user’s hand in a tabletop AR setup
tional stability factor. If the stability factor is small, LM (Lee and Hollerer 2008).
imitates the Gauss-Newton algorithm. If the stability fac-
tor is large, the LM behaves like gradient descent (Lepetit
and Fua 2005). As pose estimators face a large number
6.2. Limitations of current approaches for camera
of outliers from the images, Random Sample Consen-
pose estimation in AR & potential for AI
sus (RANSAC) (Fischler and Bolles 1981), Progressive
Sample Consensus (PROSAC) (Chum and Matas 2005), The classical approaches are plagued by a few
or M-estimators (like Huber estimator (Huber 1992), limitations:
Tukey estimator (Rousseeuw and Leroy 2005)) can be
used for improving the robustness. M-estimators need an • Despite the popularity of classical approaches, they
initial pose estimate, whereas RANSAC lacks precision. lack robustness. Large changes in viewpoint (e.g. at
Additionally, Bayesian techniques like Kalman filters and corners of buildings/machines), variation in illumina-
particle filters can also be utilised for pose estimation. tion, textureless scenes, motion blur, etc. deteriorates
Numerous works used classical methods to estimate their performance. They are unable to find accurate
the pose of the camera in AR applications. A particle fil- matching points in all scenarios.
tering framework was used for estimating the pose of a • The computational complexity of the SfM based
handheld camera for real-time AR applications (Ababsa approaches grows with the size of the model. With
and Mallem 2011). It tracked both point and line fea- larger models, the proportion of outliers increases
tures. Extended Kalman filter (EKF) was employed to considerably. Although RANSAC-like algorithms can
estimate the camera pose from square-shaped fiducials be used to increase robustness, computational expense
(Maidi et al. 2010). Four points were matched to esti- shoots high in the presence of a large number of out-
mate the initial pose. Orthogonal iteration algorithm (Lu, liers. Moreover, RANSAC does not have a constant
Hager, and Mjolsness 2000) was utilised to estimate the runtime at inference.
22 C. K. SAHU ET AL.

On the other hand, as DL-based techniques do not of the feature vector (Walch et al. 2017). Additionally,
require engineered features, and they can be trained to alternative architectures like VGG-Net (Naseer and Bur-
be robust against changes in illumination, perspective, gard 2017) and BranchNet (Wu, Ma, and Hu 2017) that
and others. The complexity of DL techniques remains split the GoogLeNet into two branches for learning the
relatively constant. Unlike RANSAC, the performance of high-level features of translation and orientation have
DL remains consistent at runtime. Additionally, Kendall, also attempted to estimate the camera pose.
Grimes, and Cipolla (2015) showed that monocular Even after several architectural modifications, the DL-
images have a one-to-one (injective) relationship with the methods were not able to achieve performance compa-
pose of the camera. So, while the classical approaches rable to the classical methods. The reason was identified
have to follow a sequential approach of detection, track- as the limited capacity of the model to learn the 3D
ing, and pose estimation, DL can directly infer the pose structural connections and constraints from monocular
from the RGB images by skipping the detection and images. So, auxiliary learning methods were employed by
tracking tasks. Valada, Radwan, and Burgard (2018) to improve camera
localisation. The primary task of camera localisation was
supplemented with the auxiliary task of visual odome-
6.3. AI approaches for camera pose estimation in AR
try in visual localisation and odometry network (VLoc-
To the best of our knowledge, there is not a sin- Net) (Valada, Radwan, and Burgard 2018). It used both
gle instance of research work that uses AI to estimate a global pose regression network and a Siamese (based
the pose of the camera in AR-assisted manufactur- on ResNet (He et al. 2016)) relative pose estimation net-
ing applications. The only work that used AI trained work. VLocNet++ (Radwan, Valada, and Burgard 2018)
a CNN using RGB images for fine-tuning the coarse learned the odometry and semantic segmentation of the
pose acquired from a consumer-grade Global Positioning scene as auxiliary tasks.
System (GPS)/Inertial Measurement Unit (IMU) (Wang The absolute pose of the camera is dependent on
et al., “DeLS-3D,” 2018). They used RNNs to learn the the global coordinate frame. Constructing a generic DL
temporal correlation of the pose. To the best of our model to learn the camera pose from all scenes is chal-
knowledge, neither their work nor any other work in AR lenging, as every scene has a different coordinate sys-
used RGB images for directly estimating the pose from tem. A model trained on a particular scene learns the
images. mapping from that particular scene to the camera pose.
Altering the scene alters the training data. Hence, the
trained model is likely to fail in other scenes. That neces-
6.4. Potential AI approaches for camera pose
sitates re-training of the models with new scenes. So,
estimation in AR
there has been a considerable thrust to learn the rel-
PoseNet (Kendall, Grimes, and Cipolla 2015) conceived ative pose of the camera to decouple the coordinate
the idea of directly estimating the 6-DoF pose includ- frame of the scene from the task of pose estimation.
ing the depth and out-of-plane rotation of a monocular Efforts at estimating the relative pose of sample images
camera in real-time directly from a single RGB image with respect to reference images include Siamese CNNs
by using a pre-trained GoogLeNet (Szegedy et al. 2015). (Laskar et al. 2017; Balntas, Li, and Prisacariu 2018)
For training, it used SfM to label the camera pose for and hourglass networks (Melekhov et al. 2017; a sym-
each frame in the video. It was robust to variations metric encoder-decoder structure based on ResNet
in lighting, motion blur, and camera intrinsics as it (He et al. 2016)).
localised the camera pose from deep features. More- Additionally, in structure-based learning techniques
over, it was fast (5 ms/frame) and had a low memory like Hierarchical Feature Network (HF-Net) (Sarlin
footprint. However, its accuracy in both outdoor (2 m et al. 2019), which supplements RGB images with a
and 6◦ ) and indoor (0.5m and 10◦ ) settings was poorer 3D model of the scene for camera localisation, the best
than feature-based state-of-the-art methods. Efforts at hypothesis was selected using RANSAC (Fischler and
increasing the accuracy of PoseNet included incorporat- Bolles 1981). As RANSAC (Fischler and Bolles 1981)
ing a geometric loss function (Kendall and Cipolla 2017) is not differentiable, it prohibited end-to-end learning.
and a Bayesian CNN (MacKay 1992) to model the uncer- DSAC (differentiable RANSAC, Brachmann et al. 2017)
tainty (Kendall and Cipolla 2016). PoseNet was prone to substituted the hard hypothesis selection in RANSAC
overfitting. Although it used dropout regularisation to with a probabilistic one and enabled end-to-end learn-
overcome overfitting, its accuracy did not improve. So, ing of the camera pose. DSAC used two CNNs, one
the final 2048-dimensional fully connected layer of the for predicting the scene coordinates and creating a pool
CNN was fed to an LSTM to reduce the dimensionality of hypotheses, and another scoring CNN for scoring
INTERNATIONAL JOURNAL OF PRODUCTION RESEARCH 23

Figure 6. Schematic diagram of the topics reviewed in illumination.

the hypotheses for global consensus. Brachmann and 7.1. Illumination


Rother (2018) replaced the scoring CNN with a soft
For accurate visual registration and coherence, the illu-
inlier count to reduce overfitting. Moreover, Brach-
mination characteristics of the scene must be determined.
mann and Rother (2018) also showed that RGB-images
Determining the illumination of a scene from a single
with ground truth pose are enough for regressing
image, captured using a low dynamic range (LDR) cam-
the scene coordinates. The 3D model of the scene is
era with a limited FOV, is a challenging task because of
optional.
the fact that the scene is illuminated by light received
from the full sphere of directions around the object, most
6.5. Conclusion of which are beyond the FOV of the camera. Its solu-
tion needs modelling of the geometric and photometric
DL-based techniques cannot be used without appro- information contained in the image. The structure of the
priately labeled data. Acquiring the ground truth pose reviewed illumination techniques is pictorially depicted
information is not as easy as it is in detection and track- in Figure 6.
ing. Currently, classical methods are used to acquire the
ground truth pose. That makes DL-techniques depen- 7.1.1. Current approaches for illumination in AR
dent on classical methods. Moreover, as the data collec- Most of the methods for estimating the illumination char-
tion process is expensive, the pose labels in the training acteristics of a scene can be broadly classified into those
data are usually sparse. Techniques based on odometry that use auxiliary information and those that do not use
accumulate drift, thereby deteriorating the accuracy of auxiliary information. Methods utilising auxiliary infor-
the pose estimates over time. Although most DL-based mation leverage RGB-D data (Jiddi, Robert, and Marc-
techniques are pre-trained on ImageNet-like datasets, it hand 2016) or information acquired from light probes or
is important to keep in mind that pose estimation and that from other sources. The light probes are quite pop-
image classification are largely unrelated tasks. That lim- ular. They can be active (like the fisheye camera used
its the effectiveness of transfer learning. So far, DL-based by Kán and Kaufmann 2013) or passive (like reflective
techniques could not achieve accuracy at par with classi- spheres used by Debevec 2008). The auxiliary informa-
cal structure-based methods. Sattler et al. (2019) argued tion can also be assumptions of some attributes or simpler
that the reason behind the limited accuracy is that the models like Lambertian illumination (Lu et al. 2010).
CNN-based absolute pose regression models are close The radiance transfer function was used by Knorr
to image retrieval which are ordinarily used for place and Kurz (2014) to estimate the lighting on a face, and
recognition, not pose estimation. it was expressed using spherical harmonics. A hemi-
spherical photon emission model was used to estimate
the global illumination with the presumption of uni-
7. Inverse rendering
formly distributed multiple light sources (Zhang, Zhao,
An image is formed by integrating illumination, scene and Wang 2019).
geometry, camera properties, and surface reflectance. Methods that do not use any auxiliary information
Inverse rendering involves recovering the constituent attempt to directly estimate the lighting of the scene
factors from a rendered image. Thus it is an under- from various features contained in the image itself. Shad-
constrained problem. We focus on inferring illumination ows (Panagopoulos et al. 2011; Jiddi, Robert, and Marc-
(photometry) and depth (geometry) characteristics of a hand 2017) and faces (Shim 2012; Yi et al. 2018) are
scene. widely used. Kán, Unterguggenberger, and Kaufmann
24 C. K. SAHU ET AL.

(2015) acquired the reflection maps of the virtual RGB LDR image of an indoor scene. Deep NNs (DNNs)
objects by convolving the image using bidirectional have also been used for estimating the illumination char-
reflectance distribution functions (BRDFs). Sato, Sato, acteristics in real-time for AR applications (Xu, Li, and
and Ikeuchi (2003) estimated the distribution of light by Zhang 2020).
exploiting the correlation between the brightness of the
image with that of shadows. Gruber, Richter-Trummer, 7.1.4. Potential AI approaches for illumination in AR
and Schmalstieg (2012) estimated the lighting conditions Gardner et al. (2017) initiated the inference of HDR illu-
of the scene using arbitrary geometry for photometric mination from single limited FOV LDR images using
registration. DCNNs in an end-to-end manner. They used LDR and
a sphere to approximate the scene geometry, which are
7.1.2. Limitations of current approaches for weak approximations to the ground truth. A cascade of
illumination in AR & potential for AI AEs was used by Li et al., “Inverse Rendering” (2020)
Although the light probes and other methods using auxil- for inferring spatially varying illumination, material, and
iary information offer better visual fidelity, the capturing geometry of the scene. Cheng et al. (2018) trained a
process is complex, and the whole process is computa- DCNN to estimate the low-frequency scene illumination,
tionally expensive and not scalable. Using light probes expressed in terms of spherical harmonics, from a pair of
to infer the illumination characteristics disturbs the real- photos.
time performance. The accuracy of the methods that do In AR applications, the illumination of the rendered
not use auxiliary information is low. Additional prepro- objects has to be adapted for its location in the real
cessing steps like depth estimation, reflectance estima- scene. So, the illumination of a specific location (Garon
tion, and scene geometry reconstruction are employed et al. 2019) and a pixel (Song and Funkhouser 2019) were
for improvement. On the other hand, methods based estimated using AEs. As the material of the objects affects
on AI do not need the definitive models of the geo- illumination, Meka et al. (2018) estimated the material
metric or photometric attributes. Moreover, they exploit of generic objects appearing in a single image using an
the attributes of the images within the FOV rather AE. ARShadowGAN (D. Liu et al. 2020) used a GAN
than those beyond the FOV, which indirectly affect the that leveraged an attention mechanism to create realis-
scene. tic shadows of virtual objects. The 3D CNNs used by
Srinivasan et al. (2020) showed a way for learning the
7.1.3. AI approaches for illumination in AR 3D illumination of a scene without the need for ground
None of the existing works that use AI for estimating the truth scene geometry. It used a stereo pair and a spherical
photometric characteristics are applied to manufacturing panorama for training.
applications. There are several research works that use
AI to estimate illumination characteristics in AR appli-
cations. An AE, with the encoder based on MobileNetV2 7.1.5. Conclusion
(Sandler et al. 2018), was used for estimating the omni- Most of the manufacturing environments are illuminated
directional illumination in indoor and outdoor envi- by multiple sources. So, adopting state-of-the-art AI tech-
ronments from an unconstrained LDR image acquired niques for determining the illumination characteristics of
from the limited FOV camera of a mobile AR device the scene would be advantageous. Unavailability of large
(LeGendre et al. 2019). A Generative Adversarial Net- scale data is impeding the full-fledged adoption of AI into
work (GAN)-based network (Goodfellow et al. 2014) was AR-assisted manufacturing applications to estimate the
employed for learning the perspective independent illu- illumination characteristics.
mination and shadow characteristics of virtual objects
for blending them realistically in an AR application
7.2. Depth inference
(Wang, Wang, and Lian 2019). They adjusted the illumi-
nation and shadow of virtual objects using pix2pixHD The modus operandi of AR systems is to overlay vir-
(Wang et al., “High-resolution Image Synthesis,” 2018). A tual objects directly on top of the real video feed, thereby
ResNet was designed for determining the illumination of occluding the real objects. For spatially sensible augmen-
the real scene in terms of spherical harmonics for an AR tations, the virtual objects should appear to be at the same
game (Marques, Clua, and Vasconcelos 2018). Marques depth as the co-located real objects. The co-location is
et al. (2018) used ResNet (He et al. 2016) for estimating infeasible without acquiring proper depth information
the position of the single point light source from a single from the image. Depth inference also aids with mutual
color image. Lalonde (2018) used a CNN to estimate the occlusion. The reviewed techniques for depth inference
high dynamic range (HDR) illumination from a single can be classified, as shown in Figure 7.
INTERNATIONAL JOURNAL OF PRODUCTION RESEARCH 25

Figure 7. Visual depiction of the topics reviewed in depth inference.

7.2.1. Current approaches for depth inference in AR matching algorithm to compute the per-pixel depth map
The lion’s share of AR applications infer depth directly without any prior information (Gordon et al. 2002).
from the scene using the ten perceptual depth cues recog-
nised by Swan et al. (2007). The cues are occlusion, 7.2.2. Limitations of current approaches for depth
shading, texture gradient, motion parallax, haze, lin- inference in AR & potential for AI
ear perspective and shortening, accommodative focus, The main reason for the lack of effectiveness of classical
height in the visual field, binocular convergence, and methods in recovering depth information from an image
binocular disparity. Although they are popular, the per- is that it is a heavily under-constrained problem. While
ception of every user is different. That leads to incon- perceptual cues can aid in recovering relative depth infor-
sistent interpretation. Classical image-based approaches mation from images, their efficacy gets deterred by cam-
also try to exploit these cues. They are comparatively era characteristics like LDR. Most of the classical meth-
more consistent as the depth is directly inferred from ods utilise only a few cues. AI methods, on the other
the video feed. Stereo (Kanade and Okutomi 1994; Zhu hand, can exploit almost all the cues at once, thereby
and Pan 2008), motion (Weng, Huang, and Ahuja 1989), offering better depth inference.
shadow casting, object shading (Horn and Brooks 1989),
and texture (Blostein and Ahuja 1989) have been used 7.2.3. AI approaches for depth inference in AR
for depth inference. Monocular cameras can also be used Owing to the limited effectiveness of classical approaches
for depth inference using multi-focus images or acquir- at depth inference, AI-based techniques for inferring
ing images from different perspectives using a moving depth have been explored in AR applications. A Siamese
camera (Newcombe and Davison 2010). Breen, Rose, and network and a self-supervised AE based on depth esti-
Whitaker (1995) used Z-buffer and Sanches et al. (2012) mation network (Garg et al. 2016) and DeConvNet (Noh,
used frame buffer to acquire depth maps. Hahne and Hong, and Han 2015) was trained for estimating the per-
Alexa (2009) combined time-of-flight range data with pixel inverse depth (disparity maps) from stereo image
stereo images to compute depth images at interactive pairs without ground truth data in a surgical application
rates. Stereo matching was used to estimate the depth (Ye et al. 2017). This is the only instance of AI in AR
of the real world by projecting the bounding box of for inferring depth information. It is not applied within
the virtual object using the model view matrix (Kan- manufacturing. We briefly discuss other state-of-the-art
bara et al. 2000). Zhu et al. (2010) combined the depth DL techniques that are not being used in AR but can be
information acquired from stereo images with color and useful for depth inference in AR-assisted manufacturing
neighbourhood information to reduce noise in handling applications.
occlusion in a video-based AR system. Dense depth maps
were acquired using geo-registered images and sparse 7.2.4. Potential AI approaches for depth inference
3D models acquired from geographic information sys- in AR
tem (GIS) data using back-projection (Zollmann and Eigen, Puhrsch, and Fergus (2014) were the first to esti-
Reitmayr 2012). The depth information was calculated by mate the depth from monocular single images using two
stereo matching after background subtraction and clus- CNNs, one for coarse global estimation and the other
tering the objects based on their silhouette in video-based for local refinement, in an end-to-end manner. Frame-
AR (Zhu and Pan 2008). Optical flow and sparse SLAM works based on AlexNet and VGG-Net were used for
reconstruction were used for inferring pixel-level depth predicting the depth in addition to semantic labeling and
information (Holynski and Kopf 2018). Census correla- estimating the surface normals (Eigen and Fergus 2015).
tion (Zabih and Woodfill 1994) was used as the stereo A ResNet backbone with reversed Huber loss (Zwald and
26 C. K. SAHU ET AL.

Lambert-Lacroix 2012) has also been used for predict-


ing depth from single RGB images (Laina et al. 2016).
An encoder-decoder framework, inspired from Seg-
Net (Badrinarayanan, Kendall, and Cipolla 2017) and
FlowNet (Dosovitskiy et al. 2015) was designed for
detecting obstacles from images and corresponding opti-
cal flow (Mancini et al. 2016). These works use abso-
lute depth information, which is ordinarily difficult to
acquire. So, a multi-scale deep network modified from
GoogLeNet (Szegedy et al. 2015) was developed for infer-
ring the pixel-wise depth information from single images
using relative depth information (Chen et al. 2016).
Additionally, Geometry and Context Network (GC-
Net) (Kendall et al. 2017) used 3D convolutions on Figure 8. Visual depiction of the topics reviewed in procedure
stereo images. Camera-Aware Multi-scale Convolutions storage.
(CAM-Convs) (Facil et al. 2019) incorporated cam-
era models into a U-Net (Ronneberger, Fischer, and et al. 2012) is restricted to indoor settings. Although
Brox 2015) based AE for estimating the depth. Li promising, unsupervised and semi-supervised learning
et al. (2015) and Wang et al., “Towards Unified Depth” still have a long way to go to achieve the desirable per-
(2015) explored Conditional Random Fields (CRFs) for formance in order to be applied within the industrial
refining the results by exploiting the continuous nature of setting.
depth among the pixels. Adversarial networks have also
forayed into depth estimation tasks (Gwn Lore et al. 2018;
Feng and Gu 2019). 8. Procedure storage
Due to the expensive process of acquisition of Procedural information is required for both assembly and
ground truth depth information, the geometric correla- maintenance tasks within the manufacturing environ-
tion between the frames has been exploited to train the ment. Having accurate and effective procedures available
model under self-supervision (Casser et al. 2019; Godard for use in the AR system is important so that the AR
et al. 2019). However, due to the lack of ground truth system can be utilised to its fullest capacity. A hierar-
information, the performance of unsupervised learn- chical description of the topics covered within procedure
ing techniques remains quite low. So, visual odome- storage can be found in Figure 8.
try (Wang et al., “Learning Depth,” 2018) and auxil-
iary learning tasks (Yin and Shi 2018; Zou, Luo, and
Huang 2018) were used to assist depth estimation. Even 8.1. Current approaches for procedure storage in AR
after several innovations, the performance of unsu- The most common procedural storage mechanisms in
pervised learning remained low. So, semi-supervised AR systems are databases. Databases store and organise
methods using stereo knowledge (Tosi et al. 2019) and data so that the data can be easily accessed and managed
sparse ground truth data (Kuznietsov, Stuckler, and (Tonnis 2003). The most popular types of databases used
Leibe 2017) have also been explored to balance accu- in AR systems are relational database management sys-
rate depth estimation with the reliance on ground truth tems (DBMS) or RDBMS, PLM systems, product infor-
data. mation management (PIM) systems, computerised main-
tenance management systems (CMMS), and extensible
7.2.5. Conclusion markup language (XML) databases.
While DL-based techniques can estimate the depth infor- An RDBMS facilitates the organisation of relational
mation from RGB images simultaneously with other databases. Relational databases provide structure to
tasks such as detection, tracking, and pose estimation, data by storing it in tables that comply with database
the major hurdle that still remains is the acquisition schema (Tonnis 2003). MySQL databases store relational
of ground truth data for training. The existing datasets databases on servers (Lytridis, Tsinakos, and Kazani-
are limited in their capabilities: Cityscapes (Cordts dis 2018). The smart maintenance solution given by
et al. 2016) lack ground truth data; KITTI (Geiger, Lenz, Terenzi and Basile (2013) includes two main parts: an
and Urtasun 2012) has sparse ground truth data; and administration web client and a mobile user client.
New York University (NYU) Depth dataset (Silberman The administration web client utilises a server-based
INTERNATIONAL JOURNAL OF PRODUCTION RESEARCH 27

RDBMS to store maintenance and user information and 8.2. Limitations of current approaches for
to connect users to maintenance information depending procedure storage in AR & potential for AI
on the environmental context. The AR solution for
While all of the database structures described are effective
distance learning given by Lytridis, Tsinakos, and Kazani-
in their storage and organisation of data for manufactur-
dis (2018) stores book and asset information in a server-
ing, they are static in nature and do not have the capability
based MySQL database. Their ARTutor platform is able
to derive further knowledge and relationships based on
to access the database in order to download images and
the data they hold. One possible solution that is able
files. While RDBMS is optimised to perform queries
to dynamically manipulate data over time is AI-based
quickly, this optimisation results in varying data struc-
storage methods such as ontology.
tures that are difficult to distribute between systems
(Tonnis 2003).
PLM systems store enterprise data (Rentzos et al. 8.3. AI approaches for procedure storage in AR
2013). The AR solution given by Setti, Bosetti, and An ontology is an AI-based storage method that is able
Ragni (2016) for computer numerical control (CNC) to not only store data in a logical and organised manner
machine setup and maintenance utilises a PLM to store but is able to derive relationships between data instances.
system description and diagnostic information for the Ontologies are data storage mechanisms that are a part
CNC machine. The PLM database is composed of YAML of the Semantic Web (Crowder et al. 2009) and can be
Ain’t Markup Language (YAML) files. The AR solu- defined as a specification of a conceptualisation (Gru-
tion for manual assembly processes given in Rentzos ber 1993). One of the benefits of using an ontology to
et al. (2013) stores virtual instructions in a repository describe domain knowledge as opposed to a database
through Visual C# .NET files and can be incorporated is that semantic queries can be completed in ontolo-
with an external enterprise database such as PLM in order gies using SPARQL Protocol and Resource Description
to gather geometrical information about parts. This geo- Framework (RDF) Query Language (SPARQL) (Zhao
metrical information is utilised to match parts with cor- and Qian 2017). Ontologies are composed of concepts
responding equipment or tools that relate to the virtual or classes and instances or individuals where concepts
instructions for assembly. describe categories within a domain, and instances are
PIM systems support the management of product and specifications of these categories or concepts. Ontologies
part structure and development (Lee and Rhee 2008b). that have instances or individuals defined are deemed
The solution for AR as a part of ubiquitous computing knowledge bases (Noy and McGuinness 2000). Many file
in cars by Lee and Rhee (2008b) utilises a PIM sys- formats have been created to store ontologies, includ-
tem to store information about a car for the purposes of ing RDF, web ontology language (OWL) and its vari-
maintenance. ants, and United States Defense Advanced Research
CMMS, also referred to as enterprise asset manage- Project Agency (DARPA) agent markup language and
ment or EAM systems, store maintenance data (Neges, the ontology inference layer (DAML + OIL) (Munir and
Wolf, and Abramovici 2015). The solution for mobile Anjum 2018). The main advantage of using an ontol-
maintenance support through AR by Neges, Wolf, and ogy to store data is the ability to use an ontological or
Abramovici (2015) utilises CMMS to manage mainte- semantic reasoner such as Jena, HermiT, ELK, or Pellet.
nance data, employees, and tasks. These reasoners are able to infer logical consequences
XML databases store XML documents that are either between ontology instances based on asserted facts or
text-based or model-based (Tonnis 2003). Schmalstieg axioms (Abburu 2012). An additional advantage of using
and Wagner (2007) utilise a server called Muddle- an ontology is the ability to integrate various sources of
ware that has three components: (1) a memory-mapped data throughout the PLM (Ali et al. 2019).
database, (2) a networking component, and (3) a con- Ontologies have been utilised in AR systems in order
troller component in the form of a finite state machine to store procedural information. Flatt et al. (2015) uses
(FSM). The memory-mapped database is structured an ontology and knowledge base to store information
using a model-based XML database. The AR navigation for their AR system for maintenance applications. Jo
system presented by Reitmayr and Schmalstieg (2003) et al. (2014) stores Component-Task-Subtask-Instruction
uses an XML database to store domain-specific prepro- pairs in a knowledge-based system (KBS). The AR system
cessing operations. While the XML format is convenient is used for maintenance tasks and the stored informa-
because of its uniform structure, XML databases are tion is imported from maintenance manuals such as Air-
not suitable for AR systems because they are resource- craft Maintenance Manual (AMM), Illustrated Tool and
intensive (Tonnis 2003). Equipment Manual (TEM), Component Maintenance
28 C. K. SAHU ET AL.

Manual (CMM), and Illustrated Parts Catalog (IPC)


for aircraft maintenance. Wang, Ong, and Nee (2014)
present an integrated AR-assisted assembly environment
(IARAE) that includes an ontology-based assembly infor-
mation model (OAIM) to assist users with AR assembly
tasks. During the assembly process, the OAIM is queried
for information. Toro, Vaquero, and Posada (2009) used
an ontology to store knowledge about industrial main-
tenance by extending the standard ontology for ubiqui-
tous and pervasive applications (SOUPA) set of ontolo-
gies. This maintenance information is combined with
AR virtual elements to create a knowledge-based indus-
trial maintenance system structure called UDKE (user,
device, knowledge, and experience). Chang, Ong, and Figure 9. Visual depiction of the topics reviewed in virtual object
Nee (2017) utilise an ontology to store product infor- creation.
mation in order to form a disassembly sequence table
as a part of the automatic content generation module 9.1. Current approaches for virtual object creation
in the AR-guided product disassembly (ARDIS) system. in AR
Rentzos et al. (2013) use a semantic-based AR system to
integrate product and process in an ontology model that Virtual objects or augmentations for AR can take the
is used to generate work instructions for the user. form of one-dimensional (1D) text, 2D graphics or
images, and 3D models. The most complex but intuitive
of these options are 3D models. An additional layer of
interactivity can be added to 3D models with animation.
8.4. Potential AI approaches for procedure storage
Animation can be incorporated into the system for any
in AR
type of augmentation (Chang, Ong, and Nee 2017).
To the best of our knowledge, there are no additional In traditional AR systems, virtual objects or augmen-
AI strategies other than ontology that can be utilised for tations are created manually and stored with respect
procedure storage in AR-assisted manufacturing applica- to tracked objects in the user’s environment (Chang,
tions. Ong, and Nee 2017). These augmentations are developed
offline by experts and loaded into the application before
use. The manual virtual object creation process is costly
8.5. Conclusion in terms of both time and capital. Even in adaptive AR
systems such as the AR Maintenance System (ARMS)
Ontologies or knowledge bases introduce added flexi- by Siew, Ong, and Nee (2019) where augmentations are
bility into AR systems. If knowledge bases are formed chosen based on the user’s actions, augmentations are
with additional domain knowledge that is available from predetermined and stored within the system before use.
enterprise management systems and geometric data, such
as in Rentzos et al. (2013) and Järvenpää et al. (2019),
additional capabilities and automation can be introduced 9.2. Limitations of current approaches for virtual
into AR systems that allow for greater integration of mul- object creation in AR & potential for AI
tiple sources of data. This utilisation of process informa- AI-based methods provide an alternative approach to vir-
tion increases the adaptability of AR systems within the tual object creation: in situ augmentation creation. In
manufacturing space. this way, minimal input, such as a single image, can be
utilised to automatically create 1D, 2D, and 3D augmen-
tations. This saves the developer of the AR system time
9. Virtual object creation and energy in manually creating and predefining these
augmentations for use in the system.
Virtual objects are a major component in any AR system.
For AR-assisted manufacturing applications, in particu-
lar, virtual objects are essential to display data and pro- 9.3. AI approaches for virtual object creation in AR
cedures to users. A hierarchical description of the topics To the best of our knowledge, there have been no AI-
covered within virtual object creation can be found in based virtual object creation strategies utilised within
Figure 9. AR-assisted manufacturing applications.
INTERNATIONAL JOURNAL OF PRODUCTION RESEARCH 29

9.4. Potential AI approaches for virtual object based on inputted images. These sketches can then
creation in AR be utilised on the screen of the AR system as added
visual cues for the user. Multiple examples of produc-
For AI-based augmentations, there are strategies that can
ing sketches from images can be found in AI litera-
be used to automatically create 1D, 2D, and 3D augmen-
ture. Ha and Eck (2017) present Sketch-RNN, an RNN
tations for AR systems given minimal input information.
that uses a sequence-to-sequence Variational AE (VAE)
9.4.1. 1D to construct stroke-based drawings of common objects.
In order to automatically generate 1D text-based aug- Chen et al., “Sketch-pix2seq,” (2017) provide improve-
mentations based on objects seen within a camera frame, ments to the Sketch-RNN architecture by using a CNN
AI-based caption generation methods can be utilised. instead of a bidirectional RNN encoder and by removing
DNN-based image captioning methods can be cate- the Kullback-Leibler (KL) divergence from the objec-
gorised into (Bai and An 2018): tive function of the VAE. These improvements to the
original Sketch-RNN model were found to outperform
• DNN-based retrieval and template-based methods; the state-of-the-art methods with RNN encoders. The
• Multimodal learning methods; method by Song et al. (2018) takes as input one photo
• Encoder-decoder methods; of a single object and outputs a sketch of the object.
• Attention-guided image captioning; The image is encoded via a CNN and fed into a neural
• Compositional architectures; and sketcher that composes a generative sequence model with
• Methods for images with novelties. an RNN. The neural sketcher is trained to learn the sketch
stroke-by-stroke. Riaz Muhammad et al. (2018) use rein-
The outputs of all DNN-based image captioning meth- forcement learning to produce a sketch given an input
ods besides those belonging to the last group of methods image. Wu et al., “SketchsegNet” (2018) present Sketch-
are limited to word dictionaries and their training data SegNet, a sequence-to-sequence VAE-based model that
that comes in the form of image-sentence pairs (Bai and implements stroke-level sketch segmentation. Li et al.,
An 2018). Thus, image captioning strategies that deal “Photo-sketching” (2019) propose a contour generation
with novel data (i.e. ‘novel captioning’) are the most algorithm or image-conditioned contour generator that
flexible for systems that aim for minimal training time detects salient boundaries in input images and outputs
and effort. Mao et al. (2015), Yao et al., “Incorporat- hand-drawn contour drawings based on objects in the
ing Copying Mechanism” (2017), and Li et al., “Pointing input image.
Novel Objects” (2019) utilise LSTM for novel captioning.
Specifically, Yao et al., “Incorporating Copying Mecha- 9.4.3. 3D
nism” (2017) utilise LSTM with a copying mechanism Some AR systems have already found ways to derive
(LSTM-C) as a new architecture to incorporate copying 3D models while the system is in use. All of these
into the CNN plus RNN captioning framework. Li et al., methods derive a 3D structure from multiple sequential
“Pointing Novel Objects” (2019) use LSTM with pointing images. Reitinger, Zach, and Schmalstieg (2007) describe
(LSTM-P) to facilitate vocabulary expansion and pro- an interactive 3D reconstruction system that is used to
due novel objects via pointing. Hendricks et al. (2016) model urban scenes. Their ‘Reconstruction Engine’ takes
train a deep multimodal caption model to integrate a as input a sequence of 2D images, extracts corners from
lexical classifier and a language model in their Deep the image using Harris Corner Detector, finds point cor-
Compositional Captioner (DCC) method. Venugopalan respondences between the images using RANSAC, and
et al. (2017) present a Novel Object Captioner (NOC), finally utilises the plane sweeping principle for depth
a semantic captioning model that can produce captions estimation to create a depth map. Yang et al. (2013)
for object categories not present in the original train- derive reconstructed 3D models from a pair of images by
ing dataset. Wu et al., “Decoupled Novel” (2018) and combining multiple techniques: SfM, Clustering Views
Baig et al. (2018) utilise Zero-Shot Learning concepts for Multi-View Stereo (CMVS), Patch-based Multi-View
for the novel captioning task. Finally, Feng et al. (2020) Stereo (PMVS), and Poisson surface reconstruction.
present Cascaded Revision Network (CRN), an approach AI-based 3D reconstruction methods can produce 3D
that integrates out-of-domain knowledge in the novel models, point clouds, or meshes using a single image. The
captioning process. method by Wu et al. (2017) takes as input a single image,
converts the objects within the image to two-and-a-half
9.4.2. 2D dimensional (2.5D) sketches, and outputs a 3D model.
For 2D graphics or image-based augmentations, AI- By deriving 2.5D sketches from the image, this method
based strategies can be utilised to produce 2D sketches avoids modelling object appearance variations within the
30 C. K. SAHU ET AL.

real image. The method by Fan, Su, and Guibas (2017) Registration combines both the intrinsic and extrinsic
produces point cloud coordinates from a single image parameters found during the camera calibration and
using PointOutNet, a network with encoder and predic- detection and tracking steps in order to determine the
tor stages. Pontes et al. (2018) present Image2Mesh which correct placement and orientation of the virtual objects
takes as input a single image and produces a mesh of the or augmentations on the screen of the visual interface for
objects found in the image. 3D meshes are used because the user. One major issue in AR is misregistration, which
3D models and point clouds have higher computational is when the pose of the virtual objects rendered on the
complexity, are less ordered, and lack finer geometry screen do not align perfectly with the real world. There
when compared to 3D meshes (Pontes et al. 2018). are various types of registration errors that contribute to
misregistration, including (Holloway 1997):
9.5. Conclusion
• Linear registration error;
The use of descriptive and expressive virtual objects or • Lateral registration error;
augmentations within AR systems is essential in order • Depth registration error; and
for the user to derive the most benefit from the sys- • Angular registration error.
tem. While augmentations are typically predetermined
and saved in the AR application before use, methods are Linear registration error describes the amount of sepa-
available to create 1D, 2D, and 3D augmentations while ration between the real and virtual objects. It is measured
the application is being used. Caption generation is a by the length of the distance vector between the pose
well-researched method within DL that can produce 1D of these objects. Lateral registration error describes the
text descriptions of objects within a single image. AI- offset between the real and virtual objects if the virtual
based sketch or contour generation methods are able to object were placed at the same distance as the real object
produce a simple 2D sketch based on objects within an with respect to the user’s eye. It is measured by the length
image. Current strategies within AR systems are able to of the vector that describes the distance between the real
use sequential images with SfM formulations to deter- object and the virtual object at the same depth as the real
mine the 3D structure of objects within a scene. AI-based object. Depth registration error describes how close or
virtual object creation methods are even more power- far the virtual object is relative to the real object from
ful in that they can determine 3D structure from single the user’s perspective. It is measured by subtracting the
images. Further, animation can be combined with these length of the line of sight for the real object from the line
augmentations to increase the interactivity of the sys- of sight for the virtual object. Finally, angular registration
tem. Automatic virtual object creation methods not only error describes the pitch offset between the objects from
save the system developer time during the AR develop- the user’s perspective. It is measured by the angle that
ment process but are able to adapt in real time to the subtends the lateral error line segment (Holloway 1997).
manufacturing environment. A visual depiction of these various types of error can be
found in Figure 10.
Registration error has two main sources: static or
10. Registration dynamic (Axholt 2011; Zheng 2015). A static registra-
Registration refers to the alignment of the virtual tion error is spatial and is visible to the user when
objects or augmentations with the user’s environment. they are stationary (Axholt 2011). Static or spatial errors

Figure 10. Visual depiction of the registration errors present in any AR system: angular error, lateral error, linear error, and depth error.
INTERNATIONAL JOURNAL OF PRODUCTION RESEARCH 31

result from inaccurate measurements of geometrical rela- error, but a starting point to estimate this error can be
tionships in the system (Henderson and Feiner 2007). the error estimate from the manufacturer for the tracking
A dynamic registration error is temporal and is appar- system that is used (MacIntyre and Coelho 2000).
ent when the user moves the AR system (MacIntyre, The majority of AR systems in use and development
Coelho, and Julier 2002; Axholt 2011). Dynamic registra- utilise an open-loop registration approach. In an open-
tion errors tend to have a higher impact on AR systems loop system, the registration error propagates through
than static errors (Bauer 2007). System latency or delay, the system and is seen by the user, but not by the sys-
which is the result of delays in every process of the system tem (Zheng, Schubert, and Weich 2013). The only way to
(Holloway 1997), is the main source of dynamic or tem- improve the registration error in open-loop AR systems
poral registration error (Holloway 1997; Zheng 2015). An is to make each component of the system more accurate
additional type of registration error that relates to inverse (Bajura and Neumann 1995b). In a closed-loop system,
rendering is photometric error. Zheng (2015) considers the visual feedback from registration is utilised during
five components that contribute to system latency or syn- the detection and tracking steps in order to minimise
chronisation delay: tracking delay, application delay, ren- registration error (Zheng, Schubert, and Weich 2013).
dering delay, scanout delay, and display delay. The four Closed-loop AR systems are tolerant of transformation
approaches currently utilised to reduce system latency errors (Bajura and Neumann 1995b).
and thus dynamic registration errors are latency minimi- Within open- or closed-loop AR systems, one solu-
sation, just-in-time image generation, predictive track- tion to account for registration error is to modify the final
ing, and video feedback. For a full review of these latency- graphics rendered in order to account for the error. For-
reducing methods, see Section 2.3.2 of Zheng (2015). For ward warping for AR, or FW-AR, corrects for registration
the current review, we are focussing upon closed-loop error by distorting augmentations to match the real cam-
registration error minimisation strategies, not strategies era image. Backward warping for Augmented Virtuality
used within open-loop systems to reduce registration (AV), or BW-AV, distorts the real camera image in order
error. This is because the only way to reduce registration to match the augmentations on the screen. FW-AR is pre-
error in an open-loop system is to improve the error in ferred over BW-AV because FW-AR does not modify the
each individual source of error throughout the compu- real camera image. Because backward warping modifies
tational pipeline and, as discussed in Section 10.1, this the real camera image, it converts the system into AV,
is not a sustainable solution for correcting registration not AR (Zheng, Schmalstieg, and Welch 2014). Addition-
error. ally, BW-AV only works for video-based interfaces where
A hierarchical description of the topics covered within the real environment is not visible to the user (Bajura
registration can be found in Figure 11. and Neumann 1995b). According to Holloway (1997), the
only possible way to decrease registration error in Opti-
cal See-Through (OST) HMDs is to use predictive head
10.1. Current approaches for registration in AR
tracking methods.
The solutions developed to minimise registration error Many researchers have attempted to better estimate
in AR systems are diverse. It is difficult to estimate regis- or measure registration error. Holloway (1997) sepa-
tration error because of the wide array of sources of the rates the registration error into the following categories:

Figure 11. Visual depiction of the topics reviewed in registration.


32 C. K. SAHU ET AL.

acquisition or alignment, tracking, display, and viewing theory with an M-estimator for pose estimation. Pres-
(for HMDs). MacIntyre and Coelho (2000) and Mac- sigout and Marchand (2004) utilise VVS control theory
Intyre, Coelho, and Julier (2002) determine registra- for markerless tracking with limited knowledge of the
tion error based on the expansion and contraction of real scene. Zheng (2015) utilise VVS theory in the devel-
objects’ 2D convex hulls within images. MacIntyre and opment of their closed-loop registration method pub-
Coelho (2000) and MacIntyre, Coelho, and Julier (2002) lished in Zheng, Schmalstieg, and Welch (2014), which
utilise the measured registration error in a Level of Error utilises model-based tracking for global pose estimation
(LOE) filtering approach where different representations and optical flow methods for pixel-based registration
or versions of augmentations are automatically rendered adjustment.
based on the changes in the registration error. Baillot
et al. (2003) measure registration error using a calibration
10.2. Limitations of current approaches for
method to estimate transformations between the tracking
registration in AR & potential for AI
system and the world.
Bajura and Neumann (1995a, 1995b) were some of Closed-loop registration provides AR systems with the
the earliest investigators into closed-loop AR systems. ability to minimise registration error without explic-
In their works, 3D coordinate system registration error itly determining the causes and potential solutions for
is dynamically estimated and corrected using the regis- the registration error present within the system. While
tration error estimate from 2D objects in images. These closed-loop registration for AR systems is an improve-
works use a backward warping or BW-AV approach to ment over open-loop registration, the current strategies
distort the real camera image in order to match the for closed-loop registration are not fully developed and
parameters of the virtual objects in the scene; thus, are not in widespread use in AR systems. Additionally,
this method can only be used for video-based systems. VVS closed-loop registration methods require features to
Zheng, Schubert, and Weich (2013) utilise the registra- be defined in the scene before use.
tion error within 2D images within the model-based One possible way to further the effectiveness of closed-
tracking procedure in order to create a closed registration loop registration within AR systems is to incorporate
loop. Zheng, Schmalstieg, and Welch (2014) handle reg- DL strategies in the VVS-inspired closed-loop registra-
istration errors on both the global scale and local scale tion approaches. AI-based VVS strategies are designed
by correcting the camera pose in 3D and by correcting to work in unknown environments in many cases (Sax-
pixel-wise registration errors in 2D. The method utilises ena et al. 2017). Additionally, AI-based VVS methods
a reference model of the real scene in a registration- are more accurate, efficient, stable, and robust than tra-
enforcing model-based tracking (RE-MBT) method to ditional image-based visual servoing (IBVS) strategies
correct the 3D camera pose. FW-AR is utilised to dis- (Comport et al. 2006).
tort the augmentations to match the real camera image.
Naik et al. (2015) take inspiration from Bajura and Neu-
10.3. AI approaches for registration in AR
mann (1995b) in implementing a closed-loop registra-
tion approach that dynamically corrects augmentations To the best of our knowledge, there have been no AI-
for projector-based AR by detecting projected features in based registration minimisation strategies utilised within
each camera image and modifying the augmentations in AR-assisted manufacturing applications.
an FW-AR approach to correct for registration error.
VVS control theory has been utilised by various
10.4. Potential AI approaches for registration in AR
researchers for closed-loop registration error minimi-
sation. In VVS control theory, features are extracted One AI-based non-VVS registration error minimisation
from images and the error in reference 2D image fea- method was found in the literature. Jiang et al. (2020)
tures is minimised in a closed-loop fashion (Behringer, use two deep networks: (1) initial registration network
Park, and Sundareswaran 2002; Liu and Li 2019). Sun- and (2) registration error network. The initial registration
dareswaran and Behringer’s motion estimation-based network is a feed-forward network based on VGG-11
tracking algorithm is based on the theory of VVS (1999). (Simonyan and Zisserman 2014) that is used to warp the
Behringer, Park, and Sundareswaran (2002) utilise VVS feature template to match the input image. The registra-
in an outdoor AR system based on 3D model-based tion error DNN then estimates the error of the resul-
tracking. Marchand and Chaumette (2002, 2005) present tant warping and optimises the image transformation
a pose estimation algorithm that iteratively uses the VVS parameters.
control law in order to extract the desired features from The AI-based VVS or IBVS methods either incor-
an image. Comport et al. (2006) utilise VVS control porate NNs into the current framework for VVS,
INTERNATIONAL JOURNAL OF PRODUCTION RESEARCH 33

utilise predictive models (model-predictive control or 10.5. Conclusion


MPC) to enable VVS, or use reinforcement learning
While most AR systems in use today utilise open-loop
to enable VVS. Reinforcement learning-based meth-
registration methods, correcting registration errors in
ods for image-based VVS are termed RL-IBVS
open-loop systems is difficult because registration error
methods.
has many sources throughout the AR pipeline. One pos-
Within AI-based VVS methods that use NNs in the
sible solution to deal with registration error is imple-
current VVS framework, Bateux et al. (2017, 2018) incor-
menting a closed-loop registration method within the
porate an AlexNet-based CNN with direct visual servo-
AR system. While closed-loop registration is not widely
ing (DVS) in order to estimate the relative pose between
used in AR systems, these methods have the advantage
the current and reference image. DVS is a method that
of being able to reduce or minimise registration error
enables VVS using a full image with no feature extraction
without having to explicitly define the sources of the
(Collewet and Marchand 2011) and no prior information
registration error. Some closed-loop registration meth-
of the scene or object (Cui 2016). Once the relative pose
ods used in AR systems are derived from VVS within
is determined, the resulting registration error is incor-
the robotics field, where visual differences in object
porated back into a classical VVS control law. Saxena
motion throughout image frames is minimised through
et al. (2017) use a similar approach in using a CNN for
a feedback system, but these VVS methods require fea-
visual feedback to find the relative camera pose between
tures within images to be predefined. Utilising AI-based
a pair of images. Liu and Li (2019) use two CNNs, one that
methods within VVS makes for even more efficient
takes as input the reference image and one that takes as
closed-loop registration. Further, AI-based VVS meth-
input the current image, and calculates the error between
ods are able to work without prior knowledge or features
the reference and current pose of a robot arm end
defined in the working space. In this way, these meth-
effector.
ods can be used in generic environments, such as those in
MPC is a method that enables real-time prediction
manufacturing.
and correction for image-to-image registration (Ebert
et al. 2018). Finn and Levine (2017) perform feedback
control at the pixel level by utilising an LSTM RNN 11. Rendering
structure to predict the pose of objects in images. Ebert
et al. (2017) introduce a video prediction model called Once the virtual objects are correctly registered with the
Skip Connection Neural Advection Model (SNA), based real environment, both with respect to pose (detection
on the dynamic neural advection (DNA) model (Finn, and tracking) and appearance (inverse rendering), the
Goodfellow, and Levine 2016) that keeps track of images next step is to render the virtual augmentations onto the
over multiple frames. Ebert et al. (2018) measure the screen of the main interface of the AR system. There are
distance to a goal within a calibrated image as the cost two main areas that must be considered in this case: what
function for their visual MPC VVS approach. Lampe and data to display and how to display this data (i.e. data
Riedmiller (2013) use reinforcement learning for use in visualisation). A hierarchical description of the topics
a visual feedback loop for control of the reaching and covered within rendering can be found in Figure 12.
grasping mechanism of a robotic arm. Finn et al. (2016)
utilise reinforcement learning to relate feature points with
11.1. Current approaches for rendering in AR
a specific task. An automated approach is used for deriv-
ing the state-space representation from inputted cam- When determining what information to display on the
era images and a deep spatial AE is used to determine AR interface, many AR systems are context-aware and
these features. Lee, Levine, and Abbeel (2017) utilise a user-customisable. For instance, the Knowledge-based
hybrid reinforcement learning approach between MPC AR for Maintenance Assistance (KARMA) system by
and conventional IBVS. Predictive models learn with pre- Feiner, Macintyre, and Seligmann (1993) uses a rule-
trained visual features to track objects within images over based illustration generation system that incorporates
time. A Q-iteration procedure is used to choose the most user preferences into data visualisation. The AR system
relevant visual features. Sampedro et al. (2018) present by Mengoni et al. (2015) utilises a 3D coordinate sys-
deep deterministic policy gradients (DDPG), a deep rein- tem to balance user preferences for the purposes of data
forcement learning algorithm for VVS-based control- visualisation.
ling of multirotor aerial robots. The DDPG algorithm A major issue in data visualisation on a screen is infor-
controls the aerial robots based on the determined mation overload (Van Krevelen and Poelman 2007). In
errors in the image plane compared to a reference order to combat information overload, two methods are
image. recommended: filtering and visualisation layouts.
34 C. K. SAHU ET AL.

Figure 12. Visual depiction of the topics reviewed in rendering.

Filtering, as in Julier et al. (2000), reduces the methods. The classical data clustering methods include
number of augmentations given on the screen at any partition-based methods as in Boomija and Phil (2008),
one time based on a metric such as spatial distance density-based methods as in Kriegel et al. (2011), and
(Virkkunen 2018). There are two main types of filter- hierarchical methods as in Johnson (1967).
ing used in data visualisation: spatial filtering as in Bane
and Hollerer (2004) and knowledge-based or semantic
11.2. Limitations of current approaches for
filtering as in Höllerer et al. (2001) (Tatzgern et al. 2016).
rendering in AR & potential for AI
Spatial filtering removes data based on physical distance
from the user while knowledge-based or spatial filter- Current approaches for determining what data to dis-
ing removes data based on user preferences or context play to the user are rigid and are based on hard-coded
(Tatzgern et al. 2016). There are also hybrid methods that rules. Innovative AI-based tools such as expert systems
combine the use of spatial and knowledge-based filters, are available that can incorporate context-awareness into
as in Julier et al. (2000). With any filtering methods, data AR systems. These AI-based strategies not only include
loss can occur, and the reduction of visual clutter is not expert knowledge into the system but also provide more
guaranteed (Tatzgern et al. 2016). flexibility for the use of the system in varying contexts.
Visualisation layouts, as in Bell and Feiner (2003), With regard to data visualisation, AI-based layout cre-
model the environment and track elements in the scene to ation strategies can be used to aid in the development of
make sure augmentations do not occlude important envi- efficient and effective AR rendering. Further, AI-based
ronmental features (Virkkunen 2018). In simple terms, data clustering approaches have the ability to transform
visualisation layout methods reorganise data in a logical data into clustering-friendly nonlinear representations,
way (Tatzgern et al. 2016). The two main visualisation thus easing the data clustering process (Min et al. 2018).
layout methods are geometric-based layout and image-
based layout (Ichihashi and Fujinami 2019). Visualisation
11.3. AI approaches for rendering in AR
layout methods fail when the amount of data grows too
large (Tatzgern et al. 2016). Expert systems are AI-based computer programs that
Clustering, as in Tatzgern et al. (2016), is an alternative emulate the decision-making abilities of human experts
approach in data visualisation that combines informa- (Patel, Virparia, and Patel 2012). Expert systems can aid
tion based on spatial and semantic attributes that not in determining what data is appropriate to display to
only reduces the amount of information shown in the the user, thus enabling user customisation. Expert sys-
display but also preserves all of the information, unlike tems have already been incorporated into AR-assisted
filtering (Virkkunen 2018). Clustering is also robust with manufacturing systems. The AR solutions by Syberfeldt
large amounts of data, unlike the visualisation layout et al. (2016) and Holm et al. (2017) utilise expert systems
INTERNATIONAL JOURNAL OF PRODUCTION RESEARCH 35

to determine the level of detail displayed for AR-guided • Methods that use VAEs; and
maintenance instructions. • Methods based on GANs.
One article was found that incorporates AI-based
methods for data visualisation within AR-assisted man- For a full review of DL-based clustering methods, the
ufacturing applications. Ivaschenko, Milutkin, and Sit- reader is referred to Min et al. (2018).
nikov (2019) utilise a NN within the Navigator or context
module of their AR system in order to handle context-
11.5. Conclusion
awareness of visualised content.
With respect to the data displayed in AR systems, many
researchers have identified the importance of adapting
11.4. Potential AI approaches for rendering in AR
their systems to the user’s environment and preferences.
A subset of DL research is dedicated to automating layout While classical methods of incorporating customisation
design for data visualisation using DL. For instance, Li into AR systems are strictly rule-based, AI-based meth-
et al., “Layoutgan” (2019) describe LayoutGAN, a GAN ods are able to incorporate customisation into AR sys-
that takes as input a set of randomly placed 2D graphi- tems in a more flexible way either through DNNs or
cal elements and outputs a layout using the 2D elements through AI-based expert systems. This adaptation of the
that maps to a wireframe image. A CNN-based discrim- AR system allows the operator to work in the manufac-
inator is used to optimise the layout within the image. turing environment with their highest productivity.
Similarly, Nguyen et al. (2018) present DeepUI, which is With regards to the visualisation of this data, the
a GAN that is trained on current phone apps and outputs majority of classical methods utilised are destructive, as
optimised wireframe User Interface (UI) designs for the in data filtering, or non-adaptive for large datasets, as in
developer. The architecture is a GAN based on a genera- visualisation layouts. While classical clustering methods
tor, UIGenerator, and a discriminator, UIDiscriminator. are effective, they are not as flexible as DL-based clus-
Jia et al. (2019) utilise a novel feature map or guidance tering methods. Further, DL research into layout opti-
map to identify important image regions for label place- misation is another strategy that can be utilised to better
ment in AR. Zheng et al. (2019) incorporate context- AR system data visualisation. Showing the operator con-
awareness into graphic design layouts. A deep genera- cise information on intuitive interfaces will decrease their
tive model learns complex layout structures from input cognitive load and thus increase their efficiency in the
images and keyword summaries. Li et al., “Auto Comple- manufacturing environment.
tion of User Interface” (2020) utilise transformer-based
tree decoders for auto-completion of UIs.
12. Future work
Data clustering is another subset of DL research that
can be applied to data visualisation in AR systems. Aljal- Based on the review, there are multiple innovative
bout et al. (2018) developed a taxonomy of DL-based research directions that can be explored for parts of the
data clustering and found that DNN-based clustering AR system and for the entire AR system as a whole. These
methods follow the same format of (1) implementing innovative future research directions are described next.
representation learning or feature mapping using DNNs,
and (2) using the representations as input to a cluster-
12.1. Reconstruction using shape primitives
ing method. Some classical unsupervised data clustering
methods include k-means as in Tatzgern et al. (2016), Humans interpret a 2D image not as a projection of a
Gaussian mixture model (GMM) clustering as in Bier- 3D object from a specific viewpoint, rather they infer
nacki, Celeux, and Govaert (2000), maximum-margin an abstract 3D representation of the actual 3D object
clustering as in Xu et al. (2005), agglomerative cluster- (see Rai and Deshpande 2016). This inference is aided
ing as in Gowda and Krishna (1978), and information- by the fact that the 2D images contain projections of a
theoretic clustering as in Li, Zhang, and Jiang (2004). Min set of 3D objects. Directly inferring 3D objects/scenes
et al. provided a review of deep clustering methods in from an image is a challenging problem. However, two
2018 and found that the current DNN-based clustering basic facts aid human interpretation. Firstly, all the 3D
methods can be categorised into one of four types: objects are composed of a set of primitive geometric
shapes called geons (Biederman 1987). The geons rep-
• Methods that utilise an AE to obtain a feasible feature resent irreducible components like spheres, cylinders,
space; cuboids, etc. Secondly, the contents of the image are
• Methods that utilise feed-forward networks, termed not limited to monochromatic textureless projections
Clustering DNN (CDNN) methods; of objects. The images contain textured details. The
36 C. K. SAHU ET AL.

illumination characteristics of a cylinder differ from a ground truth data collection. AR can expedite the devel-
cuboid, although both of them can project rectangles. opment of state-of-the-art AI techniques.
These cues can aid in depth inference. Moreover, the real-
time nature of AR can offer details from additional per-
12.4. Reconfigurability
spectives. Such information can be leveraged for inverse
rendering and for an enhanced inference of the 3D envi- While the flexibility of procedure storage has been
ronment from the live video feed. greatly improved through the use of ontologies instead of
databases, the data that is stored within these structures
is still very rigid as it is directly derived from procedu-
12.2. Synergy of classical and AI-based CV
ral manuals or from expert developers. One major area
While AI offers several advantages over classical approa of improvement within any procedure-based AR appli-
ches, it is not prudent to completely replace classical cation is reducing the rigidity of these procedures so that
approaches with pure DL techniques in all aspects of they are able to adapt to changes in the use environment
the AR system. This is particularly evident in the task and the user. Particularly, current AR-assisted manufac-
of estimating the absolute pose of the camera. Although turing applications have limited options for assistance if
Kendall, Grimes, and Cipolla (2015) argue that there the operator encounters an unprecedented issue while
exists an injective relationship between the camera pose performing a procedure. If modification of the proce-
and the images, Sattler et al. (2019) counter-argue that dures within the system is possible, the procedure can
image-based pose estimation is indeed similar to image be modified off-site through a non-mobile workstation,
retrieval. The counter-argument is reflected in the sub- on-site either by the operator themselves or remotely by
par performance of DL-based techniques in camera pose an expert, or automatically by the AR system itself. Very
estimation. Several researchers have fused data from few AR systems include strategies for automated recon-
additional sensors (e.g. GPS, IMU) to increase accuracy. figuring of procedures, and the inclusion of automated
As the accuracy of DL-based methods is not guaranteed, strategies for procedure reconfiguration is recommended
this limits the applicability of DL in precision manu- for the increased flexibility of the system for use in uncer-
facturing applications. O’Mahony et al. (2019) listed AR tain or inconsistent environments. Automated procedure
as one of the problems that is not suitable for a differ- reconfiguring can be brought to fruition through auto-
entiable implementation. Moreover, DL can sometimes mated diagnostic methods (Chou and Yao 2009; Matei
be excessive for simpler tasks like accepting or rejecting et al. 2020), disassembly sequence planning, or another
objects of a certain color from a conveyor belt. While a complex representation of the relationships between dif-
DL algorithm would need an enormous amount of data to ferent aspects of the procedures. Incorporation of domain
complete this task, a simple color thresholding algorithm knowledge into this process is of the utmost importance
can solve this problem easily. Additionally, due to the to ensure that the modified procedures are realistic and
unavailability of dedicated comprehensive datasets for safe to perform.
all of the various tasks in AR, DL techniques should be
implemented in conjunction with classical methods (Rai
12.5. Usability: human factors of AR
and Sahu 2020).
Depending upon the main interaction strategy utilised in
the AR system, AR may assist or impede manufacturing
12.3. Human-AR synergistic relationship
activities. For instance, while the computational power
The typical AR system is designed to aid humans in of handheld AR devices like tablets and smartphones is
performing tasks, and thus the AR system is very one- enticing, these devices take up the hands of users and
sided. One potential future advancement for AR systems thus makes it difficult to perform actions relating to pro-
includes utilising input from the user for the betterment cedures. Thus, in cases of maintenance or assembly tasks,
of the computational aspects of the AR system itself. This the use of VST, OST, or peripheral HMDs may be more
is especially appealing because it is well known that DL beneficial as a trade-off between hands-free work and
algorithms are data hungry and require large training computational power. In the case of manual assembly
datasets. Data acquisition is an expensive process just tasks, user mental fatigue from performing the same task
because of the sheer amount of data or the limited capa- repetitively is a major safety hazard. When considering
bility of the equipment. Cases where humans perceive the future development of AR systems in the manufactur-
better than machines, like dynamic range captured by ing environment, AR systems should be customised not
cameras and other perceptual cues like occlusion from only to environmental context and user preferences but
both the real scene and its image, can heavily support also to current user status. This is especially important
INTERNATIONAL JOURNAL OF PRODUCTION RESEARCH 37

when considering the increase in the complexity of AR and the use environment. Incorporating AI strategies
systems with the incorporation of AI-based algorithms. into an AR system leads to increased overall flexibility
AI should be incorporated into AR-assisted manufactur- of the system. For instance, if AI-based virtual object
ing tasks with caution, as even the ≤ 1% failure rate of creation strategies are utilised, virtual objects or aug-
AI can prove disastrous. Synthesizing this data for the mentations do not need to be created and loaded into
user is not only essential for decreasing cognitive load the system before use. Similarly, if a closed-loop regis-
and increasing focus, but it is also essential for ensuring tration strategy is used, camera calibration is no longer
safety in the manufacturing environment. needed (Espiau 1994). In the future, all AR systems
should be closed-loop so that they automatically correct
their intrinsic and extrinsic camera calibration param-
12.6. Multimodal AR
eters based on the feedback from the system. Future
AR was first theorised as a technology that could augment research on AR systems should be able to work without
all of the user’s senses. Although for this review, we only any parameters or data loaded into the system before use.
focus on the visual modality of the AR system, a diverse This is a feasible goal if the AR system is computation-
set of input modalities can also be incorporated. Inputs ally prepared and sufficiently interconnected with its use
from additional sensors like IMU and GPS can supple- environment, current conditions, user actions and prefer-
ment vision for various tasks. This is especially true when ences, and domain knowledge. Further, any information
considering the vast success of visual-inertial detection that is collected with one AR device can be shared with
and tracking algorithms. The data acquired from such other devices in the manufacturing facility, thus creating
sensors can aid in training the DL techniques. While an increasingly robust and stable solution over time.
vision-based pose estimation is expensive and not robust
with 3D transformations, inertial sensors are susceptible
13. Conclusion
to noise and calibration errors. These modalities comple-
ment each other because inertial sensors allow for fast Our work identifies the scope of AI in AR-assisted man-
computation and vision can correct any drift errors from ufacturing applications. We have explored the current
the inertial sensor (Bostanci et al. 2013). manufacturing environment and the place for AR and
AI in industrial processes. We have reviewed recent
literature in both AR-assisted manufacturing and AI-
12.7. Feedback in manufacturing
driven manufacturing applications. As a result of this
While AR can be utilised in all stages of PLM and impacts review, current key technical and managerial challenges
managers, employees, and consumers alike, the poten- for incorporating AI into AR-assisted manufacturing
tial use for AR for each user in PLM is mostly separated. applications were outlined and described. The main lim-
In order to implement a closed-loop system, feedback itations include issues with the handling of big data for
from consumers should be incorporated into the man- AI within AR systems and larger organisational issues
ufacturing cycle. For instance, AR can be utilised by such as safety, regulations, and trust of AI technologies.
consumers to troubleshoot issues with the final product. However, the potential benefits of AI-driven AR-assisted
Active feedback and data that results from this process manufacturing applications substantially outweigh these
of troubleshooting can be collected and utilised to deter- limitations. This review endorses the inclusion of AI as
mine potential product weak points or defects, similar to a tool within AR. This inclusion has been overlooked
how product design features are extracted from written within manufacturing applications thus far.
customer feedback in Ali et al. (2020). While this infor- We split the whole AR computational framework into
mation can be estimated using simulation strategies dur- camera calibration for acquiring the intrinsic parameters;
ing the product development stage, incorporating direct detection, tracking and camera pose estimation for acquir-
consumer input would be beneficial in the creation of ing the extrinsic parameters of the camera; inverse render-
improved products. So, a centralised AR application with ing for inferring the illumination and depth characteris-
a customised UI for each user displaying only the infor- tics of the scene for supporting non-intrusive rendering
mation relevant to them can be an integral part of the of virtual objects; procedure storage for bringing in the
PLM. stored procedures for augmentation; virtual object cre-
ation for creating the augmentations; registration for reg-
istering the virtual objects into the scene; and rendering
12.8. AR system flexibility
for optimally displaying the virtual objects in the display.
The computational pipeline of any AR system is rigidly In each of the sections, the current practices, their limi-
set based on the requirements of the system, the user, tations and potential for AI, existing work incorporating
38 C. K. SAHU ET AL.

AI, and more importantly, additional AI techniques that Intelligence Laboratory (GRAIL) at Clemson University’s Inter-
can be adopted are detailed. national Center for Automotive Research (CU-ICAR). He is
AI can delimit the AR applications from controlled interested in machine learning in general, with a focus on
their geometric attributes and mechanical applications. His
objects (fiducials) and environments to generic objects research is focussed on augmented reality, additive manufac-
and unrestricted work environments. Moreover, as AI turing, cyber-physical systems and autonomous vehicle appli-
can interpret the incoming video feed to simultane- cations. His current research work emphasises on supplement-
ously accomplish detection, tracking, and other tasks, the ing neural networks with physics models for modelling com-
latency of the AR systems can be drastically reduced with plex systems where only limited data and limited physics are
available.
parallel processing. By introducing AI in all stages of AR,
we hope for an expedited adoption of AI in AR-assisted Crystal Young received her B.S. degree
manufacturing applications. in physics from Stony Brook University,
Stony Brook, NY, in 2018, and her M.S.
Although there are challenges to incorporating AI
degree in mechanical engineering from
strategies within AR-assisted manufacturing applica- the University at Buffalo, Buffalo, NY, in
tions, further development and innovation will result in 2020. She also holds an Advanced Certifi-
great gains in efficiency, reliability, and usefulness of AR cate in Advanced Manufacturing from the
systems. Although the ubiquitous use of AR applications University at Buffalo, 2020. She recently
in the manufacturing sector is not yet a reality, the idea completed an internship with the Naval Research Enterprise
Internship Program in September 2020. Her current research
of pervasive AR is quickly becoming tangible. interests include smart and sustainable manufacturing, optimal
design, augmented reality, artificial intelligence, and assistive
technologies. She is a member of Sigma Pi Sigma, The National
Notes Physics Honor Society.
1. Cognitive load is the mental effort needed for processing Rahul Rai received the B.Tech. degree
information (Porter and Heppelmann 2017). from the National Institute of Foundry
2. Short-term trackers were supplied with the position of and Forge Technology (NIFFT), Ranchi,
the target in each frame and did not include a target re- India, in 2000, the M.S. degree in manu-
detection module. So, they failed at the first instance of facturing engineering from the Missouri
occlusion (Kristan et al. 2019). University of Science and Technology
3. In addition to tracking, long-term trackers had to identify (Missouri S&T), USA, in 2002, and the
tracking failure and trigger the included target re-detection Ph.D. degree in mechanical engineering
module. So, long term trackers were evaluated for their from The University of Texas at Austin, USA, in 2006. He
performance under occlusion (Kristan et al. 2019). is currently a Dean’s Distinguished Professor in the Depart-
4. Short-term trackers evaluated for real-time performance ment of Automotive Engineering at the International Center
(Kristan et al. 2019). for Automotive Engineering (ICAR) at Clemson University. Dr
5. https://motchallenge.net/data/MOT20Det/ Rai’s research is focussed on developing computational tools for
6. https://www.votchallenge.net/vot2020/dataset.html Manufacturing, Cyber-Physical System (CPS) Design, Auton-
omy, Collaborative Human-Technology Systems, Diagnostics
and Prognostics, and Extended Reality (XR) domains. By com-
Disclosure statement bining engineering innovations with methods from machine
No potential conflict of interest was reported by the author(s). learning, AI, statistics and optimisation, and geometric rea-
soning, his research strives to solve important problems in the
above-mentioned domains. His research has been supported
Funding by NSF, DARPA, ONR, NSWC, DMDII, HP, NYSERDA, and
NYSPII. He has authored over 100 articles to date in peer-
This work was supported by Naval Surface Warfare Center reviewed conferences and journals covering a wide array of
(NSWC) Naval Engineering Education Consortium (NEEC) problems. He was a recipient of the 2009 HP Design Innova-
award No. N00174-19-0025. tion and the 2017 ASME IDETC/CIE Young Engineer Award.
He also received the 2019 PHM Society Conference Best Paper
Award.
Notes on contributors
Chandan K. Sahu received his B.Tech.
degree from National Institute of Tech-
nology Rourkela, India in 2014, and his
M.S. degree from the State University of
New York at Buffalo, USA, in 2020, both ORCID
in mechanical engineering. He worked
at Honda Motorcycle and Scooter India, Chandan K. Sahu http://orcid.org/0000-0001-5893-4031
from 2014 to 2017. He is currently a Crystal Young http://orcid.org/0000-0001-5291-7468
research assistant at the Geometric Reasoning and Artificial Rahul Rai http://orcid.org/0000-0002-6478-4065
INTERNATIONAL JOURNAL OF PRODUCTION RESEARCH 39

References Aljalbout, Elie, Vladimir Golkov, Yawar Siddiqui, Maximil-


ian Strobel, and Daniel Cremers. 2018. “Clustering with
Ababsa, Fakhreddine. 2020. “Augmented Reality Application in
Deep Learning: Taxonomy and New Methods.” Preprint
Manufacturing Industry: Maintenance and Non-destructive
arXiv:1801.07648.
Testing (NDT) Use Cases.” International conference on
Álvarez, Hugo, Igor Lajas, Andoni Larrañaga, Luis Amozarrain,
Augmented Reality, Virtual Reality and Computer Graph-
and Iñigo Barandiaran. 2019. “Augmented Reality System
ics, Lecce, Italy, 333–344. Springer. doi:10.1007/978-3-030-
to Guide Operators in the Setup of Die Cutters.” The Inter-
58468-9_24
national Journal of Advanced Manufacturing Technology 103
Ababsa, Fakhreddine, Jean-Yves Didier, Imane Zendjebil, and
(1–4): 1543–1553.
Malik Mallem. 2008. “Markerless Vision-based Tracking of
Andriyenko, Anton, Konrad Schindler, and Stefan Roth. 2012.
Partially Known 3D Scenes for Outdoor Augmented Reality
“Discrete-continuous Optimization for Multi-target Track-
Applications.” International symposium on Visual Comput-
ing.” IEEE conference on Computer Vision and Pattern
ing, Las Vegas, NV, 498–507. Springer.
Recognition, Providence, RI, 1926–1933. IEEE.
Ababsa, Fakhr-eddine, and Malik Mallem. 2004. “Robust Cam-
Arinez, Jorge F., Qing Chang, Robert X. Gao, Chengying
era Pose Estimation Using 2d Fiducials Tracking for Real-
Xu, and Jianjing Zhang. 2020. “Artificial Intelligence in
Time Augmented Reality Systems.” Proceedings of the 2004
Advanced Manufacturing: Current Status and Future Out-
ACM SIGGRAPH international conference on Virtual Real-
look.” Journal of Manufacturing Science and Engineering 142
ity Continuum and its Applications in Industry, Singapore,
(11): 110804.
431–435.
Axholt, Magnus. 2011. “Pinhole Camera Calibration in the
Ababsa, Fakhreddine, and Malik Mallem. 2011. “Robust Cam-
Presence of Human Noise.” PhD diss., Linköping University
era Pose Tracking for Augmented Reality Using Particle Fil-
Electronic Press.
tering Framework.” Machine Vision and Applications 22 (1):
Azadeh, A., M. Jeihoonian, B. Maleki Shoja, and S. H.
181–195.
Seyedmahmoudi. 2012. “An Integrated Neural Network–
Abburu, Sunitha. 2012. “A Survey on Ontology Reasoners and
Simulation Algorithm for Performance Optimisation of the
Comparison.” International Journal of Computer Applica-
Bi-criteria Two-stage Assembly Flow-shop Scheduling Prob-
tions 57 (17): 33–39.
lem with Stochastic Activities.” International Journal of Pro-
Abdel-Hakim, Alaa E., and Aly A. Farag. 2006. “CSIFT: A SIFT
duction Research 50 (24): 7271–7284.
Descriptor with Color Invariant Characteristics.” 2006 IEEE
Azuma, Ronald, Yohan Baillot, Reinhold Behringer, Steven
computer society conference on Computer Vision and Pat-
Feiner, Simon Julier, and Blair MacIntyre. 2001. “Recent
tern Recognition (CVPR’06), Hong Kong, Vol. 2, 1978–1983.
Advances in Augmented Reality.” IEEE Computer Graphics
IEEE.
and Applications 21 (6): 34–47.
Abdi, Lotfi, and Aref Meddeb. 2017. “Driver Information Sys-
Badrinarayanan, Vijay, Alex Kendall, and Roberto Cipolla.
tem: A Combination of Augmented Reality and Deep Learn-
2017. “Segnet: A Deep Convolutional Encoder-Decoder
ing.” Proceedings of the symposium on Applied Computing,
Architecture for Image Segmentation.” IEEE Transactions
Marrakech, Morocco, 228–230.
on Pattern Analysis and Machine Intelligence 39 (12):
Abraham, Magid, and Marco Annunziata. 2017. “Augmented
2481–2495.
Reality is Already Improving Worker Performance.” Accessed
Bai, Shuang, and Shan An. 2018. “A Survey on Automatic
31 May 2020. https://hbr.org/2017/03/augmented-reality-is-
Image Caption Generation.” Neurocomputing 311: 291–
already-improving-worker-performance.
304.
Aggour, Kareem S., Vipul K. Gupta, Daniel Ruscitto, Leonardo
Baig, Mirza Muhammad Ali, Mian Ihtisham Shah, Muham-
Ajdelsztajn, Xiao Bian, Kristen H. Brosnan, Natarajan Chen-
mad Abdullah Wajahat, Nauman Zafar, and Omar Arif.
nimalai Kumar, et al. 2019. “Artificial Intelligence/Machine
2018. “Image Caption Generator with Novel Object Injec-
Learning in Manufacturing and Inspection: A GE Perspec-
tion.” Digital Image Computing: Techniques and Applica-
tive.” MRS Bulletin 44 (7): 545–558.
tions (DICTA), Canberra, Australia, 1–8. IEEE.
Akgul, Omer, H. Ibrahim Penekli, and Yakup Genc. 2016.
Bailey, Tim, and Hugh Durrant-Whyte. 2006. “Simultaneous
“Applying Deep Learning in Augmented Reality Tracking.”
Localization and Mapping (SLAM): Part II.” IEEE Robotics
12th international conference on Signal-Image Technology
& Automation Magazine 13 (3): 108–117.
& Internet-based Systems (SITIS), Naples, Italy, 47–54. IEEE.
Baillot, Yohan, Simon J. Julier, Dennis Brown, and Mark A. Liv-
Alhaija, Hassan Abu, Siva Karthik Mustikovela, Lars Mescheder,
ingston. 2003. “A Tracker Alignment Framework for Aug-
Andreas Geiger, and Carsten Rother. 2017. “Augmented
mented Reality.” Proceedings of the 2nd IEEE and ACM
Reality Meets Deep Learning for Car Instance Segmenta-
international symposium on Mixed and Augmented Reality,
tion in Urban Scenes.” British Machine Vision Conference,
Tokyo, Japan, 142–150. IEEE.
London, UK, Vols. 1, 2.
Bajic, B., I. Cosic, M. Lazarevic, N. Sremcev, and A. Rikalovic.
Ali, Munira Mohd, Mamadou Bilo Doumbouya, Thierry
2018. “Machine Learning Techniques for Smart Manufac-
Louge, Rahul Rai, and Mohamed Hedi Karray. 2020.
turing: Applications and Challenges in Industry 4.0.” 9th
“Ontology-based Approach to Extract Product’s Design
International Scientific and Expert Conference, TEAM 2018,
Features From Online Customers’ Reviews.” Computers in
Novi Sad, Serbia, 29.
Industry 116: 103175.
Bajura, Michael, and Ulrich Neumann. 1995a. “Dynamic Regis-
Ali, Munira Mohd, Rahul Rai, J. Neil Otte, and Barry Smith.
tration Correction in Augmented-reality Systems.” Proceed-
2019. “A Product Life Cycle Ontology for Additive Manu-
ings Virtual Reality Annual International Symposium’95,
facturing.” Computers in Industry 105: 191–203.
Research Triangle Park, NC, 189–196. IEEE.
40 C. K. SAHU ET AL.

Bajura, Michael, and Ulrich Neumann. 1995b. “Dynamic Reg- Inspection.” IEEE Winter conference on Applications of
istration Correction in Video-based Augmented Reality Sys- Computer Vision (WACV), Lake Placid, NY, 1–8. IEEE.
tems.” IEEE Computer Graphics and Applications 15 (5): Biederman, Irving. 1987. “Recognition-by-Components: A
52–60. Theory of Human Image Understanding.” Psychological
Balntas, Vassileios, Shuda Li, and Victor Prisacariu. 2018. Review 94 (2): 115.
“Relocnet: Continuous Metric Learning Relocalisation using Biernacki, Christophe, Gilles Celeux, and Gérard Govaert.
Neural Nets.” Proceedings of the European Conference on 2000. “Assessing a Mixture Model for Clustering with the
Computer Vision (ECCV), Munich, Germany, 751–767. Integrated Completed Likelihood.” IEEE Transactions on
Bane, Ryan, and Tobias Hollerer. 2004. “Interactive Tools for Pattern Analysis and Machine Intelligence 22 (7): 719–725.
Virtual X-ray Vision in Mobile Augmented Reality.” Third Bleser, Gabriele, Yulian Pastarmov, and Didier Stricker. 2005.
IEEE and ACM international symposium on Mixed and “Real-time 3D Camera Tracking for Industrial Augmented
Augmented Reality, Arlington, VA, 231–239. IEEE. Reality Applications.” The 13th international conference in
Bar-Shalom, Yaakov, Thomas E. Fortmann, and Peter G. Central Europe on Computer Graphics, Visualization and
Cable. 1990. “Tracking and Data Association.” The Jour- Computer Vision, University of West Bohemia, Campus
nal of the Acoustical Society of America 87: 918–919. Bory, Plzen-Bory, Czech Republic, 47–54.
doi:10.1121/1.398863. Blostein, Dorothea, and Narendra Ahuja. 1989. “Shape From
Baryannis, George, Sahar Validi, Samir Dani, and Grigoris Texture: Integrating Texture-Element Extraction and Sur-
Antoniou. 2019. “Supply Chain Risk Management and Arti- face Estimation.” IEEE Transactions on Pattern Analysis and
ficial Intelligence: State of the Art and Future Research Machine Intelligence11 (12): 1233–1251.
Directions.” International Journal of Production Research 57 Bogdan, Oleksandr, Viktor Eckstein, Francois Rameau, and
(7): 2179–2202. Jean-Charles Bazin. 2018. “DeepCalib: A Deep Learning
Bateux, Quentin, Eric Marchand, Jürgen Leitner, François Approach for Automatic Intrinsic Calibration of Wide Field-
Chaumette, and Peter Corke. 2017. “Visual Servoing from of-View Cameras.” Proceedings of the 15th ACM SIG-
Deep Neural Networks.” Preprint arXiv:1705.08940. GRAPH European conference on Visual Media Production,
Bateux, Quentin, Eric Marchand, Jürgen Leitner, François Munich, Germany, 1–10.
Chaumette, and Peter Corke. 2018. “Training Deep Neural Bolme, David S., J. Ross Beveridge, Bruce A. Draper, and Yui
Networks for Visual Servoing.” IEEE International Confer- Man Lui. 2010. “Visual Object Tracking using Adaptive Cor-
ence on Robotics and Automation (ICRA), Brisbane, Aus- relation Filters.” IEEE computer society conference on Com-
tralia, 1–8. IEEE. puter Vision and Pattern Recognition, San Francisco, CA,
Bauer, Martin. 2007. “Tracking Errors in Augmented Reality.” 2544–2550. IEEE.
PhD diss., Technische Universität München. Boomija, M. D., and M. Phil. 2008. “Comparison of Partition
Bay, Herbert, Tinne Tuytelaars, and Luc Van Gool. 2006. “Surf: Based Clustering Algorithms.” Journal of Computer Applica-
Speeded Up Robust Features.” European Conference on tions 1 (4): 18–21.
Computer Vision, Graz, Austria, 404–417. Springer. Bostanci, Erkan, Nadia Kanwal, Shoaib Ehsan, and Adrian F.
Behringer, Reinhold, Jun Park, and Venkataraman Sun- Clark. 2013. “User Tracking Methods for Augmented Real-
dareswaran. 2002. “Model-based Visual Tracking for Out- ity.” International Journal of Computer Theory and Engineer-
door Augmented Reality Applications.” Proceedings of the ing 5 (1): 93.
international symposium on Mixed and Augmented Reality, Bouthemy, Patrick. 1989. “A Maximum Likelihood Frame-
Darmstadt, Germany, 277–322. IEEE. work for Determining Moving Edges.” IEEE Transactions on
Bell, B., and S. Feiner. 2003. “Augmented Reality for Collabora- Pattern Analysis and Machine Intelligence 11 (5): 499–511.
tive Exploration of Unfamiliar Environments.” NFS Work- Brachmann, Eric, Alexander Krull, Sebastian Nowozin, Jamie
shop on Collaborative Virtual Reality and Visualization, Shotton, Frank Michel, Stefan Gumhold, and Carsten
Lake Tahoe, NV, 26–28. Citeseer. Rother. 2017. “Dsac-differentiable RANSAC for Camera
Benhimane, Selim, and Ezio Malis. 2007. “Homography-based Localization.” Proceedings of the IEEE conference on
2D Visual Tracking and Servoing.” The International Journal Computer Vision and Pattern Recognition, Honolulu, HI,
of Robotics Research 26 (7): 661–676. 6684–6692.
Berclaz, Jerome, Francois Fleuret, Engin Turetken, and Pascal Brachmann, Eric, and Carsten Rother. 2018. “Learning Less is
Fua. 2011. “Multiple Object Tracking Using K-shortest Paths More-6D Camera Localization via 3D Surface Regression.”
Optimization.” IEEE Transactions on Pattern Analysis and Proceedings of the IEEE conference on Computer Vision
Machine Intelligence33 (9): 1806–1819. and Pattern Recognition, Salt Lake City, UT, 4654–4662.
Bertinetto, Luca, Jack Valmadre, Joao F. Henriques, Andrea Breen, David E., Eric Rose, and Ross T. Whitaker. 1995.
Vedaldi, and Philip H. S. Torr. 2016. “Fully-convolutional Interactive Occlusion and Collision of Real and Virtual
Siamese Networks for Object Tracking.” European Confer- Objects in Augmented Reality. Munich: European Com-
ence on Computer Vision, Amsterdam, The Netherlands, puter Industry Research Center. Accessed 28 May 2020.
850–865. Springer. http://www.cs.iupui.edu/ tuceryan/research/AR/ECRC-95-
Bewley, Alex, Zongyuan Ge, Lionel Ott, Fabio Ramos, and 02.pdf.
Ben Upcroft. 2016. “Simple Online and Realtime Tracking.” Brunken, Hauke, and Clemens Gühmann. 2020. “Deep Learn-
IEEE International Conference on Image Processing (ICIP), ing Self-calibration from Planes.” 12th International Con-
Phoenix, AZ, 3464–3468. IEEE. ference on Machine Vision (ICMV 2019), Amsterdam, The
Bian, Xiao, Ser Nam Lim, and Ning Zhou. 2016. “Multiscale Netherlands, Vol. 11433, 114333L. International Society for
Fully Convolutional Network with Application to Industrial Optics and Photonics.
INTERNATIONAL JOURNAL OF PRODUCTION RESEARCH 41

Bruno, Fabio, Loris Barbieri, Emanuele Marino, Maurizio Muz- Chen, Chu-Song, Chi-Kuo Yu, and Yi-Ping Hung. 1999.
zupappa, Luigi D’Oriano, and Biagio Colacino. 2019. “An “New Calibration-free Approach for Augmented Reality
Augmented Reality Tool to Detect and Annotate Design based on Parameterized Cuboid Structure.” Proceedings
Variations in An Industry 4.0 Approach.” The International of the 7th IEEE international conference on Computer
Journal of Advanced Manufacturing Technology 105 (1–4): Vision, Kerkyra, Greece, Vol. 1, 30–37. IEEE.
875–887. Chendeb, Safwan, Mohamad Fawaz, and Vincent Guitteny.
Byeon, Wonmin, Thomas M. Breuel, Federico Raue, and Mar- 2013. “Calibration of a Moving Zoom-lens Camera for Aug-
cus Liwicki. 2015. “Scene Labeling with LSTM Recurrent mented Reality Applications.” IEEE international sympo-
Neural Networks.” Proceedings of the IEEE conference on sium on Industrial Electronics, Taipei, Taiwan, 1–5. IEEE.
Computer Vision and Pattern Recognition, Boston, MA, Cheng, Dachuan, Jian Shi, Yanyun Chen, Xiaoming Deng,
3547–3555. and Xiaopeng Zhang. 2018. “Learning Scene Illumination
Casser, Vincent, Soeren Pirk, Reza Mahjourian, and Anelia by Pairwise Photos from Rear and Front Mobile Cameras.”
Angelova. 2019. “Depth Prediction Without the Sensors: Computer Graphics Forum 37: 213–221.
Leveraging Structure for Unsupervised Learning from Chi, Hung-Lin, Shih-Chung Kang, and Xiangyu Wang. 2013.
Monocular Videos.” Proceedings of the AAAI conference on “Research Trends and Opportunities of Augmented Reality
Artificial Intelligence, Honolulu, HI, Vol. 33, 8001–8008. Applications in Architecture, Engineering, and Construc-
Castle, Robert Oliver, Darren J. Gawley, Georg Klein, and tion.” Automation in Construction 33: 116–122.
David W. Murray. 2007. “Towards Simultaneous Recogni- Chou, Ying-Chieh, and Leehter Yao. 2009. “Automatic Diag-
tion, Localization and Mapping for Hand-held and Wearable nostic System of Electrical Equipment using Infrared Ther-
Cameras.” Proceedings of the IEEE international confer- mography.” International conference of Soft Computing and
ence on Robotics and Automation, Roma, Italy, 4102–4107. Pattern Recognition, Malacca, Malaysia, 155–160. IEEE.
IEEE. Chum, Ondrej, and Jiri Matas. 2005. “Matching with PROSAC-
Chang, M. M. L., A. Y. C. Nee, and S. K. Ong. 2020. “Inter- progressive Sample Consensus.” IEEE computer society
active AR-assisted Product Disassembly Sequence Planning conference on Computer Vision and Pattern Recognition
(ARDIS).” International Journal of Production Research 58: (CVPR’05), San Diego, CA, Vol. 1, 220–226. IEEE.
4916–4931. Collewet, Christophe, and Eric Marchand. 2011. “Photomet-
Chang, M. M. L., S. K. Ong, and A. Y. C. Nee. 2017. “AR-guided ric Visual Servoing.” IEEE Transactions on Robotics 27 (4):
Product Disassembly for Maintenance and Remanufactur- 828–834.
ing, Kamakura, Japan.” Procedia Cirp 61 (1): 299–304. Comaniciu, Dorin, and Peter Meer. 2002. “Mean Shift: A
Chen, Weifeng, Zhao Fu, Dawei Yang, and Jia Deng. 2016. Robust Approach Toward Feature Space Analysis.” IEEE
“Single-image Depth Perception in the Wild.” Advances in Transactions on Pattern Analysis and Machine Intelligence 24
Neural Information Processing Systems, Barcelona, Spain, (5): 603–619.
730–738. Comaniciu, Dorin, Visvanathan Ramesh, and Peter Meer.
Chen, Xinlei, Ross Girshick, Kaiming He, and Piotr Dollár. 2000. “Real-time Tracking of Non-rigid Objects using Mean
2019. “Tensormask: A Foundation for Dense Object Seg- Shift.” Proceedings of the IEEE conference on Computer
mentation.” Proceedings of the IEEE international confer- Vision and Pattern Recognition, CVPR 2000 (Cat. No.
ence on Computer Vision, Seoul, Korea, 2061–2069. PR00662), Hilton Head Island, SC, Vol. 2, 142–149. IEEE.
Chen, Liang-Chieh, Alexander Hermans, George Papandreou, Comport, Andrew I., Éric Marchand, and François Chaumette.
Florian Schroff, Peng Wang, and Hartwig Adam. 2018. 2003. “A Real-time Tracker for Markerless Augmented Real-
“MaskLab: Instance Segmentation by Refining Object Detec- ity.” Proceedings of the 2nd IEEE and ACM international
tion with Semantic and Direction Features.” Proceedings symposium on Mixed and Augmented Reality, Tokyo, Japan,
of the IEEE conference on Computer Vision and Pattern 36–45. IEEE.
Recognition, Salt Lake City, UT, 4013–4022. Comport, Andrew I., Eric Marchand, Muriel Pressigout, and
Chen, C. J., J. Hong, and S. F. Wang. 2015. “Automated Posi- Francois Chaumette. 2006. “Real-time Markerless Tracking
tioning of 3D Virtual Scene in AR-based Assembly and for Augmented Reality: The Virtual Visual Servoing Frame-
Disassembly Guiding System.” The International Journal of work.” IEEE Transactions on Visualization and Computer
Advanced Manufacturing Technology 76 (5–8): 753–764. Graphics 12 (4): 615–628.
Chen, Liang-Chieh, George Papandreou, Iasonas Kokkinos, Cordts, Marius, Mohamed Omran, Sebastian Ramos, Timo
Kevin Murphy, and Alan L. Yuille. 2017. “DeepLab: Semantic Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke,
Image Segmentation with Deep Convolutional Nets, Atrous Stefan Roth, and Bernt Schiele. 2016. “The Cityscapes
Convolution, and Fully Connected CRFs.” IEEE Transac- Dataset for Semantic Urban Scene Understanding.” Pro-
tions on Pattern Analysis and Machine Intelligence 40 (4): ceedings of the IEEE conference on Computer Vision and
834–848. Pattern Recognition, Las Vegas, NV, 3213–3223.
Chen, Longtao, Xiaojiang Peng, and Mingwu Ren. 2018. Criminisi, Antonio, Patrick Perez, and Kentaro Toyama. 2003.
“Recurrent Metric Networks and Batch Multiple Hypothesis “Object Removal by Exemplar-based Inpainting.” Proceed-
for Multi-object Tracking.” IEEE Access 7: 3093–3105. ings of the 2003 IEEE computer society conference on Com-
Chen, Bao Xin, and John K. Tsotsos. 2019. “Fast Visual puter Vision and Pattern Recognition, Madison, WI, Vol. 2,
Object Tracking with Rotated Bounding Boxes.” Preprint II–II. IEEE.
arXiv:1907.03892. Crowder, Richard M., Max L. Wilson, David Fowler, Nigel
Chen, Yajing, Shikui Tu, Yuqi Yi, and Lei Xu. 2017. “Sketch- Shadbolt, Gary Wills, and Sylvia Wong. 2009. “Navigation
pix2seq: A Model to Generate Sketches of Multiple Cate- Over a Large Ontology for Industrial Web Applications.”
gories.” Preprint arXiv:1709.04121. ASME 2009 International Design Engineering Technical
42 C. K. SAHU ET AL.

Conferences and Computers and Information in Engineer- Doil, Fabian, W. Schreiber, T. Alt, and C. Patron. 2003. “Aug-
ing Conference, San Diego, CA, 1333–1340. American Soci- mented Reality for Manufacturing Planning.” Proceedings
ety of Mechanical Engineers Digital Collection. of the workshop on Virtual Environments 2003, Zurich,
Cui, Le. 2016. “Robust Micro/Nano-Positioning by Visual Ser- Switzerland, 71–76.
voing.” PhD diss., Institut de Recherche en Informatique et Donné, Simon, Jonas De Vylder, Bart Goossens, and Wilfried
Système Aléatoires, Université de Rennes. Philips. 2016. “MATE: Machine Learning for Adaptive Cali-
Dacko, Scott G. 2017. “Enabling Smart Retail Settings Via bration Template Detection.” Sensors 16 (11): 1858.
Mobile Augmented Reality Shopping Apps.” Technological Dorfmüller, Klaus. 1999. “Robust Tracking for Augmented
Forecasting and Social Change 124: 243–256. Reality Using Retroreflective Markers.” Computers & Graph-
Dagnaw, Gizealew. 2020. “Artificial Intelligence towards Future ics 23 (6): 795–800.
Industrial Opportunities and Challenges.” The 6th Annual Doshi, Ashish, Ross T. Smith, Bruce H. Thomas, and Con
ACIST Proceedings, Addis Ababa University, Ethiopia. Bouras. 2017. “Use of Projector Based Augmented Reality
Accessed 11 May 2020. https://digitalcommons.kennesaw. to Improve Manual Spot-welding Precision and Accuracy
edu/cgi/viewcontent.cgi?article = 1027&context = acist. for Automotive Manufacturing.” The International Journal of
Dai, Jifeng, Kaiming He, and Jian Sun. 2016. “Instance-aware Advanced Manufacturing Technology 89 (5–8): 1279–1293.
Semantic Segmentation via Multi-task Network Cascades.” Dosovitskiy, Alexey, Philipp Fischer, Eddy Ilg, Philip Hausser,
Proceedings of the IEEE conference on Computer Vision Caner Hazirbas, Vladimir Golkov, Patrick Van Der Smagt,
and Pattern Recognition, Las Vegas, NV, 3150–3158. Daniel Cremers, and Thomas Brox. 2015. “FlowNet: Learn-
Dai, Jifeng, Yi Li, Kaiming He, and Jian Sun. 2016. “R-FCN: ing Optical Flow with Convolutional Networks.” Proceed-
Object Detection via Region-based Fully Convolutional Net- ings of the IEEE international conference on Computer
works.” Advances in Neural Information Processing Sys- Vision, Santiago, Chile, 2758–2766.
tems, Barcelona, Spain, 379–387. Drummond, Tom, and Roberto Cipolla. 1999. “Visual Track-
Dame, Amaury, and Eric Marchand. 2010. “Accurate Real-time ing and Control using Lie Algebras.” Proceedings of the 1999
Tracking Using Mutual Information.” IEEE international IEEE computer society conference on Computer Vision and
symposium on Mixed and Augmented Reality, Seoul, Korea, Pattern Recognition (Cat. No PR00149), Fort Collins, CO,
47–56. IEEE. Vol. 2, 652–657. IEEE.
Danelljan, Martin, Goutam Bhat, Fahad Shahbaz Khan, and Drummond, Tom, and Roberto Cipolla. 2002. “Real-time
Michael Felsberg. 2017. “Eco: Efficient Convolution Opera- Visual Tracking of Complex Structures.” IEEE Transactions
tors for Tracking.” Proceedings of the IEEE conference on on Pattern Analysis and Machine Intelligence 24 (7): 932–946.
Computer Vision and Pattern Recognition, Honolulu, HI, Durrant-Whyte, Hugh, and Tim Bailey. 2006. “Simultaneous
6638–6646. Localization and Mapping: Part I.” IEEE Robotics & Automa-
Danelljan, Martin, Gustav Hager, Fahad Shahbaz Khan, and tion Magazine 13 (2): 99–110.
Michael Felsberg. 2015. “Convolutional Features for Cor- Ebert, Frederik, Sudeep Dasari, Alex X. Lee, Sergey Levine,
relation Filter based Visual Tracking.” Proceedings of the and Chelsea Finn. 2018. “Robustness via Retrying: Closed-
IEEE international conference on Computer Vision Work- loop Robotic Manipulation with Self-supervised Learning.”
shops, Santiago, Chile, 58–66. Preprint arXiv:1810.03043.
Dash, Ajaya Kumar, Santosh Kumar Behera, Debi Prosad Ebert, Frederik, Chelsea Finn, Alex X. Lee, and Sergey Levine.
Dogra, and Partha Pratim Roy. 2018. “Designing of Marker- 2017. “Self-supervised Visual Planning with Temporal Skip
based Augmented Reality Learning Environment for Kids Connections.” Preprint arXiv:1710.05268.
Using Convolutional Neural Network Architecture.” Dis- Eigen, David, and Rob Fergus. 2015. “Predicting Depth, Surface
plays 55: 46–54. Normals and Semantic Labels with a Common Multi-scale
Dave, Achal, Tarasha Khurana, Pavel Tokmakov, Cordelia Convolutional Architecture.” Proceedings of the IEEE inter-
Schmid, and Deva Ramanan. 2020. “TAO: A Large- national conference on Computer Vision, Santiago, Chile,
Scale Benchmark for Tracking Any Object.” Preprint 2650–2658.
arXiv:2005.10356. Eigen, David, Christian Puhrsch, and Rob Fergus. 2014. “Depth
Deac, Crina Narcisa, Gicu Calin Deac, Cicerone Lauren- Map Prediction from a Single Image using a Multi-scale
tiu Popa, Mihalache Ghinea, and Costel Emil Cotet. Deep Network.” Advances in Neural Information Processing
2017. “Using Augmented Reality in Smart Manufacturing.” Systems, Montreal, Canada, 2366–2374.
Proceedings of the 28th DAAAM International Sympo- Erhan, Dumitru, Christian Szegedy, Alexander Toshev, and
sium, Zadar, Croatia, 0727–0732. Dragomir Anguelov. 2014. “Scalable Object Detection Using
Debevec, Paul. 2008. “Rendering Synthetic Objects into Real Deep Neural Networks.” Proceedings of the IEEE Confer-
Scenes: Bridging Traditional and Image-based Graphics ence on Computer Vision and Pattern Recognition, Colum-
with Global Illumination and High Dynamic Range Pho- bus, OH, 2147–2154.
tography.” ACM SIGGRAPH 2008 Classes, Orlando, FL, Escobar, Carlos A., and Ruben Morales-Menendez. 2018.
1–10. “Machine Learning Techniques for Quality Control in High
Dementhon, Daniel F., and Larry S. Davis. 1995. “Model-based Conformance Manufacturing Environment.” Advances in
Object Pose in 25 Lines of Code.” International Journal of Mechanical Engineering 10 (2): 1687814018755519.
Computer Vision 15 (1–2): 123–141. Espiau, Bernard. 1994. “Effect of Camera Calibration Errors
Deng, Jia, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li on Visual Servoing in Robotics.” Experimental Robotics
Fei-Fei. 2009. “ImageNet: A Large-Scale Hierarchical Image III, Kyoto, Japan, 182–192. Springer.
Database.” IEEE conference on Computer Vision and Pat- Everingham, Mark, Luc Van Gool, Christopher K. I. Williams,
tern Recognition, Miami, FL, 248–255. IEEE. John Winn, and Andrew Zisserman. 2010. “The Pascal
INTERNATIONAL JOURNAL OF PRODUCTION RESEARCH 43

Visual Object Classes (VOC) Challenge.” International Jour- conference on Emerging Technologies & Factory Automa-
nal of Computer Vision 88 (2): 303–338. tion (ETFA), Luxembourg, 1–4. IEEE.
Facil, Jose M., Benjamin Ummenhofer, Huizhong Zhou, Luis Fu, Cheng-Yang, Wei Liu, Ananth Ranga, Ambrish Tyagi, and
Montesano, Thomas Brox, and Javier Civera. 2019. “CAM- Alexander C. Berg. 2017. “Dssd: Deconvolutional Single
Convs: Camera-aware Multi-scale Convolutions for Single- Shot Detector.” Preprint arXiv:1701.06659.
view Depth.” Proceedings of the IEEE conference on Com- Furini, Francesco, Rahul Rai, Barry Smith, Giorgio Colombo,
puter Vision and Pattern Recognition, Long Beach, CA, and Venkat Krovi. 2016. “Development of a Manufactur-
11826–11835. ing Ontology for Functionally Graded Materials.” ASME
Fan, Haoqiang, Hao Su, and Leonidas J. Guibas. 2017. “A Point 2016 International Design Engineering Technical Confer-
Set Generation Network for 3D Object Reconstruction from ences and Computers and Information in Engineering Con-
a Single Image.” Proceedings of the IEEE conference on ference, Charlotte, NC, American Society of Mechanical
Computer Vision and Pattern Recognition, Honolulu, HI, Engineers Digital Collection.
605–613. Garcia-Garcia, Alberto, Sergio Orts-Escolano, Sergiu Oprea,
Fang, H. C., S. K. Ong, and A. Y. C. Nee. 2013. “Orientation Victor Villena-Martinez, and Jose Garcia-Rodriguez. 2017.
Planning of Robot End-effector Using Augmented Reality.” “A Review on Deep Learning Techniques Applied to Seman-
The International Journal of Advanced Manufacturing Tech- tic Segmentation.” Preprint arXiv:1704.06857.
nology 67 (9–12): 2033–2049. Gardner, Marc-André, Kalyan Sunkavalli, Ersin Yumer, Xiao-
Faugeras, Olivier D., Q.-T. Luong, and Stephen J. Maybank. hui Shen, Emiliano Gambaretto, Christian Gagné, and Jean-
1992. “Camera Self-calibration: Theory and Experiments.” François Lalonde. 2017. “Learning to Predict Indoor Illumi-
European conference on Computer Vision, Santa Margherita nation from a Single Image.” Preprint arXiv:1704.00090.
Ligure, Italy, 321–334. Springer. Garg, Ravi, B. G. Vijay Kumar, Gustavo Carneiro, and Ian Reid.
Feiner, Steven, Blair Macintyre, and Dorée Seligmann. 1993. 2016. “Unsupervised CNN for Single View Depth Estima-
“Knowledge-based Augmented Reality.” Communications of tion: Geometry to the Rescue.” European conference on
the ACM 36 (7): 53–62. Computer Vision, Amsterdam, The Netherlands, 740–756.
Feng, Tuo, and Dongbing Gu. 2019. “Sganvo: Unsuper- Springer.
vised Deep Visual Odometry and Depth Estimation with Garon, Mathieu, Kalyan Sunkavalli, Sunil Hadap, Nathan Carr,
Stacked Generative Adversarial Networks.” IEEE Robotics and Jean-François Lalonde. 2019. “Fast Spatially-Varying
and Automation Letters 4 (4): 4431–4437. Indoor Lighting Estimation.” Proceedings of the IEEE con-
Feng, Qianyu, Yu Wu, Hehe Fan, Chenggang Yan, Mingliang ference on Computer Vision and Pattern Recognition, Long
Xu, and Yi Yang. 2020. “Cascaded Revision Network for Beach, CA, 6908–6917.
Novel Object Captioning.” IEEE Transactions on Circuits and Gavish, Nirit, Teresa Gutiérrez, Sabine Webel, Jorge Rodríguez,
Systems for Video Technology. Matteo Peveri, Uli Bockholt, and Franco Tecchia. 2015.
Fiala, Mark. 2005. “ARTag, A Fiducial Marker System using “Evaluating Virtual Reality and Augmented Reality Training
Digital Techniques.” IEEE computer society conference on for Industrial Maintenance and Assembly Tasks.” Interactive
Computer Vision and Pattern Recognition (CVPR’05), San Learning Environments 23 (6): 778–798.
Diego, CA, Vol. 2, 590–596. IEEE. Geiger, Andreas, Philip Lenz, Christoph Stiller, and Raquel
Finn, Chelsea, Ian Goodfellow, and Sergey Levine. 2016. Urtasun. 2013. “Vision Meets Robotics: The Kitti Dataset.”
“Unsupervised Learning for Physical Interaction Through The International Journal of Robotics Research 32 (11):
Video Prediction.” Advances in Neural Information Process- 1231–1237.
ing Systems, Barcelona, Spain, 64–72. Geiger, Andreas, Philip Lenz, and Raquel Urtasun. 2012.
Finn, Chelsea, and Sergey Levine. 2017. “Deep Visual Foresight “Are We Ready for Autonomous Driving? The Kitti
for Planning Robot Motion.” IEEE international conference Vision Benchmark Suite.” IEEE conference on Computer
on Robotics and Automation (ICRA), Mariana Bay Sands, Vision and Pattern Recognition, Providence, RI, 3354–3361.
Singapore, 2786–2793. IEEE. IEEE.
Finn, Chelsea, Xin Yu Tan, Yan Duan, Trevor Darrell, Sergey Gheisari, Masoud, Shane Goodman, Justin Schmidt, Grace-
Levine, and Pieter Abbeel. 2016. “Deep Spatial Autoencoders line Williams, and Javier Irizarry. 2014. “Exploring BIM and
for Visuomotor Learning.” IEEE International Conference Mobile Augmented Reality Use in Facilities Management.”
on Robotics and Automation (ICRA), Stockholm, Sweden, Construction Research Congress 2014: Construction in a
512–519. IEEE. Global Network, Atlanta, GA, 1941–1950.
Fischer, Jan, Dirk Bartz, and Wolfgang Straßer. 2004. “Occlu- Girshick, Ross. 2015. “Fast R-CNN.” Proceedings of the IEEE
sion Handling for Medical Augmented Reality using a Vol- international conference on Computer Vision, Santiago,
umetric Phantom Model.” Proceedings of the ACM sym- Chile, 1440–1448.
posium on Virtual Reality Software and Technology, Hong Girshick, Ross, Jeff Donahue, Trevor Darrell, and Jitendra
Kong, 174–177. Malik. 2014. “Rich Feature Hierarchies for Accurate Object
Fischler, Martin A., and Robert C. Bolles. 1981. “Random Sam- Detection and Semantic Segmentation.” Proceedings of the
ple Consensus: A Paradigm for Model Fitting with Appli- IEEE conference on Computer Vision and Pattern Recogni-
cations to Image Analysis and Automated Cartography.” tion, Columbus, OH, 580–587.
Communications of the ACM 24 (6): 381–395. Godard, Clément, Oisin Mac Aodha, Michael Firman, and
Flatt, Holger, Nils Koch, Carsten Röcker, Andrei Günter, and Gabriel J. Brostow. 2019. “Digging into Self-supervised
Jürgen Jasperneite. 2015. “A Context-aware Assistance Sys- Monocular Depth Estimation.” Proceedings of the IEEE
tem for Maintenance Applications in Smart Factories based international conference on Computer Vision, Seoul, Korea,
on Augmented Reality and Indoor Localization.” IEEE 20th 3828–3838.
44 C. K. SAHU ET AL.

Gomez, J.-F. V., Gilles Simon, and M.-O. Berger. 2005. “Cal- Hamid Rezatofighi, Seyed, Anton Milan, Zhen Zhang, Qinfeng
ibration Errors in Augmented Reality: A Practical Study.” Shi, Anthony Dick, and Ian Reid. 2015. “Joint Probabilistic
4th IEEE and ACM International Symposium on Mixed and Data Association Revisited.” Proceedings of the IEEE inter-
Augmented Reality (ISMAR’05), Vienna, Austria, 154–163. national conference on Computer Vision, Santiago, Chile,
IEEE. 3047–3055.
Gonzalez-Franco, Mar, Julio Cermeron, Katie Li, Rodrigo Hariharan, Bharath, Pablo Arbeláez, Ross Girshick, and Jiten-
Pizarro, Jacob Thorn, Windo Hutabarat, Ashutosh Tiwari, dra Malik. 2014. “Simultaneous Detection and Segmenta-
and Pablo Bermell-Garcia. 2016. “Immersive Augmented tion.” European conference on Computer Vision, Zurich,
Reality Training for Complex Manufacturing Scenarios.” Switzerland, 297–312. Springer.
Preprint arXiv:1602.01944. Hariharan, Bharath, Pablo Arbelaez, Ross Girshick, and Jiten-
Goodfellow, Ian, Jean Pouget-Abadie, Mehdi Mirza, Bing dra Malik. 2016. “Object Instance Segmentation and Fine-
Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, grained Localization Using Hypercolumns.” IEEE Transac-
and Yoshua Bengio. 2014. “Generative Adversarial Nets.” tions on Pattern Analysis and Machine Intelligence 39 (4):
Advances in Neural Information Processing Systems, Mon- 627–639.
treal, Canada, 2672–2680. Harris, Chris. 1993. “Tracking with Rigid Models.” In Active
Gordon, Gaile, Mark Billinghurst, Melanie Bell, John Wood- Vision, 59–73.
fill, Bill Kowalik, Alex Erendi, and Janet Tilander. 2002. “The Hayder, Zeeshan, Xuming He, and Mathieu Salzmann. 2017.
Use of Dense Stereo Range Data in Augmented Reality.” “Boundary-aware Instance Segmentation.” Proceedings of
Proceedings of the international symposium on Mixed and the IEEE Conference on Computer Vision and Pattern
Augmented Reality, Darmstadt, Germany, 14–23. IEEE. Recognition, Honolulu, HI, 5696–5704.
Gordon, Ariel, Hanhan Li, Rico Jonschkowski, and Anelia He, Kaiming, Georgia Gkioxari, Piotr Dollár, and Ross Gir-
Angelova. 2019. “Depth from Videos in the Wild: Unsu- shick. 2017. “Mask R-CNN.” Proceedings of the IEEE inter-
pervised Monocular depth Learning from Unknown Cam- national conference on Computer Vision, Venice, Italy,
eras.” Proceedings of the IEEE international conference on 2961–2969.
Computer Vision, Seoul, Korea, 8977–8986. He, Anfeng, Chong Luo, Xinmei Tian, and Wenjun Zeng.
Gordon, Iryna, and David G. Lowe. 2006. “What and Where: 2018a. “Towards a Better Match in Siamese Network based
3D Object Recognition with Accurate Pose.” In Toward Visual Object Tracker.” Proceedings of the European Con-
Category-level Object Recognition, 67–82. Springer. ference on Computer Vision (ECCV), Munich, Germany,
Gowda, K. Chidananda, and G. Krishna. 1978. “Agglomerative 1–16.
Clustering Using the Concept of Mutual Nearest Neighbour- He, Anfeng, Chong Luo, Xinmei Tian, and Wenjun Zeng.
hood.” Pattern Recognition 10 (2): 105–112. 2018b. “A Twofold Siamese Network for Real-Time Object
Green, Scott A., Mark Billinghurst, XiaoQi Chen, and J. Geof- Tracking.” Proceedings of the IEEE conference on Com-
frey Chase. 2008. “Human-robot Collaboration: A Literature puter Vision and Pattern Recognition, Salt Lake City, UT,
Review and Augmented Reality Approach in Design.” Inter- 4834–4843.
national Journal of Advanced Robotic Systems 5 (1): 1–18. He, Lei, Guanghui Wang, and Zhanyi Hu. 2018. “Learning
Gruber, Thomas R. 1993. “A Translation Approach to Portable Depth From Single Images with Deep Neural Network
Ontology Specifications.” Knowledge Acquisition 5 (2): Embedding Focal Length.” IEEE Transactions on Image Pro-
199–221. cessing 27 (9): 4676–4689.
Gruber, Lukas, Thomas Richter-Trummer, and Dieter Schmal- He, Kaiming, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
stieg. 2012. “Real-time Photometric Registration from Arbi- 2015. “Spatial Pyramid Pooling in Deep Convolutional Net-
trary Geometry.” IEEE International Symposium on Mixed works for Visual Recognition.” IEEE Transactions on Pattern
and Augmented Reality (ISMAR), Atlanta, GA, 119–128. Analysis and Machine Intelligence 37 (9): 1904–1916.
IEEE. He, Kaiming, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
Guo, Qing, Wei Feng, Ce Zhou, Rui Huang, Liang Wan, 2016. “Deep Residual Learning for Image Recognition.” Pro-
and Song Wang. 2017. “Learning Dynamic Siamese Net- ceedings of the IEEE conference on Computer Vision and
work for Visual Object Tracking.” Proceedings of the IEEE Pattern Recognition, Las Vegas, NV, 770–778.
international conference on Computer Vision, Venice, Italy, Henderson, Steven J., and Steven K. Feiner. 2007. Augmented
1763–1771. Reality for Maintenance and Repair (ARMAR). Technical
Gwn Lore, Kin, Kishore Reddy, Michael Giering, and Edgar A. Report. Columbia Univ New York Dept of Computer Sci-
Bernal. 2018. “Generative Adversarial Networks for Depth ence.
Map Estimation from RGB Video.” Proceedings of the IEEE Hendricks, Lisa Anne, Subhashini Venugopalan, Marcus
conference on Computer Vision and Pattern Recognition Rohrbach, Raymond Mooney, Kate Saenko, and Trevor Dar-
Workshops, Salt Lake City, UT, 1177–1185. rell. 2016. “Deep Compositional Captioning: Describing
Ha, David, and Douglas Eck. 2017. “A Neural Representation Novel Object Categories Without Paired Training Data.”
of Sketch Drawings.” Preprint arXiv:1704.03477. Proceedings of the IEEE conference on Computer Vision
Ha, Jaewon, Jinki Jung, ByungOk Han, Kyusung Cho, and Hyun and Pattern Recognition, Las Vegas, NV, 1–10. IEEE.
S. Yang. 2011. “Mobile Augmented Reality using Scalable Heyden, Anders, and Kalle Åkström. 1998. “Minimal Con-
Recognition and Tracking.” IEEE Virtual Reality Confer- ditions on Intrinsic Parameters for Euclidean Reconstruc-
ence, Singapore, 211–212. IEEE. tion.” Asian conference on Computer Vision, Hong Kong,
Hahne, Uwe, and Marc Alexa. 2009. “Depth Imaging by Com- 169–176. Springer.
bining Time-of-Flight and On-demand Stereo.” Workshop Heyden, Anders, and Kalle Astrom. 1997. “Euclidean Recon-
on Dynamic 3D Imaging, Jena, Germany, 70–83. Springer. struction from Image Sequences with Varying and Unknown
INTERNATIONAL JOURNAL OF PRODUCTION RESEARCH 45

Focal Length and Principal Point.” Proceedings of IEEE Ivaschenko, Anton, Michael Milutkin, and Pavel Sitnikov. 2019.
computer society conference on Computer Vision and Pat- “Industrial Application of Accented Visualization Based
tern Recognition, San Juan, Puerto Rico, 438–443. IEEE. on Augmented Reality.” 24th conference of Open Inno-
Hinton, Geoffrey E., Nitish Srivastava, Alex Krizhevsky, Ilya vations Association (FRUCT), Moscow, Russia, 123–129.
Sutskever, and Ruslan R. Salakhutdinov. 2012. “Improving IEEE.
Neural Networks by Preventing Co-adaptation of Feature Järvenpää, Eeva, Niko Siltala, Otto Hylli, and Minna Lanz.
Detectors.” Preprint arXiv:1207.0580. 2019. “The Development of An Ontology for Describing
Hold-Geoffroy, Yannick, Kalyan Sunkavalli, Jonathan Eisen- the Capabilities of Manufacturing Resources.” Journal of
mann, Matthew Fisher, Emiliano Gambaretto, Sunil Hadap, Intelligent Manufacturing 30 (2): 959–978.
and Jean-François Lalonde. 2018. “A Perceptual Measure Jia, Jianqing, Semir Elezovikj, Heng Fan, Shuojin Yang, Jing Liu,
for Deep Single Image Camera Calibration.” Proceedings Wei Guo, Chiu C. Tan, and Haibin Ling. 2019. “Semantic-
of the IEEE conference on Computer Vision and Pattern Aware Label Placement for Augmented Reality in Street
Recognition, Salt Lake City, UT, 2354–2363. View.” Preprint arXiv:1912.07105.
Höllerer, Tobias, Steven Feiner, Drexel Hallaway, Blaine Bell, Jiang, Wei, Juan Camilo Gamboa Higuera, Baptiste Angles,
Marco Lanzagorta, Dennis Brown, Simon Julier, Yohan Bail- Weiwei Sun, Mehrsan Javan, and Kwang Moo Yi. 2020.
lot, and Lawrence Rosenblum. 2001. “User Interface Man- “Optimizing through Learned Errors for Accurate Sports
agement Techniques for Collaborative Mobile Augmented Field Registration.” The IEEE Winter conference on Appli-
Reality.” Computers & Graphics 25 (5): 799–810. cations of Computer Vision, Snowmass Village, Colorado,
Holloway, Richard L. 1997. “Registration Error Analysis for 201–210.
Augmented Reality.” Presence: Teleoperators & Virtual Envi- Jiddi, Salma, Philippe Robert, and Eric Marchand. 2016.
ronments 6 (4): 413–432. “Reflectance and Illumination Estimation for Realistic Aug-
Holm, Magnus, Oscar Danielsson, Anna Syberfeldt, Philip mentations of Real Scenes.” IEEE International Symposium
Moore, and Lihui Wang. 2017. “Adaptive Instructions to on Mixed and Augmented Reality (ISMAR-Adjunct), Merida,
Novice Shop-floor Operators Using Augmented Reality.” Yucatan, Mexico, 244–249. IEEE.
Journal of Industrial and Production Engineering 34 (5): Jiddi, Salma, Philippe Robert, and Eric Marchand. 2017.
362–374. “[POSTER] Illumination Estimation Using Cast Shadows
Holynski, Aleksander, and Johannes Kopf. 2018. “Fast Depth for Realistic Augmented Reality Applications.” IEEE Inter-
Densification for Occlusion-aware Augmented Reality.” national Symposium on Mixed and Augmented Reality
ACM Transactions on Graphics (TOG) 37 (6): 1–11. (ISMAR-Adjunct), Nantes, France, 192–193. IEEE.
Horn, Berthold K. P., and Michael J. Brooks. 1989. Shape Jo, Geun-Sik, Kyeong-Jin Oh, Inay Ha, Kee-Sung Lee, Myung-
from Shading. Vol. 2. Cambridge, MA: MIT Press. ISBN: Duk Hong, Ulrich Neumann, and Suya You. 2014. “A Uni-
9780262519175. fied Framework for Augmented Reality and Knowledge-
Howard, Andrew G., Menglong Zhu, Bo Chen, Dmitry based Systems in Maintaining Aircraft.” 26th IAAI Confer-
Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andree ence, Toronto, Canada, 2990–2997.
tto, and Hartwig Adam. 2017. “Mobilenets: Efficient Convo- Johnson, Stephen C.. 1967. “Hierarchical Clustering Schemes.”
lutional Neural Networks for Mobile Vision Applications.” Psychometrika 32 (3): 241–254.
Preprint arXiv:1704.04861. Jordan, Michael I., and Tom M. Mitchell. 2015. “Machine
Huber, Peter J. 1992. “Robust Estimation of a Location Param- Learning: Trends, Perspectives, and Prospects.” Science 349
eter.” In Breakthroughs in Statistics, edited by S. Kotz (6245): 255–260.
and N. L. Johnson, 492–518. New York, NY: Springer. Julier, Simon, Marco Lanzagorta, Yohan Baillot, Lawrence
doi:10.1007/978-1-4612-4380-9_35. Rosenblum, Steven Feiner, Tobias Hollerer, and Sabrina Ses-
Hung, Wei-Chih, Yi-Hsuan Tsai, Yan-Ting Liou, Yen-Yu tito. 2000. “Information Filtering for Mobile Augmented
Lin, and Ming-Hsuan Yang. 2018. “Adversarial Learn- Reality.” Proceedings IEEE and ACM International Sympo-
ing for Semi-supervised Semantic Segmentation.” Preprint sium on Augmented Reality (ISAR 2000), Munich, Germany,
arXiv:1802.07934. 3–11. IEEE.
Huo, Jiage, Jianghua Zhang, and Felix T. S. Chan. 2020. “A Kalman, Rudolph Emil. 1960. “A New Approach to Linear
Fuzzy Control System for Assembly Line Balancing with a Filtering and Prediction Problems.” Journal of Fluids Engi-
Three-state Degradation Process in the Era of Industry 4.0.” neering 82 (1): 35–45.
International Journal of Production Research 58: 1–18. Kán, Peter, and Hannes Kaufmann. 2013. “Differential Irradi-
Iandola, Forrest N., Song Han, Matthew W. Moskewicz, ance Caching for Fast High-quality Light Transport between
Khalid Ashraf, William J. Dally, and Kurt Keutzer. 2016. Virtual and Real Worlds.” IEEE International Symposium
“SqueezeNet: AlexNet-level Accuracy with 50× Fewer on Mixed and Augmented Reality (ISMAR), Adelaide, S.A.,
Parameters and < 0.5 MB Model Size.” Preprint arXiv:1602. Australia, 133–141. IEEE.
07360. Kán, Peter, Johannes Unterguggenberger, and Hannes Kauf-
Ichihashi, Keita, and Kaori Fujinami. 2019. “Estimating Visi- mann. 2015. “High-quality Consistent Illumination in
bility of Annotations for View Management in Spatial Aug- Mobile Augmented Reality by Radiance Convolution on the
mented Reality Based on Machine-Learning Techniques.” GPU.” International symposium on Visual Computing, Las
Sensors 19 (4): 939–966. Vegas, NV, 574–585. Springer.
Irani, Michal, and P. Anandan. 1998. “Robust Multi-sensor Kanade, Takeo, and Masatoshi Okutomi. 1994. “A Stereo
Image Alignment.” 6th International conference on Com- Matching Algorithm with An Adaptive Window: Theory
puter Vision (IEEE Cat. No. 98CH36271), Bombay, India, and Experiment.” IEEE Transactions on Pattern Analysis and
959–966. IEEE. Machine Intelligence 16 (9): 920–932.
46 C. K. SAHU ET AL.

Kanbara, Masayuki, Takashi Okuma, Haruo Takemura, and Klein, Georg, and David Murray. 2007. “Parallel Tracking and
Naokazu Yokoya. 2000. “A Stereoscopic Video See-through Mapping for Small AR Workspaces.” 6th IEEE and ACM
Augmented Reality System based on Real-time Vision-based international symposium on Mixed and Augmented Real-
Registration.” Proceedings IEEE Virtual Reality 2000 (Cat. ity, Nara, Japan, 225–234. IEEE.
No. 00CB37048), New Brunswick, NJ, 255–262. IEEE. Knorr, Sebastian B., and Daniel Kurz. 2014. “Real-time Illu-
Kang, Seokho, Eunji Kim, Jaewoong Shim, Wonsang Chang, mination Estimation from Faces for Coherent Rendering.”
and Sungzoon Cho. 2018. “Product Failure Prediction with IEEE international symposium on Mixed and Augmented
Missing Data.” International Journal of Production Research Reality (ISMAR), Munich, Germany, 113–122. IEEE.
56 (14): 4849–4859. Kriegel, Hans-Peter, Peer Kröger, Jörg Sander, and Arthur
Kato, Hirokazu, and Mark Billinghurst. 1999. “Marker Tracking Zimek. 2011. “Density-based Clustering.” Wiley Interdisci-
and HMD Calibration for a Video-based Augmented Real- plinary Reviews: Data Mining and Knowledge Discovery 1 (3):
ity Conferencing System.” Proceedings 2nd IEEE and ACM 231–240.
International Workshop on Augmented Reality (IWAR’99), Kristan, Matej, Ales Leonardis, Jiri Matas, Michael Felsberg,
Bellevue, WA, 85–94. IEEE. Roman Pflugfelder, Luka Cehovin Zajc, and Tomas Vojir.
Ke, Yan, and Rahul Sukthankar. 2004. “PCA-SIFT: A More 2018. “The Sixth Visual Object Tracking VOT2018 Chal-
Distinctive Representation for Local Image Descriptors.” lenge Results.” Proceedings of the European Conference on
Proceedings of the 2004 IEEE computer society conference Computer Vision (ECCV), Munich, Germany, 3–53.
on Computer Vision and Pattern Recognition, 2004. CVPR Kristan, Matej, Jiri Matas, Ales Leonardis, Michael Felsberg,
2004, Washington, DC, Vol. 2, II–II. IEEE. Roman Pflugfelder, Joni-Kristian Kamarainen, and Luka
Kendall, Alex, and Roberto Cipolla. 2016. “Modelling Uncer- Cehovin Zajc. 2019. “The Seventh Visual Object Tracking
tainty in Deep Learning for Camera Relocalization.” IEEE VOT2019 Challenge Results.” Proceedings of the IEEE inter-
International Conference on Robotics and Automation national conference on Computer Vision Workshops, Seoul,
(ICRA), Stockholm, Sweden, 4762–4769. IEEE. Korea, 2206–2241.
Kendall, Alex, and Roberto Cipolla. 2017. “Geometric Loss Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton. 2012.
Functions for Camera Pose Regression with Deep Learning.” “Imagenet Classification with Deep Convolutional Neural
Proceedings of the IEEE conference on Computer Vision Networks.” Advances in Neural Information Processing Sys-
and Pattern Recognition, Honolulu, HI, 5974–5983. tems, Lake Tahoe, NV, 1097–1105.
Kendall, Alex, Matthew Grimes, and Roberto Cipolla. 2015. Kruppa, Erwin. 1913. “Zur Ermittlung eines Objektes aus
“Posenet: A Convolutional Network for Real-time 6-DOF zwei Perspektiven mit innerer Orientierung.” Sitzungs-
Camera Relocalization.” Proceedings of the IEEE inter- berichte der Mathematisch-Naturwissenschaftlichen Kaiser-
national conference on Computer Vision, Santiago, Chile, lichen Akademie der Wissenschaften 122: 1939–1948.
2938–2946. Kudan Inc. 2020. “Home — Kudan.” https://www.kudan.io/.
Kendall, Alex, Hayk Martirosyan, Saumitro Dasgupta, Peter Kuhn, Harold W. 1955. “The Hungarian Method for the Assign-
Henry, Ryan Kennedy, Abraham Bachrach, and Adam Bry. ment Problem.” Naval Research Logistics Quarterly 2 (1–2):
2017. “End-to-End Learning of Geometry and Context for 83–97. Accessed 14 May 2020. http://www.bioinfo.org.cn/
Deep Stereo Regression.” Proceedings of the IEEE interna- ∼ dbu/AlgorithmCourses/Lectures/50YearsIP.pdf#page = 46.
tional conference on Computer Vision, Venice, Italy, 66–75. Kuo, Cheng-Hao, and Ram Nevatia. 2011. “How Does Person
Kieritz, Hilke, Wolfgang Hubner, and Michael Arens. 2018. Identity Recognition Help Multi-person Tracking?” CVPR
“Joint Detection and Online Multi-object Tracking.” Pro- 2011, Colorado Springs, CO, 1217–1224. IEEE.
ceedings of the IEEE conference on Computer Vision Kutulakos, Kiriakos N., and James R. Vallino. 1998. “Calibra
and Pattern Recognition Workshops, Salt Lake City, UT, tion-free Augmented Reality.” IEEE Transactions on Visual-
1459–1467. ization and Computer Graphics 4 (1): 1–20.
Kim, Minyoung, Stefano Alletto, and Luca Rigazio. 2016. “Sim- Kuznietsov, Yevhen, Jorg Stuckler, and Bastian Leibe. 2017.
ilarity Mapping with Enhanced Siamese Network for Multi- “Semi-supervised Deep Learning for Monocular Depth
object Tracking.” Preprint arXiv:1609.09156. Map Prediction.” Proceedings of the IEEE Conference on
Kim, Chanran, Younkyoung Lee, Jong-Il Park, and Jaeha Lee. Computer Vision and Pattern Recognition, Honolulu, HI,
2018. “Diminishing Unwanted Objects based on Object 6647–6655.
Detection using Deep Learning and Image Inpainting.” Laina, Iro, Christian Rupprecht, Vasileios Belagiannis, Fed-
International Workshop on Advanced Image Technology erico Tombari, and Nassir Navab. 2016. “Deeper Depth Pre-
(IWAIT), Chiang Mai, Thailand, 1–3. IEEE. diction with Fully Convolutional Residual Networks.” 4th
Kim, Kiyoung, Vincent Lepetit, and Woontack Woo. 2010. international conference on 3D Vision (3DV), Stanford, CA,
“Keyframe-based Modeling and Tracking of Multiple 3D 239–248. IEEE.
Objects.” IEEE international symposium on Mixed and Aug- Lalonde, Jean-François. 2018. “Deep Learning for Aug-
mented Reality, Seoul, Korea, 193–198. IEEE. mented Reality.” 17th Workshop on Information Optics
Kim, Chanho, Fuxin Li, and James M Rehg. 2018. “Multi-object (WIO), Quebec, QC, Canada, 1–3. IEEE.
Tracking with Neural Gating using Bilinear LSTM.” Pro- Lampe, Thomas, and Martin Riedmiller. 2013. “Acquiring
ceedings of the European Conference on Computer Vision Visual Servoing Reaching and Grasping Skills using Neural
(ECCV), Munich, Germany, 200–215. Reinforcement Learning.” International Joint Conference on
Klein, Georg, and David W. Murray. 2006. “Full-3D Edge Neural Networks (IJCNN), Dallas, TX, 1–8. IEEE.
Tracking with a Particle Filter.” BMVC, Edinburgh, UK, Laskar, Zakaria, Iaroslav Melekhov, Surya Kalia, and Juho Kan-
1119–1128. nala. 2017. “Camera Relocalization by Computing Pairwise
INTERNATIONAL JOURNAL OF PRODUCTION RESEARCH 47

Relative Poses using Convolutional Neural Network.” Pro- Li, Zhengqin, Mohammad Shafiei, Ravi Ramamoorthi, Kalyan
ceedings of the IEEE international conference on Computer Sunkavalli, and Manmohan Chandraker. 2020. “Inverse
Vision Workshops, Venice, Italy, 929–938. Rendering for Complex Indoor Scenes: Shape, Spatially-
Leal-Taixé, Laura, Cristian Canton-Ferrer, and Konrad Schindler. varying Lighting and SVBRDF from a Single Image.” Pro-
2016. “Learning by Tracking: Siamese CNN for Robust Tar- ceedings of the IEEE/CVF conference on Computer Vision
get Association.” Proceedings of the IEEE conference on and Pattern Recognition, Long Beach, CA, 2475–2484.
Computer Vision and Pattern Recognition Workshops, Las Li, Bo, Chunhua Shen, Yuchao Dai, Anton Van Den Hen-
Vegas, NV, 33–40. gel, and Mingyi He. 2015. “Depth and Surface Normal
LeCun, Yann, Léon Bottou, Yoshua Bengio, and Patrick Estimation from Monocular Images using Regression on
Haffner. 1998. “Gradient-based Learning Applied to Doc- Deep Features and Hierarchical CRFs.” Proceedings of the
ument Recognition.” Proceedings of the IEEE 86 (11): IEEE conference on Computer Vision and Pattern Recogni-
2278–2324. tion, Boston, MA, 1119–1127.
Lee, Taehee, and Tobias Hollerer. 2008. “Hybrid Feature Track- Li, Bo, Wei Wu, Qiang Wang, Fangyi Zhang, Junliang Xing,
ing and user Interaction for Markerless Augmented Reality.” and Junjie Yan. 2019. “Siamrpn++: Evolution of Siamese
IEEE Virtual Reality Conference, Reno, NV, 145–152. IEEE. Visual Tracking with Very Deep Networks.” Proceedings
Lee, Sangyun, and Euntai Kim. 2018. “Multiple Object Track- of the IEEE conference on Computer Vision and Pattern
ing Via Feature Pyramid Siamese Networks.” IEEE Access 7: Recognition, Long Beach, CA, 4282–4291.
8181–8194. Li, Bo, Junjie Yan, Wei Wu, Zheng Zhu, and Xiaolin Hu. 2018.
Lee, Alex X., Sergey Levine, and Pieter Abbeel. 2017. “Learning “High Performance Visual Tracking with Siamese Region
Visual Servoing with Deep Features and Fitted Q-iteration.” Proposal Network.” Proceedings of the IEEE conference on
Preprint arXiv:1703.11000. Computer Vision and Pattern Recognition, Salt Lake City,
Lee, Jae Yeol, and Guewon Rhee. 2008a. “Context-aware 3D UT, 8971–8980.
Visualization and Collaboration Services for Ubiquitous Li, Jianan, Jimei Yang, Aaron Hertzmann, Jianming Zhang, and
Cars Using Augmented Reality.” The International Journal of Tingfa Xu. 2019. “Layoutgan: Generating Graphic Layouts
Advanced Manufacturing Technology 37 (5–6): 431–442. with Wireframe Discriminators.” Preprint arXiv:1901.06767.
Lee, Jae Yeol, and Guewon Rhee. 2008b. “Context-aware 3D Li, Yehao, Ting Yao, Yingwei Pan, Hongyang Chao, and Tao
Visualization and Collaboration Services for Ubiquitous Mei. 2019. “Pointing Novel Objects in Image Captioning.”
Cars Using Augmented Reality.” The International Journal of Proceedings of the IEEE conference on Computer Vision
Advanced Manufacturing Technology 37 (5–6): 431–442. and Pattern Recognition, Long Beach, CA, 12497–12506.
LeGendre, Chloe, Wan-Chun Ma, Graham Fyffe, John Flynn, Li, Haifeng, Keshu Zhang, and Tao Jiang. 2004. “Minimum
Laurent Charbonnel, Jay Busch, and Paul Debevec. 2019. Entropy Clustering and Applications to Gene Expression
“Deeplight: Learning Illumination for Unconstrained Mobile Analysis.” Proceedings of the IEEE Computational Sys-
Mixed Reality.” Proceedings of the IEEE conference on tems Bioinformatics Conference, CSB 2004, Stanford, CA,
Computer Vision and Pattern Recognition, Long Beach, CA, 142–151. IEEE.
5918–5928. Liang, Yiming, and Yue Zhou. 2018. “LSTM Multiple Object
Lepetit, Vincent, and Pascal Fua. 2005. “Monocular Model- Tracker Combining Multiple Cues.” 25th IEEE International
based 3D Tracking of Rigid Objects.” Foundations and Conference on Image Processing (ICIP), Athens, Greece,
Trends in Computer Graphics and Vision 1 (1): 1–89. 2351–2355. IEEE.
doi:10.1561/0600000001. Lin, Tsung-Yi, Piotr Dollár, Ross Girshick, Kaiming He,
Lepetit, Vincent, and Pascal Fua. 2006. “Keypoint Recogni- Bharath Hariharan, and Serge Belongie. 2017. “Feature Pyra-
tion Using Randomized Trees.” IEEE Transactions on Pattern mid Networks for Object Detection.” Proceedings of the
Analysis and Machine Intelligence 28 (9): 1465–1479. IEEE conference on Computer Vision and Pattern Recogni-
Lepetit, Vincent, Luca Vacchetti, Daniel Thalmann, and Pas- tion, Honolulu, HI, 2117–2125.
cal Fua. 2003. “Fully Automated and Stable Registration for Liu, Wei, Dragomir Anguelov, Dumitru Erhan, Christian
Augmented Reality Applications.” Proceedings of the 2nd Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C. Berg.
IEEE and ACM international symposium on Mixed and 2016. “SSD: Single Shot multibox Detector.” European con-
Augmented Reality, Tokyo, Japan, 93–102. IEEE. ference on Computer Vision, Amsterdam, The Netherlands,
Li, Yang, Julien Amelot, Xin Zhou, Samy Bengio, and Si 21–37. Springer.
Si. 2020. “Auto Completion of User Interface Layout Liu, Jingshu, and Yuan Li. 2019. “An Image Based Visual Servo
Design Using Transformer-Based Tree Decoders.” Preprint Approach with Deep Learning for Robotic Manipulation.”
arXiv:2001.05308. Preprint arXiv:1909.07727.
Li, Yuanyuan, Stefano Carabelli, Edoardo Fadda, Daniele Liu, Daquan, Chengjiang Long, Hongpan Zhang, Hanning
Manerba, Roberto Tadei, and Olivier Terzo. 2020. “Machine Yu, Xinzhi Dong, and Chunxia Xiao. 2020. “ARShadow-
Learning and Optimization for Production Rescheduling in GAN: Shadow Generative Adversarial Network for Aug-
Industry 4.0.” The International Journal of Advanced Manu- mented Reality in Single Light Scenes.” Proceedings of the
facturing Technology 110: 2445–2463. IEEE/CVF conference on Computer Vision and Pattern
Li, Mengtian, Zhe Lin, Radomir Mech, Ersin Yumer, and Deva Recognition, Long Beach, CA, 8139–8148.
Ramanan. 2019. “Photo-sketching: Inferring Contour Draw- Liu, Shu, Lu Qi, Haifang Qin, Jianping Shi, and Jiaya Jia. 2018.
ings from Images.” IEEE Winter conference on Applications “Path Aggregation Network for Instance Segmentation.”
of Computer Vision (WACV), Waikoloa Village, Hawaii, Proceedings of the IEEE conference on Computer Vision
1403–1412. IEEE. and Pattern Recognition, Salt Lake City, UT, 8759–8768.
48 C. K. SAHU ET AL.

Liu, Wei, Andrew Rabinovich, and Alexander C. Berg. MacIntyre, Blair, Enylton Machado Coelho, and Simon J.
2015. “Parsenet: Looking Wider to See Better.” Preprint Julier. 2002. “Estimating and Adapting to Registration Errors
arXiv:1506.04579. in Augmented Reality Systems.” Proceedings IEEE Virtual
Liu, Ying, Rahul Rai, Anurag Purwar, Bin He, and Mahesh Reality 2002, Orlando, FL, 73–80. IEEE.
Mani. 2020. “Machine Learning Applications in Manufac- MacKay, David J. C. 1992. “A Practical Bayesian Framework
turing.” Journal of Computing and Information Science in for Backpropagation Networks.” Neural Computation 4 (3):
Engineering 20 (2): 020301. doi:10.1115/1.4046427. 448–472.
Livingston, Mark A., Lawrence J. Rosenblum, Dennis G. Maidi, Madjid, Jean-Yves Didier, Fakhreddine Ababsa, and
Brown, Gregory S. Schmidt, Simon J. Julier, Yohan Baillot, J. Malik Mallem. 2010. “A Performance Study for Camera Pose
Edward Swan, Zhuming Ai, and Paul Maassel. 2011. “Mil- Estimation Using Visual Marker Based Tracking.” Machine
itary Applications of Augmented Reality.” In Handbook of Vision and Applications21 (3): 365–376.
Augmented Reality edited by Borko Furht, 671–706. New Makris, Sotiris, Panagiotis Karagiannis, Spyridon Koukas, and
York, NY: Springer. Aleksandros-Stereos Matthaiakis. 2016. “Augmented Reality
Long, Jonathan, Evan Shelhamer, and Trevor Darrell. 2015. System for Operator Support in Human–Robot Collabora-
“Fully Convolutional Networks for Semantic Segmentation.” tive Assembly.” CIRP Annals 65 (1): 61–64.
Proceedings of the IEEE conference on Computer Vision Malek, S., N. Zenati-Henda, M. Belhocine, and S. Benbelkacem.
and Pattern Recognition, Boston, MA, 3431–3440. 2008. “Calibration Method for An Augmented Reality Sys-
Lopez, Manuel, Roger Mari, Pau Gargallo, Yubin Kuang, Javier tem.” World Academy of Science, Engineering and Technology
Gonzalez-Jimenez, and Gloria Haro. 2019. “Deep Single 45: 309–314.
Image Camera Calibration with Radial Distortion.” Pro- Mancini, Michele, Gabriele Costante, Paolo Valigi, and Thomas
ceedings of the IEEE conference on Computer Vision and A. Ciarfuglia. 2016. “Fast Robust Monocular Depth Esti-
Pattern Recognition, Long Beach, CA, 11817–11825. mation for Obstacle Detection with Fully Convolutional
Lowe, David G. 1991. “Fitting Parameterized Three-dimensio Networks.” IEEE/RSJ international conference on Intelligent
nal Models to Images.” IEEE Transactions on Pattern Analy- Robots and Systems (IROS), Daejeon, Korea, 4296–4303.
sis & Machine Intelligence 13 (5): 441–450. IEEE.
Lowe, David G. 1999. “Object Recognition from Local Scale- Mao, Junhua, Xu Wei, Yi Yang, Jiang Wang, Zhiheng Huang,
invariant Features.” Proceedings of the 7th IEEE interna- and Alan L. Yuille. 2015. “Learning Like a Child: Fast Novel
tional conference on Computer Vision, Kerkyra, Greece, Vol. Visual Concept Learning from Sentence Descriptions of
2, 1150–1157. IEEE. Images.” Proceedings of the IEEE international conference
Lowe, David G. 2004. “Distinctive Image Features From Scale- on Computer Vision, Santiago, Chile, 2533–2541.
invariant Keypoints.” International Journal of Computer Marchand, Eric, Patrick Bouthemy, and François Chaumette.
Vision 60 (2): 91–110. 2001. “A 2D–3D Model-based Approach to Real-time Visual
Lu, C.-P., Gregory D. Hager, and Eric Mjolsness. 2000. “Fast and Tracking.” Image and Vision Computing 19 (13): 941–
Globally Convergent Pose Estimation From Video Images.” 955.
IEEE Transactions on Pattern Analysis and Machine Intelli- Marchand, Éric, and François Chaumette. 2002. “Virtual Visual
gence 22 (6): 610–622. Servoing: A Framework for Real-Time Augmented Reality.”
Lu, Boun Vinh, Tetsuya Kakuta, Rei Kawakami, Takeshi Computer Graphics Forum 21: 289–297.
Oishi, and Katsushi Ikeuchi. 2010. “Foreground and Shadow Marchand, Éric, and François Chaumette. 2005. “Feature
Occlusion Handling for Outdoor Augmented Reality.” IEEE Tracking for Visual Servoing Purposes.” Robotics and
international symposium on Mixed and Augmented Real- Autonomous Systems 52 (1): 53–70.
ity, Seoul, Korea, 109–118. IEEE. Marques, Bruno Augusto Dorta, Esteban Walter Gonzalez
Lu, Yongyi, Cewu Lu, and Chi-Keung Tang. 2017. “Online Clua, and Cristina Nader Vasconcelos. 2018. “Deep Spher-
Video Object Detection using Association LSTM.” Proceed- ical Harmonics Light Probe Estimator for Mixed Reality
ings of the IEEE international conference on Computer Games.” Computers & Graphics76: 96–106.
Vision, Venice, Italy, 2344–2352. Marques, Bruno A. D., Rafael Rego Drumond, Cristina
Luc, Pauline, Camille Couprie, Soumith Chintala, and Jakob Nader Vasconcelos, and Esteban Clua. 2018. “Deep Light
Verbeek. 2016. “Semantic Segmentation using Adversarial Source Estimation for Mixed Reality.” VISIGRAPP (1:
Networks.” Preprint arXiv:1611.08408. GRAPP), Funchal, Madeira, Portugal, 303–311.
Lucas, Bruce D., and Takeo Kanade. 1981. “An Iterative Martinetti, Alberto, Henrique Costa Marques, Sarbjeet Singh,
Image Registration Technique with an Application to Stereo and Leo van Dongen. 2019. “Reflections on the Limited
Vision.” Proceedings of the 7th international joint confer- Pervasiveness of Augmented Reality in Industrial Sectors.”
ence on Artificial Intelligence, Vancouver, BC, Canada, Vol. Applied Sciences 9 (16): 3382.
2, 674–679. Morgan Kaufmann Publishers. Masood, Tariq, and Johannes Egger. 2020. “Adopting Aug-
Lytridis, Chris, Avgoustos Tsinakos, and Ioannis Kazanidis. mented Reality in the Age of Industrial Digitalisation.” Com-
2018. “ARTutor–An Augmented Reality Platform for Inter- puters in Industry 115: 103112.
active Distance Learning.” Education Sciences 8 (1): 6. Matei, Ion, Johan de Kleer, Alexander Feldman, Rahul Rai, and
MacIntyre, Blair, and E. Machado Coelho. 2000. “Adapting to Souma Chowdhury. 2020. “Hybrid Modeling: Applications
Dynamic Registration Errors using Level of Error (LOE) Fil- in Real-time Diagnosis.” Preprint arXiv:2003.02671.
tering.” Proceedings IEEE and ACM International Sympo- Meka, Abhimitra, Maxim Maximov, Michael Zollhoefer,
sium on Augmented Reality (ISAR 2000), Munich, Germany, Avishek Chatterjee, Hans-Peter Seidel, Christian Richardt,
85–88. IEEE. and Christian Theobalt. 2018. “Lime: Live Intrinsic Material
INTERNATIONAL JOURNAL OF PRODUCTION RESEARCH 49

Estimation.” Proceedings of the IEEE Conference on Com- Naimark, Leonid, and Eric Foxlin. 2002. “Circular Data Matrix
puter Vision and Pattern Recognition, Salt Lake City, UT, Fiducial System and Robust Image Processing for a Wear-
6315–6324. able Vision-inertial Self-tracker.” Proceedings. international
Melekhov, Iaroslav, Juha Ylioinas, Juho Kannala, and Esa symposium on Mixed and Augmented Reality, Darmstadt,
Rahtu. 2017. “Image-based Localization using Hourglass Germany, 27–36. IEEE.
Networks.” Proceedings of the IEEE international con- Nair, Vinod, and Geoffrey E. Hinton. 2010. “Rectified Lin-
ference on Computer Vision Workshops, Venice, Italy, ear Units Improve Restricted Boltzmann Machines.” Pro-
879–886. ceedings of the 27th International Conference on Machine
Mengoni, Maura, Matteo Iualè, Margherita Peruzzini, and Learning (ICML-10), Haifa, Israel, 807–814.
Michele Germani. 2015. “An Adaptable AR User Interface Najibi, Mahyar, Mohammad Rastegari, and Larry S. Davis.
to Face the Challenge of Ageing Workers in Manufacturing.” 2016. “G-CNN: An Iterative Grid based Object Detector.”
International conference on Human Aspects of IT for the Proceedings of the IEEE conference on Computer Vision
Aged Population, Los Angeles, CA, 311–323. Springer. and Pattern Recognition, Las Vegas, NV, 2369–2377.
Milan, Anton, S. Hamid Rezatofighi, Anthony Dick, Ian Reid, Nam, Hyeonseob, and Bohyung Han. 2016. “Learning Multi-
and Konrad Schindler. 2017. “Online Multi-target Track- domain Convolutional Neural Networks for Visual Track-
ing using Recurrent Neural Networks.” 31st AAAI confer- ing.” Proceedings of the IEEE conference on Computer
ence on Artificial Intelligence, San Francisco, CA, 4225– Vision and Pattern Recognition, Las Vegas, NV, 4293–4302.
4232. Naseer, Tayyab, and Wolfram Burgard. 2017. “Deep Regression
Min, Erxue, Xifeng Guo, Qiang Liu, Gen Zhang, Jianjing Cui, for Monocular Camera-based 6-DOF Global Localization in
and Jun Long. 2018. “A Survey of Clustering with Deep Outdoor Environments.” IEEE/RSJ international conference
Learning: From the Perspective of Network Architecture.” on Intelligent Robots and Systems (IROS), Vancouver, BC,
IEEE Access 6: 39501–39514. Canada, 1525–1530. IEEE.
Mohring, Mathias, Christian Lessig, and Oliver Bimber. 2004. Neges, Matthias, Mario Wolf, and Michael Abramovici. 2015.
“Video See-through AR on Consumer Cell-phones.” 3rd “Secure Access Augmented Reality Solution for Mobile
IEEE and ACM international symposium on Mixed and Maintenance Support Utilizing Condition-oriented Work
Augmented Reality, Arlington, VA, 252–253. IEEE. Instructions.” Procedia CIRP 38: 58–62.
Morel, Jean-Michel, and Guoshen Yu. 2009. “ASIFT: A New Neumann, Ulrich, and Youngkwan Cho. 1996. “A Self-tracking
Framework for Fully Affine Invariant Image Comparison.” Augmented Reality System.” Proceedings of the ACM sym-
SIAM Journal on Imaging Sciences 2 (2): 438–469. posium on Virtual Reality Software and Technology, Hong
Moreno-Noguer, Francesc, Vincent Lepetit, and Pascal Fua. Kong, 109–115.
2007. “Accurate Non-Iterative O(n) Solution to the PnP Neumann, Ulrich, and Suya You. 1998. “Integration of Region
Problem.” IEEE 11th international conference on Computer Tracking and Optical Flow for Image Motion Estima-
Vision, Rio de Janeiro, Brazil, 1–8. IEEE. tion.” Proceedings 1998 International Conference on Image
Mortensen, Eric N., Hongli Deng, and Linda Shapiro. 2005. “A Processing, ICIP98 (Cat. No. 98CB36269), Chicago, IL,
SIFT Descriptor with Global Context.” IEEE computer soci- 658–662. IEEE.
ety conference on Computer Vision and Pattern Recognition Newcombe, Richard A., and Andrew J. Davison. 2010. “Live
(CVPR’05), San Diego, CA, Vol. 1, 184–190. IEEE. Dense Reconstruction with a Single Moving Camera.” IEEE
Motoyama, Yuichi, Kazuyo Iwamoto, Hitoshi Tokunaga, and computer society conference on Computer Vision and Pat-
Toshimitsu Okane. 2020. “Measuring Hand-pouring Motion tern Recognition, San Francisco, CA, 1498–1505. IEEE.
in Casting Process Using Augmented Reality Marker Track- Nguyen, Tam, Phong Vu, Hung Pham, and Tung Nguyen.
ing.” The International Journal of Advanced Manufacturing 2018. “Deep Learning UI Design Patterns of Mobile Apps.”
Technology 106 (11): 5333–5343. IEEE/ACM 40th International Conference on Software
Mourtzis, Dimitris, Vasilios Zogopoulos, and Fotini Xan- Engineering: New Ideas and Emerging Technologies Results
thi. 2019. “Augmented Reality Application to Support the (ICSE-NIER), Gothenburg, Sweden, 65–68. IEEE.
Assembly of Highly Customized Products and to Adapt Noh, Hyeonwoo, Seunghoon Hong, and Bohyung Han. 2015.
to Production Re-scheduling.” The International Journal “Learning Deconvolution Network for Semantic Segmenta-
of Advanced Manufacturing Technology 105 (9): 3899– tion.” Proceedings of the IEEE international conference on
3910. Computer Vision, Santiago, Chile, 1520–1528.
Munir, Kamran, and M. Sheraz Anjum. 2018. “The Use of Noy, Natalya F., and Deborah L. McGuinness. 2000. “What is
Ontologies for Effective Knowledge Modelling and Informa- an Ontology and Why We Need It.” Accessed 10 May 2020.
tion Retrieval.” Applied Computing and Informatics 14 (2): https://protege.stanford.edu/publications/ontology_
116–126. development/ontology101-noy-mcguinness.html.
Mur-Artal, Raul, Jose Maria Martinez Montiel, and Juan D. Tar- Oberkampf, Denis, Daniel F. DeMenthon, and Larry S. Davis.
dos. 2015. “ORB-SLAM: A Versatile and Accurate Monoc- 1996. “Iterative Pose Estimation Using Coplanar Feature
ular SLAM System.” IEEE Transactions on Robotics 31 (5): Points.” Computer Vision and Image Understanding 63 (3):
1147–1163. 495–511.
Naik, Hemal, Federico Tombari, Christoph Resch, Peter Keitler, O’Mahony, Niall, Sean Campbell, Anderson Carvalho, Suman
and Nassir Navab. 2015. “A Step Closer To Reality: Closed Harapanahalli, Gustavo Velasco Hernandez, Lenka Krpalkova,
Loop Dynamic Registration Correction in SAR.” IEEE Daniel Riordan, and Joseph Walsh. 2019. “Deep Learning
international symposium on Mixed and Augmented Real- vs. Traditional Computer Vision.” Science and Information
ity, Fukuoka, Japan, 112–115. IEEE. Conference, Las Vegas, NV, 128–144. Springer.
50 C. K. SAHU ET AL.

Ong, S. K., M. L. Yuan, and A. Y. C. Nee. 2008. “Aug- Peniche, Amaury, Christian Diaz, Helmuth Trefftz, and Gabriel
mented Reality Applications in Manufacturing: A Survey.” Paramo. 2012. “Combining Virtual and Augmented Real-
International Journal of Production Research 46 (10): 2707– ity to Improve the Mechanical Assembly Training Process
2742. in Manufacturing.” Proceedings of the 6th WSEAS inter-
Osti, Francesco, Alessandro Ceruti, Alfredo Liverani, and national conference on Computer Engineering and Appli-
Gianni Caligiana. 2017. “Semi-automatic Design for Dis- cations, and Proceedings of the 2012 American conference
assembly Strategy Planning: An Augmented Reality Appro on Applied Mathematics, Cambridge, MA, 292–297. World
ach.” Procedia Manufacturing 11: 1481–1488. Scientific and Engineering Academy and Society (WSEAS).
Ozuysal, Mustafa, Michael Calonder, Vincent Lepetit, and Pas- Piekarski, Wayne, and Bruce Thomas. 2002. “ARQuake: The
cal Fua. 2009. “Fast Keypoint Recognition Using Random Outdoor Augmented Reality Gaming System.” Communica-
Ferns.” IEEE Transactions on Pattern Analysis and Machine tions of the ACM 45 (1): 36–38.
Intelligence 32 (3): 448–461. Pilet, Julien, and Hideo Saito. 2010. “Virtually Augmenting
Ozuysal, Mustafa, Pascal Fua, and Vincent Lepetit. 2007. “Fast Hundreds of Real Pictures: An Approach based on Learning,
Keypoint Recognition in Ten Lines of Code.” IEEE confer- Retrieval, and Tracking.” IEEE Virtual Reality Conference
ence on Computer Vision and Pattern Recognition, Min- (VR), Waltham, MA, 71–78. IEEE.
neapolis, MN, 1–8. IEEE. Pinheiro, Pedro H. O., and Ronan Collobert. 2014. “Recur-
Palmer, Claire, Zahid Usman, Osiris Canciglieri Junior, rent Convolutional Neural Networks for Scene Label-
Andreia Malucelli, and Robert I. M. Young. 2018. “Inter- ing.” 31st International Conference on Machine Learning
operable Manufacturing Knowledge Systems.” International (ICML), Beijing, China, Vol I, 82–90.
Journal of Production Research 56 (8): 2733–2752. Pinheiro, Pedro O. O., Ronan Collobert, and Piotr Dollár. 2015.
Panagopoulos, Alexandros, Chaohui Wang, Dimitris Sama- “Learning to Segment Object Candidates.” Advances in
ras, and Nikos Paragios. 2011. “Illumination Estimation and Neural Information Processing Systems, Montreal, Canada,
Cast Shadow Detection through a Higher-order Graphi- 1990–1998.
cal Model.” CVPR 2011, Colorado Springs, CO, 673–680. Pinheiro, Pedro O., Tsung-Yi Lin, Ronan Collobert, and Piotr
IEEE. Dollár. 2016. “Learning to Refine Object Segments.” Euro-
Pang, Yanwei, Wei Li, Yuan Yuan, and Jing Pan. 2012. “Fully pean conference on Computer Vision, Amsterdam, The
Affine Invariant SURF for Image Matching.” Neurocomput- Netherlands, 75–91. Springer.
ing 85: 6–10. Pollefeys, Marc, Reinhard Koch, and Luc Van Gool. 1999. “Self-
Pang, Y., M. L. Yuan, Andrew Y. C. Nee, Soh-Khim Ong, calibration and Metric Reconstruction Inspite of Varying
and Kamal Youcef-Toumi. 2006. “A Markerless Registra- and Unknown Intrinsic Camera Parameters.” International
tion Method for Augmented Reality based on Affine Prop- Journal of Computer Vision 32 (1): 7–25.
erties.” Proceedings of the 7th Australasian User Interface Pontes, Jhony K., Chen Kong, Sridha Sridharan, Simon Lucey,
Conference, Hobart, Tasmania, Vol. 50, 25–32. Citeseer. Anders Eriksson, and Clinton Fookes. 2018. “Image2Mesh:
Panin, Giorgio, and Alois Knoll. 2008. “Mutual Information- A Learning Framework for Single Image 3D Reconstruc-
based 3D Object Tracking.” International Journal of Com- tion.” Asian Conference on Computer Vision, Perth, WA,
puter Vision 78 (1): 107–118. Australia, 365–381. Springer.
Park, Kyeong-Beom, Minseok Kim, Sung Ho Choi, and Jae Porter, Michael E., and James E. Heppelmann. 2017. “A Man-
Yeol Lee. 2020. “Deep Learning-based Smart Task Assistance ager’s Guide to Augmented Reality.” November–December.
in Wearable Augmented Reality.” Robotics and Computer- Accessed 27 May 2020. https://hbr.org/2017/11/a-managers-
Integrated Manufacturing63: 101887. guide-to-augmented-reality.
Park, Youngmin, Vincent Lepetit, and Woontack Woo. 2008. Pressigout, Muriel, and Eric Marchand. 2004. “Model-free Aug-
“Multiple 3D Object Tracking for Augmented Reality.” 7th mented Reality by Virtual Visual Servoing.” Proceedings of
IEEE/ACM international symposium on Mixed and Aug- the 17th International Conference on Pattern Recognition,
mented Reality, Cambridge, UK, 117–120. IEEE. ICPR 2004, Cambridge, England, UK, Vol. 2, 887–890. IEEE.
Park, Joonsuk, and Jun Park. 2010. “3DOF Tracking Accu- PTC. 2020. “Vuforia: Market-Leading Enterprise AR — PTC.”
racy Improvement for Outdoor Augmented Reality.” IEEE Accessed 19 May 2020. https://www.ptc.com/en/products/
international symposium on Mixed and Augmented Real- augmented-reality/vuforia.
ity, Seoul, Korea, 263–264. IEEE. Pupilli, Mark, and Andrew Calway. 2006. “Real-time Cam-
Park, Jun, Suya You, and Ulrich Neumann. 1999. “Natural Fea- era Tracking Using Known 3D Models and a Particle Fil-
ture Tracking for Extendible Robust Augmented Realities.” ter.” 18th International Conference on Pattern Recognition
Proceedings of the international workshop on Augmented (ICPR’06), Hong Kong, Vol. 1, 199–203. IEEE.
Reality: Placing Artificial Objects in Real Scenes: Placing Qin, S. Joe, and Leo H. Chiang. 2019. “Advances and Oppor-
Artificial Objects in Real Scenes, Bellevue, WA, 209–217. AK tunities in Machine Learning for Process Data Analytics.”
Peters, Ltd. Computers & Chemical Engineering 126: 465–473.
Paszke, Adam, Abhishek Chaurasia, Sangpil Kim, and Eugenio Quandt, Moritz, Benjamin Knoke, Christian Gorldt, Michael
Culurciello. 2016. “Enet: A Deep Neural Network Archi- Freitag, and Klaus-Dieter Thoben. 2018. “General Require-
tecture for Real-time Semantic Segmentation.” Preprint ments for Industrial Augmented Reality Applications.” Pro-
arXiv:1606.02147. cedia Cirp 72 (1): 1130–1135.
Patel, Maitri, Paresh Virparia, and Dharmendra Patel. 2012. Radu, Iulian. 2012. “Why Should My Students Use AR? A Com-
“Web Based Fuzzy Expert System and its Applications: A parative Review of the Educational Impacts of Augmented-
Survey.” International Journal of Applied Information Systems reality.” IEEE International Symposium on Mixed and Aug-
1 (7): 11–15. mented Reality (ISMAR), Atlanta, GA, 313–314. IEEE.
INTERNATIONAL JOURNAL OF PRODUCTION RESEARCH 51

Radwan, Noha, Abhinav Valada, and Wolfram Burgard. 2018. Conference on Computer Vision (ECCV), Munich, Ger-
“Vlocnet++: Deep Multitask Learning for Semantic Visual many, 586–602.
Localization and Odometry.” IEEE Robotics and Automation Rentzos, Loukas, Stergios Papanastasiou, Nikolaos Papakostas,
Letters 3 (4): 4407–4414. and George Chryssolouris. 2013. “Augmented Reality for
Rai, Rahul, and Akshay V. Deshpande. 2016. “Fragmentary Human-based Assembly: Using Product and Process Seman
Shape Recognition: A BCI Study.” Computer-Aided Design tics.” IFAC Proceedings Volumes 46 (15): 98–101.
71: 51–64. Riaz Muhammad, Umar, Yongxin Yang, Yi-Zhe Song, Tao
Rai, Rahul, and Chandan K. Sahu. 2020. “Driven by Data Xiang, and Timothy M. Hospedales. 2018. “Learning Deep
Or Derived Through Physics? A Review of Hybrid Physics Sketch Abstraction.” Proceedings of the IEEE conference on
Guided Machine Learning Techniques With Cyber-Physical Computer Vision and Pattern Recognition, Salt Lake City,
System (CPS) Focus.” IEEE Access 8: 71050–71073. UT, 8014–8023.
Rao, Jinmeng, Yanjun Qiao, Fu Ren, Junxing Wang, and Rogério, Richa, Raphael Sznitman, Russell Taylor, and Gregory
Qingyun Du. 2017. “A Mobile Outdoor Augmented Real- Hager. 2011. “Visual Tracking using the Sum of Conditional
ity Method Combining Deep Learning Object Detection and Variance.” IEEE/RSJ international conference on Intelligent
Spatial Relationships for Geovisualization.” Sensors 17 (9): Robots and Systems, San Francisco, CA, 2953–2958. IEEE.
1951. Ronneberger, Olaf, Philipp Fischer, and Thomas Brox. 2015.
Redmon, Joseph, Santosh Divvala, Ross Girshick, and Ali “U-net: Convolutional Networks for Biomedical Image Seg-
Farhadi. 2016. “You Only Look Once: Unified, Real-time mentation.” International conference on Medical Image
Object Detection.” Proceedings of the IEEE conference on Computing and Computer-assisted Intervention, Munich,
Computer Vision and Pattern recognition, Las Vegas, NV, Germany, 234–241. Springer.
779–788. Rosello, Pol, and Mykel J. Kochenderfer. 2018. “Multi-
Redmon, Joseph, and Ali Farhadi. 2017. “YOLO9000: Better, agent Reinforcement Learning for Multi-object Track-
Faster, Stronger.” Proceedings of the IEEE conference on ing.” Proceedings of the 17th international conference on
Computer Vision and Pattern Recognition, Honolulu, HI, Autonomous Agents and MultiAgent Systems, Munich, Ger-
7263–7271. many, 1397–1404. International Foundation for Autonomous
Redmon, Joseph, and Ali Farhadi. 2018. “YOLOv3: An Incre- Agents and Multiagent Systems.
mental Improvement.” Preprint arXiv:1804.02767. Rousseeuw, Peter J., and Annick M. Leroy. 2005. Robust Regres-
Regenbrecht, Holger, Gregory Baratoff, and Wilhelm Wilke. sion and Outlier Detection. Vol. 589. Hoboken, NJ: John
2005. “Augmented Reality Projects in the Automotive and Wiley & Sons. doi:10.1002/0471725382.
Aerospace Industries.” IEEE Computer Graphics and Appli- Roy, Anirban, and Sinisa Todorovic. 2016. “A Multi-scale CNN
cations 25 (6): 48–56. for Affordance Segmentation in RGB Images.” European
Reid, Donald. 1979. “An Algorithm for Tracking Multiple Conference on Computer Vision, Amsterdam, The Nether-
Targets.” IEEE Transactions on Automatic Control 24 (6): lands, 186–201. Springer.
843–854. Rüßmann, Michael, Markus Lorenz, Philipp Gerbert, Manuela
Reitinger, Bernhard, Christopher Zach, and Dieter Schmal- Waldner, Jan Justus, Pascal Engel, and Michael Harnisch.
stieg. 2007. “Augmented Reality Scouting for Interactive 3D 2015. “Industry 4.0: The Future of Productivity and Growth
Reconstruction.” IEEE Virtual Reality Conference, Char- in Manufacturing Industries.” Boston Consulting Group 9
lotte, NC, 219–222. IEEE. (1): 54–89. Accessed 14 May 2020. https://www.bcg.com/
Reitmayr, Gerhard, and Tom W. Drummond. 2006. “Going publications/2015/engineered_products_project_business_
Out: Robust Model-based Tracking for Outdoor Augmented industry_4_future_productivity_growth_manufacturing_
Reality.” IEEE/ACM international symposium on Mixed and industries.
Augmented Reality, Santa Barbara, CA, 109–118. IEEE. Sadeghian, Amir, Alexandre Alahi, and Silvio Savarese. 2017.
Reitmayr, Gerhard, Ethan Eade, and Tom W. Drummond. 2007. “Tracking the Untrackable: Learning to Track Multiple Cues
“Semi-automatic Annotations in Unknown Environments.” with Long-term Dependencies.” Proceedings of the IEEE
6th IEEE and ACM international symposium on Mixed and international conference on Computer Vision, Venice, Italy,
Augmented Reality, Nara, Japan, 67–70. IEEE. 300–311.
Reitmayr, Gerhard, Tobias Langlotz, Daniel Wagner, Alessan- Sampedro, Carlos, Alejandro Rodriguez-Ramos, Ignacio Gil,
dro Mulloni, Gerhard Schall, Dieter Schmalstieg, and Qi Luis Mejias, and Pascual Campoy. 2018. “Image-Based
Pan. 2010. “Simultaneous Localization and Mapping for Visual Servoing Controller for Multirotor Aerial Robots
Augmented Reality.” International symposium on Ubiqui- Using Deep Reinforcement Learning.” IEEE/RSJ inter-
tous Virtual Reality, Gwangji, South Korea, 5–8. IEEE. national conference on Intelligent Robots and Systems
Reitmayr, Gerhard, and Dieter Schmalstieg. 2003. “Data Man- (IROS), Madrid, Spain, 979–986. IEEE.
agement Strategies for Mobile Augmented Reality.” Proceed- Sanches, Silvio R. R., Daniel M. Tokunaga, Valdinei F. Silva,
ings of the international workshop on Software Technology Antonio C. Sementille, and Romero Tori. 2012. “Mutual
for Augmented Reality Systems, Tokyo, Japan, 47–52. Occlusion between Real and Virtual Elements in Augmented
Ren, Shaoqing, Kaiming He, Ross Girshick, and Jian Sun. 2015. Reality based on Fiducial Markers.” IEEE Workshop on the
“Faster R-CNN: Towards Real-time Object Detection with Applications of Computer Vision (WACV), Breckenridge,
Region Proposal Networks.” Advances in Neural Informa- CO, 49–54. IEEE.
tion Processing Systems, Montreal, Canada, 91–99. Sanderson, David, Jack C. Chaplin, and Svetan Ratchev. 2019.
Ren, Liangliang, Jiwen Lu, Zifeng Wang, Qi Tian, and Jie “A Function-Behaviour-Structure Design Methodology for
Zhou. 2018. “Collaborative Deep Reinforcement Learning Adaptive Production Systems.” The International Journal of
for Multi-object Tracking.” Proceedings of the European Advanced Manufacturing Technology 105 (9): 3731–3742.
52 C. K. SAHU ET AL.

Sandler, Mark, Andrew Howard, Menglong Zhu, Andrey Setti, Amedeo, Paolo Bosetti, and Matteo Ragni. 2016.
Zhmoginov, and Liang-Chieh Chen. 2018. “MobileNetV2: “ARTool-Augmented Reality Platform for Machining Setup
Inverted Residuals and Linear Bottlenecks.” Proceedings and Maintenance.” Proceedings of SAI Intelligent Systems
of the IEEE conference on Computer Vision and Pattern Conference, London, UK, 457–475. Springer.
Recognition, Salt Lake City, UT, 4510–4520. Shen, Zhiqiang, Zhuang Liu, Jianguo Li, Yu-Gang Jiang,
Sanna, Andrea, Federico Manuri, Fabrizio Lamberti, Gian- Yurong Chen, and Xiangyang Xue. 2017. “DSOD: Learning
luca Paravati, and Pietro Pezzolla. 2015. “Using Handheld Deeply Supervised Object Detectors from Scratch.” Pro-
Devices to Support Augmented Reality-based Maintenance ceedings of the IEEE international conference on Computer
and Assembly Tasks.” IEEE International Conference on Vision, Venice, Italy, 1919–1927.
Consumer Electronics (ICCE), Las Vegas, NV, 178–179. Shim, Hyunjung. 2012. “Faces As Light Probes for Relighting.”
IEEE. Optical Engineering 51 (7): 077002.
Sarlin, Paul-Edouard, Cesar Cadena, Roland Siegwart, and Siew, C. Y., S. K. Ong, and A. Y. C. Nee. 2019. “A Practical Aug-
Marcin Dymczyk. 2019. “From Coarse to Fine: Robust Hier- mented Reality-assisted Maintenance System Framework for
archical Localization at Large Scale.” Proceedings of the Adaptive User Support.” Robotics and Computer-Integrated
IEEE conference on Computer Vision and Pattern Recogni- Manufacturing 59: 115–129.
tion, Long Beach, CA, 12716–12725. Silberman, Nathan, Derek Hoiem, Pushmeet Kohli, and Rob
Sato, Imari, Yoichi Sato, and Katsushi Ikeuchi. 2003. “Illumina- Fergus. 2012. “Indoor Segmentation and Support Inference
tion From Shadows.” IEEE Transactions on Pattern Analysis from RGBD Images.” European conference on Computer
and Machine Intelligence 25 (3): 290–300. Vision, Firenze, Italy, 746–760. Springer.
Sattler, Torsten, Qunjie Zhou, Marc Pollefeys, and Laura Leal- Simões, Francisco Paulo Magalhãws, Rafael Alves Roberto,
Taixe. 2019. “Understanding the Limitations of CNN-based Lucas Silva Figueiredo, João Paulo Silva do Monte Lima,
Absolute Camera Pose Regression.” Proceedings of the Mozart William Almeida, and Veronica Teichrieb. 2013. “3D
IEEE conference on Computer Vision and Pattern Recogni- Tracking in Industrial Scenarios: A Case Study at the ISMAR
tion, Long Beach, CA, 3302–3312. Tracking Competition.” XV symposium on Virtual and Aug-
Saxena, Aseem, Harit Pandya, Gourav Kumar, Ayush Gaud, and mented Reality, Cuiaba, Brazil, 97–106. IEEE.
K. Madhava Krishna. 2017. “Exploring Convolutional Net- Simon, Gilles. 2011. “Tracking-by-Synthesis using Point Fea-
works for End-to-End Visual Servoing.” IEEE International tures and Pyramidal Blurring.” 10th IEEE international sym-
Conference on Robotics and Automation (ICRA), Mariana posium on Mixed and Augmented Reality, Basel, Switzer-
Bay Sands, Singapore, 3817–3823. IEEE. land, 85–92. IEEE.
Schall, Gerhard, Helmut Grabner, Michael Grabner, Paul Simonyan, Karen, and Andrew Zisserman. 2014. “Very Deep
Wohlhart, Dieter Schmalstieg, and Horst Bischof. 2008. “3D Convolutional Networks for Large-scale Image Recogni-
Tracking in Unknown Environments using On-line Key- tion.” Preprint arXiv:1409.1556.
point Learning for Mobile Augmented Reality.” IEEE com- Skrypnyk, Iryna, and David G. Lowe. 2004. “Scene Modelling,
puter society conference on Computer Vision and Pattern Recognition and Tracking with Invariant Image Features.”
Recognition Workshops, Anchorage, AK, 1–8. IEEE. 3rd IEEE and ACM international symposium on Mixed and
Schmalstieg, Dieter, and Daniel Wagner. 2007. “Experiences Augmented Reality, Arlington, VA, 110–119. IEEE.
with Handheld Augmented Reality.” 6th IEEE and ACM Song, Shuran, and Thomas Funkhouser. 2019. “Neural Illu-
international symposium on Mixed and Augmented Real- mination: Lighting Prediction for Indoor Environments.”
ity, Nara, Japan, 3–18. IEEE. Proceedings of the IEEE conference on Computer Vision
Scholz, Joachim, and Andrew N. Smith. 2016. “Augmented and Pattern Recognition, Long Beach, CA, 6918–6926.
Reality: Designing Immersive Experiences that Maximize Song, Jifei, Kaiyue Pang, Yi-Zhe Song, Tao Xiang, and Timo-
Consumer Engagement.” Business Horizons 59 (2): 149– thy M. Hospedales. 2018. “Learning to Sketch with Shortcut
161. Cycle Consistency.” Proceedings of the IEEE conference on
Schwerdtfeger, Björn, Daniel Pustka, Andreas Hofhauser, and Computer Vision and Pattern Recognition, Salt Lake City,
Gudrun Klinker. 2008. “Using Laser Projectors for Aug- UT, 801–810.
mented Reality.” Proceedings of the 2008 ACM sympo- Souly, Nasim, Concetto Spampinato, and Mubarak Shah. 2017.
sium on Virtual Reality Software and Technology, Bordeaux, “Semi supervised semantic segmentation using generative
France, 134–137. adversarial network.” Proceedings of the IEEE international
Seo, Yongduek, Min-Ho Ahn, and Ki Sang Hong. 1998. conference on Computer Vision, Venice, Italy, 5688–5696.
“Video Augmentation by Image-based Rendering Under Srinivasan, Pratul P., Ben Mildenhall, Matthew Tancik, Jonathan
the Perspective Camera Model.” Proceedings of the 14th T. Barron, Richard Tucker, and Noah Snavely. 2020. “Light-
international conference on Pattern Recognition (Cat. house: Predicting Lighting Volumes for Spatially-Coherent
No. 98EX170), Brisbane, Queensland, Australia, Vol. 2, Illumination.” Proceedings of the IEEE/CVF conference on
1694–1696. IEEE. Computer Vision and Pattern Recognition, Long Beach, CA,
Seo, Yongduek, and Ki Sang Hong. 2000. “Calibration-free 8080–8089.
Augmented Reality in Perspective.” IEEE Transactions on Steinbis, John, William Hoff, and Tyrone L. Vincent. 2008. “3D
Visualization and Computer Graphics 6 (4): 346–359. Fiducials for Scalable AR Visual Tracking.” 7th IEEE/ACM
Sermanet, Pierre, David Eigen, Xiang Zhang, Michaël Mathieu, international symposium on Mixed and Augmented Real-
Rob Fergus, and Yann LeCun. 2013. “Overfeat: Integrated ity, Cambridge, UK, 183–184. IEEE.
Recognition, Localization and Detection using Convolu- Stocker, Cosima, Marc Schmid, and Gunther Reinhart. 2019.
tional Networks.” Preprint arXiv:1312.6229. “Reinforcement Learning–based Design of Orienting Devi
INTERNATIONAL JOURNAL OF PRODUCTION RESEARCH 53

ces for Vibratory Bowl Feeders.” The International Journal of of the IEEE conference on Computer Vision and Pattern
Advanced Manufacturing Technology 105 (9): 3631–3642. Recognition, Las Vegas, NV, 1420–1429.
Subakti, Hanas, and Jehn-Ruey Jiang. 2018. “Indoor Aug- Tao, Fei, Qinglin Qi, Ang Liu, and Andrew Kusiak. 2018. “Data-
mented Reality using Deep Learning for Industry 4.0 Smart driven Smart Manufacturing.” Journal of Manufacturing Sys-
Factories.” IEEE 42nd Annual Computer Software and tems 48: 157–169.
Applications Conference (COMPSAC), Tokyo, Japan, Vol. 2, Tatzgern, Markus, Valeria Orso, Denis Kalkofen, Giulio Jacucci,
63–68. IEEE. Luciano Gamberini, and Dieter Schmalstieg. 2016. “Adaptive
Sundareswaran, Venkataraman, and Reinhold Behringer. 1999. Information Density for Augmented Reality Displays.” IEEE
“Visual Servoing-based Augmented Reality.” IWAR’98: Pro- Virtual Reality (VR), Greenville, SC, 83–92. IEEE.
ceedings of the International Workshop on Augmented Real- Teichrieb, Veronica, J. P. S. M. Lima, E. Apolinário, T. S. M.
ity: Placing Artificial Objects in Real Scenes, Bellevue, WA, C. Farias, Márcio Bueno, Judith Kelner, and Ismael San-
193–200. Natick, MA: AK Peters, Ltd. tos. 2007. “A Survey of Online Monocular Markerless Aug-
Sutherland, Ivan E. 1968. “A Head-mounted Three Dimen- mented Reality.” International Journal of Modeling and Sim-
sional Display.” Proceedings of the December 9–11, 1968, ulation for the Petroleum Industry 1 (1): 1–7.
Fall Joint Computer Conference, San Francisco, CA, Part I, Teng, Chin-Hung, and Bing-Shiun Wu. 2012. “Developing
757–764. QR Code based Augmented Reality using SIFT Features.”
Swan, J. Edward, Adam Jones, Eric Kolstad, Mark A. Livingston, 9th international conference on Ubiquitous Intelligence
and Harvey S. Smallman. 2007. “Egocentric Depth Judg- and Computing and 9th international conference on Auto-
ments in Optical, See-through Augmented Reality.” IEEE nomic and Trusted Computing, Fukuoka, Japan, 985–990.
Transactions on Visualization and Computer Graphics 13 (3): IEEE.
429–442. Terenzi, Graziano, and Giuseppe Basile. 2013. “Smart Mainte-
Syberfeldt, Anna, Oscar Danielsson, and Patrik Gustavsson. nance.” In An Augmented Reality Platform for Training and
2017. “Augmented Reality Smart Glasses in the Smart Fac- Field Operations in the Manufacturing Industry. ARMedia,
tory: Product Evaluation Guidelines and Review of Available 2nd White Paper of Inglobe Technologies Srl.
Products.” IEEE Access5: 9118–9130. Thomas, P. C., and W. M. David. 1992. “Augmented Reality:
Syberfeldt, Anna, Oscar Danielsson, Magnus Holm, and Lihui An Application of Heads-up Display Technology to Manual
Wang. 2016. “Dynamic Operator Instructions based on Aug- Manufacturing Processes.” Hawaii international conference
mented Reality and Rule-based Expert Systems.” CIRP CMS on System Sciences, Kauai, HI, 659–669.
2015, 48th CIRP Conference on Manufacturing Systems, tom Dieck, M. Claudia, and Timothy Jung. 2018. “A Theo-
Research and Innovation in Manufacturing: Key Enabling retical Model of Mobile Augmented Reality Acceptance in
Technologies for the Factories of the Future, Naples, Italy, Urban Heritage Tourism.” Current Issues in Tourism 21 (2):
24–26 June 2015, Ischia (Naples), Italy, Vol. 41, 346–351. 154–174.
Elsevier. Tonnis, Marcus. 2003. “Data Management for Augmented
Szegedy, Christian, Wei Liu, Yangqing Jia, Pierre Sermanet, Reality Applications.” Master’s thesis, Technische Universitat
Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Munchen.
Vanhoucke, and Andrew Rabinovich. 2015. “Going Deeper Toro, Carlos, Javier Vaquero, and Jorge Posada. 2009. “User
with Convolutions.” Proceedings of the IEEE conference Experience, Ambient Intelligence and Virtual Reality in an
on Computer Vision and Pattern Recognition, Boston, MA, Industrial Maintenance domain using Protégé.”, Amster-
1–9. dam, The Netherlands.
Szegedy, Christian, Vincent Vanhoucke, Sergey Ioffe, Jon Tosi, Fabio, Filippo Aleotti, Matteo Poggi, and Stefano Mat-
Shlens, and Zbigniew Wojna. 2016. “Rethinking the Incep- toccia. 2019. “Learning Monocular Depth Estimation Infus-
tion Architecture for Computer Vision.” Proceedings of the ing Traditional Stereo Knowledge.” Proceedings of the
IEEE conference on Computer Vision and Pattern Recogni- IEEE conference on Computer Vision and Pattern Recogni-
tion, Las Vegas, NV, 2818–2826. tion, Long Beach, CA, 9799–9809.
Takacs, Gabriel, Vijay Chandrasekhar, Natasha Gelfand, Yin- Tuceryan, Mihran, Yakup Genc, and Nassir Navab. 2002.
gen Xiong, Wei-Chao Chen, Thanos Bismpigiannis, Radek “Single-point Active Alignment Method (SPAAM) for Opti-
Grzeszczuk, Kari Pulli, and Bernd Girod. 2008. “Out- cal See-through HMD Calibration for Augmented Real-
doors Augmented Reality on Mobile Phone using Loxel- ity.” Presence: Teleoperators & Virtual Environments, Munich,
based Visual Feature Organization.” Proceedings of the 1st Germany 11 (3): 259–276.
ACM international conference on Multimedia Information Uchiyama, Hideaki, and Eric Marchand. 2012. “Object Detec-
Retrieval, Vancouver, BC, Canada, 427–434. tion and Pose Tracking for Augmented Reality: Recent
Taketomi, Takafumi, Kazuya Okada, Goshiro Yamamoto, Jun Approaches.” 18th Korea-Japan Joint Workshop on Frontiers
Miyazaki, and Hirokazu Kato. 2014. “Camera Pose Estima- of Computer Vision (FCV) Conference, Kawasaki, Japan,
tion Under Dynamic Intrinsic Parameter Change for Aug- 1–8.
mented Reality.” Computers & Graphics 44: 11–19. Uijlings, Jasper R. R., Koen E. A. Van De Sande, Theo Gev-
Tang, Arthur, Charles Owen, Frank Biocca, and Weimin Mou. ers, and Arnold W. M. Smeulders. 2013. “Selective Search
2003. “Comparative Effectiveness of Augmented Reality in for Object Recognition.” International Journal of Computer
Object Assembly.” Proceedings of the SIGCHI conference on Vision 104 (2): 154–171.
Human Factors in Computing Systems, Ft. Lauderdale, FL, Uva, Antonio E., Michele Gattullo, Vito M. Manghisi, Daniele
73–80. Spagnulo, Giuseppe L. Cascella, and Michele Fiorentino.
Tao, Ran, Efstratios Gavves, and Arnold W. M. Smeulders. 2018. “Evaluating the Effectiveness of Spatial Augmented
2016. “Siamese Instance Search for Tracking.” Proceedings Reality in Smart Manufacturing: A Solution for Manual
54 C. K. SAHU ET AL.

Working Stations.” The International Journal of Advanced Videos using Direct Methods.” Proceedings of the IEEE con-
Manufacturing Technology 94 (1–4): 509–521. ference on Computer Vision and Pattern Recognition, Salt
Vacchetti, Luca, Vincent Lepetit, and Pascal Fua. 2004. “Com- Lake City, UT, 2022–2030.
bining Edge and Texture Information for Real-time Accurate Wang, Panqu, Pengfei Chen, Ye Yuan, Ding Liu, Zehua Huang,
3D Camera Tracking.” 3rd IEEE and ACM international Xiaodi Hou, and Garrison Cottrell. 2018. “Understand-
symposium on Mixed and Augmented Reality, Arlington, ing Convolution for Semantic Segmentation.” IEEE Winter
VA, 48–56. IEEE. Conference on Applications of Computer Vision (WACV),
Valada, Abhinav, Noha Radwan, and Wolfram Burgard. 2018. Lake Tahoe, NV, 1451–1460. IEEE.
“Deep Auxiliary Learning for Visual Localization and Wang, Naiyan, Siyi Li, Abhinav Gupta, and Dit-Yan Yeung.
Odometry.” IEEE International Conference on Robotics 2015. “Transferring Rich Feature Hierarchies for Robust
and Automation (ICRA), Brisbane, Australia, 6939–6946 Visual Tracking.” Preprint arXiv:1501.04587.
. IEEE. Wang, Ting-Chun, Ming-Yu Liu, Jun-Yan Zhu, Andrew Tao,
Van Krevelen, D., and R. Poelman. 2007. “Augmented Real- Jan Kautz, and Bryan Catanzaro. 2018. “High-resolution
ity: Technologies, Applications, and Limitations.” Vrije Univ. Image Synthesis and Semantic Manipulation with Condi-
Amsterdam, Dep. Comput. Sci., 1–25. doi:10.13140/RG.2.1. tional Gans.” Proceedings of the IEEE conference on Com-
1874.7929. puter Vision and Pattern Recognition, Salt Lake City, UT,
Venugopalan, Subhashini, Lisa Anne Hendricks, Marcus 8798–8807.
Rohrbach, Raymond Mooney, Trevor Darrell, and Kate Wang, Z. B., L. X. Ng, S. K. Ong, and A. Y. C. Nee. 2013. “Assem-
Saenko. 2017. “Captioning Images with Diverse Objects.” bly Planning and Evaluation in An Augmented Reality Envi-
Proceedings of the IEEE conference on Computer Vision ronment.” International Journal of Production Research 51
and Pattern Recognition, Honolulu, HI, 5753–5761. (23–24): 7388–7404.
Viola, Paul, and William M. Wells III. 1997. “Alignment by Wang, Z. B., S. K. Ong, and A. Y. C. Nee. 2013. “Augmented
Maximization of Mutual Information.” International Journal Reality Aided Interactive Manual Assembly Design.” The
of Computer Vision 24 (2): 137–154. International Journal of Advanced Manufacturing Technology
Virkkunen, Anja. 2018. “Automatic Speech Recognition for the 69 (5–8): 1311–1321.
Hearing Impaired in an Augmented Reality Application.” Wang, X., S. K. Ong, and A. Y. C. Nee. 2014. “Augmented Real-
Master’s thesis, School of Science, Aalto University. ity Interfaces for Industrial Assembly Design and Planning.”
Visin, Francesco, Marco Ciccone, Adriana Romero, Kyle Kast- Proceedings of Eighth International Conference on Inter-
ner, Kyunghyun Cho, Yoshua Bengio, Matteo Matteucci, faces and Human Computer Interaction, Lisbon, Portugal,
and Aaron Courville. 2016. “Reseg: A Recurrent Neural 83–90.
Network-based Model for Semantic Segmentation.” Pro- Wang, Lijun, Wanli Ouyang, Xiaogang Wang, and Huchuan
ceedings of the IEEE conference on Computer Vision and Lu. 2015. “Visual Tracking with Fully Convolutional Net-
Pattern Recognition Workshops, Las Vegas, NV, 41–48. works.” Proceedings of the IEEE international conference on
Visin, Francesco, Kyle Kastner, Kyunghyun Cho, Matteo Mat- Computer Vision, Santiago, Chile, 3119–3127.
teucci, Aaron Courville, and Yoshua Bengio. 2015. “Renet: Wang, Li, Nam Trung Pham, Tian-Tsong Ng, Gang Wang, Kap
A Recurrent Neural Network based Alternative to Convolu- Luk Chan, and Karianto Leman. 2014. “Learning Deep Fea-
tional Networks.” Preprint arXiv:1505.00393. tures for Multiple Object Tracking by using a Multi-task
Wagner, Daniel, Gerhard Reitmayr, Alessandro Mulloni, Tom Learning Strategy.” IEEE International Conference on Image
Drummond, and Dieter Schmalstieg. 2008. “Pose Tracking Processing (ICIP), Paris, France, 838–842. IEEE.
from Natural Features on Mobile Phones.” 7th IEEE/ACM Wang, Peng, Xiaohui Shen, Zhe Lin, Scott Cohen, Brian Price,
international symposium on Mixed and Augmented Real- and Alan L. Yuille. 2015. “Towards Unified Depth and
ity, Cambridge, UK, 125–134. IEEE. Semantic Prediction from a Single Image.” Proceedings
Wagner, Daniel, Gerhard Reitmayr, Alessandro Mulloni, Tom of the IEEE conference on Computer Vision and Pattern
Drummond, and Dieter Schmalstieg. 2009. “Real-time Recognition, Boston, MA, 2800–2809.
Detection and Tracking for Augmented Reality on Mobile Wang, Qiang, Zhu Teng, Junliang Xing, Jin Gao, Weiming Hu,
Phones.” IEEE Transactions on Visualization and Computer and Stephen Maybank. 2018. “Learning Attentions: Residual
Graphics 16 (3): 355–368. Attentional Siamese Network for High Performance Online
Wagner, Daniel, and Dieter Schmalstieg. 2007. “ARToolKitPlus Visual Tracking.” Proceedings of the IEEE conference on
for POSE Tracking on Mobile Devices.” Proceedings of 12th Computer Vision and Pattern Recognition, Salt Lake City,
Computer Vision Winter Workshop, St. Lambrecht, Austria. UT, 4854–4863.
Walch, Florian, Caner Hazirbas, Laura Leal-Taixe, Torsten Wang, Xiang, Kai Wang, and Shiguo Lian. 2019. “Deep Consis-
Sattler, Sebastian Hilsenbeck, and Daniel Cremers. 2017. tent Illumination in Augmented Reality.” IEEE International
“Image-based Localization using LSTMs for Structured Fea- Symposium on Mixed and Augmented Reality Adjunct
ture Correlation.” Proceedings of the IEEE international (ISMAR-Adjunct), Beijing, China, 189–194. IEEE.
conference on Computer Vision, Venice, Italy, 627–637. Wang, Bing, Li Wang, Bing Shuai, Zhen Zuo, Ting Liu, Kap
Wan, Xingyu, Jinjun Wang, and Sanping Zhou. 2018. “An Luk Chan, and Gang Wang. 2016. “Joint Learning of Con-
Online and Flexible Multi-object Tracking Framework using volutional Neural Networks and Temporally Constrained
Long Short-term Memory.” Proceedings of the IEEE confer- Metrics for Tracklet Association.” Proceedings of the IEEE
ence on Computer Vision and Pattern Recognition Work- conference on Computer Vision and Pattern Recognition
shops, Venice, Italy, 1230–1238. Workshops, Las Vegas, NV, 1–8.
Wang, Chaoyang, José Miguel Buenaposada, Rui Zhu, and Wang, Peng, Ruigang Yang, Binbin Cao, Wei Xu, and Yuanqing
Simon Lucey. 2018. “Learning Depth from Monocular Lin. 2018. “DeLS-3D: Deep Localization and Segmentation
INTERNATIONAL JOURNAL OF PRODUCTION RESEARCH 55

with a 3D Semantic Map.” Proceedings of the IEEE confer- Wuest, Harald, Florent Vial, and Didier Strieker. 2005. “Adap-
ence on Computer Vision and Pattern Recognition, Salt Lake tive Line Tracking with Multiple Hypotheses for Augmented
City, UT, 5860–5869. Reality.” 4th IEEE and ACM International Symposium on
Wang, X., A. W. W. Yew, S. K. Ong, and A. Y. C. Nee. 2020. Mixed and Augmented Reality (ISMAR’05), Vienna, Austria,
“Enhancing Smart Shop Floor Management with Ubiqui- 62–69. IEEE.
tous Augmented Reality.” International Journal of Production Wuest, Thorsten, Daniel Weimer, Christopher Irgens, and
Research 58 (8): 2352–2367. Klaus-Dieter Thoben. 2016. “Machine Learning in Manufac-
Wang, Qiang, Li Zhang, Luca Bertinetto, Weiming Hu, and turing: Advantages, Challenges, and Applications.” Produc-
Philip H. S. Torr. 2019. “Fast Online Object Tracking and tion & Manufacturing Research4 (1): 23–45.
Segmentation: A Unifying Approach.” Proceedings of the Xiao, Jianxiong, Krista A. Ehinger, Aude Oliva, and Anto-
IEEE Conference on Computer Vision and Pattern Recog- nio Torralba. 2012. “Recognizing Scene Viewpoint using
nition, Long Beach, CA, 1328–1338. Panoramic Place Representation.” IEEE conference on Com-
Wei, Side, Gang Ren, and Eamonn O’Neill. 2014. “Haptic and puter Vision and Pattern Recognition, Providence, RI,
Audio Displays for Augmented Reality Tourism Applica- 2695–2702. IEEE.
tions.” IEEE Haptics Symposium (HAPTICS), Houston, TX, Xie, Chun, Hidehiko Shishido, Yoshinari Kameda, Kenji
485–488. IEEE. Suzuki, and Itaru Kitahara. 2018. “A Calibration Method
Wen, Longyin, Dawei Du, Zhaowei Cai, Zhen Lei, Ming-Ching for Large-scale Projection based Floor Display System.”
Chang, Honggang Qi, Jongwoo Lim, Ming-Hsuan Yang, and IEEE conference on Virtual Reality and 3D User Interfaces
Siwei Lyu. 2015. “UA-DETRAC: A New Benchmark and (VR), Reutlingen, Germany, 725–726. IEEE.
Protocol for Multi-object Detection and Tracking.” Preprint Xu, Di, Zhen Li, and Yanning Zhang. 2020. “Real-time Illu-
arXiv:1511.04136. mination Estimation for Mixed Reality on Mobile Devices.”
Weng, Juyang, Thomas S. Huang, and Narendra Ahuja. 1989. IEEE conference on Virtual Reality and 3D User Interfaces
“Motion and Structure From Two Perspective Views: Algo- Abstracts and Workshops (VRW), Atlanta, GA, 703–704.
rithms, Error Analysis, and Error Estimation.” IEEE Trans- IEEE.
actions on Pattern Analysis and Machine Intelligence 11 (5): Xu, Linli, James Neufeld, Bryce Larson, and Dale Schuurmans.
451–476. 2005. “Maximum Margin Clustering.” Advances in Neural
Werrlich, Stefan, Kai Nitsche, and Gunther Notni. 2017. Information Processing Systems, Vancouver, BC, Canada,
“Demand Analysis for an Augmented Reality based Assem- 1537–1544.
bly Training.” Proceedings of the 10th international confer- Yacob, Filmon, Daniel Semere, and Erik Nordgren. 2019.
ence on PErvasive Technologies Related to Assistive Envi- “Anomaly Detection in Skin Model Shapes Using Machine
ronments, Island of Rhodes, Greece, 416–422. Learning Classifiers.” The International Journal of Advanced
Wikitude GmbH. 2018. “Wikitude Augmented Reality: The Manufacturing Technology105 (9): 3677–3689.
World’s Leading Cross-Platform AR SDK.” Accessed 9 May Yamamoto, Tomonori, Niki Abolhassani, Sung Jung, Allison
2020. https://www.wikitude.com/. M. Okamura, and Timothy N. Judkins. 2012. “Augmented
Wu, Hsin-Kai, Silvia Wen-Yu Lee, Hsin-Yi Chang, and Jyh- Reality and Haptic Interfaces for Robot-assisted Surgery.”
Chong Liang. 2013. “Current Status, Opportunities and The International Journal of Medical Robotics and Computer
Challenges of Augmented Reality in Education.” Computers Assisted Surgery 8 (1): 45–56.
& Education 62: 41–49. Yang, Tianyu, and Antoni B. Chan. 2018. “Learning Dynamic
Wu, Yi, Jongwoo Lim, and Ming-Hsuan Yang. 2013. “Online Memory Networks for Object Tracking.” Proceedings of the
Object Tracking: A Benchmark.” IEEE Conference on Com- European Conference on Computer Vision (ECCV), Munich,
puter Vision and Pattern Recognition (CVPR), Portland, Germany, 152–167.
OR. Yang, Ming-Der, Chih-Fan Chao, Kai-Siang Huang, Liang-You
Wu, Jian, Liwei Ma, and Xiaolin Hu. 2017. “Delving Deeper into Lu, and Yi-Ping Chen. 2013. “Image-based 3D Scene Recon-
Convolutional Neural Networks for Camera Relocalization.” struction and Exploration in Augmented Reality.” Automa-
IEEE International Conference on Robotics and Automa- tion in Construction 33: 48–60.
tion (ICRA), Mariana Bay Sands, Singapore, 5644–5651. Yang, Ruo-Yu, and Rahul Rai. 2019. “Machine Auscultation:
IEEE. Enabling Machine Diagnostics Using Convolutional Neural
Wu, Xingyuan, Yonggang Qi, Jun Liu, and Jie Yang. 2018. Networks and Large-scale Machine Audio Data.” Advances
“SketchsegNet: A RNN Model for Labeling Sketch Strokes.” in Manufacturing 7 (2): 174–187.
IEEE 28th international workshop on Machine Learning for Yang, Ruoyu, Shubhendu Kumar Singh, Mostafa Tavakkoli,
Signal Processing (MLSP), Aalborg, Denmark, 1–6. IEEE. Nikta Amiri, Yongchao Yang, M. Amin Karami, and
Wu, Xiongwei, Doyen Sahoo, and Steven C. H. Hoi. 2020. Rahul Rai. 2020. “CNN-LSTM Deep Learning Architecture
“Recent Advances in Deep Learning for Object Detection.” for Computer Vision-based Modal Frequency Detection.”
Neurocomputing 396: 39–64. Mechanical Systems and Signal Processing 144: 106885.
Wu, Jiajun, Yifan Wang, Tianfan Xue, Xingyuan Sun, Bill Yao, Ting, Yingwei Pan, Yehao Li, and Tao Mei. 2017. “Incorpo-
Freeman, and Josh Tenenbaum. 2017. “MarrNet: 3D Shape rating Copying Mechanism in Image Captioning for Learn-
Reconstruction via 2.5 D Sketches.” Advances in Neural ing Novel Objects.” Proceedings of the IEEE conference on
Information Processing Systems, Long Beach, CA, 540–550. Computer Vision and Pattern Recognition, Honolulu, HI,
Wu, Yu, Linchao Zhu, Lu Jiang, and Yi Yang. 2018. “Decou- 6580–6588.
pled Novel Object Captioner.” Proceedings of the 26th Yao, Xifan, Jiajun Zhou, Jiangming Zhang, and Claudio R. Boër.
ACM international conference on Multimedia, Seoul, Korea, 2017. “From Intelligent Manufacturing to Smart Manufac-
1029–1037. turing for Industry 4.0 Driven by Next Generation Artificial
56 C. K. SAHU ET AL.

Intelligence and Further On.” 5th International conference Zhang, Xiangyu, Xinyu Zhou, Mengxiao Lin, and Jian Sun.
on Enterprise Systems (ES), Beijing, China, 311–318. IEEE. 2018. “ShuffleNet: An Extremely Efficient Convolutional
Ye, Menglong, Edward Johns, Ankur Handa, Lin Zhang, Neural Network for Mobile Devices.” Proceedings of the
Philip Pratt, and Guang-Zhong Yang. 2017. “Self-supervised IEEE conference on Computer Vision and Pattern Recogni-
Siamese Learning on Stereo Image Pairs for Depth Estima- tion, Salt Lake City, UT, 6848–6856.
tion in Robotic Surgery.” Preprint arXiv:1705.08260. Zhao, Shuai, and Quan Qian. 2017. “Ontology Based Heteroge-
Yi, Renjiao, Chenyang Zhu, Ping Tan, and Stephen Lin. 2018. neous Materials Database Integration and Semantic Query.”
“Faces as Lighting Probes via Unsupervised Deep Highlight AIP Advances 7 (10): 105325.
Extraction.” Proceedings of the European Conference on Zhao, Hengshuang, Jianping Shi, Xiaojuan Qi, Xiaogang Wang,
Computer Vision (ECCV), Munich, Germany, 317–333. and Jiaya Jia. 2017. “Pyramid Scene Parsing Network.” Pro-
Yin, Zhichao, and Jianping Shi. 2018. “GeoNet: Unsupervised ceedings of the IEEE conference on Computer Vision and
Learning of Dense Depth, Optical Flow and Camera Pose.” Pattern Recognition, Honolulu, HI, 2881–2890.
Proceedings of the IEEE conference on Computer Vision Zheng, Feng. 2015. “Spatio-temporal Registration in Aug-
and Pattern Recognition, Salt Lake City, UT, 1983–1992. mented Reality.” PhD diss., Department of Computer
Yoo, Donggeun, Sunggyun Park, Joon-Young Lee, Anthony Science, University of North Carolina at Chapel Hill.
S. Paek, and In So Kweon. 2015. “AttentionNet: Aggregat- doi:10.17615/pw64-4p19.
ing Weak Directions for Accurate Object Detection.” Pro- Zheng, Xinru, Xiaotian Qiao, Ying Cao, and Rynson W. H.
ceedings of the IEEE international conference on Computer Lau. 2019. “Content-aware Generative Modeling of Graphic
Vision, Santiago, Chile, 2659–2667. Design Layouts.” ACM Transactions on Graphics (TOG) 38
Yu, Fisher, and Vladlen Koltun. 2015. “Multi-scale Context (4): 1–15.
Aggregation by Dilated Convolutions.” Preprint arXiv:1511. Zheng, Feng, Dieter Schmalstieg, and Greg Welch. 2014. “Pixel-
07122. wise Closed-loop Registration in Video-based Augmented
Yuan, M. L., Soh-Khim Ong, and Andrew Y. C. Nee. 2006. Reality.” IEEE International Symposium on Mixed and
“Registration Using Natural Features for Augmented Reality Augmented Reality (ISMAR), Munich, Germany, 135–143.
Systems.” IEEE Transactions on Visualization and Computer IEEE.
Graphics 12 (4): 569–580. Zheng, Feng, Ryan Schubert, and Greg Weich. 2013. “A General
Zabih, Ramin, and John Woodfill. 1994. “Non-parametric Approach for Closed-loop Registration in AR.” IEEE Virtual
Local Transforms for Computing Visual Correspondence.” Reality (VR), Lake Buena Vista, FL, 47–50. IEEE.
European conference on Computer Vision, Berlin, Germany, Zhou, Huiyu, Yuan Yuan, and Chunmei Shi. 2009. “Object
151–158. Springer. Tracking Using SIFT Features and Mean Shift.” Computer
Zamir, Amir Roshan, Afshin Dehghan, and Mubarak Shah. Vision and Image Understanding 113 (3): 345–352.
2012. “GMCP-tracker: Global Multi-object Tracking using Zhou, Guanghui, Chao Zhang, Zhi Li, Kai Ding, and Chuang
Generalized Minimum Clique Graphs.” European confer- Wang. 2020. “Knowledge-driven Digital Twin Manufactur-
ence on Computer Vision, Firenze, Italy, 343–356. Springer. ing Cell Towards Intelligent Manufacturing.” International
Zhang, Mi, Xiangyun Hu, Jian Yao, Like Zhao, and Jiancheng Li, Journal of Production Research 58 (4): 1034–1051.
Jianya Gong. 2019. “Line-Based Geometric Consensus Rec- Zhu, Jiahong, Sohkhim K. Ong, and Andrew Y. C. Nee. 2013.
tification and Calibration From Single Distorted Manhattan “An Authorable Context-aware Augmented Reality System
Image.” IEEE Access 7: 156400–156412. to Assist the Maintenance Technicians.” The International
Zhang, Binbin, Prakhar Jaiswal, Rahul Rai, Paul Guerrier, and Journal of Advanced Manufacturing Technology 66 (9–12):
George Baggs. 2019. “Convolutional Neural Network-based 1699–1714.
Inspection of Metal Additive Manufacturing Parts.” Rapid Zhu, Jiejie, and Zhigeng Pan. 2008. “Occlusion Registration
Prototyping Journal25: 530–540. in Video-based Augmented Reality.” Proceedings of The
Zhang, Yunbo, and Tsz-Ho Kwok. 2018. “Design and Interac- 7th ACM SIGGRAPH International Conference on Virtual-
tion Interface Using Augmented Reality for Smart Manufac- Reality Continuum and Its Applications in Industry, Singa-
turing.” Procedia Manufacturing 26: 1278–1286. pore, 1–6.
Zhang, Li, Yuan Li, and Ramakant Nevatia. 2008. “Global Zhu, Jiejie, Zhigeng Pan, Chao Sun, and Wenzhi Chen. 2010.
Data Association for Multi-object Tracking using Network “Handling Occlusions in Video-based Augmented Reality
Flows.” IEEE conference on Computer Vision and Pattern Using Depth Information.” Computer Animation and Virtual
Recognition, Anchorage, AK, 1–8. IEEE. Worlds 21 (5): 509–521.
Zhang, Xianyu, Xinguo Ming, Zhiwen Liu, Dao Yin, Zhihua Zhu, Zheng, Qiang Wang, Bo Li, Wei Wu, Junjie Yan, and
Chen, and Yuan Chang. 2019. “A Reference Framework and Weiming Hu. 2018. “Distractor-aware Siamese Networks for
Overall Planning of Industrial Artificial Intelligence (I-AI) Visual Object Tracking.” Proceedings of the European Con-
for New Application Scenarios.” The International Journal of ference on Computer Vision (ECCV), Munich, Germany,
Advanced Manufacturing Technology 101 (9–12): 2367–2389. 101–117.
Zhang, Zhipeng, and Houwen Peng. 2019. “Deeper and Wider Zhuang, Bingbing, Quoc-Huy Tran, Pan Ji, Gim Hee Lee,
Siamese Networks for Real-time Visual Tracking.” Proceed- Loong Fah Cheong, and Manmohan Chandraker. 2019.
ings of the IEEE conference on Computer Vision and Pattern “Degeneracy in Self-Calibration Revisited and a Deep Learn-
Recognition, Long Beach, CA, 4591–4600. ing Solution for Uncalibrated SLAM.” Preprint arXiv:1907.
Zhang, Aijia, Yan Zhao, and Shigang Wang. 2019. “Illumina- 13185.
tion Estimation for Augmented Reality Based on a Global Zollmann, Stefanie, and Gerhard Reitmayr. 2012. “Dense
Illumination Model.” Multimedia Tools and Applications 78 Depth Maps from Sparse Models and Image Coherence
(23): 33487–33503. for Augmented Reality.” Proceedings of the 18th ACM
INTERNATIONAL JOURNAL OF PRODUCTION RESEARCH 57

Symposium on Virtual Reality Software and Technol- Zubizarreta, Jon, Iker Aguinaga, and Aiert Amundarain. 2019.
ogy, Toronto, Ontario, Canada, 53–60. “A Framework for Augmented Reality Guidance in Indus-
Zou, Yuliang, Zelun Luo, and Jia-Bin Huang. 2018. “DF- try.” The International Journal of Advanced Manufacturing
Net: Unsupervised Joint Learning of Depth and Flow using Technology 102 (9–12): 4095–4108.
Cross-task Consistency.” Proceedings of the European Con- Zwald, Laurent, and Sophie Lambert-Lacroix. 2012. “The
ference on Computer Vision (ECCV), Munich, Germany, BerHu Penalty and the Grouped Effect.” Preprint arXiv:
36–53. 1207.6868.

You might also like