Virtual_Testing_and_Policy_Deployment_Framework_for_Autonomous_Navigation_of_an_Unmanned_Ground_Vehicle_Using_Reinforcement_Learning

This document presents a framework for training an unmanned ground vehicle (UGV) using deep reinforcement learning (DRL) for optimal navigation in unfamiliar environments. It discusses the implementation of Multi-Agent Virtual Exploration in Deep Q-Learning (MVEDQL) to train the UGV in a virtual setting, which is then tested with a physical vehicle. The study emphasizes the importance of effective policy transfer and the challenges of real-time deployment in robotic systems.

Uploaded by

Awaiz Adnan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

10 views6 pages

Virtual_Testing_and_Policy_Deployment_Framework_for_Autonomous_Navigation_of_an_Unmanned_Ground_Vehicle_Using_Reinforcement_Learning

Uploaded by

Awaiz Adnan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 6

2021 World Automation Congress (WAC), Online, August 1-5

Virtual Testing and Policy Deployment Framework

for Autonomous Navigation of an Unmanned
Ground Vehicle Using Reinforcement Learning
Tyrell Lewis, Patrick Benavidez and Mo Jamshidi
Department of Electrical and Computer Engineering
The University of Texas at San Antonio
San Antonio, TX 78249
Email: [email protected], [email protected], [email protected]

Abstract—The use of deep reinforcement learning (DRL) as it pertains to navigation, aims to discretize the space into a
a framework for training a mobile robot to perform optimal grid-world representation to allow for a fixed number of cell-
navigation in an unfamiliar environment is a suitable choice for based movement actions that can be feasibly implemented by a
implementing AI with real-time robotic systems. In this study, the
environment and surrounding obstacles of an Ackermann-steered low-level controller. This technique can be extended to reduce
UGV are reconstructed into a virtual setting for training the UGV the planning horizon of modern applications by decomposing
to centrally learn the optimal route (guidance actions to be taken complex tasks into options, or sub-tasks composed of ele-
at any given state) towards a desired goal position using Multi- mentary components. These options can be executed to help
Agent Virtual Exploration in Deep Q-Learning (MVEDQL) for ensure safe exploration of the environment, a key necessity
various model configurations. The trained model policies are to
be transferred to a physical vehicle and compared based on their for reliable performance in the learning process [1], [2].
individual effectiveness for performing autonomous waypoint
navigation. Prior to incorporating the learned model with the B. Reinforcement Learning Model Implementation
physical UGV for testing, this paper outlines the development of
a GUI application to provide an interface for remotely deploying
In this implementation, the environment and surrounding
the vehicle and a virtual reality framework reconstruction of the obstacles of an Ackermann-steered UGV are reconstructed
training environment to assist safely testing the system using the into a virtual setting for training the UGV (represented as a
reinforcement learning model. bicycle model) to centrally learn the optimal route (guidance
Index Terms—Reinforcement Learning, Virtual Reality, Au- actions to be taken at any given state) towards a desired goal
tonomous Car, Simulation, Multi-Agent Systems, Exploration
position using Multi-Agent Virtual Exploration in Deep Q-
Learning (MVEDQL) [3], which has been shown to converge
I. I NTRODUCTION faster than standard Q-Learning techniques. Combining this
A. Reinforcement Learning for Robotic Systems method of learning with a physical UGV can be accomplished
by making inferences of newly-seen obstacles by the UGV’s
The use of deep reinforcement learning (DRL) as a frame-
distance / ranging sensors and incorporating them into the
work for training a UGV to perform optimal navigation in an
simulated environment for iterative learning (hence re-learning
unfamiliar environment is a suitable choice for implementing
an optimal route given unfamiliar conditions). By designing
AI with real-time robotic systems. Deep Q-Network (DQN)
a navigation controller with this “hands-off” approach and
and similar variations are capable of approximating func-
allowing the DRL model to perform the work, it is possi-
tions to provide generalized solutions for model-free Markov
ble to modify the configuration of the physical system (a
Decision Processes (MDP) and optimizing complex control
non-holonomic dynamical system) and have it autonomously
problems by producing a learned optimal policy function to
navigate through unfamiliar environments without having to
govern the behavior of an agent in an unfamiliar environment.
strictly modify the design of a controller and instead re-learn
The state-action space of the robotic system is a represen-
the optimal path based on state configuration changes.
tation of all possible “actions” (motor actuation commands,
The state of the virtual UGV s(t), or “agent”, at any given
task execution) that a robot can take given any “state” which
time t is represented by the following equations:
often describes the robot’s position, orientation, and any other
information considered pertaining to its surrounding environ-
ment. The classical approach to reinforcement learning, as ~ scaled , sin(ρ), cos(ρ), s ◦ , s ◦ , s ◦ ]
φ = [kdk (1)
−30 0 30
∗ Thiswork was supported by Grant number FA8750-15-2-0116 from Air
Force Research Laboratory and OSD through a contract with North Carolina
Agricultural and Technical State University. d~ = (xa − xd )î + (ya − yd )ĵ (2)

978-1-68524-111-7/21/$31.00 ©2021 TSI Press Inc. 317

Authorized licensed use limited to: GHULAM ISHAQ KHAN INST OF ENG SCI AND TECH. Downloaded on September 21,2022 at 19:16:25 UTC from IEEE Xplore. Restrictions apply.
2021 World Automation Congress (WAC), Online, August 1-5

Table I: Cases 1-3 aim to compare the effect of randomizing

the agent initial position and destination. Cases 4-5 will test
the performance of using inter-agent collisions during training.
Configuration Cases 1 2 3 4 5
Random Agent Position T F T T T
Random Destination Position T F F F T
Box Collision Model T T T T T
Inter-agent Collision F F F T T

ρ = ]d~ − ω (3)
where (xa , ya ), (xd , yd ) represents the position of the agent
and destination respectively, ρ is defined as the angle offset
between the vector directed towards the destination and the Figure 1: Reinforcement Learning Model Simulation Case
current position, ω is the agent’s orientation relative to the Training Results
world frame, and s−30◦ , s0◦ , s30◦ are inverse scaled sensor
values that indicate the presence of an obstacle as the agent
approaches it. The actions a that can be taken consist of configuration parameters are all boolean-valued, and the list of
velocity and steering commands allowing movement towards configurations being tested is not exhaustive. The individual
a desired state position, and are notated as a = {(υ, ψ); υ ∈ training performances are presented in Figure 1, where a fixed
{0.2, 1.5}, ψ ∈ {0.0, 0.6, −0.6}}. agent position, fixed destination, and no inter-agent collision
learns the fastest for the early stages of training. While this
II. D ESIGN G OALS model may perform best in an arena constructed to be identical
A. Problem Overview to the one it used for training, it will likely suffer in practice
This project aims to develop a framework for efficient when the environment has changed, as it is not adapted for
deployment of a trained deep reinforcement-learning (DRL) exploration. However, this can be used to offer insight into
model that will optimally navigate an autonomous wheeled the effectiveness of the framework developed in the following
mobile robot to a user-provided waypoint goal location while sections.
avoiding collisions with various obstacles in the surround- III. BACKGROUND
ing environment. The DRL model has been trained using
reinforcement learning (Multi-Agent Virtual Exploration in A. Reinforcement Learning for Robotic Navigation Systems
Deep Q-learning, MVEDQL) in a custom Python simulation Expanding upon the claim of using reinforcement learning
environment [3], [4] that can spawn multiple virtual agents as an ideal machine learning framework for training real-
representing the vehicle which perform reward-based learned- time robotic systems should be prefaced with a review of the
actions to optimally navigate to the goal location after spawn- challenging complications involved with these often highly-
ing from a random position within the virtual environment complex physical systems [5]. Firstly, training a DRL model
during each iteration. requires repeated sampling from the robot’s internal and
The results obtained from simulation demonstrated the external sensors so that an adequate representation of it’s
fastest success using MVEDQL with 6 agents and a box state-action space can be properly constructed without needing
collision model. The algorithm parameters affecting the ro- an unreasonable amount of exploration that will ultimately
bustness of the robot’s performance are those that randomize converge to a globally optimal solution.
the environment to elicit greater exploration and will likely For mobile navigation concerns, time-discretization of the
have a significant affect on policy transfer-ability, or the actuation is challenging since neither the motors nor the
effectiveness to which the physical robot takes actions to vehicle can move instantaneously (an inherent time delay)
perform a task in real-time given a policy that was learned which can distort the distance measurements collected between
offline. These training parameters will be unique for several states and therefore delay the observable actions by multiple
simulation model configurations (shown in Table I) to be time steps. While this problem can be handled by placing some
compared during test evaluation. of the most recent actions into a buffer, this strategy increases
The primary focus of this research is to compare the perfor- the dimensionality of the problem. Instead, the duration of the
mance of implementing these pre-trained model configurations time steps can be increased with the penalty of reducing the
based on the shortest time taken to reach a destination position precision of the controller.
(if at all) for a fixed number of trials. The total trajectory The internal sensors are used to monitor the states of the
length needed to reach the goal point for this time will system that pertain to the position and orientation of the robot’s
also recorded, in addition to the percentage of successful joints and linkages. As the number of joints and linkages
trials that reach the destination without collision. The model increase (as with modern robotic systems) so does the total

318
Authorized licensed use limited to: GHULAM ISHAQ KHAN INST OF ENG SCI AND TECH. Downloaded on September 21,2022 at 19:16:25 UTC from IEEE Xplore. Restrictions apply.
2021 World Automation Congress (WAC), Online, August 1-5

Figure 2: Training a Reinforcement Learning Model for Autonomous Waypoint Trajectory Generation Using Multi-Agent
Virtual Exploration in Deep Q-Learning (MVEDQL) [3]

number of degrees of freedom (DoF) which must all be at a significant computational cost. Alternatively, training the
monitored. Over time, the dynamics of the system can change model in simulation and transferring the learned policy to the
due to external effects such as wear and corrosion, affecting physical robot is an idealistic approach that largely depends
the robot’s behavior and ultimately the learning process. This on the accuracy of the system’s dynamic model. Small model
effect is more drastic for higher-velocity systems such as errors can accumulate for under-modelled systems (e.g. by
drones and UAVs. neglecting contact friction between mechanical interactions,
External sensors are then used to monitor the environ- neglecting air drag, etc.) resulting in divergence from the
ment surrounding the robot (e.g., nearby obstacles, landmarks, real system. The technique of transferring policies works
chemical concentrations, local wind velocity, ambient temper- best for self-stabilizing systems that do not require active
ature, etc.) expanding the complexity of the overall system. control to maintain a safe state for the robot. For unstable
Even with a limited quantity of sensors used to sample from tasks or approximate models, this approach still has utility
the environment for a simple system, obtaining these samples for testing algorithms in simulation, verifying theoretically
can be very costly due to time-extensive task execution peri- optimal solutions, and identifying alternative data-collection
ods and any necessary interventions from human supervisors strategies. Addressed in [5], a general framework that can
(maintenance, repairs, position resets, etc.). Furthermore, to be used for issuing comparable experiments and consistent
efficiently implement an autonomous navigation algorithm in evaluation of reinforcement learning models on real robotic
real-time, any delays in communication and actuation, filtering systems is highly desireable, but has yet to be developed. This
signal noise and task execution must be accounted for. paper seeks to identify a possible solution to this problem.
Increasing the number of sensors for systems of higher
IV. M ETHOD
complexity not only demands attention to these challenges, but
additionally introduces a new problem commonly referred to A. Testing Environment for the Trained Model
as the curse of dimensionality, where exponentially more data A demonstration of the learning process using MVEDQL
collection (and processing) is needed to cover the complete for a simple box-shaped indoor environment with a single
state-action space. Clearly, limitations on the amount of real- obstacle is provided in Figure 2, where multiple vehicles are
world interaction time along with sample-efficient algorithms spawned at the same location with different initial orientations
that can learn from a small number of trials are essential, es- with respect to the desired goal position. During testing,
pecially for non-episodic settings where movements cannot be the agents employ the learned policies and attempt to reach
paused while undergoing training and actions must be selected the input destination position. The process of testing for the
within a restricted time budget. A real-time architecture for chosen environment is visualized in Figure 3.
model-based value-function reinforcement learning methods Accomplishing the primary objective of implementing the
has been proposed by [6] where individual threads are cre- trained DRL model for autonomous navigation of the robot
ated for planning, model-learning, and action selection. This can seemingly be satisfied by constructing a simple indoor
particular example offers thorough insight into the difficulties arena with a physical obstacle placed between the vehicle and
of real-time policy deployment. a goal point. Short walls restrict the domain or search space
Training the reinforcement learning model on-line involves and the gray centered box acts as the obstructing object. The
a major component of agent exploration to sample data lighthouse sensors on the robot are visible at all positions
from the environment in order to learn an optimal policy to within the domain as required to accurately track the pose
accomplish a given objective (reach a destination without col- of the robot. However, when using a testing arena with real
lision with obstacles). A consequence of this approach is that obstacles, it is apparent that testing reinforcement learning
learning a policy while attempting to select new actions comes models with the mobile robot will often result in a collision for

319
Authorized licensed use limited to: GHULAM ISHAQ KHAN INST OF ENG SCI AND TECH. Downloaded on September 21,2022 at 19:16:25 UTC from IEEE Xplore. Restrictions apply.
2021 World Automation Congress (WAC), Online, August 1-5

Figure 3: Testing a Reinforcement Learning Model for Au-

tonomous Waypoint Trajectory Generation Using Multi-Agent
Virtual Exploration in Deep Q-Learning (MVEDQL) [3]
Figure 5: Host GUI and ROS Network Configuration for
Testing and Deployment of the Unmanned Vehicle

V. E XPERIMENTAL S ETUP AND R ESULTS

A. ROS Network Configuration and Remote Deployment GUI
Deployment, testing and control of the mobile robot (client)
Figure 4: Light-to-digital Sensors Compatible With SteamVR is facilitated by a custom GUI application running on a
Lighthouse Sensors separate PC (host) by providing several utilities accessible via
a user-friendly interface with labeled button inputs initiating
several utilities, spawning threads for the individual processes.
To initialize the sensors and configure the ROS network for
poorly performing models. Over time, this will introduce wear, communication with the host PC, a single script on the robotś
potentially damage the robot and alter its dynamics, negatively microcomputer (Raspberry Pi 3B) is executed upon startup
influencing the outcome of the results. which enables the ROS nodes necessary for controlling the
vehicle and publication of sensor data over associated topics
Seeking a safer approach, consideration has been given to a that become available to the remote host. The GUI framework
potentially new method for testing pre-trained reinforcement is displayed in Figure 5 where all thread processes for testing
learning models with physical robots without the possibility and deployment have been spawned, with a visualization of
of damaging the system or altering its dynamics over time their connection to the ROS network:
using virtual reality (VR). Instead of placing physical obstacles The actions to be taken by the physical vehicle are produced
in an empty indoor environment, a virtual rendering of the by the RL model during testing (the “rlcarsim” node, which
object(s) can be constructed in VR so that during testing if reads data from the ToF sensors and Pose topic) and pub-
the actual robot’s position (tracked in the virtual environment lished to the “/cmd” topic which acts as input to control the
using the light-to-digital sensors shown in Figure 4) coincides vehicle motors. Alternatively, the “teleop” node enables user-
with the virtual object then the test result counts as a collision, controlled remote navigation through this same route. This
ending the test without damaging the robot. This approach network is maintained during testing within the Unity Virtual
offers many benefits beyond safety as new environments can Reality environment described in the following sections. The
be constructed virtually, saving time and material costs. One connection between the established ROS network and the
caveat to this approach is that the readings from the robot’s Unity VR framework is assisted by the ROS# [9] Unity asset
ToF distance sensors will need to be artificially reproduced package which provides tools for publishing to the topics
to account for obstacle detection based on the heading of the representing the robot pose and simulated ToF sensor readings
vehicle at any time step. Although this artificial fabrication of as visualized in Figure 6.
the readings subtracts from the “real” hardware implementa-
tion aspect of testing, the vehicle dynamics are still unaffected, B. Virtual Reality Testing Arena Construction
which is the primary necessity for valid testing. A virtual reality testing environment shown in Figure 8 has
An example application similar to this method can be found been constructed in Unity using the SteamVR SDK which
in [7], where a simulated environment of an underground mine replicates the simple arena used for training the robots using
roadway is used to test visual sensing algorithms. A framework reinforcement learning. The testing arena developed for this
for combining Unity and ROS has been proposed in [8]. These implementation matches the simple arena used for initial
two studies are very recent, but highlight the possibility for training. It includes a simple box obstacle centered within a
using virtual reality as a common framework for reinforcement bounded arena, and a single fixed waypoint destination for
learning inference. the vehicle. The distance to nearby objects are continuously

320
Authorized licensed use limited to: GHULAM ISHAQ KHAN INST OF ENG SCI AND TECH. Downloaded on September 21,2022 at 19:16:25 UTC from IEEE Xplore. Restrictions apply.
2021 World Automation Congress (WAC), Online, August 1-5

motion of the robot. However, the model should still “collide”

with any obstacles it encounters, which are detectable by
triggers. Triggers are used to allow the scripting system to
detect when collisions occur, then initiate actions using the
“OnCollisionEnter” function. With the “Is Trigger” enabled,
the model does not behave as a solid object, and other
colliders will pass through, which calls the “OnTriggerEnter”
function on the trigger object’s scripts. If the position of
Figure 6: UNITY Nodes and Topic Messages Published Using the vehicle collides with an obstacle (the gray virtual box,
ROS# or the arena boundary, as shown in Figure 9) in the testing
arena, the triggered collision event will halt the testing and
physical vehicle (through ROS), recording the performance
and reporting a failed trial.

D. Artificial ToF Sensor Data

This virtual framework requires the addition of an artificial

reconstruction of the vehicle’s distance sensor readings so that
the proper state of the robot can be represented in accordance
with its position relative to any nearby obstacles. A script
for the ToF array produces RayCast vectors from an initial
transform position located at the center of the top platform
on the vehicle and scales a line GameObject to visualize
the ray between the vehicle and detected rigid body object,
shown as three orange lines in Figure 9. These readings are
used to update the state of the robot as if it were interacting
with physical objects in its environment. The distances are
published over ROS and used as input for the rlcarsim ros.py
script to generate new actions for the current state.
Figure 7: Program Execution Order and State Transition
Diagram
E. DRL Testing User Interfaces

An in-game menu for allowing selection of model weights

updated and displayed above the testing area. In this environ-
and other various simulation configurations is constructed for
ment, it is possible for the user to issue waypoint destinations
testing. Similarly, an overlay of the test performance as it is
for the vehicle, or configure static destination for consistent
occurring in real-time enhances the utility of the application.
testing. New complex environments can be constructed quickly
The Canvas used for the UI has a Render Mode setting which
in Unity to provide sophisticated interactions between the
can be used to make it render in screen space or world space.
physical robot and virtual obstacles. With the ability to safely
The screen space is used in this project to be visible at all times
visualize testing in real-time with hardware-in-the-loop, this
during in-game testing, showing the current task performance
method has the potential to offer greater insight into the
of the robot model and sensor readings. A separate world
behavior of the trained models, and even promote on-line
space canvas is used to allow the selection of the current
learning beyond validation testing.
loaded model parameter configuration. The actions initiated
The execution of individual programs and condition checks
upon selecting “Test RL Navigation” are ordered as follows:
are visualized in the state transition diagram of Figure 7.
This flowchart outlines the general procedure to be followed 1) Request user selection of environment and waypoint
while testing different model configurations within the virtual destination point.
environment. 2) Simulate the environment and vehicle, starting in the
initial position matching the physical robot.
C. Robot Model Collisions in Unity 3) Load the pre-trained weights for the user’s chosen envi-
Tracked using Lighthouse sensors that continuously scan the ronment.
testing area, the robot’s estimated position can be monitored 4) Run “rlcarsim ros.py” which generates actions to con-
and linked to the 3D model produced in Figure 8. The trol the linear and angular velocities of the physical
kinematic motion of the vehicle model within Unity is to vehicle over the ROS network based on the state of the
be controlled by script code, not by the physics engine, simulated vehicle (the vehicle pose relative to surround-
where the actions of the vehicle at any given state guide the ing obstacles).

321
Authorized licensed use limited to: GHULAM ISHAQ KHAN INST OF ENG SCI AND TECH. Downloaded on September 21,2022 at 19:16:25 UTC from IEEE Xplore. Restrictions apply.
2021 World Automation Congress (WAC), Online, August 1-5

Figure 8: Reinforcement Learning Testing Arena in Unity Environment

ities, the most important of which is to ensure that collisions

between the vehicle and obstacles in its environment are
prevented. In a more general sense, this method is a promising
approach to safely testing and training reinforcement learning
techniques for indoor robotics applications. While there is
additional effort to be made before the testing environment
is robust, the development of a user-friendly GUI framework
for deployment of the vehicle presented in this work offers
additional measures of safety during testing. Upon completing
Figure 9: Collision Event Message Display the development of the virtual reality testing arena, it is of
interest to adapt the environment for general compatibility
in testing reinforcement learning models for other robotic
VI. F UTURE D EVELOPMENTS
systems that have been trained using deep reinforcement
A final step for the VR simulation framework is to evaluate learning techniques.
its effectiveness for testing the model cases established in sec-
R EFERENCES
tion II-A in real-time. After validating this approach, improve-
ments to the virtual testing arena and reinforcement learning [1] J. G. Schneider, “Exploiting model uncertainty estimates for safe dy-
namic control learning,” Advances in neural information processing
simulator can be considered. The vehicle in this research systems, pp. 1047–1053, 1997.
has been initially trained in a python simulation environment [2] T. M. Moldovan and P. Abbeel, “Safe exploration in markov decision
with various environment configurations, including tracks and processes,” 2012. [Online]. Available: https://arxiv.org/abs/1205.4810
[3] A. Majumdar, P. Benavidez, and M. Jamshidi, “Multi-agent exploration
arenas of more complex geometry than the simple box used in for faster and reliable deep q-learning convergence in reinforcement
this demonstration. One objective could be to recreate these learning,” in 2018 World Automation Congress (WAC). Stevenson,
environments in Unity and test the model performances for WA, 2018, pp. 1–6.
[4] ——, “Lightweight multi car dynamic simulator for reinforcement
further validation of the virtual framework approach. Addition- learning,” in 2018 World Automation Congress (WAC). Stevenson,
ally, the ML agents Unity asset package [10] provides many WA, 2018, pp. 1–6.
utilities for training virtual agents using python scripts, and [5] J. Kober, J. A. Bagnell, and J. Peters, “Reinforcement learning in
robotics: A survey,” The International Journal of Robotics Research,
may benefit the development of the simulator by incorporating vol. 32, no. 11, pp. 1238–1274, 2013.
training into the new framework. [6] T. Hester, M. Quinlan, and P. Stone, “Rtmba: A real-time model-based
It is of primary interest to adapt the virtual testing arena reinforcement learning architecture for robot control,” in 2012 IEEE
International Conference on Robotics and Automation. IEEE, 2012,
framework to be applicable for more general cases of testing pp. 85–90.
robotic systems. As has been shown, there are many useful [7] Y. Wang, P. Tian, B. Zheng, Y. Zhou, Y. Li, and X. Wu, “Special
packages currently available for integrating Unity with ROS, robot vision algorithm test platform in virtual reality environment,”
in 2019 International Conference on Internet of Things (iThings) and
implementing AI algorithms, and establishing hardware-in- IEEE Green Computing and Communications (GreenCom) and IEEE
the-loop connections to robotic systems using SteamVR’s Cyber, Physical and Social Computing (CPSCom) and IEEE Smart Data
lighthouse sensor tracking technology. These systems in com- (SmartData). IEEE, 2019, pp. 416–421.
[8] E. Rosen, D. Whitney, E. Phillips, D. Ullman, and S. Tellex, “Testing
bination can potentially provide a useful resource for robotics robot teleoperation using a virtual reality interface with ros reality,” in
researchers by establishing a common framework for test- Proceedings of the 1st International Workshop on Virtual, Augmented,
ing reinforcement learning algorithms on real-time robotic and Mixed Reality for HRI (VAM-HRI), 2018, pp. 1–4.
[9] A. Hussein, F. Garcı́a, and C. Olaverri-Monreal, “Ros and unity based
systems, opening up the possibility for collaborative efforts framework for intelligent vehicles control and simulation,” in 2018 IEEE
among robotics engineers. International Conference on Vehicular Electronics and Safety (ICVES).
IEEE, 2018, pp. 1–6.
VII. C ONCLUSION [10] A. Juliani, V.-P. Berges, E. Teng, A. Cohen, J. Harper, C. Elion,
C. Goy, Y. Gao, H. Henry, M. Mattar et al., “Unity: A
The creation of a virtual reality environment for testing general platform for intelligent agents,” 2018. [Online]. Available:
reinforcement learning-based navigation has many useful util- https://arxiv.org/abs/1809.02627

322
Authorized licensed use limited to: GHULAM ISHAQ KHAN INST OF ENG SCI AND TECH. Downloaded on September 21,2022 at 19:16:25 UTC from IEEE Xplore. Restrictions apply.