Virtual_Testing_and_Policy_Deployment_Framework_for_Autonomous_Navigation_of_an_Unmanned_Ground_Vehicle_Using_Reinforcement_Learning
Virtual_Testing_and_Policy_Deployment_Framework_for_Autonomous_Navigation_of_an_Unmanned_Ground_Vehicle_Using_Reinforcement_Learning
Abstract—The use of deep reinforcement learning (DRL) as it pertains to navigation, aims to discretize the space into a
a framework for training a mobile robot to perform optimal grid-world representation to allow for a fixed number of cell-
navigation in an unfamiliar environment is a suitable choice for based movement actions that can be feasibly implemented by a
implementing AI with real-time robotic systems. In this study, the
environment and surrounding obstacles of an Ackermann-steered low-level controller. This technique can be extended to reduce
UGV are reconstructed into a virtual setting for training the UGV the planning horizon of modern applications by decomposing
to centrally learn the optimal route (guidance actions to be taken complex tasks into options, or sub-tasks composed of ele-
at any given state) towards a desired goal position using Multi- mentary components. These options can be executed to help
Agent Virtual Exploration in Deep Q-Learning (MVEDQL) for ensure safe exploration of the environment, a key necessity
various model configurations. The trained model policies are to
be transferred to a physical vehicle and compared based on their for reliable performance in the learning process [1], [2].
individual effectiveness for performing autonomous waypoint
navigation. Prior to incorporating the learned model with the B. Reinforcement Learning Model Implementation
physical UGV for testing, this paper outlines the development of
a GUI application to provide an interface for remotely deploying
In this implementation, the environment and surrounding
the vehicle and a virtual reality framework reconstruction of the obstacles of an Ackermann-steered UGV are reconstructed
training environment to assist safely testing the system using the into a virtual setting for training the UGV (represented as a
reinforcement learning model. bicycle model) to centrally learn the optimal route (guidance
Index Terms—Reinforcement Learning, Virtual Reality, Au- actions to be taken at any given state) towards a desired goal
tonomous Car, Simulation, Multi-Agent Systems, Exploration
position using Multi-Agent Virtual Exploration in Deep Q-
Learning (MVEDQL) [3], which has been shown to converge
I. I NTRODUCTION faster than standard Q-Learning techniques. Combining this
A. Reinforcement Learning for Robotic Systems method of learning with a physical UGV can be accomplished
by making inferences of newly-seen obstacles by the UGV’s
The use of deep reinforcement learning (DRL) as a frame-
distance / ranging sensors and incorporating them into the
work for training a UGV to perform optimal navigation in an
simulated environment for iterative learning (hence re-learning
unfamiliar environment is a suitable choice for implementing
an optimal route given unfamiliar conditions). By designing
AI with real-time robotic systems. Deep Q-Network (DQN)
a navigation controller with this “hands-off” approach and
and similar variations are capable of approximating func-
allowing the DRL model to perform the work, it is possi-
tions to provide generalized solutions for model-free Markov
ble to modify the configuration of the physical system (a
Decision Processes (MDP) and optimizing complex control
non-holonomic dynamical system) and have it autonomously
problems by producing a learned optimal policy function to
navigate through unfamiliar environments without having to
govern the behavior of an agent in an unfamiliar environment.
strictly modify the design of a controller and instead re-learn
The state-action space of the robotic system is a represen-
the optimal path based on state configuration changes.
tation of all possible “actions” (motor actuation commands,
The state of the virtual UGV s(t), or “agent”, at any given
task execution) that a robot can take given any “state” which
time t is represented by the following equations:
often describes the robot’s position, orientation, and any other
information considered pertaining to its surrounding environ-
ment. The classical approach to reinforcement learning, as ~ scaled , sin(ρ), cos(ρ), s ◦ , s ◦ , s ◦ ]
φ = [kdk (1)
−30 0 30
∗ Thiswork was supported by Grant number FA8750-15-2-0116 from Air
Force Research Laboratory and OSD through a contract with North Carolina
Agricultural and Technical State University. d~ = (xa − xd )î + (ya − yd )ĵ (2)
ρ = ]d~ − ω (3)
where (xa , ya ), (xd , yd ) represents the position of the agent
and destination respectively, ρ is defined as the angle offset
between the vector directed towards the destination and the Figure 1: Reinforcement Learning Model Simulation Case
current position, ω is the agent’s orientation relative to the Training Results
world frame, and s−30◦ , s0◦ , s30◦ are inverse scaled sensor
values that indicate the presence of an obstacle as the agent
approaches it. The actions a that can be taken consist of configuration parameters are all boolean-valued, and the list of
velocity and steering commands allowing movement towards configurations being tested is not exhaustive. The individual
a desired state position, and are notated as a = {(υ, ψ); υ ∈ training performances are presented in Figure 1, where a fixed
{0.2, 1.5}, ψ ∈ {0.0, 0.6, −0.6}}. agent position, fixed destination, and no inter-agent collision
learns the fastest for the early stages of training. While this
II. D ESIGN G OALS model may perform best in an arena constructed to be identical
A. Problem Overview to the one it used for training, it will likely suffer in practice
This project aims to develop a framework for efficient when the environment has changed, as it is not adapted for
deployment of a trained deep reinforcement-learning (DRL) exploration. However, this can be used to offer insight into
model that will optimally navigate an autonomous wheeled the effectiveness of the framework developed in the following
mobile robot to a user-provided waypoint goal location while sections.
avoiding collisions with various obstacles in the surround- III. BACKGROUND
ing environment. The DRL model has been trained using
reinforcement learning (Multi-Agent Virtual Exploration in A. Reinforcement Learning for Robotic Navigation Systems
Deep Q-learning, MVEDQL) in a custom Python simulation Expanding upon the claim of using reinforcement learning
environment [3], [4] that can spawn multiple virtual agents as an ideal machine learning framework for training real-
representing the vehicle which perform reward-based learned- time robotic systems should be prefaced with a review of the
actions to optimally navigate to the goal location after spawn- challenging complications involved with these often highly-
ing from a random position within the virtual environment complex physical systems [5]. Firstly, training a DRL model
during each iteration. requires repeated sampling from the robot’s internal and
The results obtained from simulation demonstrated the external sensors so that an adequate representation of it’s
fastest success using MVEDQL with 6 agents and a box state-action space can be properly constructed without needing
collision model. The algorithm parameters affecting the ro- an unreasonable amount of exploration that will ultimately
bustness of the robot’s performance are those that randomize converge to a globally optimal solution.
the environment to elicit greater exploration and will likely For mobile navigation concerns, time-discretization of the
have a significant affect on policy transfer-ability, or the actuation is challenging since neither the motors nor the
effectiveness to which the physical robot takes actions to vehicle can move instantaneously (an inherent time delay)
perform a task in real-time given a policy that was learned which can distort the distance measurements collected between
offline. These training parameters will be unique for several states and therefore delay the observable actions by multiple
simulation model configurations (shown in Table I) to be time steps. While this problem can be handled by placing some
compared during test evaluation. of the most recent actions into a buffer, this strategy increases
The primary focus of this research is to compare the perfor- the dimensionality of the problem. Instead, the duration of the
mance of implementing these pre-trained model configurations time steps can be increased with the penalty of reducing the
based on the shortest time taken to reach a destination position precision of the controller.
(if at all) for a fixed number of trials. The total trajectory The internal sensors are used to monitor the states of the
length needed to reach the goal point for this time will system that pertain to the position and orientation of the robot’s
also recorded, in addition to the percentage of successful joints and linkages. As the number of joints and linkages
trials that reach the destination without collision. The model increase (as with modern robotic systems) so does the total
318
Authorized licensed use limited to: GHULAM ISHAQ KHAN INST OF ENG SCI AND TECH. Downloaded on September 21,2022 at 19:16:25 UTC from IEEE Xplore. Restrictions apply.
2021 World Automation Congress (WAC), Online, August 1-5
Figure 2: Training a Reinforcement Learning Model for Autonomous Waypoint Trajectory Generation Using Multi-Agent
Virtual Exploration in Deep Q-Learning (MVEDQL) [3]
number of degrees of freedom (DoF) which must all be at a significant computational cost. Alternatively, training the
monitored. Over time, the dynamics of the system can change model in simulation and transferring the learned policy to the
due to external effects such as wear and corrosion, affecting physical robot is an idealistic approach that largely depends
the robot’s behavior and ultimately the learning process. This on the accuracy of the system’s dynamic model. Small model
effect is more drastic for higher-velocity systems such as errors can accumulate for under-modelled systems (e.g. by
drones and UAVs. neglecting contact friction between mechanical interactions,
External sensors are then used to monitor the environ- neglecting air drag, etc.) resulting in divergence from the
ment surrounding the robot (e.g., nearby obstacles, landmarks, real system. The technique of transferring policies works
chemical concentrations, local wind velocity, ambient temper- best for self-stabilizing systems that do not require active
ature, etc.) expanding the complexity of the overall system. control to maintain a safe state for the robot. For unstable
Even with a limited quantity of sensors used to sample from tasks or approximate models, this approach still has utility
the environment for a simple system, obtaining these samples for testing algorithms in simulation, verifying theoretically
can be very costly due to time-extensive task execution peri- optimal solutions, and identifying alternative data-collection
ods and any necessary interventions from human supervisors strategies. Addressed in [5], a general framework that can
(maintenance, repairs, position resets, etc.). Furthermore, to be used for issuing comparable experiments and consistent
efficiently implement an autonomous navigation algorithm in evaluation of reinforcement learning models on real robotic
real-time, any delays in communication and actuation, filtering systems is highly desireable, but has yet to be developed. This
signal noise and task execution must be accounted for. paper seeks to identify a possible solution to this problem.
Increasing the number of sensors for systems of higher
IV. M ETHOD
complexity not only demands attention to these challenges, but
additionally introduces a new problem commonly referred to A. Testing Environment for the Trained Model
as the curse of dimensionality, where exponentially more data A demonstration of the learning process using MVEDQL
collection (and processing) is needed to cover the complete for a simple box-shaped indoor environment with a single
state-action space. Clearly, limitations on the amount of real- obstacle is provided in Figure 2, where multiple vehicles are
world interaction time along with sample-efficient algorithms spawned at the same location with different initial orientations
that can learn from a small number of trials are essential, es- with respect to the desired goal position. During testing,
pecially for non-episodic settings where movements cannot be the agents employ the learned policies and attempt to reach
paused while undergoing training and actions must be selected the input destination position. The process of testing for the
within a restricted time budget. A real-time architecture for chosen environment is visualized in Figure 3.
model-based value-function reinforcement learning methods Accomplishing the primary objective of implementing the
has been proposed by [6] where individual threads are cre- trained DRL model for autonomous navigation of the robot
ated for planning, model-learning, and action selection. This can seemingly be satisfied by constructing a simple indoor
particular example offers thorough insight into the difficulties arena with a physical obstacle placed between the vehicle and
of real-time policy deployment. a goal point. Short walls restrict the domain or search space
Training the reinforcement learning model on-line involves and the gray centered box acts as the obstructing object. The
a major component of agent exploration to sample data lighthouse sensors on the robot are visible at all positions
from the environment in order to learn an optimal policy to within the domain as required to accurately track the pose
accomplish a given objective (reach a destination without col- of the robot. However, when using a testing arena with real
lision with obstacles). A consequence of this approach is that obstacles, it is apparent that testing reinforcement learning
learning a policy while attempting to select new actions comes models with the mobile robot will often result in a collision for
319
Authorized licensed use limited to: GHULAM ISHAQ KHAN INST OF ENG SCI AND TECH. Downloaded on September 21,2022 at 19:16:25 UTC from IEEE Xplore. Restrictions apply.
2021 World Automation Congress (WAC), Online, August 1-5
320
Authorized licensed use limited to: GHULAM ISHAQ KHAN INST OF ENG SCI AND TECH. Downloaded on September 21,2022 at 19:16:25 UTC from IEEE Xplore. Restrictions apply.
2021 World Automation Congress (WAC), Online, August 1-5
321
Authorized licensed use limited to: GHULAM ISHAQ KHAN INST OF ENG SCI AND TECH. Downloaded on September 21,2022 at 19:16:25 UTC from IEEE Xplore. Restrictions apply.
2021 World Automation Congress (WAC), Online, August 1-5
322
Authorized licensed use limited to: GHULAM ISHAQ KHAN INST OF ENG SCI AND TECH. Downloaded on September 21,2022 at 19:16:25 UTC from IEEE Xplore. Restrictions apply.