Report Final
Report Final
ON
“Explorations of Deep Reinforcement Learning (DRL) models, for Industrial
Robots”
BY
Sriram B 2024H1410105P
ABSTRACT....................................................................................................................................................... 2
1. INTRODUCTION ..................................................................................................................................... 3
2. INTRODUCTION TO DEEP REINFORCEMENT LEARNING ............................................................ 3
2.1 MDP, Q-NETWORK, BELLMAN EQUATION AND POLICY GRADIENT ................................ 4
3. BLOCKS OF DRL:........................................................................................................................................ 5
3.1 AGENT .................................................................................................................................................... 5
3.2 ENVIRONMENT .................................................................................................................................... 5
3.3 STATE ..................................................................................................................................................... 5
3.4 ACTION................................................................................................................................................... 5
3.5 REWARD, POLICY, VALUE FUNCTION, AND DISCOUNT FACTOR .......................................... 5
4. MODELS OF DRL: ....................................................................................................................................... 6
4.1 VALUE-BASED METHODS ................................................................................................................. 6
4.1.1 DOUBLE DEEP Q-NETWORK (DQN) .......................................................................................... 6
4.2 ACTOR-CRITIC METHODS ................................................................................................................. 8
4.2.1 DEEP DETERMINISTIC POLICY GRADIENT ............................................................................ 8
4.2.2 ENERGY EFFICIENT TRAJECTORY PLANNING ................................................................... 11
4.2.3 PROXIMAL POLICY OPTIMISATION ....................................................................................... 12
4.3 TRUST REGION POLICY OPTIMIZATION ...................................................................................... 13
5. CONCLUSION ............................................................................................................................................ 14
LIST OF FIGURES
1. INTRODUCTION
The robotic modelling and control field has witnessed numerous leaps in understanding and advancement in
every process stage, from construction to deployment. These advances have provided renewed impetus,
driving more research into versatile applications of robotics, catering to various demands of the modern world.
Natural Language Processing (NLP), Collaborative Robots (Cobots), Digital Twins, Soft Robotics, Deep
Learning, and Reinforcement Learning have all resulted in significant advances in robotics, particularly for
tasks requiring complex sensory inputs and elaborate action spaces. A concentric circle graph shows the
hierarchical links between different AI concepts.
In DRL model, the Markov Decision Process is applied as a popular mathematical framework which can be
used to define decision-making problems. It basically simulates an agent's interacts with the environment and
how the environment acts. The MDP serves as the cornerstone of DRL, allowing the agent to learn by
investigating and taking advantage of its surroundings. An MDP aims to teach the agent a policy π(a|s) that
maximises the predicted cumulative reward continuously.
2.1.2. Q-NETWORK IN DRL
A Q-Network is a neural network used in Deep Q-Learning to represent the Q-function. The Q-function,
Q(s,a), indicates the expected total reward (return) obtained by starting in state s, doing action a, and following
a policy. A Q-Network is a neural network that approximates the Q-function for high-dimensional spaces and
is updated using the Bellman equation, with the Q-Network attempting to learn this mapping using
backpropagation
2.2.3 BELLMAN EQUATION FOR Q-FUNCTION:
The Bellman equation for the Q-function is a recursive connection that defines how to calculate the Q-value
for each state-action combination by taking into account the immediate reward and the discounted future Q-
values of the following state:
…1
Here:
● Q(st,at) is the Q-value for taking action ata_tat in state sts_tst.
● rt is the immediate reward.
● γ is the discount factor.
● maxa′ Q(st+1,a′) is the maximum Q-value for the next state st+1, given that the agent follows the
optimal policy.
2.2.4 POLICY GRADIENT
A group of reinforcement learning algorithms known as" Policy Gradient Styles" updates the policy parameters
in the direction of the grade of the anticipated return, thus directly optimising the policy. These ways are
particularly helpful when learning a complicated, stochastic policy or when the action space is nonstop. In
policy grade approaches, maximising the anticipated return is the goal
…2
where J(θ) is expected return, and θ is the parameters of the policy network.
3. BLOCKS OF DRL:
3.1 AGENT
In DRL, the agent is the one who learns or makes decisions. The agent engages with the terrain through action
selection, state observation, and reward accession. The agent's goal is to discover a strategy that optimises the
accretive long-term reward. The agent consists of two primary parts
● Policy: A function (or network) that translates states into actions. In DRL, this is commonly represented
by a neural network.
● Value Function (or Q-function): A function that calculates the predicted return (accumulative price)
for a state or state- action duo. This is utilized to decide which action to take.
3.2 ENVIRONMENT
The Environment outlines the problem that the agent is attempting to address. It is the universe with which
the agent interacts, where the agent performs activities and receives feedback in the form of reward. The
environment follows the principles of a Markov Decision Process (MDP), offering states and rewards and
transitioning between them based on the agent's action.
3.3 STATE
The state space contains all the potential states the environment can be in. Each state represents a snapshot of
the environment containing all the information the agent requires to make decisions.
3.4 ACTION
An Action is any decision or movement made by the agent within the environment. The action is decided
based on the present state and is consistent with the agent's policy. The agent's purpose is to choose activities
that result in larger rewards.
3.5 REWARD, POLICY, VALUE FUNCTION, AND DISCOUNT FACTOR
The following DRL components work together to direct the agent's learning process
3.5.1 REWARD
The Reward is a scalar feedback signal the agent receives after acting in a given state. Basically,
It judges how fruitful or destructive the activity was in accomplishing the agent's aim. The nature
of the environment and task gives an insight into whether rewards are immediate or delayed.
3.5.2 POLICY
The Policy describes the agent's decision-making strategy. It represents a mapping from the state space to the
action space. A policy can be either
● State-value function (V(s)): The expected return from state s, following the policy.
● Action-value function (Q(s, a)): The expected return from state s after taking action a,
following the policy.
3.5.4 DISCOUNT FACTOR
The Discount factor (γ) measures the value of future benefits vs present rewards. It normally varies from 0
to 1. A lower discount factor (near to 0) indicates that the agent values current benefits more than future
rewards, boosting long-term planning.
4. MODELS OF DRL:
4.1 VALUE-BASED METHODS
4.1.1 DOUBLE DEEP Q-NETWORK (DQN)
Since its introduction, the classic DQN model has solved several technical challenges. With the extensive
application and development of DQN, numerous enhanced variants have been presented. Hasselt et al. created
the Double DQN (DDQN) method, inspired by the double Q-learning technique, to enhance learning
efficiency. Eq. 2 (DDQN) replaces Eq. 1 (the goal function for DQN).
…3 …4
Many transition experiences et = (st, at, rt, st+1) are recorded in an experience replay buffer and picked
randomly for network parameter updating purposes. Because the value of experiences varies across the
network, learning efficiency is low in the network learning process. The figure shows the algorithm's
framework.
Figure 3: The framework of DDQN with the prioritized experience replay mechanism
4.1.2.1 VISUAL PATH–FOLLOWING SKILLS USING A VISUAL SENSOR IN THE ROS SIMULATION
ENVIRONMENT
IMPLEMENTATION: "A visual path-following algorithm based on artificial intelligence deep reinforcement
learning double deep Q-network (DDQN)".
The suggested technique enables the robot to develop path-following skills on its own, employing a visual
sensor in the Robot Operating System (ROS) simulation environment. The robot can learn pathways with
various textures, colours, and forms, increasing its versatility for many industrial robot applications.
Skills learnt in simulation may be applied straight to the real world. To evaluate the route-following skill, the
six-jointed robot Universal Robots 5 (UR5) draws a path randomly on the workpiece. The simulation and real-
world experiment findings show that robots can follow paths efficiently and precisely utilising visual input
without knowing the way's characteristics or scripting the path ahead of time.
Simulation environment: As it explores its surroundings, the robotic agent acquires experiences and learns
from them. There are two techniques of gathering experience information.
The first step involves gathering experience information from a simulation environment, while the second
involves gathering information from a real-world setting.
I. Figure shows the construction of a virtual simulation platform based on ROS. Fig. illustrates that
three training samples with various appearance characteristics reflect three distinct task skill
learning scenarios. Skill 1 is learnt by training sample 1. Skill 2 and 3 correspond to the second
and third training samples.
II. A training sample with a complicated B-sample curve is utilised for skill acquisition. .
Figure 4: Path-following simulation environment Figure 5: Physical experimental platform for pathfollowing
Real experiment: Experimental sample #1 is examined with skill 1, whereas experimental sample #2 is
evaluated using skill 2. The camera's observation centre is fixed at the curve's start before the test. The
computer gathers picture information from the camera using ROS messaging topics and delivers control orders
to the control cabinet to control the robot's activity. The signal flow diagram is depicted in Figure 13.
Throughout the trial, the robot monitors the path through the camera and determines how to execute the action
using the trained deep Q network. Following the output action, the camera's observation centre will move to
a new location on the route. The robot then proceeds to observe and follow the course.
The TCP's coordinates are recorded and utilised as interpolation points during the experiment. The figure
shows the experimental findings of simulated laser cutting. They confirm that the robot can accomplish route-
following tasks effectively and autonomously in real-world settings. The centre of the laser beam is on the
path's centre-line, indicating that the robot is very good at following paths.
…5
4.2.1.1 TARGET REACHING TASK FOR A 2DOF ROBOT
Network components: It employs two networks, the Actor and Critic networks. The actor network accepts
input as a state s and returns an action a. Here, the actor is employed to deterministically approach the best
strategy. In DDPG, we want the best possible action every time we query the actor network. The critic network
receives an input of a state s and an action a and outputs a Q value. Figure 3 depicts the Actor and Critic
network.
…. 6
This noise facilitates exploration The Ornstein-Uhlenbeck process creates noise associated with the prior
noise, preventing the noise from cancelling out or freezing the overall dynamics.
(c) Actor (policy) and Critic (value). Network updates: The Critic network is updated similarly to Q-learning.
The Bellman equation yields the revised Q value.
…. 7
(d) Target network: By directly updating actor and critic neural network weights using gradients acquired from
the error signal computed from both the replay buffer and the output of the actor and critic networks, our
learning method diverges (or fails to train at all).
It was revealed that employing a collection of target networks to produce targets for TD error computation
improves the learning algorithm's stability. In DDPG, the target networks are updated using a soft updates
technique, which consists of gradually merging normal network weights with target network weights, as
shown by
…. 8 …. 9
IMPLEMENTATION:
The simulation environment for this case is depicted in Figure. It's a two-dimensional robotic arm with two
arm segments connected by a joint. Another joint links the entire arm to a black circle in the centre of the
simulation area. The red circle represents the target. The green circle depicts the end-effector, the location of
an actual robot's hand or gripper. Target reaching aims to move the end-effector to the target's position via
regulating the joints.
Regardless of the number of iterations, the PPO network takes less time to train than the DDPG network. And
the Reward r2 was utilised to guarantee that the joints moved smoothly throughout the activity. This attribute
was proven for the first two types of incentives by comparing their joint velocities for the same experiment,
as seen in Fig. 9
An example of training and testing the controller using the DDPG method is shown below:
4.2.1.2 ENERGY EFFICIENT TRAJECTORY PLANNING
A reinforcement learning agent interacts with the environment while continuously optimizing its strategy by
receiving immediate rewards. This process helps the agent find a state trajectory that maximizes the total
return. The DRL method used here utilizes Deep Neural Networks (DNN) to estimate state or action value
functions, extending the state space of a traditional reinforcement learning agent from discrete to continuous.
DNN also facilitates policy expression, making the agent handle continuous action scenarios.
The combined use of DNNs for approximating state and action functions is called the Actor-Critic method.
This method is well-suited for optimizing complex systems with continuous inputs and outputs. The popular
Actor-critical algorithms include deep deterministic policy gradient (DDPG), soft Actor-critical, and Twin-
delayed DDPG.
This achieves significant energy savings (23.21%) reduction compared to default trajectories) without the
computational overhead of traditional nonlinear methods. It simplifies the dynamic model for improved
training efficiency and lower computational cost.
The main advantage of this method is it achieves real-time trajectory generation, contrasting with
slower traditional optimization techniques like genetic algorithms or dynamic programming. On the other
hand, the limitation of this method is it is effective only in fixed scenarios and requires retraining the model
when the robot parameters and environments are varied.
4.2.3 PROXIMAL POLICY OPTIMISATION
Proximal Policy Optimization (PPO) is a reinforcement learning algorithm designed to optimize policies
efficiently and reliably by improving upon Trust Region Policy Optimization (TRPO). PPO uses a clipped
objective function to limit the size of policy updates, ensuring they stay within a safe range without requiring
the computational complexity of TRPO's trust region constraints. The algorithm alternates between collecting
trajectories through the current policy and updating the policy using a simplified surrogate objective, which
balances exploration and exploitation. PPO is computationally efficient, easy to implement, and works well
in environments with continuous or discrete action spaces.
To study the trajectory tracking method for robotic arms, the traditional tracking method has low accuracy
and cannot realize the complex tracking tasks. Compared with traditional methods, deep reinforcement
learning is an effective scheme with the advantages of robustness and solving complex problems. The deep
reinforcement learning continuous control algorithm mainly includes asynchronous advantage actor-critic
(A3C) and proximal policy optimization (PPO).
The A3C algorithm is an asynchronous multi-threading algorithm that optimizes the actor-critic method with
better convergence. However, the A3C algorithm needs to run the data first to calculate the gradient and then
update the global network, thus making up for the shortcomings of the A3C an algorithm based on the policy
gradient (PG), the PPO algorithm performs better while updating the policy problem offline, transforming the
on-policy into an off-policy, and adding constraints to the objective function to form the PPO.
The pseudocode of the Improved-PPO algorithm is shown in Table 1. KL represents the scatter value of the
policy before and after the update. When KL is less than twice the target value, b decreased it by half. When
KL is greater than half of the target value, b increased by a factor of two.
The trajectory tracking curve of the end-effector of the robotic arm is shown in Fig. 5. The solid blue line in
Fig. 5a is the expected trajectory of the robotic arm. The red solid line in Fig. 5b shows the actual trajectory
of the robotic arm. The simulation results show that the Improved-PPO algorithm for robotic arm trajectory
tracking is superior to the A3C and PPO algorithms. The Improved-PPO algorithm converges faster and has
the shortest trajectory tracking time.
(a) (b)
Figure 12: The end trajectory tracking curve of the robotic arm; a) the expected trajectory of the robotic arm;
b) the actual trajectory of the robotic arm.
Results and Conclusion: The improved PPO algorithm is employed for trajectory tracking. The Improved-
PPO algorithm is further compared with the A3C and PPO algorithms. Compared with the A3C algorithm,
the improved PPO algorithm increased the convergence speed by 84.3% and the reward value by 77.8%.
Compared with the PPO algorithm, the Improved-PPO algorithm improved the convergence speed by 15.4%
and increased the reward value by 54.2%. The simulation results show that the Improved-PPO algorithm
outperforms the A3C and PPO algorithms for robotic arm trajectory tracking. The improved PPO algorithm
converges faster and has the shortest trajectory tracking time. This method provides a new research idea for
robotic arm trajectory tracking.
When dealing with dynamic object tasks, HTRPO can solve the tasks under low speed motion quite well,
however the tasks under high speed still remain challenges to policy gradient methods, which will be studied
in our future work.
Experiment
The four challenging sparse-reward environments include two types of tasks. One is the reaching task with
obstacles, the other consists of three dynamic object tasks. Both types of tasks are goal-conditioned, which
means robot will have a goal observation at every time step. For every task, the reward is binary, when robot
successfully reach the goal with tolerance ℇ, the environment gives the reward 0, otherwise gives -1.
Figure 13: Four challenging sparse-reward tasks on 7-DoF robot arms. (a) The red block is the goal surrounded
by three bigger blocks which represent the obstacles. (b)(c)(d): The red ball is the goal moving alone a certain
trajectory (as the red arrow indicates) at different speeds. (b) and (c) tries to reach the moving goal, while (d)
needs to push the dark block to the moving goal.
Obstacle-Reaching. A 7-Dof Sawyer robot is fixed on a base. The state space contains speeds and angles of
the robot arm and the objects’ position and quaternion, which is represented by a vector of 70 dimensions.
The action is a vector that has eight dimensions and control the movement of joints and grippers. The goal of
the Sawyer robot is to touch the red block which is surrounded by obstacles.
Dy-Reaching. A 3-link Fetch robot arm is fixed on a base. The state space is the same as Obstacle-Reaching,
however it only has 10 dimensions. The action is a vector that has four dimensions and control the gripper’s
movement, the gripper keeps closed in the task. The goal represents the desired position in the space, which
is a three-dimension vector. The robot’s goal is to chasing the goal moving along the x axis at a speed of
0.005m/s.
Dy-Circling. The state and action dimension are the same as Dy-Reaching task. The robot’s goal is to reach
the target which does a circular motion at a speed of 0.005m/s.
Dy-Pushing. The action dimension is the same as Dy-Reaching task while the state has 25 dimensions. The
robot needs to push the dark block to the position of red goal which is moving at a speed of 0.002m/s.
5. CONCLUSION
In conclusion, the of Deep Reinforcement Learning (DRL) in industrial robot represents a promising frontier
for enhancing robotic structure and performance in complex, real-world environment. However, several
challenges remain, particularly in terms of sample efficiency, safety, and the scalability of models to handle
diverse industrial settings. Further advancements are needed in transfer learning and domain adaptation to
ensure that DRL models can seamlessly transition from virtual to physical spaces.
The future of DRL in industrial robots is likely to be shaped by improvements in model robustness, the
development of hybrid approaches combining DRL with classical control methods, and innovations in
hardware that enable faster computation and real-time decision-making. Continued research into these areas
will pave the way for more intelligent, flexible, and efficient robotic systems that can handle the complexities
of advanced industrial tasks..
REFERENCES
[1] Mehran Taghian and Shotaro Miwa, “Explainability of deep reinforcement learning
algorithms in robotic domains by using Layer-wise Relevance Propagation”, In proceedings
of the research paper on Engineering Applications of Artifcial Intelligence 137 (2024) 109131
[2] Guoliang Liu et al, “Learning visual path–following skills for industrial robot using deep
reinforcement learning”, in proceedings of the research paper published in International
Journal of Advanced Manufacturing Technology (2022) 122:1099–1111
[3] Deepak Raina and Subir Kumar Saha, “AI-Based Modeling and Control of Robotic
Systems: A Brief Tutorial”, in proceedings of research paper in 2021 2021 3rd International
Conference on Robotics and Computer Vision (ICRCV) | 978-1-6654-3628-1/21/$31.00
©2021 IEEE |
[5] Wang, Xiaolong, et al. "Energy-efficient trajectory planning for a class of industrial
robots using parallel deep reinforcement learning." Nonlinear Dynamics (2024): 1-21.
[6] Yang, Deyu, Hanbo Zhang, and Xuguang Lan. "Research on Complex Robot
Manipulation Tasks Based on Hindsight Trust Region Policy Optimization." 2020 Chinese
Automation Congress (CAC). IEEE, 2020.
[7] ZHENG, Qingchun, et al. "Robotic arm trajectory tracking method based on improved
proximal policy optimization." Proceedings of the Romanian Academy, Series A:
Mathematics, Physics, Technical Sciences, Information Science 24.3 (2023).
[8]Chayan Banerjii and Zhiyong Chein “Improved soft Actor-Critic:Mixing prioritized off-
policy samples with on policy experiences ” In proceedings of the research paper on IEEE
transactions on neutral networks and learning system volume.35 No.3 March,2024