0% found this document useful (0 votes)

22 views15 pages

Report Final

Uploaded by

h20240104

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

22 views15 pages

Report Final

Uploaded by

h20240104

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 15

A REPORT

ON
“Explorations of Deep Reinforcement Learning (DRL) models, for Industrial
Robots”
BY
Sriram B 2024H1410105P

Shravan Sri Vishnav G 2024H1410103P

Harshal Dinesh Laddha 2024H1410104P

Prepared on completion of the

Mechanisms and Robotics Course No. ME-G511
AT
BITS PILANI, PILANI CAMPUS
DEPARTMENT OF MECHANICAL ENGINEERING

BIRLA INSTITUTE OF TECHNOLOGY AND SCIENCE, PILANI

December, 2024
TABLE OF CONTENTS

ABSTRACT....................................................................................................................................................... 2
1. INTRODUCTION ..................................................................................................................................... 3
2. INTRODUCTION TO DEEP REINFORCEMENT LEARNING ............................................................ 3
2.1 MDP, Q-NETWORK, BELLMAN EQUATION AND POLICY GRADIENT ................................ 4
3. BLOCKS OF DRL:........................................................................................................................................ 5
3.1 AGENT .................................................................................................................................................... 5
3.2 ENVIRONMENT .................................................................................................................................... 5
3.3 STATE ..................................................................................................................................................... 5
3.4 ACTION................................................................................................................................................... 5
3.5 REWARD, POLICY, VALUE FUNCTION, AND DISCOUNT FACTOR .......................................... 5
4. MODELS OF DRL: ....................................................................................................................................... 6
4.1 VALUE-BASED METHODS ................................................................................................................. 6
4.1.1 DOUBLE DEEP Q-NETWORK (DQN) .......................................................................................... 6
4.2 ACTOR-CRITIC METHODS ................................................................................................................. 8
4.2.1 DEEP DETERMINISTIC POLICY GRADIENT ............................................................................ 8
4.2.2 ENERGY EFFICIENT TRAJECTORY PLANNING ................................................................... 11
4.2.3 PROXIMAL POLICY OPTIMISATION ....................................................................................... 12
4.3 TRUST REGION POLICY OPTIMIZATION ...................................................................................... 13
5. CONCLUSION ............................................................................................................................................ 14

LIST OF FIGURES

Figure 1: Hierarchical links between different AI concepts. ............................................................................. 3

Figure 2: Block diagram depicting elements of DRL ........................................................................................ 6
Figure 3: The framework of DDQN with the prioritized experience replay mechanism .................................. 6
Figure 4: Path-following simulation environment ............................................................................................ 7
Figure 6: Actor and critic network .................................................................................................................... 8
Figure 7: Target reaching task of a 2 DOF manipulator .................................................................................. 9
Figure 9 : Comparison of joint velocity for reward types (a)r0 (b) r1 ............................................................ 10
Figure 10: Typical structure of an Actor-Critic algorithm. ............................................................................ 11
Figure 11 Pseudo-code for improved PPO Algorithm .................................................................................... 12
Figure 12: The end trajectory tracking curve of the robotic arm.................................................................... 13
Figure 13: Four challenging sparse-reward tasks on 7-DoF robot arms ....................................................... 14
ABSTRACT
The application of Deep Reinforcement Learning to industrial robotics has garnered significant interest to its
potential to improve the autonomy, adaptability, and efficiency of robotic systems. Our work aims at exploring
various DRL models in the domain of industrial robots, focusing on their capabilities of decision-making,
path planning, and real-time adaptation to explicit dynamic environments. We examine the integration of DRL
with traditional robotics, focusing on algorithms such as Deep Q-Networks (DQN), Proximal Policy
Optimization (PPO), and Actor-Critic methods, assessing their performance in tasks like assembly, material
handling, and inspection. We also aim at addressing the challenges confronted by the AI model in industrial
settings, including issues related to sample efficiency, safety, and the need for domain adaptation. Further, we
discuss the role of simulation environments in training DRL models, the importance of transfer learning, and
recent advancements in model scalability and robustness.

1. INTRODUCTION
The robotic modelling and control field has witnessed numerous leaps in understanding and advancement in
every process stage, from construction to deployment. These advances have provided renewed impetus,
driving more research into versatile applications of robotics, catering to various demands of the modern world.
Natural Language Processing (NLP), Collaborative Robots (Cobots), Digital Twins, Soft Robotics, Deep
Learning, and Reinforcement Learning have all resulted in significant advances in robotics, particularly for
tasks requiring complex sensory inputs and elaborate action spaces. A concentric circle graph shows the
hierarchical links between different AI concepts.

2. INTRODUCTION TO DEEP REINFORCEMENT LEARNING

Deep Reinforcement Learning (DRL) is a powerful approach in Robotics that combines reinforcement
learning (RL) and deep learning to teach robots to do complicated tasks by interacting with their surroundings.
DRL enables robots to automatically learn from experience, making decisive decisions and improving their
behaviour over time without being explicitly programmed for each problem statement. DRL combines Markov
Decision Process and Dynamic Programming theory with deep learning function approximation techniques.

Figure 1: Hierarchical links between different AI concepts.

2.1 MDP, Q-NETWORK, BELLMAN EQUATION AND POLICY GRADIENT
2.1.1. MARKOV DECISION PROCESS (MDP)

In DRL model, the Markov Decision Process is applied as a popular mathematical framework which can be
used to define decision-making problems. It basically simulates an agent's interacts with the environment and
how the environment acts. The MDP serves as the cornerstone of DRL, allowing the agent to learn by
investigating and taking advantage of its surroundings. An MDP aims to teach the agent a policy π(a|s) that
maximises the predicted cumulative reward continuously.
2.1.2. Q-NETWORK IN DRL
A Q-Network is a neural network used in Deep Q-Learning to represent the Q-function. The Q-function,
Q(s,a), indicates the expected total reward (return) obtained by starting in state s, doing action a, and following
a policy. A Q-Network is a neural network that approximates the Q-function for high-dimensional spaces and
is updated using the Bellman equation, with the Q-Network attempting to learn this mapping using
backpropagation
2.2.3 BELLMAN EQUATION FOR Q-FUNCTION:
The Bellman equation for the Q-function is a recursive connection that defines how to calculate the Q-value
for each state-action combination by taking into account the immediate reward and the discounted future Q-
values of the following state:

…1
Here:
● Q(st,at) is the Q-value for taking action ata_tat in state sts_tst.
● rt is the immediate reward.
● γ is the discount factor.
● maxa′ Q(st+1,a′) is the maximum Q-value for the next state st+1, given that the agent follows the
optimal policy.
2.2.4 POLICY GRADIENT
A group of reinforcement learning algorithms known as" Policy Gradient Styles" updates the policy parameters
in the direction of the grade of the anticipated return, thus directly optimising the policy. These ways are
particularly helpful when learning a complicated, stochastic policy or when the action space is nonstop. In
policy grade approaches, maximising the anticipated return is the goal

…2
where J(θ) is expected return, and θ is the parameters of the policy network.
3. BLOCKS OF DRL:
3.1 AGENT
In DRL, the agent is the one who learns or makes decisions. The agent engages with the terrain through action
selection, state observation, and reward accession. The agent's goal is to discover a strategy that optimises the
accretive long-term reward. The agent consists of two primary parts
● Policy: A function (or network) that translates states into actions. In DRL, this is commonly represented
by a neural network.
● Value Function (or Q-function): A function that calculates the predicted return (accumulative price)
for a state or state- action duo. This is utilized to decide which action to take.
3.2 ENVIRONMENT
The Environment outlines the problem that the agent is attempting to address. It is the universe with which
the agent interacts, where the agent performs activities and receives feedback in the form of reward. The
environment follows the principles of a Markov Decision Process (MDP), offering states and rewards and
transitioning between them based on the agent's action.
3.3 STATE
The state space contains all the potential states the environment can be in. Each state represents a snapshot of
the environment containing all the information the agent requires to make decisions.
3.4 ACTION
An Action is any decision or movement made by the agent within the environment. The action is decided
based on the present state and is consistent with the agent's policy. The agent's purpose is to choose activities
that result in larger rewards.
3.5 REWARD, POLICY, VALUE FUNCTION, AND DISCOUNT FACTOR
The following DRL components work together to direct the agent's learning process
3.5.1 REWARD
The Reward is a scalar feedback signal the agent receives after acting in a given state. Basically,
It judges how fruitful or destructive the activity was in accomplishing the agent's aim. The nature
of the environment and task gives an insight into whether rewards are immediate or delayed.
3.5.2 POLICY
The Policy describes the agent's decision-making strategy. It represents a mapping from the state space to the
action space. A policy can be either

• Deterministic (always taking the same action in a given state) or

• Stochastic (choosing actions based on probability distribution).
3.5.3 VALUE FUNCTION
The Value function computes the predicted cumulative reward that an agent may obtain from a particular state
(or state-action combination) while adhering to its present policy.

● State-value function (V(s)): The expected return from state s, following the policy.
● Action-value function (Q(s, a)): The expected return from state s after taking action a,
following the policy.
3.5.4 DISCOUNT FACTOR
The Discount factor (γ) measures the value of future benefits vs present rewards. It normally varies from 0
to 1. A lower discount factor (near to 0) indicates that the agent values current benefits more than future
rewards, boosting long-term planning.

Figure 2: Block diagram depicting elements of DRL

4. MODELS OF DRL:
4.1 VALUE-BASED METHODS
4.1.1 DOUBLE DEEP Q-NETWORK (DQN)

Since its introduction, the classic DQN model has solved several technical challenges. With the extensive
application and development of DQN, numerous enhanced variants have been presented. Hasselt et al. created
the Double DQN (DDQN) method, inspired by the double Q-learning technique, to enhance learning
efficiency. Eq. 2 (DDQN) replaces Eq. 1 (the goal function for DQN).

…3 …4

Many transition experiences et = (st, at, rt, st+1) are recorded in an experience replay buffer and picked
randomly for network parameter updating purposes. Because the value of experiences varies across the
network, learning efficiency is low in the network learning process. The figure shows the algorithm's
framework.

Figure 3: The framework of DDQN with the prioritized experience replay mechanism
4.1.2.1 VISUAL PATH–FOLLOWING SKILLS USING A VISUAL SENSOR IN THE ROS SIMULATION
ENVIRONMENT

IMPLEMENTATION: "A visual path-following algorithm based on artificial intelligence deep reinforcement
learning double deep Q-network (DDQN)".

The suggested technique enables the robot to develop path-following skills on its own, employing a visual
sensor in the Robot Operating System (ROS) simulation environment. The robot can learn pathways with
various textures, colours, and forms, increasing its versatility for many industrial robot applications.

Skills learnt in simulation may be applied straight to the real world. To evaluate the route-following skill, the
six-jointed robot Universal Robots 5 (UR5) draws a path randomly on the workpiece. The simulation and real-
world experiment findings show that robots can follow paths efficiently and precisely utilising visual input
without knowing the way's characteristics or scripting the path ahead of time.

Simulation environment: As it explores its surroundings, the robotic agent acquires experiences and learns
from them. There are two techniques of gathering experience information.

The first step involves gathering experience information from a simulation environment, while the second
involves gathering information from a real-world setting.

I. Figure shows the construction of a virtual simulation platform based on ROS. Fig. illustrates that
three training samples with various appearance characteristics reflect three distinct task skill
learning scenarios. Skill 1 is learnt by training sample 1. Skill 2 and 3 correspond to the second
and third training samples.
II. A training sample with a complicated B-sample curve is utilised for skill acquisition. .

Figure 4: Path-following simulation environment Figure 5: Physical experimental platform for pathfollowing

Real experiment: Experimental sample #1 is examined with skill 1, whereas experimental sample #2 is
evaluated using skill 2. The camera's observation centre is fixed at the curve's start before the test. The
computer gathers picture information from the camera using ROS messaging topics and delivers control orders
to the control cabinet to control the robot's activity. The signal flow diagram is depicted in Figure 13.
Throughout the trial, the robot monitors the path through the camera and determines how to execute the action
using the trained deep Q network. Following the output action, the camera's observation centre will move to
a new location on the route. The robot then proceeds to observe and follow the course.

The TCP's coordinates are recorded and utilised as interpolation points during the experiment. The figure
shows the experimental findings of simulated laser cutting. They confirm that the robot can accomplish route-
following tasks effectively and autonomously in real-world settings. The centre of the laser beam is on the
path's centre-line, indicating that the robot is very good at following paths.

4.2 ACTOR-CRITIC METHODS

4.2.1 DEEP DETERMINISTIC POLICY GRADIENT
Deep Deterministic Policy Gradient (DDPG) is an algorithm that simultaneously learns a Q-function and a
policy. It learns the Q-function using off-policy data and the Bellman equation, then applies that knowledge
to policy learning. This method is closely connected to Q-learning, where if the optimum actionvalue function
q(s; a) is known, then in each given state, the best action a(s) may be discovered by solving

…5
4.2.1.1 TARGET REACHING TASK FOR A 2DOF ROBOT
Network components: It employs two networks, the Actor and Critic networks. The actor network accepts
input as a state s and returns an action a. Here, the actor is employed to deterministically approach the best
strategy. In DDPG, we want the best possible action every time we query the actor network. The critic network
receives an input of a state s and an action a and outputs a Q value. Figure 3 depicts the Actor and Critic
network.

Figure 6: Actor and critic network

Network training: DDPG employs the following strategies to accelerate learning and improve network
efficiency.
(a) Replay Buffer: During each trajectory roll-out, save all experience tuples (state, action, reward, and future
state) and store them in a finite-sized cache known as the replay buffer. When the value and policy networks
are updated, it is sampled as random mini-batches of experience from the replay buffer.
(b) Exploration: First, an action is chosen from the local actor-network, and then the noise is introduced to it.
To choose the appropriate action, add an exploration noise N to the action generated by Actor, as indicated by

…. 6
This noise facilitates exploration The Ornstein-Uhlenbeck process creates noise associated with the prior
noise, preventing the noise from cancelling out or freezing the overall dynamics.
(c) Actor (policy) and Critic (value). Network updates: The Critic network is updated similarly to Q-learning.
The Bellman equation yields the revised Q value.

…. 7
(d) Target network: By directly updating actor and critic neural network weights using gradients acquired from
the error signal computed from both the replay buffer and the output of the actor and critic networks, our
learning method diverges (or fails to train at all).
It was revealed that employing a collection of target networks to produce targets for TD error computation
improves the learning algorithm's stability. In DDPG, the target networks are updated using a soft updates
technique, which consists of gradually merging normal network weights with target network weights, as
shown by

…. 8 …. 9
IMPLEMENTATION:
The simulation environment for this case is depicted in Figure. It's a two-dimensional robotic arm with two
arm segments connected by a joint. Another joint links the entire arm to a black circle in the centre of the
simulation area. The red circle represents the target. The green circle depicts the end-effector, the location of
an actual robot's hand or gripper. Target reaching aims to move the end-effector to the target's position via
regulating the joints.

Figure 7: Target reaching task of a 2 DOF manipulator

Various reward functions were attempted in the target-reaching task:
I. The first form of reward is simple and basic, denoted as r0. Here, the reward is the negative of the
distance between the end-effector and the target item.
II. In the second kind (r1), the reward is the control vector's squared norm plus the distance to the target.
The velocity norm in the reward ensures that the joints move smoothly during the job.
III. In the third kind (r2), failure to meet the aim results in a -1 penalty, whereas meeting the target results
in a +100 score. The fourth kind (r3) allows just 180o of mobility for each joint.
The reward was recorded for each stage in the episode during network training. The running average payment
is calculated for 100 episodes and plotted against the number of episodes.
Figure 8: Comparison between DDPG and PPO

Regardless of the number of iterations, the PPO network takes less time to train than the DDPG network. And
the Reward r2 was utilised to guarantee that the joints moved smoothly throughout the activity. This attribute
was proven for the first two types of incentives by comparing their joint velocities for the same experiment,
as seen in Fig. 9

Figure 9 : Comparison of joint velocity for reward types (a)r0 (b) r1

In 9(a), the joint motions exhibit a high degree of vibration, but in Fig. 9(b), the joint motions are smooth.
Thus, it demonstrates how the construction of a reward function influences the behaviour of a learnt controller.

An example of training and testing the controller using the DDPG method is shown below:
4.2.1.2 ENERGY EFFICIENT TRAJECTORY PLANNING
A reinforcement learning agent interacts with the environment while continuously optimizing its strategy by
receiving immediate rewards. This process helps the agent find a state trajectory that maximizes the total
return. The DRL method used here utilizes Deep Neural Networks (DNN) to estimate state or action value
functions, extending the state space of a traditional reinforcement learning agent from discrete to continuous.
DNN also facilitates policy expression, making the agent handle continuous action scenarios.

The combined use of DNNs for approximating state and action functions is called the Actor-Critic method.
This method is well-suited for optimizing complex systems with continuous inputs and outputs. The popular
Actor-critical algorithms include deep deterministic policy gradient (DDPG), soft Actor-critical, and Twin-
delayed DDPG.

Figure 10: Typical structure of an Actor-Critic algorithm.

The energy-efficient trajectory planning is possible by utilizing the Deep Deterministic Policy Gradient
(DDPG) algorithm for training. It invovles parallel training by dividing the robot's dynamic model into
submodules, facilitating faster training while obtaining high accuracy. The model here applies heuristic
reward functions to guide DRL agents in minimizing energy consumption while meeting task constraints and
introduces an autoencoder to preprocess joint and Cartesian state data, accelerating the learning process and
enhancing efficiency.

This achieves significant energy savings (23.21%) reduction compared to default trajectories) without the
computational overhead of traditional nonlinear methods. It simplifies the dynamic model for improved
training efficiency and lower computational cost.

The main advantage of this method is it achieves real-time trajectory generation, contrasting with
slower traditional optimization techniques like genetic algorithms or dynamic programming. On the other
hand, the limitation of this method is it is effective only in fixed scenarios and requires retraining the model
when the robot parameters and environments are varied.
4.2.3 PROXIMAL POLICY OPTIMISATION
Proximal Policy Optimization (PPO) is a reinforcement learning algorithm designed to optimize policies
efficiently and reliably by improving upon Trust Region Policy Optimization (TRPO). PPO uses a clipped
objective function to limit the size of policy updates, ensuring they stay within a safe range without requiring
the computational complexity of TRPO's trust region constraints. The algorithm alternates between collecting
trajectories through the current policy and updating the policy using a simplified surrogate objective, which
balances exploration and exploitation. PPO is computationally efficient, easy to implement, and works well
in environments with continuous or discrete action spaces.

ROBOTIC ARM TRAJECTORY TRACKING METHOD BASED ON IMPROVED PROXIMAL POLICY

OPTIMIZATION

To study the trajectory tracking method for robotic arms, the traditional tracking method has low accuracy
and cannot realize the complex tracking tasks. Compared with traditional methods, deep reinforcement
learning is an effective scheme with the advantages of robustness and solving complex problems. The deep
reinforcement learning continuous control algorithm mainly includes asynchronous advantage actor-critic
(A3C) and proximal policy optimization (PPO).

The A3C algorithm is an asynchronous multi-threading algorithm that optimizes the actor-critic method with
better convergence. However, the A3C algorithm needs to run the data first to calculate the gradient and then
update the global network, thus making up for the shortcomings of the A3C an algorithm based on the policy
gradient (PG), the PPO algorithm performs better while updating the policy problem offline, transforming the
on-policy into an off-policy, and adding constraints to the objective function to form the PPO.

Figure 11 Pseudo-code for improved PPO Algorithm

The pseudocode of the Improved-PPO algorithm is shown in Table 1. KL represents the scatter value of the
policy before and after the update. When KL is less than twice the target value, b decreased it by half. When
KL is greater than half of the target value, b increased by a factor of two.
The trajectory tracking curve of the end-effector of the robotic arm is shown in Fig. 5. The solid blue line in
Fig. 5a is the expected trajectory of the robotic arm. The red solid line in Fig. 5b shows the actual trajectory
of the robotic arm. The simulation results show that the Improved-PPO algorithm for robotic arm trajectory
tracking is superior to the A3C and PPO algorithms. The Improved-PPO algorithm converges faster and has
the shortest trajectory tracking time.

(a) (b)
Figure 12: The end trajectory tracking curve of the robotic arm; a) the expected trajectory of the robotic arm;
b) the actual trajectory of the robotic arm.

Results and Conclusion: The improved PPO algorithm is employed for trajectory tracking. The Improved-
PPO algorithm is further compared with the A3C and PPO algorithms. Compared with the A3C algorithm,
the improved PPO algorithm increased the convergence speed by 84.3% and the reward value by 77.8%.
Compared with the PPO algorithm, the Improved-PPO algorithm improved the convergence speed by 15.4%
and increased the reward value by 54.2%. The simulation results show that the Improved-PPO algorithm
outperforms the A3C and PPO algorithms for robotic arm trajectory tracking. The improved PPO algorithm
converges faster and has the shortest trajectory tracking time. This method provides a new research idea for
robotic arm trajectory tracking.

4.3 TRUST REGION POLICY OPTIMIZATION

Trust Region Policy Optimization (TRPO) can be applied to industrial robots for trajectory planning and
control by optimizing the robot's policies to perform tasks while minimizing energy consumption and
improving precision efficiently. Using TRPO, industrial robots can improve their decision-making in complex
scenarios, such as assembly or material handling, while ensuring performance consistency and energy
efficiency.

COMPLEX ROBOT MANIPULATION TASKS BASED ON HINDSIGHT TRUST REGION POLICY

OPTIMIZATION
One of the primary methodologies to solve sparse-reward problems is called Hindsight Experience Replay
(HER), whose idea is to treat the goal achieved through interaction under current policy as replaceable goals,
thereby generating more training data and providing more information, allowing the agent to move towards
intended goals progressively.

When dealing with dynamic object tasks, HTRPO can solve the tasks under low speed motion quite well,
however the tasks under high speed still remain challenges to policy gradient methods, which will be studied
in our future work.
Experiment
The four challenging sparse-reward environments include two types of tasks. One is the reaching task with
obstacles, the other consists of three dynamic object tasks. Both types of tasks are goal-conditioned, which
means robot will have a goal observation at every time step. For every task, the reward is binary, when robot
successfully reach the goal with tolerance ℇ, the environment gives the reward 0, otherwise gives -1.

Figure 13: Four challenging sparse-reward tasks on 7-DoF robot arms. (a) The red block is the goal surrounded
by three bigger blocks which represent the obstacles. (b)(c)(d): The red ball is the goal moving alone a certain
trajectory (as the red arrow indicates) at different speeds. (b) and (c) tries to reach the moving goal, while (d)
needs to push the dark block to the moving goal.
Obstacle-Reaching. A 7-Dof Sawyer robot is fixed on a base. The state space contains speeds and angles of
the robot arm and the objects’ position and quaternion, which is represented by a vector of 70 dimensions.
The action is a vector that has eight dimensions and control the movement of joints and grippers. The goal of
the Sawyer robot is to touch the red block which is surrounded by obstacles.

Dy-Reaching. A 3-link Fetch robot arm is fixed on a base. The state space is the same as Obstacle-Reaching,
however it only has 10 dimensions. The action is a vector that has four dimensions and control the gripper’s
movement, the gripper keeps closed in the task. The goal represents the desired position in the space, which
is a three-dimension vector. The robot’s goal is to chasing the goal moving along the x axis at a speed of
0.005m/s.

Dy-Circling. The state and action dimension are the same as Dy-Reaching task. The robot’s goal is to reach
the target which does a circular motion at a speed of 0.005m/s.

Dy-Pushing. The action dimension is the same as Dy-Reaching task while the state has 25 dimensions. The
robot needs to push the dark block to the position of red goal which is moving at a speed of 0.002m/s.

5. CONCLUSION
In conclusion, the of Deep Reinforcement Learning (DRL) in industrial robot represents a promising frontier
for enhancing robotic structure and performance in complex, real-world environment. However, several
challenges remain, particularly in terms of sample efficiency, safety, and the scalability of models to handle
diverse industrial settings. Further advancements are needed in transfer learning and domain adaptation to
ensure that DRL models can seamlessly transition from virtual to physical spaces.

The future of DRL in industrial robots is likely to be shaped by improvements in model robustness, the
development of hybrid approaches combining DRL with classical control methods, and innovations in
hardware that enable faster computation and real-time decision-making. Continued research into these areas
will pave the way for more intelligent, flexible, and efficient robotic systems that can handle the complexities
of advanced industrial tasks..
REFERENCES

[1] Mehran Taghian and Shotaro Miwa, “Explainability of deep reinforcement learning
algorithms in robotic domains by using Layer-wise Relevance Propagation”, In proceedings
of the research paper on Engineering Applications of Artifcial Intelligence 137 (2024) 109131

[2] Guoliang Liu et al, “Learning visual path–following skills for industrial robot using deep
reinforcement learning”, in proceedings of the research paper published in International
Journal of Advanced Manufacturing Technology (2022) 122:1099–1111

[3] Deepak Raina and Subir Kumar Saha, “AI-Based Modeling and Control of Robotic
Systems: A Brief Tutorial”, in proceedings of research paper in 2021 2021 3rd International
Conference on Robotics and Computer Vision (ICRCV) | 978-1-6654-3628-1/21/$31.00
©2021 IEEE |

[4] Bhuiyan, Teham, et al. "Deep-reinforcement-learning-based path planning for industrial

robots using distance sensors as observation." 2023 8th International Conference on Control
and Robotics Engineering (ICCRE). IEEE, 2023.

[5] Wang, Xiaolong, et al. "Energy-efficient trajectory planning for a class of industrial
robots using parallel deep reinforcement learning." Nonlinear Dynamics (2024): 1-21.

[6] Yang, Deyu, Hanbo Zhang, and Xuguang Lan. "Research on Complex Robot
Manipulation Tasks Based on Hindsight Trust Region Policy Optimization." 2020 Chinese
Automation Congress (CAC). IEEE, 2020.

[7] ZHENG, Qingchun, et al. "Robotic arm trajectory tracking method based on improved
proximal policy optimization." Proceedings of the Romanian Academy, Series A:
Mathematics, Physics, Technical Sciences, Information Science 24.3 (2023).

[8]Chayan Banerjii and Zhiyong Chein “Improved soft Actor-Critic:Mixing prioritized off-
policy samples with on policy experiences ” In proceedings of the research paper on IEEE
transactions on neutral networks and learning system volume.35 No.3 March,2024

(Addison-Wesley Data & Analytics Series) Laura Graesser - Wah Loon Keng - Foundations of Deep Reinforcement Learning - Theory and Practice in Python-Addison-Wesley Professional (2019) PDF
100% (1)
(Addison-Wesley Data & Analytics Series) Laura Graesser - Wah Loon Keng - Foundations of Deep Reinforcement Learning - Theory and Practice in Python-Addison-Wesley Professional (2019) PDF
656 pages
Deep Reinforcement Learning Mohit Sewak
No ratings yet
Deep Reinforcement Learning Mohit Sewak
6 pages
Deep Reinforcement Learning: From Q-Learning To Deep Q-Learning
No ratings yet
Deep Reinforcement Learning: From Q-Learning To Deep Q-Learning
9 pages
MEG511_Term Report
No ratings yet
MEG511_Term Report
15 pages
Deep Reinforcement Learning
No ratings yet
Deep Reinforcement Learning
7 pages
A Deep Reinforcement Learning Algorithm For Robotic Manipulation Tasks in Simulated Environments
No ratings yet
A Deep Reinforcement Learning Algorithm For Robotic Manipulation Tasks in Simulated Environments
10 pages
Unit 5 - Copy
No ratings yet
Unit 5 - Copy
7 pages
An Introduction To Deep Reinforcement Learning PDF
No ratings yet
An Introduction To Deep Reinforcement Learning PDF
140 pages
A Brief Survey of Deep Reinforcement Learning
No ratings yet
A Brief Survey of Deep Reinforcement Learning
16 pages
A Brief Survey of Deep Reinforcement Learning PDF
No ratings yet
A Brief Survey of Deep Reinforcement Learning PDF
14 pages
Multi-Agent Deep Reinforcement Learning: Maxim Egorov Stanford University
No ratings yet
Multi-Agent Deep Reinforcement Learning: Maxim Egorov Stanford University
8 pages
Deep Reinforcement Learning
No ratings yet
Deep Reinforcement Learning
3 pages
Towards Delivering a Coherent Self-Contained Explanation of Proximal Policy Optimization
No ratings yet
Towards Delivering a Coherent Self-Contained Explanation of Proximal Policy Optimization
36 pages
Deep Reinforcement Learning for AI – Powered Robotics
No ratings yet
Deep Reinforcement Learning for AI – Powered Robotics
4 pages
Comprehensive Survey of Reinforcement Learning From Algorithms to Practical Challenges
No ratings yet
Comprehensive Survey of Reinforcement Learning From Algorithms to Practical Challenges
79 pages
A Survey of Deep Reinforcement Learning in Video Games
No ratings yet
A Survey of Deep Reinforcement Learning in Video Games
13 pages
Advancements_in_Deep_Reinforcement_Learning_and_Inverse_Reinforcement_Learning_for_Robotic_Manipulation_Toward_Trustworthy_Interpretable_and_Explainable_Artificial_Intelligence
No ratings yet
Advancements_in_Deep_Reinforcement_Learning_and_Inverse_Reinforcement_Learning_for_Robotic_Manipulation_Toward_Trustworthy_Interpretable_and_Explainable_Artificial_Intelligence
19 pages
Deep Reinforcement Learning: A Brief Survey
No ratings yet
Deep Reinforcement Learning: A Brief Survey
13 pages
Lecture15 Deep Reinforcement Learning PDF
No ratings yet
Lecture15 Deep Reinforcement Learning PDF
109 pages
2206.02733v1
No ratings yet
2206.02733v1
30 pages
Deep Reinforcement Learning For Wireless Communications and Networking
No ratings yet
Deep Reinforcement Learning For Wireless Communications and Networking
375 pages
Deep Reinforcement Learning Yuxi Li Itebooks download
No ratings yet
Deep Reinforcement Learning Yuxi Li Itebooks download
53 pages
Deep Reinforcement Learning Handout v2.0.docx (1)
0% (1)
Deep Reinforcement Learning Handout v2.0.docx (1)
6 pages
Modern_Deep_Reinforcement_Learning_Algorithms
No ratings yet
Modern_Deep_Reinforcement_Learning_Algorithms
56 pages
03-04-lessonarticle
No ratings yet
03-04-lessonarticle
5 pages
Introduction To Deep Reinforcement Learning
No ratings yet
Introduction To Deep Reinforcement Learning
7 pages
Deep Reinforcement Learning An Overview
No ratings yet
Deep Reinforcement Learning An Overview
30 pages
1701 07274v2 PDF
No ratings yet
1701 07274v2 PDF
30 pages
RL+Decision
No ratings yet
RL+Decision
40 pages
Disertatie
No ratings yet
Disertatie
5 pages
Deep Reinforcement Learning
No ratings yet
Deep Reinforcement Learning
47 pages
RL Project - Deep Q-Network Agent Report
No ratings yet
RL Project - Deep Q-Network Agent Report
11 pages
An Introduction to Deep Reinforcement Learning
No ratings yet
An Introduction to Deep Reinforcement Learning
280 pages
A Review On Deep Reinforcement Learning For Fluid Mechanics
No ratings yet
A Review On Deep Reinforcement Learning For Fluid Mechanics
23 pages
unit 4
No ratings yet
unit 4
23 pages
Deep Reinforcement Learning in Computer Vision: A Comprehensive Survey
No ratings yet
Deep Reinforcement Learning in Computer Vision: A Comprehensive Survey
103 pages
Survey Cleanversion
No ratings yet
Survey Cleanversion
61 pages
Reinforcement Learning For IoT - Final
No ratings yet
Reinforcement Learning For IoT - Final
45 pages
Final
No ratings yet
Final
18 pages
applsci_13_02443
No ratings yet
applsci_13_02443
23 pages
15
No ratings yet
15
17 pages
Deep Quality-Value (DQV) Learning: Preprint. Work in Progress
No ratings yet
Deep Quality-Value (DQV) Learning: Preprint. Work in Progress
10 pages
Reinforcement Learning Notes ?
No ratings yet
Reinforcement Learning Notes ?
40 pages
Ibarz Et Al 2021 How To Train Your Robot With Deep Reinforcement Learning Lessons We Have Learned
No ratings yet
Ibarz Et Al 2021 How To Train Your Robot With Deep Reinforcement Learning Lessons We Have Learned
24 pages
Paper 1
No ratings yet
Paper 1
7 pages
Safe, Efficient, Comfort, and Energy-Saving Automated Driving Through Roundabout Based On Deep Reinforcement Learning
No ratings yet
Safe, Efficient, Comfort, and Energy-Saving Automated Driving Through Roundabout Based On Deep Reinforcement Learning
6 pages
Panzer 2021 - Deep Reinforcement Learning In Production Planning And Control A Systematic Literature Review - CPSL2021
No ratings yet
Panzer 2021 - Deep Reinforcement Learning In Production Planning And Control A Systematic Literature Review - CPSL2021
11 pages
Reinforcement Learning in Dynamic Environments Optimizing Real Time Decision Making for Complex Systems MAR 2025
No ratings yet
Reinforcement Learning in Dynamic Environments Optimizing Real Time Decision Making for Complex Systems MAR 2025
8 pages
ML 5 Reinforcement
No ratings yet
ML 5 Reinforcement
23 pages
RL Chap 5
No ratings yet
RL Chap 5
21 pages
Understanding Reinforcement Learning Algorithms：Q Learning
No ratings yet
Understanding Reinforcement Learning Algorithms：Q Learning
18 pages
1 Introduction To RL
No ratings yet
1 Introduction To RL
46 pages
2015.08.26.Lecture01Intro 2
No ratings yet
2015.08.26.Lecture01Intro 2
37 pages
2404.06974v1
No ratings yet
2404.06974v1
7 pages
Notes RL
No ratings yet
Notes RL
12 pages
Q-Learning and Deep Q Networks (DQN)
No ratings yet
Q-Learning and Deep Q Networks (DQN)
52 pages
ML Assign Shubham
No ratings yet
ML Assign Shubham
13 pages
Deep Reinforcement Learning: An Essential Guide
From Everand
Deep Reinforcement Learning: An Essential Guide
Robert Johnson
No ratings yet
Artificial Intelligence 2024 Book 2 of 2: AI, #2
From Everand
Artificial Intelligence 2024 Book 2 of 2: AI, #2
Yang Yen Thaw
No ratings yet
Microprediction: Building an Open AI Network
From Everand
Microprediction: Building an Open AI Network
Peter Cotton
No ratings yet
Easter_Decoration_set4
No ratings yet
Easter_Decoration_set4
20 pages
Defendant's Response To Commonwealth's Supplemental Opposition To Defendant's Motion To Dismiss
No ratings yet
Defendant's Response To Commonwealth's Supplemental Opposition To Defendant's Motion To Dismiss
2 pages
Merits and Demerits of Mean
100% (2)
Merits and Demerits of Mean
6 pages
Introduction To The Philosophy of The Human Person Quarter 1 - Module 3.2 The Human Person As An Embodied Spirit
No ratings yet
Introduction To The Philosophy of The Human Person Quarter 1 - Module 3.2 The Human Person As An Embodied Spirit
2 pages
Terminology: Environmental Science Is An
No ratings yet
Terminology: Environmental Science Is An
3 pages
Blantyre International University
No ratings yet
Blantyre International University
5 pages
The Book Thief - Role Work 2
No ratings yet
The Book Thief - Role Work 2
5 pages
Vigan City Executive Summary 2020
No ratings yet
Vigan City Executive Summary 2020
5 pages
Nitrox 125 Rpiezas
No ratings yet
Nitrox 125 Rpiezas
34 pages
Internal vs. External Forces
No ratings yet
Internal vs. External Forces
7 pages
Crisp DM - Crisp MLQ
No ratings yet
Crisp DM - Crisp MLQ
9 pages
THE INDEPENDENT Issue 587
100% (1)
THE INDEPENDENT Issue 587
44 pages
Motocalv Eg: Calibration Utilities
No ratings yet
Motocalv Eg: Calibration Utilities
2 pages
Onion: Taxonomy and Etymology
No ratings yet
Onion: Taxonomy and Etymology
1 page
Work Sheet
No ratings yet
Work Sheet
1 page
Lion Heart - Amanda Chong
No ratings yet
Lion Heart - Amanda Chong
31 pages
API - Rubrik SDK For Python 5.0-V2
No ratings yet
API - Rubrik SDK For Python 5.0-V2
4 pages
ENVIROvideo Rubric
No ratings yet
ENVIROvideo Rubric
1 page
Lor Devan Sir
No ratings yet
Lor Devan Sir
1 page
Cycle Inventory
No ratings yet
Cycle Inventory
40 pages
Spectro Photo Me Try
No ratings yet
Spectro Photo Me Try
8 pages
Bandel Final Used Inst List
No ratings yet
Bandel Final Used Inst List
3 pages
List_of_essential_psychotherapeutic_medicines_2019.15
No ratings yet
List_of_essential_psychotherapeutic_medicines_2019.15
5 pages
Solution Manual For Operations and Supply Chain Management The Core Canadian Edition 2nd Edition by Jacobs
No ratings yet
Solution Manual For Operations and Supply Chain Management The Core Canadian Edition 2nd Edition by Jacobs
9 pages
Project Assignment Group 18 Industrial Vending Machine
No ratings yet
Project Assignment Group 18 Industrial Vending Machine
16 pages
BR SPECORDaccesoires V03 en AJK
No ratings yet
BR SPECORDaccesoires V03 en AJK
20 pages
Traffic Engineering Introduction
100% (1)
Traffic Engineering Introduction
6 pages
Let's Envisage AngerManagement-combined
No ratings yet
Let's Envisage AngerManagement-combined
46 pages
4) Lecture_4_Unit 1 - PAYMENT VOUCHER
No ratings yet
4) Lecture_4_Unit 1 - PAYMENT VOUCHER
15 pages
Case Ih Dx31 Dx34 Tractors Repair Manual
No ratings yet
Case Ih Dx31 Dx34 Tractors Repair Manual
8 pages

Report Final

Uploaded by

Report Final

Uploaded by

A REPORT

Shravan Sri Vishnav G 2024H1410103P

Harshal Dinesh Laddha 2024H1410104P

Prepared on completion of the

BIRLA INSTITUTE OF TECHNOLOGY AND SCIENCE, PILANI

Figure 1: Hierarchical links between different AI concepts. ............................................................................. 3

2. INTRODUCTION TO DEEP REINFORCEMENT LEARNING

Figure 1: Hierarchical links between different AI concepts.

• Deterministic (always taking the same action in a given state) or

Figure 2: Block diagram depicting elements of DRL

4.2 ACTOR-CRITIC METHODS

Figure 6: Actor and critic network

Figure 7: Target reaching task of a 2 DOF manipulator

Figure 9 : Comparison of joint velocity for reward types (a)r0 (b) r1

Figure 10: Typical structure of an Actor-Critic algorithm.

ROBOTIC ARM TRAJECTORY TRACKING METHOD BASED ON IMPROVED PROXIMAL POLICY

Figure 11 Pseudo-code for improved PPO Algorithm

4.3 TRUST REGION POLICY OPTIMIZATION

COMPLEX ROBOT MANIPULATION TASKS BASED ON HINDSIGHT TRUST REGION POLICY

[4] Bhuiyan, Teham, et al. "Deep-reinforcement-learning-based path planning for industrial

You might also like