\addbibresource

bibliography/references.bib

Integrating Model-Based Footstep Planning with Model-Free Reinforcement Learning for Dynamic Legged Locomotion

Ho Jae Lee1, Seungwoo Hong1, and Sangbae Kim1 1All authors are with the Biomimetic Robotics Lab, Mechanical Engineering Department, Massachusetts Institute of Technology, Cambridge, Massachusetts, USA (e-mail: {hjlee201, swhong, sangbae}@mit.edu). This work was supported by NAVER Labs.
Abstract

In this work, we introduce a control framework that combines model-based footstep planning with Reinforcement Learning (RL), leveraging desired footstep patterns derived from the Linear Inverted Pendulum (LIP) dynamics. Utilizing the LIP model, our method forward predicts robot states and determines the desired foot placement given the velocity commands. We then train an RL policy to track the foot placements without following the full reference motions derived from the LIP model. This partial guidance from the physics model allows the RL policy to integrate the predictive capabilities of the physics-informed dynamics and the adaptability characteristics of the RL controller without overfitting the policy to the template model. Our approach is validated on the MIT Humanoid, demonstrating that our policy can achieve stable yet dynamic locomotion for walking and turning. We further validate the adaptability and generalizability of our policy by extending the locomotion task to unseen, uneven terrain. During the hardware deployment, we have achieved forward walking speeds of up to 1.5 m/s on a treadmill and have successfully performed dynamic locomotion maneuvers such as 90-degree and 180-degree turns.

I INTRODUCTION

Legged biological systems are capable of navigating through unstructured, complex, and discontinuous terrains, such as stepping stones. In the realm of legged robotics, researchers have long strived to enable legged robots to achieve mobility comparable to their natural counterparts, which would provide numerous practical real-world applications. However, designing controllers for legged robots is non-trivial because they have high degrees-of-freedom and their under-actuated floating base can only be indirectly controlled through external contact wrenches, making their equations of motion highly nonlinear and non-smooth.

Refer to caption
Figure 1: Our control hierarchy that employs a 3D-LIPM to determine the desired footstep location for locomotion. We train an RL policy to track the given steps and deploy the policy on MIT Humanoid.

Model-based control approaches, such as reactive and predictive control methods, have emerged as a highly effective strategy for solving these complex control challenges, showcasing remarkable performance in quadrupedal and bipedal robotics applications [kajita20013d, hof2008extrapolated, englsberger2011bipedal, semini2016design, herdt2010online, rutschmann2012nonlinear, di2018dynamic, hong2022agile, kuindersma2016optimization]. The major advantage of the model-based control approach lies in leveraging insights from physics models to predict robot behavior, thereby enhancing controller design. In particular, foot placement emerges as one of the main components in model-based control on uneven or discontinuous terrain, providing an interface to control the robot through contact forces. Numerous studies have successfully used simplified models, such as the linear inverted pendulum model (LIPM) [kajita20013d, hof2008extrapolated, englsberger2011bipedal] to calculate foot placements or utilized optimization-based approaches that leverage simplified models like spring-loaded inverted pendulum (SLIP) [rutschmann2012nonlinear], single rigid body dynamics (SRBD) [bledt2019implementing], centroidal dynamics [grandia2023perceptive] for simultaneous computation of foot placements and contact forces. However, these model-based control strategies that purely rely on simplified dynamics are inherently constrained by their simplifications and model mismatches, resulting in conservative locomotion that does not fully exploit the robot’s capabilities.

Parallel to these developments, model-free Reinforcement Learning (RL) has emerged as a powerful tool for robotic control, demonstrating remarkable success in managing complex, dynamic environments [kober2013reinforcement]. The application of model-free RL to both quadrupedal and bipedal robots has showcased a great performance in the given task and robustness against external perturbations [lee2020learning, miki2022learning, duan2022learning, xie2020learning]. Nonetheless, the model-free RL lacks interpretability, making the process of reward shaping and hyperparameter tuning less straightforward and challenging. Furthermore, it has difficulty in generalizing learned policies to new tasks or environments without undergoing retraining. In response, numerous studies used heuristic-based references or model-based physics insights to inform and guide policy learning. Specifically, some of the studies have employed heuristic or sampling-based methods to generate footstep locations and track these references using RL policy [duan2022learning, duan2022sim]. However, the absence of physical reasoning in footstep selection makes it challenging to accurately track target footstep locations while simultaneously maintaining balance without leading to instability or falls. Other studies have used reduced-order models to guide the RL policy to follow the reference trajectories generated by these models [green2021learning, batke2022optimizing, jenelten2024dtc]. However, directly tracking the reference body and joint trajectories [jenelten2024dtc] or imitating the offline motion library [green2021learning, batke2022optimizing] from simplified models causes the RL policy to become overly aligned with the model, restricting exploration during training. Consequently, the resulting policy may be excessively constrained by the simplified model, failing to fully utilize the potential of whole-body dynamics.

In this work, we aim to bridge the gap between these two paradigms, integrating the physics-driven insights of model-based approaches with the adaptive and robust characteristics of RL. Specifically, we propose a hierarchical control framework that employs physics-informed step placements, utilizing linear inverted pendulum (LIP) dynamics to generate target step patterns, while concurrently training an RL policy to ensure the robot adheres to these prescribed step placements. This partial guidance from the physics-based template model prevents the policy from being confined to the model and results in a stable control policy capable of dynamic locomotion tasks, such as fast walking and sharp turns. Furthermore, our approach exhibits the robustness and adaptability inherent to RL policies, extending its capability to navigate unseen, uneven terrains by dynamically adjusting desired steps during the swing phase. The effectiveness of our approach is demonstrated through simulations and hardware experiments on the MIT Humanoid robot [saloutos2023design], showcasing its potential in advancing robotic locomotion in complex environments (see video111Supplementary Video Link and code222Open-Source Code Link).

II BACKGROUND

A single inverted pendulum that connects the supporting foot with its center of mass (CoM) via a massless telescopic leg is commonly used as a simplified model to represent bipedal locomotion [kajita20013d]. By applying constraints to the inverted pendulum’s motion, including a constant CoM velocity along the z-axis and a point-foot model without an actuated ankle joint, an analytical solution for the 3D-LIPM (Fig. 2(a)[kajita20013d] can be formulated, governed by a linearly independent equation of motion:

𝒓¨=ω02(𝒓𝒑)¨𝒓superscriptsubscript𝜔02𝒓𝒑\ddot{\boldsymbol{r}}=\omega_{0}^{2}(\boldsymbol{r}-\boldsymbol{p})over¨ start_ARG bold_italic_r end_ARG = italic_ω start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( bold_italic_r - bold_italic_p ) (1)

where 𝒓=(rx,ry)T𝒓superscriptsubscript𝑟𝑥subscript𝑟𝑦𝑇\boldsymbol{r}=(r_{x},r_{y})^{T}bold_italic_r = ( italic_r start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT denotes the position of the CoM, ω0=g/z0subscript𝜔0𝑔subscript𝑧0\omega_{0}=\sqrt{g/z_{0}}italic_ω start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = square-root start_ARG italic_g / italic_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG indicates the natural frequency of the pendulum, and 𝒑=(px,py)T𝒑superscriptsubscript𝑝𝑥subscript𝑝𝑦𝑇\boldsymbol{p}=(p_{x},p_{y})^{T}bold_italic_p = ( italic_p start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT , italic_p start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT represents the position of the foot. During the derivation of (1), it is assumed that the foot is in contact with the ground.

Integrating (1) yields the ”orbital energy” [kajita20013d], leading to the formulation of the Instantaneous Capture Point (ICP), which is a point on the ground that the system comes to a stop if it were to instantaneously place its foot there [koolen2012capturability]:

𝝃=𝒓+𝒓˙ω0𝝃𝒓˙𝒓subscript𝜔0\boldsymbol{\xi}=\boldsymbol{r}+\frac{\dot{\boldsymbol{r}}}{\omega_{0}}bold_italic_ξ = bold_italic_r + divide start_ARG over˙ start_ARG bold_italic_r end_ARG end_ARG start_ARG italic_ω start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG (2)

where 𝝃=(ξx,ξy)T𝝃superscriptsubscript𝜉𝑥subscript𝜉𝑦𝑇\boldsymbol{\xi}=(\xi_{x},\xi_{y})^{T}bold_italic_ξ = ( italic_ξ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT , italic_ξ start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT denotes the position of the ICP. By differentiating (2) with respect to time, and inserting (1) into that equation, we obtain the ICP dynamics:

𝝃˙=ω0(𝝃𝒑)˙𝝃subscript𝜔0𝝃𝒑\dot{\boldsymbol{\xi}}=\omega_{0}(\boldsymbol{\xi}-\boldsymbol{p})over˙ start_ARG bold_italic_ξ end_ARG = italic_ω start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( bold_italic_ξ - bold_italic_p ) (3)

We can derive the solution of (3) as follows:

𝝃(t)=eω0t𝝃(0)+(1eω0t)𝒑𝝃𝑡superscript𝑒subscript𝜔0𝑡𝝃01superscript𝑒subscript𝜔0𝑡𝒑\boldsymbol{\xi}(t)=e^{\omega_{0}t}\boldsymbol{\xi}(0)+(1-e^{\omega_{0}t})% \boldsymbol{p}bold_italic_ξ ( italic_t ) = italic_e start_POSTSUPERSCRIPT italic_ω start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_t end_POSTSUPERSCRIPT bold_italic_ξ ( 0 ) + ( 1 - italic_e start_POSTSUPERSCRIPT italic_ω start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_t end_POSTSUPERSCRIPT ) bold_italic_p (4)

These principles, described in (3) and (4), are pivotal for generating stable step patterns, as discussed in the next section.

III STEP PATTERN GENERATION ALGORITHMS

In this section, we describe the process of generating a suitable step pattern for the 3D-LIPM to achieve a velocity tracking task [hof2008extrapolated, englsberger2011bipedal]. We incorporate these strategies into the RL problem in Sec. IV, ensuring the bipedal robot aligns its foot placement with calculated step locations.

Refer to caption
(a) 3D view
Refer to caption
(b) 2D top-view
Figure 2: Step pattern generation algorithms for 3D-LIPM from 3D (Figure 2(a)), 2D top-view (Figure 2(b)) perspective. Figure 2(a) depicts the LIPM with two legs. The LIP dynamics can predict the CoM trajectory (green lines, and green dashed lines). Our method calculates ICP trajectory (yellow lines) and adds offsets (bx,by)subscript𝑏𝑥subscript𝑏𝑦(b_{x},b_{y})( italic_b start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT , italic_b start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ) to the final ICP (ξxf,ξyf)superscriptsubscript𝜉𝑥fsuperscriptsubscript𝜉𝑦f(\xi_{x}^{\text{f}},\xi_{y}^{\text{f}})( italic_ξ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT f end_POSTSUPERSCRIPT , italic_ξ start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT start_POSTSUPERSCRIPT f end_POSTSUPERSCRIPT ) to determine desired step locations for tracking velocity commands. Figure 2(b) depicts the top view of the proposed method.
Refer to caption
Figure 3: Overall control diagram and training framework for both learning in simulation and deployment to hardware. Step pattern generation algorithms generate the desired step location by utilizing the robot’s CoM position, velocity, and foot states. These algorithms and NN update at a frequency of 100 Hz where both the actor and critic are trained using the PPO algorithm. Once the policy (actor) outputs joint position targets, the joint PD controller is evaluated at 1 KHz, and the command torques are sent to the motor.

Our main objective is to track the desired base velocity command. Fig. 2(a) outlines our step pattern generation algorithms, deriving the ICP trajectory and calculating the necessary offsets to determine the step locations for the left and right steps. Fig. 2(b) presents a top-view of our algorithms.

We assume the LIPM moves along the positive x𝑥xitalic_x-axis. First, we calculate the desired step length sdsubscript𝑠𝑑s_{d}italic_s start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT based on the velocity command 𝒗^=(v^x,v^y)^𝒗subscript^𝑣𝑥subscript^𝑣𝑦\hat{\boldsymbol{v}}=(\hat{v}_{x},\hat{v}_{y})over^ start_ARG bold_italic_v end_ARG = ( over^ start_ARG italic_v end_ARG start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT , over^ start_ARG italic_v end_ARG start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ):

sd=|𝒗^|δTsubscript𝑠𝑑^𝒗𝛿𝑇s_{d}=|\hat{\boldsymbol{v}}|\cdot\delta Titalic_s start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT = | over^ start_ARG bold_italic_v end_ARG | ⋅ italic_δ italic_T (5)

Here, δT𝛿𝑇\delta Titalic_δ italic_T denotes the remaining step duration calculated by δT=Tst𝛿𝑇subscript𝑇s𝑡\delta T=T_{\text{s}}-titalic_δ italic_T = italic_T start_POSTSUBSCRIPT s end_POSTSUBSCRIPT - italic_t, where Tssubscript𝑇sT_{\text{s}}italic_T start_POSTSUBSCRIPT s end_POSTSUBSCRIPT represents the user-defined step duration, and t𝑡titalic_t indicates the elapsed time since the beginning of the step. At the start of each step, t𝑡titalic_t is reset to 0 and progresses to Tssubscript𝑇sT_{\text{s}}italic_T start_POSTSUBSCRIPT s end_POSTSUBSCRIPT.

Similarly, we calculate the desired step width wdsubscript𝑤𝑑w_{d}italic_w start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT based on the step width command w^^𝑤\hat{w}over^ start_ARG italic_w end_ARG:

wd=|w^|δT/Tssubscript𝑤𝑑^𝑤𝛿𝑇subscript𝑇sw_{d}=|\hat{w}|\cdot\delta T/T_{\text{s}}italic_w start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT = | over^ start_ARG italic_w end_ARG | ⋅ italic_δ italic_T / italic_T start_POSTSUBSCRIPT s end_POSTSUBSCRIPT (6)

Given the initial (i.e., at the beginning of the step) body state 𝒓o=(rxo,ryo)superscript𝒓osuperscriptsubscript𝑟𝑥osuperscriptsubscript𝑟𝑦o\boldsymbol{r}^{\text{o}}=(r_{x}^{\text{o}},r_{y}^{\text{o}})bold_italic_r start_POSTSUPERSCRIPT o end_POSTSUPERSCRIPT = ( italic_r start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT o end_POSTSUPERSCRIPT , italic_r start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT start_POSTSUPERSCRIPT o end_POSTSUPERSCRIPT ), 𝒓˙o=(r˙xo,r˙yo)superscript˙𝒓osuperscriptsubscript˙𝑟𝑥osuperscriptsubscript˙𝑟𝑦o\dot{\boldsymbol{r}}^{\text{o}}=(\dot{r}_{x}^{\text{o}},\dot{r}_{y}^{\text{o}})over˙ start_ARG bold_italic_r end_ARG start_POSTSUPERSCRIPT o end_POSTSUPERSCRIPT = ( over˙ start_ARG italic_r end_ARG start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT o end_POSTSUPERSCRIPT , over˙ start_ARG italic_r end_ARG start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT start_POSTSUPERSCRIPT o end_POSTSUPERSCRIPT ), we calculate the initial ICP 𝝃o=(ξxo,ξyo)superscript𝝃osuperscriptsubscript𝜉𝑥osuperscriptsubscript𝜉𝑦o\boldsymbol{\xi}^{\text{o}}=(\xi_{x}^{\text{o}},\xi_{y}^{\text{o}})bold_italic_ξ start_POSTSUPERSCRIPT o end_POSTSUPERSCRIPT = ( italic_ξ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT o end_POSTSUPERSCRIPT , italic_ξ start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT start_POSTSUPERSCRIPT o end_POSTSUPERSCRIPT ). Then we predict the LIP’s final (i.e., at the end of step) ICP 𝝃f=(ξxf,ξyf)superscript𝝃fsuperscriptsubscript𝜉𝑥fsuperscriptsubscript𝜉𝑦f\boldsymbol{\xi}^{\text{f}}=(\xi_{x}^{\text{f}},\xi_{y}^{\text{f}})bold_italic_ξ start_POSTSUPERSCRIPT f end_POSTSUPERSCRIPT = ( italic_ξ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT f end_POSTSUPERSCRIPT , italic_ξ start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT start_POSTSUPERSCRIPT f end_POSTSUPERSCRIPT ) after δT𝛿𝑇\delta Titalic_δ italic_T using (4):

(ξxfξyf)=(ξx(δT)ξy(δT))=(eω0δTξxo+(1eω0δT)pxeω0δTξyo+(1eω0δT)py)matrixsuperscriptsubscript𝜉𝑥fsuperscriptsubscript𝜉𝑦fmatrixsubscript𝜉𝑥𝛿𝑇subscript𝜉𝑦𝛿𝑇matrixsuperscript𝑒subscript𝜔0𝛿𝑇superscriptsubscript𝜉𝑥o1superscript𝑒subscript𝜔0𝛿𝑇subscript𝑝𝑥superscript𝑒subscript𝜔0𝛿𝑇superscriptsubscript𝜉𝑦o1superscript𝑒subscript𝜔0𝛿𝑇subscript𝑝𝑦\begin{pmatrix}\xi_{x}^{\text{f}}\\ \xi_{y}^{\text{f}}\\ \end{pmatrix}=\begin{pmatrix}\xi_{x}(\delta T)\\ \xi_{y}(\delta T)\\ \end{pmatrix}=\begin{pmatrix}e^{\omega_{0}\delta T}\cdot\xi_{x}^{\text{o}}+(1-% e^{\omega_{0}\delta T})\cdot p_{x}\\ e^{\omega_{0}\delta T}\cdot\xi_{y}^{\text{o}}+(1-e^{\omega_{0}\delta T})\cdot p% _{y}\\ \end{pmatrix}( start_ARG start_ROW start_CELL italic_ξ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT f end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL italic_ξ start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT start_POSTSUPERSCRIPT f end_POSTSUPERSCRIPT end_CELL end_ROW end_ARG ) = ( start_ARG start_ROW start_CELL italic_ξ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ( italic_δ italic_T ) end_CELL end_ROW start_ROW start_CELL italic_ξ start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ( italic_δ italic_T ) end_CELL end_ROW end_ARG ) = ( start_ARG start_ROW start_CELL italic_e start_POSTSUPERSCRIPT italic_ω start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_δ italic_T end_POSTSUPERSCRIPT ⋅ italic_ξ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT o end_POSTSUPERSCRIPT + ( 1 - italic_e start_POSTSUPERSCRIPT italic_ω start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_δ italic_T end_POSTSUPERSCRIPT ) ⋅ italic_p start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL italic_e start_POSTSUPERSCRIPT italic_ω start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_δ italic_T end_POSTSUPERSCRIPT ⋅ italic_ξ start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT start_POSTSUPERSCRIPT o end_POSTSUPERSCRIPT + ( 1 - italic_e start_POSTSUPERSCRIPT italic_ω start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_δ italic_T end_POSTSUPERSCRIPT ) ⋅ italic_p start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT end_CELL end_ROW end_ARG ) (7)

Based on Fig. 2(b), we observe that sdsubscript𝑠𝑑s_{d}italic_s start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT and wdsubscript𝑤𝑑w_{d}italic_w start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT can be readily expressed as follows:

(sdwd)=((ξxfξx0)(ξyfξy0)+2(ξy0py))matrixsubscript𝑠𝑑subscript𝑤𝑑matrixsuperscriptsubscript𝜉𝑥fsuperscriptsubscript𝜉𝑥0superscriptsubscript𝜉𝑦fsuperscriptsubscript𝜉𝑦02superscriptsubscript𝜉𝑦0subscript𝑝𝑦\begin{pmatrix}s_{d}\\ w_{d}\\ \end{pmatrix}=\begin{pmatrix}(\xi_{x}^{\text{f}}-\xi_{x}^{\text{0}})\\ (\xi_{y}^{\text{f}}-\xi_{y}^{\text{0}})+2(\xi_{y}^{\text{0}}-p_{y})\\ \end{pmatrix}( start_ARG start_ROW start_CELL italic_s start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL italic_w start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT end_CELL end_ROW end_ARG ) = ( start_ARG start_ROW start_CELL ( italic_ξ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT f end_POSTSUPERSCRIPT - italic_ξ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) end_CELL end_ROW start_ROW start_CELL ( italic_ξ start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT start_POSTSUPERSCRIPT f end_POSTSUPERSCRIPT - italic_ξ start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) + 2 ( italic_ξ start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT - italic_p start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ) end_CELL end_ROW end_ARG ) (8)

By inserting (8) into (7), we obtain the constant offset vector (bx,by)subscript𝑏𝑥subscript𝑏𝑦(b_{x},b_{y})( italic_b start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT , italic_b start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ), which when added to the final ICP, guarantees the step pattern has the desired step length sdsubscript𝑠𝑑s_{d}italic_s start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT and width wdsubscript𝑤𝑑w_{d}italic_w start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT:

(bxby)matrixsubscript𝑏𝑥subscript𝑏𝑦\displaystyle\begin{pmatrix}b_{x}\\ b_{y}\\ \end{pmatrix}( start_ARG start_ROW start_CELL italic_b start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL italic_b start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT end_CELL end_ROW end_ARG ) =(ξx0pxξy0py)=(ξxfξx0eω0δT1(ξyfξy0)+2(ξy0py)eω0δT+1)=(sdeω0δT1wdeω0δT+1)absentmatrixsuperscriptsubscript𝜉𝑥0subscript𝑝𝑥superscriptsubscript𝜉𝑦0subscript𝑝𝑦matrixsuperscriptsubscript𝜉𝑥fsuperscriptsubscript𝜉𝑥0superscript𝑒subscript𝜔0𝛿𝑇1superscriptsubscript𝜉𝑦fsuperscriptsubscript𝜉𝑦02superscriptsubscript𝜉𝑦0subscript𝑝𝑦superscript𝑒subscript𝜔0𝛿𝑇1matrixsubscript𝑠𝑑superscript𝑒subscript𝜔0𝛿𝑇1subscript𝑤𝑑superscript𝑒subscript𝜔0𝛿𝑇1\displaystyle=\begin{pmatrix}\xi_{x}^{\text{0}}-p_{x}\\ \xi_{y}^{\text{0}}-p_{y}\\ \end{pmatrix}=\begin{pmatrix}\frac{\xi_{x}^{\text{f}}-\xi_{x}^{\text{0}}}{e^{% \omega_{0}\delta T}-1}\\ \frac{(\xi_{y}^{\text{f}}-\xi_{y}^{\text{0}})+2(\xi_{y}^{\text{0}}-p_{y})}{e^{% \omega_{0}\delta T}+1}\\ \end{pmatrix}=\begin{pmatrix}\frac{s_{d}}{e^{\omega_{0}\delta T}-1}\\ \frac{w_{d}}{e^{\omega_{0}\delta T}+1}\\ \end{pmatrix}= ( start_ARG start_ROW start_CELL italic_ξ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT - italic_p start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL italic_ξ start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT - italic_p start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT end_CELL end_ROW end_ARG ) = ( start_ARG start_ROW start_CELL divide start_ARG italic_ξ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT f end_POSTSUPERSCRIPT - italic_ξ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT end_ARG start_ARG italic_e start_POSTSUPERSCRIPT italic_ω start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_δ italic_T end_POSTSUPERSCRIPT - 1 end_ARG end_CELL end_ROW start_ROW start_CELL divide start_ARG ( italic_ξ start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT start_POSTSUPERSCRIPT f end_POSTSUPERSCRIPT - italic_ξ start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) + 2 ( italic_ξ start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT - italic_p start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ) end_ARG start_ARG italic_e start_POSTSUPERSCRIPT italic_ω start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_δ italic_T end_POSTSUPERSCRIPT + 1 end_ARG end_CELL end_ROW end_ARG ) = ( start_ARG start_ROW start_CELL divide start_ARG italic_s start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT end_ARG start_ARG italic_e start_POSTSUPERSCRIPT italic_ω start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_δ italic_T end_POSTSUPERSCRIPT - 1 end_ARG end_CELL end_ROW start_ROW start_CELL divide start_ARG italic_w start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT end_ARG start_ARG italic_e start_POSTSUPERSCRIPT italic_ω start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_δ italic_T end_POSTSUPERSCRIPT + 1 end_ARG end_CELL end_ROW end_ARG ) (9)

Then we determine the desired step location 𝒑^=(p^x,p^y)T^𝒑superscriptsubscript^𝑝𝑥subscript^𝑝𝑦𝑇\hat{\boldsymbol{p}}=(\hat{p}_{x},\hat{p}_{y})^{T}over^ start_ARG bold_italic_p end_ARG = ( over^ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT , over^ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT by adding this constant offset vector to the final ICP [hof2008extrapolated]:

(p^xp^y)=(ξxfbxξyf+(1)nby))\begin{pmatrix}\hat{p}_{x}\\ \hat{p}_{y}\\ \end{pmatrix}=\begin{pmatrix}\xi_{x}^{\text{f}}-b_{x}\\ \xi_{y}^{\text{f}}+(-1)^{n}b_{y})\\ \end{pmatrix}( start_ARG start_ROW start_CELL over^ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL over^ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT end_CELL end_ROW end_ARG ) = ( start_ARG start_ROW start_CELL italic_ξ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT f end_POSTSUPERSCRIPT - italic_b start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL italic_ξ start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT start_POSTSUPERSCRIPT f end_POSTSUPERSCRIPT + ( - 1 ) start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_b start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ) end_CELL end_ROW end_ARG ) (10)

where n𝑛nitalic_n indicates the step cycle (even n𝑛nitalic_n for the left step and odd n𝑛nitalic_n for the right step). In the case of turning in xy𝑥𝑦xyitalic_x italic_y-plane, we modify (10) by rotating the constant offset vector:

(p^xp^y)=(ξxfξyf)+(cos(θ)sin(θ)sin(θ)cos(θ))(bx(1)nby)matrixsubscript^𝑝𝑥subscript^𝑝𝑦matrixsuperscriptsubscript𝜉𝑥fsuperscriptsubscript𝜉𝑦fmatrix𝑐𝑜𝑠𝜃𝑠𝑖𝑛𝜃𝑠𝑖𝑛𝜃𝑐𝑜𝑠𝜃matrixsubscript𝑏𝑥superscript1𝑛subscript𝑏𝑦\begin{pmatrix}\hat{p}_{x}\\ \hat{p}_{y}\\ \end{pmatrix}=\begin{pmatrix}\xi_{x}^{\text{f}}\\ \xi_{y}^{\text{f}}\\ \end{pmatrix}+\begin{pmatrix}cos(\theta)&-sin(\theta)\\ sin(\theta)&cos(\theta)\\ \end{pmatrix}\begin{pmatrix}-b_{x}\\ (-1)^{n}b_{y}\\ \end{pmatrix}( start_ARG start_ROW start_CELL over^ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL over^ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT end_CELL end_ROW end_ARG ) = ( start_ARG start_ROW start_CELL italic_ξ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT f end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL italic_ξ start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT start_POSTSUPERSCRIPT f end_POSTSUPERSCRIPT end_CELL end_ROW end_ARG ) + ( start_ARG start_ROW start_CELL italic_c italic_o italic_s ( italic_θ ) end_CELL start_CELL - italic_s italic_i italic_n ( italic_θ ) end_CELL end_ROW start_ROW start_CELL italic_s italic_i italic_n ( italic_θ ) end_CELL start_CELL italic_c italic_o italic_s ( italic_θ ) end_CELL end_ROW end_ARG ) ( start_ARG start_ROW start_CELL - italic_b start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL ( - 1 ) start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_b start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT end_CELL end_ROW end_ARG ) (11)

where θ𝜃\thetaitalic_θ refers to the rotation angle around z-axis defined by θ=tan1(v^y/v^x)𝜃𝑡𝑎superscript𝑛1subscript^𝑣𝑦subscript^𝑣𝑥\theta=tan^{-1}(\hat{v}_{y}/\hat{v}_{x})italic_θ = italic_t italic_a italic_n start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( over^ start_ARG italic_v end_ARG start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT / over^ start_ARG italic_v end_ARG start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ). During the training, we calculate the desired step location only at the beginning of the step when t=0𝑡0t=0italic_t = 0 (i.e., δT=Ts𝛿𝑇subscript𝑇s\delta T=T_{\text{s}}italic_δ italic_T = italic_T start_POSTSUBSCRIPT s end_POSTSUBSCRIPT). The detailed values for each user-defined variable are given in Table I.

TABLE I: User-defined Variables for Step Pattern Generation Algorithms
Variable Value
Velocity commands 𝒗^^𝒗\hat{\boldsymbol{v}}over^ start_ARG bold_italic_v end_ARG U2[2.0,2.0]superscript𝑈22.02.0U^{2}[-2.0,~{}2.0]italic_U start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT [ - 2.0 , 2.0 ] m/s
Step width command w^^𝑤\hat{w}over^ start_ARG italic_w end_ARG 0.3 m
Step duration Tssubscript𝑇sT_{\text{s}}italic_T start_POSTSUBSCRIPT s end_POSTSUBSCRIPT 0.35 s
Base height p^base,zsubscript^𝑝base𝑧\hat{p}_{\text{base},z}over^ start_ARG italic_p end_ARG start_POSTSUBSCRIPT base , italic_z end_POSTSUBSCRIPT 0.62 m
Base heading θ^basesubscript^𝜃base\hat{\theta}_{\text{base}}over^ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT base end_POSTSUBSCRIPT tan1(v^y/v^x)𝑡𝑎superscript𝑛1subscript^𝑣𝑦subscript^𝑣𝑥tan^{-1}(\hat{v}_{y}/\hat{v}_{x})italic_t italic_a italic_n start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( over^ start_ARG italic_v end_ARG start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT / over^ start_ARG italic_v end_ARG start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT )

IV RL PROBLEM FORMULATION

In this section, we describe our RL training framework to ensure the robot tracks the velocity commands and the step pattern generated by (11).

Fig. 3 shows the overview of our control framework. Our control policy is a fully connected neural network with 3 hidden layers, each layer with 256 nodes. The policy takes as input the robot states, step commands derived from our proposed step pattern algorithms, and user velocity commands, and outputs the desired residual joint PD setpoints. We train our policy in the IsaacGym simulation engine using PPO [schulman2017proximal] algorithms with parallelization of 4096 environments and input normalization. Detailed information on PPO hyperparameters can be found in Table II. Now, we introduce the state space, action space, and reward formulation for the RL problem.

IV-A State Space

The state space of our policy consists of the observed robot states, step commands, and user-defined velocity commands with a size of 𝒮51𝒮superscript51\mathcal{S}\in\mathbb{R}^{51}caligraphic_S ∈ blackboard_R start_POSTSUPERSCRIPT 51 end_POSTSUPERSCRIPT. In detail, 𝒮𝒮\mathcal{S}caligraphic_S includes the base height, base linear velocity in the world frame, base angular velocity, projection of the gravity vector in the base frame, left and right foot location and heading in the base frame, left and right desired step location and heading in the base frame, velocity commands, phase clock in sine and cosine functions, joint position, joint velocity. The phase identifiers indicate the swing and stance phase of each foot through the contact scheduler. The base states are measured through the phase-based state estimator [mit-biomimetics_cheetah-software_2023] that assumes the foot contact on the ground at the specified contact schedule.

IV-B Action Space

We define the action space 𝒜10𝒜superscript10\mathcal{A}\in\mathbb{R}^{10}caligraphic_A ∈ blackboard_R start_POSTSUPERSCRIPT 10 end_POSTSUPERSCRIPT as the desired residual joint PD setpoints Δ𝒒^Δ^𝒒\Delta\hat{\boldsymbol{q}}roman_Δ over^ start_ARG bold_italic_q end_ARG, representing a deviation from the nominal joint position 𝒒refsuperscript𝒒ref\boldsymbol{q}^{\text{ref}}bold_italic_q start_POSTSUPERSCRIPT ref end_POSTSUPERSCRIPT for hip yaw, hip abduction, hip pitch, knee and ankle joint respectively. The action from our policy is updated at a frequency of 100 Hz and fed into the joint PD controller. Then, the fixed-gain joint PD controller operates at 1 kHz. To be specific, the joint PD controller uses the following equation to convert the action into the desired torque command:

𝝉^=𝐊𝐩(𝒒ref+Δ𝒒^𝒒)+𝐊𝐝(𝟎𝒒˙)^𝝉subscript𝐊𝐩superscript𝒒refΔ^𝒒𝒒subscript𝐊𝐝0˙𝒒\hat{\boldsymbol{\tau}}=\mathbf{K_{p}}(\boldsymbol{q}^{\text{ref}}+\Delta\hat{% \boldsymbol{q}}-\boldsymbol{q})+\mathbf{K_{d}}(\mathbf{0}-\dot{\boldsymbol{q}})over^ start_ARG bold_italic_τ end_ARG = bold_K start_POSTSUBSCRIPT bold_p end_POSTSUBSCRIPT ( bold_italic_q start_POSTSUPERSCRIPT ref end_POSTSUPERSCRIPT + roman_Δ over^ start_ARG bold_italic_q end_ARG - bold_italic_q ) + bold_K start_POSTSUBSCRIPT bold_d end_POSTSUBSCRIPT ( bold_0 - over˙ start_ARG bold_italic_q end_ARG ) (12)

For the joint PD controller’s gains, we have configured 𝐊𝐩subscript𝐊𝐩\mathbf{K_{p}}bold_K start_POSTSUBSCRIPT bold_p end_POSTSUBSCRIPT to diag(30,,30)𝑑𝑖𝑎𝑔3030diag(30,...,30)italic_d italic_i italic_a italic_g ( 30 , … , 30 ), and 𝐊𝐝subscript𝐊𝐝\mathbf{K_{d}}bold_K start_POSTSUBSCRIPT bold_d end_POSTSUBSCRIPT to diag(1,,1)𝑑𝑖𝑎𝑔11diag(1,...,1)italic_d italic_i italic_a italic_g ( 1 , … , 1 ).

IV-C Rewards

TABLE II: PPO Hyperparameters
Parameter Value
Horizon (H) 24
Adam learning rate 1×1051superscript1051\times 10^{-5}1 × 10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT
Number of epochs 5
Number of mini-batches 4
Discount (γ𝛾\gammaitalic_γ) 0.99
Clipping parameter (ϵitalic-ϵ\epsilonitalic_ϵ) 0.2
Max gradient norm 1
TABLE III: Regularization Rewards
Reward Weight Expression
 
Joint torques 1e-4 |𝝉|2superscript𝝉2-\lvert\boldsymbol{\tau}\rvert^{2}- | bold_italic_τ | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
Joint torque limits 1e-2 max(|𝝉|0.9𝝉max,0)max𝝉0.9subscript𝝉𝑚𝑎𝑥0-\operatorname{max}(\lvert\boldsymbol{\tau}\rvert-0.9\boldsymbol{\tau}_{max},0)- roman_max ( | bold_italic_τ | - 0.9 bold_italic_τ start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT , 0 )
Joint velocity 1e-3 |𝒒˙|2superscript˙𝒒2-\lvert\dot{\boldsymbol{q}}\rvert^{2}- | over˙ start_ARG bold_italic_q end_ARG | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
Joint limits 10 clip(|𝒒|0.9𝒒max,0,1)clip𝒒0.9subscript𝒒𝑚𝑎𝑥01-\operatorname{clip}(\lvert\boldsymbol{q}\rvert-0.9\boldsymbol{q}_{max},0,1)- roman_clip ( | bold_italic_q | - 0.9 bold_italic_q start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT , 0 , 1 )
Action smoothness 1 1e-3 |(𝒂t𝒂t1)/Δt|2superscriptsubscript𝒂𝑡subscript𝒂𝑡1Δ𝑡2-\lvert(\boldsymbol{a}_{t}-\boldsymbol{a}_{t-1})/\Delta t\rvert^{2}- | ( bold_italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - bold_italic_a start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) / roman_Δ italic_t | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
Action smoothness 2 1e-4 |(𝒂t2𝒂t1+𝒂t2)/Δt|2superscriptsubscript𝒂𝑡2subscript𝒂𝑡1subscript𝒂𝑡2Δ𝑡2-\lvert(\boldsymbol{a}_{t}-2\boldsymbol{a}_{t-1}+\boldsymbol{a}_{t-2})/\Delta t% \rvert^{2}- | ( bold_italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - 2 bold_italic_a start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT + bold_italic_a start_POSTSUBSCRIPT italic_t - 2 end_POSTSUBSCRIPT ) / roman_Δ italic_t | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
Hip joint regularization 1.25 exp((𝒒hip,xz)2/σ)superscriptsubscript𝒒hip𝑥𝑧2𝜎\exp(-(\boldsymbol{q}_{\text{hip},xz})^{2}/\sigma)roman_exp ( - ( bold_italic_q start_POSTSUBSCRIPT hip , italic_x italic_z end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / italic_σ )
Base roll-pitch velocity 1e-2 (ωbase,x2+ωbase,y2)superscriptsubscript𝜔base𝑥2superscriptsubscript𝜔base𝑦2-(\omega_{\text{base},x}^{2}+\omega_{\text{base},y}^{2})- ( italic_ω start_POSTSUBSCRIPT base , italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_ω start_POSTSUBSCRIPT base , italic_y end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT )
Base z-axis velocity 1e-1 |vbase,z|2superscriptsubscript𝑣base𝑧2-\lvert v_{\text{base},z}\rvert^{2}- | italic_v start_POSTSUBSCRIPT base , italic_z end_POSTSUBSCRIPT | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
Base tilting 1 exp((gbase,x2+gbase,y2)/σ)superscriptsubscript𝑔base𝑥2superscriptsubscript𝑔base𝑦2𝜎\exp(-(g_{\text{base},x}^{2}+g_{\text{base},y}^{2})/\sigma)roman_exp ( - ( italic_g start_POSTSUBSCRIPT base , italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_g start_POSTSUBSCRIPT base , italic_y end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) / italic_σ )
Termination 100 {1,selfcollision,1,|𝒗base|10 [m/s],1,|𝝎base|5 [rad/s],1,gbase,x,gbase,y0.7,1,pbase,z<0.3 [m],0,otherwise.cases1selfcollisionotherwise1subscript𝒗base10 [m/s]otherwise1subscript𝝎base5 [rad/s]otherwise1subscript𝑔base𝑥subscript𝑔base𝑦0.7otherwise1subscript𝑝base𝑧0.3 [m]otherwise0otherwiseotherwise\begin{cases}-1,\;\operatorname{self-collision},\\ -1,\;\lvert\boldsymbol{v}_{\text{base}}\rvert\geq 10\text{ [m/s]},\\ -1,\;\lvert\boldsymbol{\omega}_{\text{base}}\rvert\geq 5\text{ [rad/s]},\\ -1,\;g_{\text{base},x},g_{\text{base},y}\geq 0.7,\\ -1,\;p_{\text{base},z}<0.3\text{ [m]},\\ 0,\;\operatorname{otherwise}.\end{cases}{ start_ROW start_CELL - 1 , start_OPFUNCTION roman_self - roman_collision end_OPFUNCTION , end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL - 1 , | bold_italic_v start_POSTSUBSCRIPT base end_POSTSUBSCRIPT | ≥ 10 [m/s] , end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL - 1 , | bold_italic_ω start_POSTSUBSCRIPT base end_POSTSUBSCRIPT | ≥ 5 [rad/s] , end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL - 1 , italic_g start_POSTSUBSCRIPT base , italic_x end_POSTSUBSCRIPT , italic_g start_POSTSUBSCRIPT base , italic_y end_POSTSUBSCRIPT ≥ 0.7 , end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL - 1 , italic_p start_POSTSUBSCRIPT base , italic_z end_POSTSUBSCRIPT < 0.3 [m] , end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL 0 , roman_otherwise . end_CELL start_CELL end_CELL end_ROW

We formulate the reward structure to ensure the robot tracks the desired step location while maintaining stability and adaptability. Since the desired step location is derived based on the LIPM, we incorporate specific rewards to satisfy the assumption of the LIPM. To retain the inherent flexibility and adaptability characteristic of RL policy, however, we do not impose explicit rewards to follow the LIPM’s CoM trajectory.

The overall reward function is formulated as follows:

r=rbh+rbo+rvt+rcs+rRegularization𝑟subscript𝑟𝑏subscript𝑟𝑏𝑜subscript𝑟𝑣𝑡subscript𝑟𝑐𝑠subscript𝑟Regularizationr=r_{bh}+r_{bo}+r_{vt}+r_{cs}+r_{\text{Regularization}}italic_r = italic_r start_POSTSUBSCRIPT italic_b italic_h end_POSTSUBSCRIPT + italic_r start_POSTSUBSCRIPT italic_b italic_o end_POSTSUBSCRIPT + italic_r start_POSTSUBSCRIPT italic_v italic_t end_POSTSUBSCRIPT + italic_r start_POSTSUBSCRIPT italic_c italic_s end_POSTSUBSCRIPT + italic_r start_POSTSUBSCRIPT Regularization end_POSTSUBSCRIPT (13)

First, to address the LIPM’s assumption of a constant height, we introduce a reward rbhsubscript𝑟𝑏r_{bh}italic_r start_POSTSUBSCRIPT italic_b italic_h end_POSTSUBSCRIPT that encourages the robot to keep a constant base height p^base,zsubscript^𝑝base𝑧\hat{p}_{\text{base},z}over^ start_ARG italic_p end_ARG start_POSTSUBSCRIPT base , italic_z end_POSTSUBSCRIPT:

rbh=exp((p^base,zpbase,z)2/σ)subscript𝑟𝑏superscriptsubscript^𝑝base𝑧subscript𝑝base𝑧2𝜎r_{bh}=\exp(-(\hat{p}_{\text{base},z}-p_{\text{base},z})^{2}/\sigma)italic_r start_POSTSUBSCRIPT italic_b italic_h end_POSTSUBSCRIPT = roman_exp ( - ( over^ start_ARG italic_p end_ARG start_POSTSUBSCRIPT base , italic_z end_POSTSUBSCRIPT - italic_p start_POSTSUBSCRIPT base , italic_z end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / italic_σ ) (14)

Given that the LIPM is represented solely by a point mass and lacks any orientation, it offers no direct control over the robot’s orientation. Therefore, we assume that the robot’s base should consistently orient towards the desired base heading θ^basesubscript^𝜃base\hat{\theta}_{\text{base}}over^ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT base end_POSTSUBSCRIPT direction. To encapsulate this concept, the reward rbosubscript𝑟𝑏𝑜r_{bo}italic_r start_POSTSUBSCRIPT italic_b italic_o end_POSTSUBSCRIPT is designed:

rbo=2exp(|θ^baseθbase|/σ)subscript𝑟𝑏𝑜2subscript^𝜃basesubscript𝜃base𝜎r_{bo}=2\exp(-\lvert\hat{\theta}_{\text{base}}-\theta_{\text{base}}\rvert/\sigma)italic_r start_POSTSUBSCRIPT italic_b italic_o end_POSTSUBSCRIPT = 2 roman_exp ( - | over^ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT base end_POSTSUBSCRIPT - italic_θ start_POSTSUBSCRIPT base end_POSTSUBSCRIPT | / italic_σ ) (15)

The desired step location calculated by step pattern generation algorithms in Sec. III results in the LIPM’s passive dynamics naturally fulfilling the velocity command 𝒗^^𝒗\hat{\boldsymbol{v}}over^ start_ARG bold_italic_v end_ARG. Given the robot’s deviation from the LIPM, we implement the velocity tracking reward rvtsubscript𝑟𝑣𝑡r_{vt}italic_r start_POSTSUBSCRIPT italic_v italic_t end_POSTSUBSCRIPT to ensure tracking of the velocity command 𝒗^^𝒗\hat{\boldsymbol{v}}over^ start_ARG bold_italic_v end_ARG:

rvt=4exp((𝒗^𝒗world1+|𝒗^|)2/σ)subscript𝑟𝑣𝑡4superscript^𝒗subscript𝒗world1^𝒗2𝜎r_{vt}=4\exp(-(\frac{\hat{\boldsymbol{v}}-\boldsymbol{v}_{\text{world}}}{1+% \lvert\hat{\boldsymbol{v}}\rvert})^{2}/\sigma)italic_r start_POSTSUBSCRIPT italic_v italic_t end_POSTSUBSCRIPT = 4 roman_exp ( - ( divide start_ARG over^ start_ARG bold_italic_v end_ARG - bold_italic_v start_POSTSUBSCRIPT world end_POSTSUBSCRIPT end_ARG start_ARG 1 + | over^ start_ARG bold_italic_v end_ARG | end_ARG ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / italic_σ ) (16)

Upon determining the desired step location, the robot must place its foot at this location for the specified step duration. The reward rcssubscript𝑟𝑐𝑠r_{cs}italic_r start_POSTSUBSCRIPT italic_c italic_s end_POSTSUBSCRIPT is crafted to incentivize the robot to conform to the contact schedule at the desired step location:

rcs=9(𝟙r,contact𝟙l,contact)ϕcontactexp((𝒑^𝒑2)/σ)subscript𝑟𝑐𝑠9subscript1r,contactsubscript1l,contactsubscriptitalic-ϕcontactsubscriptnorm^𝒑𝒑2𝜎r_{cs}=9(\mathbbm{1}_{\text{r,contact}}-\mathbbm{1}_{\text{l,contact}})\phi_{% \text{contact}}\cdot\exp(-(||\hat{\boldsymbol{p}}-\boldsymbol{p}||_{2})/\sigma)italic_r start_POSTSUBSCRIPT italic_c italic_s end_POSTSUBSCRIPT = 9 ( blackboard_1 start_POSTSUBSCRIPT r,contact end_POSTSUBSCRIPT - blackboard_1 start_POSTSUBSCRIPT l,contact end_POSTSUBSCRIPT ) italic_ϕ start_POSTSUBSCRIPT contact end_POSTSUBSCRIPT ⋅ roman_exp ( - ( | | over^ start_ARG bold_italic_p end_ARG - bold_italic_p | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) / italic_σ ) (17)

Here, 𝟙r,contactsubscript1r,contact\mathbbm{1}_{\text{r,contact}}blackboard_1 start_POSTSUBSCRIPT r,contact end_POSTSUBSCRIPT and 𝟙l,contactsubscript1l,contact\mathbbm{1}_{\text{l,contact}}blackboard_1 start_POSTSUBSCRIPT l,contact end_POSTSUBSCRIPT are indicator functions for right and left foot ground contact, respectively. The contact schedule ϕcontactsubscriptitalic-ϕcontact\phi_{\text{contact}}italic_ϕ start_POSTSUBSCRIPT contact end_POSTSUBSCRIPT is a continuous function that oscillates between -1 and 1 across each two-step duration 2Ts2subscript𝑇s2T_{\text{s}}2 italic_T start_POSTSUBSCRIPT s end_POSTSUBSCRIPT:

ϕcontact=sin(2πϕ)sin2(2πϕ)+0.04,ϕ=t2Tsformulae-sequencesubscriptitalic-ϕcontact𝑠𝑖𝑛2𝜋italic-ϕ𝑠𝑖superscript𝑛22𝜋italic-ϕ0.04italic-ϕsuperscript𝑡2subscript𝑇s\phi_{\text{contact}}=\frac{sin(2\pi\phi)}{\sqrt{sin^{2}(2\pi\phi)+0.04}}~{},~% {}~{}\phi=\frac{t^{\prime}}{2T_{\text{s}}}italic_ϕ start_POSTSUBSCRIPT contact end_POSTSUBSCRIPT = divide start_ARG italic_s italic_i italic_n ( 2 italic_π italic_ϕ ) end_ARG start_ARG square-root start_ARG italic_s italic_i italic_n start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( 2 italic_π italic_ϕ ) + 0.04 end_ARG end_ARG , italic_ϕ = divide start_ARG italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_ARG start_ARG 2 italic_T start_POSTSUBSCRIPT s end_POSTSUBSCRIPT end_ARG (18)

where tsuperscript𝑡t^{\prime}italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT denotes the elapsed time from the start of the right-foot step, which is reset to 0 every two-step cycle, 2Ts2subscript𝑇s2T_{\text{s}}2 italic_T start_POSTSUBSCRIPT s end_POSTSUBSCRIPT.

Furthermore, to mitigate any undesirable motions, a set of regularization rewards rRegularizationsubscript𝑟Regularizationr_{\text{Regularization}}italic_r start_POSTSUBSCRIPT Regularization end_POSTSUBSCRIPT is imposed to penalize excessive joint torque, velocities, unnecessary angular motion, policy termination due to falls, etc (see Table III). The reward shaping parameter σ𝜎\sigmaitalic_σ for the exponential function is set to 0.25 during training.

V EXPERIMENT RESULTS

We now present our simulation and hardware test results on MIT Humanoid to evaluate the effectiveness of our approach. The training process takes about three hours of wall clock time using a Nvidia GeForce 3090 GPU.

V-A Simulation Results

1) Velocity tracking performance:

Refer to caption
Figure 4: Comparison of velocity tracking performance between our method, End-to-End policies trained on flat terrain versus mixed terrains (flat, rough, and gap), and Raibert heuristic policy. Commands were given in flat terrain. Our method exceeds the performance of the End-to-End approach trained on varied terrains and shows comparable results to the End-to-End policy trained exclusively on flat terrain.

Fig. 4 presents the velocity tracking performance of our method compared to: 1) End-to-End policy, which is trained to track the velocity commands without foot placement constraints; and 2) Raibert heuristic [raibert1983dynamically] policy, which replaces step pattern generation algorithms with Raibert heuristic. Our method and Raibert heuristic policy are trained exclusively on flat terrain. Additionally, we train two End-to-End policies: one on flat terrain; and the other on multiple terrains, including not only flat but also rough and gap-containing terrains. This plot depicts that our method exceeds the velocity tracking performance of the End-to-End policy trained across these diverse terrains. It also shows that our tracking accuracy is comparable to the End-to-End policy trained solely on flat terrain, which is recognized for its proficiency in single-task scenarios. The results validate that our method reliably tracks velocities up to 2.0 m/s. Furthermore, we observe that with the application of lateral velocity commands, the robot can execute dynamic maneuvers, including 90-degree and 180-degree turns.

2) Learning desired step duration:

Refer to caption
Figure 5: Learned foot contact schedule. The step duration Tssubscript𝑇sT_{\text{s}}italic_T start_POSTSUBSCRIPT s end_POSTSUBSCRIPT that was set to 0.35 seconds was encouraged by a contact schedule reward. The error between the measured and desired foot contact schedule is less than 0.01 seconds.

Fig. 5 shows the step duration Tssubscript𝑇sT_{\text{s}}italic_T start_POSTSUBSCRIPT s end_POSTSUBSCRIPT learned by the policy. Throughout the training phase, Tssubscript𝑇sT_{\text{s}}italic_T start_POSTSUBSCRIPT s end_POSTSUBSCRIPT was fixed at 0.35 seconds, indicating the ground contact duration for a single step. In our setting, a foot is considered to be in contact with the ground if either the toe or heel is touching the ground. This behavior is encouraged through a contact schedule reward (17). The plot confirms that the policy has successfully learned to maintain ground contact for 0.35 seconds for each leg, alternating between the left and right. This consistent step sequence is subsequently beneficial for employing a phase-based state estimator in hardware deployment.

3) Tracking desired step location:

Refer to caption
Figure 6: Tracking desired step location and its resulting foot trajectory. Both left and right foot trajectories show a consistent and smooth path that culminates in an accurate touchdown at the target step locations. A notable observation is the robot’s measured CoM trajectory exhibits a close correspondence to the LIPM CoM trajectory

Fig. 6 shows the robot’s successful tracking of desired step location generated by step pattern generation algorithms. Both right and left foot trajectories form a smooth and regular trajectory ensuring accurate touch down on the target step location. Notably, the measured CoM trajectory of the robot closely aligns with the analytical LIPM CoM trajectory. This behavior is attributed to the implementation of the rewards that encourage the robot to satisfy the assumptions of LIPM.

4) Extension to rough terrain and gap terrain:

Refer to caption
(a) Rough terrain
Refer to caption
(b) Gap terrain
Figure 7: Adaptive locomotion on varied terrain. For rough terrain (Fig. 7(a)), the policy dynamically updates the desired step location by modifying δT𝛿𝑇\delta Titalic_δ italic_T in the step pattern generation algorithms to compensate for the irregularities in terrain that affect the robot’s CoM height. On gap terrain (Fig. 7(b)), the policy adjusts step location to the nearest flat surface, demonstrating the robot’s capability to navigate discontinuities in the surface and maintain a forward velocity v^xsubscript^𝑣𝑥\hat{v}_{x}over^ start_ARG italic_v end_ARG start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT.
Refer to caption
Figure 8: Success rate for walking on the rough terrain. To navigate rough terrain, we modify the robot’s desired step locations at each time step to account for the deviation from the LIPM dynamics caused by uneven terrain. These adjustments, along with step elevation refinements based on heightmap data, helped maintain the robot’s balance while tracking velocity command. Our method proved more effective than an End-to-End and Raibert heuristic policy trained on flat terrain, as evidenced by on average a higher success rate in maintaining forward velocity without falling.

To evaluate the adaptability of our policy to unseen and uneven terrains, we conducted tests on both rough and gap terrain (Fig. 7). For the rough terrain, (Fig. 7(a)), we implemented dynamic adjustments of the desired step location by modifying δT𝛿𝑇\delta Titalic_δ italic_T in equations (5)-(7) every time step. This approach compensates for the inevitable deviations from the LIP dynamics due to the rough terrain’s impact on the constancy of the robot’s CoM height. Additionally, we refined the desired step elevation in accordance with the ground height data obtained from a heightmap. The efficacy of our method was quantified by comparing it to an End-to-End policy and Raibert heuristic policy, originally trained on flat terrain, using a success metric defined by the robot’s ability to maintain a predetermined forward velocity command v^xsubscript^𝑣𝑥\hat{v}_{x}over^ start_ARG italic_v end_ARG start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT for five seconds without falling. As depicted in Fig. 8, our policy showed on average a higher success rate. In gap terrain scenarios (Fig. 7(b)), if the desired step location falls into a gap, we adjust it to the closest flat ground using heightmap data. Through these deployments on both rough and gapped terrains, we have validated the robustness and adaptability of our policy: it can successfully modify the desired step location in response to real-time terrain alterations, thereby sustaining effective locomotion.

V-B Hardware Results

We successfully transferred the policies developed in simulation to robot hardware, showcasing the robust sim-to-real transfer capabilities of our policies (Fig. 9). The robot demonstrated the ability to maintain a consistent height and precisely track the desired step locations for the given step duration Tssubscript𝑇sT_{\text{s}}italic_T start_POSTSUBSCRIPT s end_POSTSUBSCRIPT. To compensate for state estimator noise, we dynamically modified the step locations at each timestep. The performance was evaluated through two specific locomotion tasks:

Refer to caption
(a) Forward Walking
Refer to caption
(b) 90° Turning
Figure 9: Hardware experiment results for forward walking and 90-degree turning. We successfully achieved forward walking at speeds up to 1.5 m/s and executed 90-degree turns, both featuring a heel-to-toe motion during touchdown similar to human walking. These motions in the hardware precisely mirrored those observed in the simulation environment.
Refer to caption
(a) Forward Walking
Refer to caption
(b) Turning
Figure 10: Velocity tracking plot for both forward walking and turning. Fig. 10(a) corresponds to Fig. 9(a), and Fig. 10(b) corresponds to Fig. 9(b). Despite the presence of noise in the base linear velocity readings from the state estimator, our policy is able to track the velocity command and execute the given locomotion tasks.

1) Forward walking: We evaluated the robot’s ability to follow a forward velocity command on flat terrain using a treadmill, as shown in Fig. 9(a), confirming its capacity to walk at speeds up to 1.5 m/s. Notably, the robot demonstrated a heel-to-toe motion closely resembling human walking. Despite the noise in the base linear velocity from the state estimator, the policy enabled stable walking while accurately tracking velocity commands, as shown in Fig. 10(a).

2) Dynamic turning: Dynamic locomotion tasks including 90-degree and 180-degree turns were evaluated, with the results showcased in the supplementary video. Due to spatial limitations of the testing area, only small lateral velocity commands could be issued, resulting in the robot’s inability to track these commands precisely. However, the robot was still able to execute stable turns as demonstrated in Fig. 9(b), and Fig. 10(b).

VI CONCLUSION AND FUTURE WORKS

In this work, we present an approach that combines LIPM with RL to learn the policy capable of accurately tracking desired step locations determined by LIP dynamics. Specifically, our control framework forward predicts the robot states and determines the desired step location to track a given velocity command based on LIP dynamics. We demonstrated our approach on MIT Humanoid and confirmed that tracking these steps enables stable forward walking and dynamic turning. The learned policy further showcased flexibility and adaptability by adjusting desired steps during the swing phase proving its extendability to unseen and uneven terrains. We were able to deploy our policy on MIT Humanoid achieving a forward walking speed of 1.5 m/s and dynamic 90 and 180-degree turning.

In future work, our aim is twofold: 1) We plan to incorporate vision algorithms into our system to detect the height of the terrain. This will allow us to identify stable stepping locations, enhancing the robot’s ability to navigate real-world uneven terrain. 2) We aim to refine our method of determining desired step locations by replacing the LIP dynamics with whole-body dynamics, employing a model predictive controller. This refinement is expected to further improve our control framework to predict better step locations across various locomotion tasks.

ACKNOWLEDGMENT

We thank the members of the Biomimetic Robotics Laboratory at MIT for insightful discussions and feedback on the paper. We especially thank Se Hwan Jeon, Elijah Stanger-Jones, and Charles Khazoom for their helpful support in setting up and conducting the hardware experiments. \printbibliography