bibliography/references.bib
Integrating Model-Based Footstep Planning with Model-Free Reinforcement Learning for Dynamic Legged Locomotion
Abstract
In this work, we introduce a control framework that combines model-based footstep planning with Reinforcement Learning (RL), leveraging desired footstep patterns derived from the Linear Inverted Pendulum (LIP) dynamics. Utilizing the LIP model, our method forward predicts robot states and determines the desired foot placement given the velocity commands. We then train an RL policy to track the foot placements without following the full reference motions derived from the LIP model. This partial guidance from the physics model allows the RL policy to integrate the predictive capabilities of the physics-informed dynamics and the adaptability characteristics of the RL controller without overfitting the policy to the template model. Our approach is validated on the MIT Humanoid, demonstrating that our policy can achieve stable yet dynamic locomotion for walking and turning. We further validate the adaptability and generalizability of our policy by extending the locomotion task to unseen, uneven terrain. During the hardware deployment, we have achieved forward walking speeds of up to 1.5 m/s on a treadmill and have successfully performed dynamic locomotion maneuvers such as 90-degree and 180-degree turns.
I INTRODUCTION
Legged biological systems are capable of navigating through unstructured, complex, and discontinuous terrains, such as stepping stones. In the realm of legged robotics, researchers have long strived to enable legged robots to achieve mobility comparable to their natural counterparts, which would provide numerous practical real-world applications. However, designing controllers for legged robots is non-trivial because they have high degrees-of-freedom and their under-actuated floating base can only be indirectly controlled through external contact wrenches, making their equations of motion highly nonlinear and non-smooth.

Model-based control approaches, such as reactive and predictive control methods, have emerged as a highly effective strategy for solving these complex control challenges, showcasing remarkable performance in quadrupedal and bipedal robotics applications [kajita20013d, hof2008extrapolated, englsberger2011bipedal, semini2016design, herdt2010online, rutschmann2012nonlinear, di2018dynamic, hong2022agile, kuindersma2016optimization]. The major advantage of the model-based control approach lies in leveraging insights from physics models to predict robot behavior, thereby enhancing controller design. In particular, foot placement emerges as one of the main components in model-based control on uneven or discontinuous terrain, providing an interface to control the robot through contact forces. Numerous studies have successfully used simplified models, such as the linear inverted pendulum model (LIPM) [kajita20013d, hof2008extrapolated, englsberger2011bipedal] to calculate foot placements or utilized optimization-based approaches that leverage simplified models like spring-loaded inverted pendulum (SLIP) [rutschmann2012nonlinear], single rigid body dynamics (SRBD) [bledt2019implementing], centroidal dynamics [grandia2023perceptive] for simultaneous computation of foot placements and contact forces. However, these model-based control strategies that purely rely on simplified dynamics are inherently constrained by their simplifications and model mismatches, resulting in conservative locomotion that does not fully exploit the robot’s capabilities.
Parallel to these developments, model-free Reinforcement Learning (RL) has emerged as a powerful tool for robotic control, demonstrating remarkable success in managing complex, dynamic environments [kober2013reinforcement]. The application of model-free RL to both quadrupedal and bipedal robots has showcased a great performance in the given task and robustness against external perturbations [lee2020learning, miki2022learning, duan2022learning, xie2020learning]. Nonetheless, the model-free RL lacks interpretability, making the process of reward shaping and hyperparameter tuning less straightforward and challenging. Furthermore, it has difficulty in generalizing learned policies to new tasks or environments without undergoing retraining. In response, numerous studies used heuristic-based references or model-based physics insights to inform and guide policy learning. Specifically, some of the studies have employed heuristic or sampling-based methods to generate footstep locations and track these references using RL policy [duan2022learning, duan2022sim]. However, the absence of physical reasoning in footstep selection makes it challenging to accurately track target footstep locations while simultaneously maintaining balance without leading to instability or falls. Other studies have used reduced-order models to guide the RL policy to follow the reference trajectories generated by these models [green2021learning, batke2022optimizing, jenelten2024dtc]. However, directly tracking the reference body and joint trajectories [jenelten2024dtc] or imitating the offline motion library [green2021learning, batke2022optimizing] from simplified models causes the RL policy to become overly aligned with the model, restricting exploration during training. Consequently, the resulting policy may be excessively constrained by the simplified model, failing to fully utilize the potential of whole-body dynamics.
In this work, we aim to bridge the gap between these two paradigms, integrating the physics-driven insights of model-based approaches with the adaptive and robust characteristics of RL. Specifically, we propose a hierarchical control framework that employs physics-informed step placements, utilizing linear inverted pendulum (LIP) dynamics to generate target step patterns, while concurrently training an RL policy to ensure the robot adheres to these prescribed step placements. This partial guidance from the physics-based template model prevents the policy from being confined to the model and results in a stable control policy capable of dynamic locomotion tasks, such as fast walking and sharp turns. Furthermore, our approach exhibits the robustness and adaptability inherent to RL policies, extending its capability to navigate unseen, uneven terrains by dynamically adjusting desired steps during the swing phase. The effectiveness of our approach is demonstrated through simulations and hardware experiments on the MIT Humanoid robot [saloutos2023design], showcasing its potential in advancing robotic locomotion in complex environments (see video111Supplementary Video Link and code222Open-Source Code Link).
II BACKGROUND
A single inverted pendulum that connects the supporting foot with its center of mass (CoM) via a massless telescopic leg is commonly used as a simplified model to represent bipedal locomotion [kajita20013d]. By applying constraints to the inverted pendulum’s motion, including a constant CoM velocity along the z-axis and a point-foot model without an actuated ankle joint, an analytical solution for the 3D-LIPM (Fig. 2(a)) [kajita20013d] can be formulated, governed by a linearly independent equation of motion:
(1) |
where denotes the position of the CoM, indicates the natural frequency of the pendulum, and represents the position of the foot. During the derivation of (1), it is assumed that the foot is in contact with the ground.
Integrating (1) yields the ”orbital energy” [kajita20013d], leading to the formulation of the Instantaneous Capture Point (ICP), which is a point on the ground that the system comes to a stop if it were to instantaneously place its foot there [koolen2012capturability]:
(2) |
where denotes the position of the ICP. By differentiating (2) with respect to time, and inserting (1) into that equation, we obtain the ICP dynamics:
(3) |
We can derive the solution of (3) as follows:
(4) |
III STEP PATTERN GENERATION ALGORITHMS
In this section, we describe the process of generating a suitable step pattern for the 3D-LIPM to achieve a velocity tracking task [hof2008extrapolated, englsberger2011bipedal]. We incorporate these strategies into the RL problem in Sec. IV, ensuring the bipedal robot aligns its foot placement with calculated step locations.



Our main objective is to track the desired base velocity command. Fig. 2(a) outlines our step pattern generation algorithms, deriving the ICP trajectory and calculating the necessary offsets to determine the step locations for the left and right steps. Fig. 2(b) presents a top-view of our algorithms.
We assume the LIPM moves along the positive -axis. First, we calculate the desired step length based on the velocity command :
(5) |
Here, denotes the remaining step duration calculated by , where represents the user-defined step duration, and indicates the elapsed time since the beginning of the step. At the start of each step, is reset to 0 and progresses to .
Similarly, we calculate the desired step width based on the step width command :
(6) |
Given the initial (i.e., at the beginning of the step) body state , , we calculate the initial ICP . Then we predict the LIP’s final (i.e., at the end of step) ICP after using (4):
(7) |
Based on Fig. 2(b), we observe that and can be readily expressed as follows:
(8) |
By inserting (8) into (7), we obtain the constant offset vector , which when added to the final ICP, guarantees the step pattern has the desired step length and width :
(9) |
Then we determine the desired step location by adding this constant offset vector to the final ICP [hof2008extrapolated]:
(10) |
where indicates the step cycle (even for the left step and odd for the right step). In the case of turning in -plane, we modify (10) by rotating the constant offset vector:
(11) |
where refers to the rotation angle around z-axis defined by . During the training, we calculate the desired step location only at the beginning of the step when (i.e., ). The detailed values for each user-defined variable are given in Table I.
Variable | Value |
Velocity commands | m/s |
Step width command | 0.3 m |
Step duration | 0.35 s |
Base height | 0.62 m |
Base heading |
IV RL PROBLEM FORMULATION
In this section, we describe our RL training framework to ensure the robot tracks the velocity commands and the step pattern generated by (11).
Fig. 3 shows the overview of our control framework. Our control policy is a fully connected neural network with 3 hidden layers, each layer with 256 nodes. The policy takes as input the robot states, step commands derived from our proposed step pattern algorithms, and user velocity commands, and outputs the desired residual joint PD setpoints. We train our policy in the IsaacGym simulation engine using PPO [schulman2017proximal] algorithms with parallelization of 4096 environments and input normalization. Detailed information on PPO hyperparameters can be found in Table II. Now, we introduce the state space, action space, and reward formulation for the RL problem.
IV-A State Space
The state space of our policy consists of the observed robot states, step commands, and user-defined velocity commands with a size of . In detail, includes the base height, base linear velocity in the world frame, base angular velocity, projection of the gravity vector in the base frame, left and right foot location and heading in the base frame, left and right desired step location and heading in the base frame, velocity commands, phase clock in sine and cosine functions, joint position, joint velocity. The phase identifiers indicate the swing and stance phase of each foot through the contact scheduler. The base states are measured through the phase-based state estimator [mit-biomimetics_cheetah-software_2023] that assumes the foot contact on the ground at the specified contact schedule.
IV-B Action Space
We define the action space as the desired residual joint PD setpoints , representing a deviation from the nominal joint position for hip yaw, hip abduction, hip pitch, knee and ankle joint respectively. The action from our policy is updated at a frequency of 100 Hz and fed into the joint PD controller. Then, the fixed-gain joint PD controller operates at 1 kHz. To be specific, the joint PD controller uses the following equation to convert the action into the desired torque command:
(12) |
For the joint PD controller’s gains, we have configured to , and to .
IV-C Rewards
Parameter | Value |
Horizon (H) | 24 |
Adam learning rate | |
Number of epochs | 5 |
Number of mini-batches | 4 |
Discount () | 0.99 |
Clipping parameter () | 0.2 |
Max gradient norm | 1 |
Reward | Weight | Expression |
Joint torques | 1e-4 | |
Joint torque limits | 1e-2 | |
Joint velocity | 1e-3 | |
Joint limits | 10 | |
Action smoothness 1 | 1e-3 | |
Action smoothness 2 | 1e-4 | |
Hip joint regularization | 1.25 | |
Base roll-pitch velocity | 1e-2 | |
Base z-axis velocity | 1e-1 | |
Base tilting | 1 | |
Termination | 100 |
We formulate the reward structure to ensure the robot tracks the desired step location while maintaining stability and adaptability. Since the desired step location is derived based on the LIPM, we incorporate specific rewards to satisfy the assumption of the LIPM. To retain the inherent flexibility and adaptability characteristic of RL policy, however, we do not impose explicit rewards to follow the LIPM’s CoM trajectory.
The overall reward function is formulated as follows:
(13) |
First, to address the LIPM’s assumption of a constant height, we introduce a reward that encourages the robot to keep a constant base height :
(14) |
Given that the LIPM is represented solely by a point mass and lacks any orientation, it offers no direct control over the robot’s orientation. Therefore, we assume that the robot’s base should consistently orient towards the desired base heading direction. To encapsulate this concept, the reward is designed:
(15) |
The desired step location calculated by step pattern generation algorithms in Sec. III results in the LIPM’s passive dynamics naturally fulfilling the velocity command . Given the robot’s deviation from the LIPM, we implement the velocity tracking reward to ensure tracking of the velocity command :
(16) |
Upon determining the desired step location, the robot must place its foot at this location for the specified step duration. The reward is crafted to incentivize the robot to conform to the contact schedule at the desired step location:
(17) |
Here, and are indicator functions for right and left foot ground contact, respectively. The contact schedule is a continuous function that oscillates between -1 and 1 across each two-step duration :
(18) |
where denotes the elapsed time from the start of the right-foot step, which is reset to 0 every two-step cycle, .
Furthermore, to mitigate any undesirable motions, a set of regularization rewards is imposed to penalize excessive joint torque, velocities, unnecessary angular motion, policy termination due to falls, etc (see Table III). The reward shaping parameter for the exponential function is set to 0.25 during training.
V EXPERIMENT RESULTS
We now present our simulation and hardware test results on MIT Humanoid to evaluate the effectiveness of our approach. The training process takes about three hours of wall clock time using a Nvidia GeForce 3090 GPU.
V-A Simulation Results
1) Velocity tracking performance:

Fig. 4 presents the velocity tracking performance of our method compared to: 1) End-to-End policy, which is trained to track the velocity commands without foot placement constraints; and 2) Raibert heuristic [raibert1983dynamically] policy, which replaces step pattern generation algorithms with Raibert heuristic. Our method and Raibert heuristic policy are trained exclusively on flat terrain. Additionally, we train two End-to-End policies: one on flat terrain; and the other on multiple terrains, including not only flat but also rough and gap-containing terrains. This plot depicts that our method exceeds the velocity tracking performance of the End-to-End policy trained across these diverse terrains. It also shows that our tracking accuracy is comparable to the End-to-End policy trained solely on flat terrain, which is recognized for its proficiency in single-task scenarios. The results validate that our method reliably tracks velocities up to 2.0 m/s. Furthermore, we observe that with the application of lateral velocity commands, the robot can execute dynamic maneuvers, including 90-degree and 180-degree turns.
2) Learning desired step duration:

Fig. 5 shows the step duration learned by the policy. Throughout the training phase, was fixed at 0.35 seconds, indicating the ground contact duration for a single step. In our setting, a foot is considered to be in contact with the ground if either the toe or heel is touching the ground. This behavior is encouraged through a contact schedule reward (17). The plot confirms that the policy has successfully learned to maintain ground contact for 0.35 seconds for each leg, alternating between the left and right. This consistent step sequence is subsequently beneficial for employing a phase-based state estimator in hardware deployment.
3) Tracking desired step location:

Fig. 6 shows the robot’s successful tracking of desired step location generated by step pattern generation algorithms. Both right and left foot trajectories form a smooth and regular trajectory ensuring accurate touch down on the target step location. Notably, the measured CoM trajectory of the robot closely aligns with the analytical LIPM CoM trajectory. This behavior is attributed to the implementation of the rewards that encourage the robot to satisfy the assumptions of LIPM.
4) Extension to rough terrain and gap terrain:



To evaluate the adaptability of our policy to unseen and uneven terrains, we conducted tests on both rough and gap terrain (Fig. 7). For the rough terrain, (Fig. 7(a)), we implemented dynamic adjustments of the desired step location by modifying in equations (5)-(7) every time step. This approach compensates for the inevitable deviations from the LIP dynamics due to the rough terrain’s impact on the constancy of the robot’s CoM height. Additionally, we refined the desired step elevation in accordance with the ground height data obtained from a heightmap. The efficacy of our method was quantified by comparing it to an End-to-End policy and Raibert heuristic policy, originally trained on flat terrain, using a success metric defined by the robot’s ability to maintain a predetermined forward velocity command for five seconds without falling. As depicted in Fig. 8, our policy showed on average a higher success rate. In gap terrain scenarios (Fig. 7(b)), if the desired step location falls into a gap, we adjust it to the closest flat ground using heightmap data. Through these deployments on both rough and gapped terrains, we have validated the robustness and adaptability of our policy: it can successfully modify the desired step location in response to real-time terrain alterations, thereby sustaining effective locomotion.
V-B Hardware Results
We successfully transferred the policies developed in simulation to robot hardware, showcasing the robust sim-to-real transfer capabilities of our policies (Fig. 9). The robot demonstrated the ability to maintain a consistent height and precisely track the desired step locations for the given step duration . To compensate for state estimator noise, we dynamically modified the step locations at each timestep. The performance was evaluated through two specific locomotion tasks:




1) Forward walking: We evaluated the robot’s ability to follow a forward velocity command on flat terrain using a treadmill, as shown in Fig. 9(a), confirming its capacity to walk at speeds up to 1.5 m/s. Notably, the robot demonstrated a heel-to-toe motion closely resembling human walking. Despite the noise in the base linear velocity from the state estimator, the policy enabled stable walking while accurately tracking velocity commands, as shown in Fig. 10(a).
2) Dynamic turning: Dynamic locomotion tasks including 90-degree and 180-degree turns were evaluated, with the results showcased in the supplementary video. Due to spatial limitations of the testing area, only small lateral velocity commands could be issued, resulting in the robot’s inability to track these commands precisely. However, the robot was still able to execute stable turns as demonstrated in Fig. 9(b), and Fig. 10(b).
VI CONCLUSION AND FUTURE WORKS
In this work, we present an approach that combines LIPM with RL to learn the policy capable of accurately tracking desired step locations determined by LIP dynamics. Specifically, our control framework forward predicts the robot states and determines the desired step location to track a given velocity command based on LIP dynamics. We demonstrated our approach on MIT Humanoid and confirmed that tracking these steps enables stable forward walking and dynamic turning. The learned policy further showcased flexibility and adaptability by adjusting desired steps during the swing phase proving its extendability to unseen and uneven terrains. We were able to deploy our policy on MIT Humanoid achieving a forward walking speed of 1.5 m/s and dynamic 90 and 180-degree turning.
In future work, our aim is twofold: 1) We plan to incorporate vision algorithms into our system to detect the height of the terrain. This will allow us to identify stable stepping locations, enhancing the robot’s ability to navigate real-world uneven terrain. 2) We aim to refine our method of determining desired step locations by replacing the LIP dynamics with whole-body dynamics, employing a model predictive controller. This refinement is expected to further improve our control framework to predict better step locations across various locomotion tasks.
ACKNOWLEDGMENT
We thank the members of the Biomimetic Robotics Laboratory at MIT for insightful discussions and feedback on the paper. We especially thank Se Hwan Jeon, Elijah Stanger-Jones, and Charles Khazoom for their helpful support in setting up and conducting the hardware experiments. \printbibliography