\equalcont

These authors contributed equally to this work.

[3]\fnmJinliang \surDing \equalcontThese authors contributed equally to this work.

[1]\orgdivState Key Laboratory of Synthetical Automation for Process Industries, \orgnameNortheastern University, \orgaddress\streetWenHua Road, \cityShenyang, \postcode110819, \stateLiaoning, \countryChina

Hierarchical learning control for autonomous robots inspired by central nervous system

\fnmPei \surZhang [email protected] \fnmZhaobo \surHua [email protected] [email protected] *

Abstract

Mammals can generate autonomous behaviors in various complex environments through the coordination and interaction of activities at different levels of their central nervous system. In this paper, we propose a novel hierarchical learning control framework by mimicking the hierarchical structure of the central nervous system along with their coordination and interaction behaviors. The framework combines the active and passive control systems to improve both the flexibility and reliability of the control system as well as to achieve more diverse autonomous behaviors of robots. Specifically, the framework has a backbone of independent neural network controllers at different levels and takes a three-level dual descending pathway structure, inspired from the functionality of the cerebral cortex, cerebellum, and spinal cord. We comprehensively validated the proposed approach through the simulation as well as the experiment of a hexapod robot in various complex environments, including obstacle crossing and rapid recovery after partial damage. This study reveals the principle that governs the autonomous behavior in the central nervous system and demonstrates the effectiveness of the hierarchical control approach with the salient features of the hierarchical learning control architecture and combination of active and passive control systems.

Refer to caption — Figure 1: Overview of the central nervous system and hierarchical learning control framework. a, Mammalian central nervous system structure, the figure contains the cerebral cortex partition, the spinal cord’s internal structure, and the double-layer structure of CPGs neural circuits. b, Schematic diagram of the proposed hierarchical control framework. The gray nodes in a and the gray box in b represent the sensing mechanism in the nervous system and control framework, respectively, and are responsible for the acquisition of sensing signals. In the nervous system, S1 and the visual cortex are mainly responsible. In the control frame, it is provided by sensor measurement. The green nodes and boxes in a and b represent the high-level institutions in the nervous system and control framework, respectively, responsible for observing the environment and making decisions. In the nervous system, most cortical regions are responsible for this function. In the control framework, this part is realized by the deep reinforcement learning neural network policy. The yellow nodes and boxes in a and b represent the mid-level institutions responsible for coordinating the limbs and generating various motion patterns. In the nervous system, the cerebellum and primary motor cortex are responsible. In the control framework, this part adopts an unsupervised reinforcement learning algorithm and skill-driven neural network. The purple nodes and boxes in a and b represent the low-level institutions that are responsible for the generation and execution of motion signals. In the nervous system, the brain stem and spinal cord are responsible. In the control framework, it is realized by the CPG module, which contains an oscillator and a desired pose solver to provide the desired joint position and uses the built-in PID feedback loop of the robot to control 18 motors. The solid line in a connects different nerve regions, representing the information flow relationship, and the thin purple solid line on the right represents the ascending and descending spinal nerves. Dotted lines indicate descending pathway feedback of the CPGs. The solid line in b represents the action relationship between the sensor and the control signal, and the black dotted line connects the specific analysis of each module. c, Four different indoor obstacle terrain crossing tasks. d, Various new obstacle terrain crossing tasks have never been learned.

Animals can develop a large number of motor skills to complete various complex tasks, which stem from the hierarchical structure of their central nervous system (CNS) and the complex interaction mechanisms between multiple regions^{[1, 2]}. Studying the hierarchical structure and mechanisms of the central nervous system can help understand the principle of autonomous movement and promote the design of autonomous robot control systems^{[3, 4]}.

Recently, approaches to developing robot autonomous control systems inspired by the central nervous system can be divided into two groups: hierarchical model active control approaches based on hierarchical structure and bionic passive control approaches based on neural mechanisms. The hierarchical model active control approaches aim to obtain a multi-level active control system with high flexibility and adaptability (consciously participated) by mimicking the hierarchical structure of the nervous system. Its internal controller continuously and actively provides control signals based on real-time feedback information to achieve specific goals ^{[5, 6, 7, 8, 9, 10, 11]}. These approaches generally include two stages: In the pre-training stage, the low-level controller learns low-level skills. In the task-training phase, the high-level controller learns to use these skills for autonomous decision-making^{[5, 6, 7]}. However, finding effective reusable skills in complex robot systems is difficult. Learning low-level skills from data is a promising method, but the flexibility and versatility of skills are limited by the quality of data sets^{[8, 9]}. Collecting high-quality motion data for some special robots (such as insect robots) is also particularly difficult. The unsupervised reinforcement learning methods encourage robots to learn diverse skills through intrinsic objectives and eliminate the dependence on data. However, in complex systems with high degrees of freedom, it is usually unable to find useful skills^[10]. The goal-driven methods transform low-level skill learning in complex systems into target tracking tasks by designing the target space^[11]. Still, the manually designed target space (such as plane coordinates and joint angles) will limit the full exploration in the learning process. In addition, due to the need for real-time data feedback for continuous decision-making, active control systems rely heavily on sensor information.

The bionic passive control approaches aim to obtain a structurally simple and highly reliable passive control system (unconsciously participated) by simulating various mechanisms within the central nervous system. Through preset intrinsic internal mechanisms, the robot can spontaneously generate various behaviors, thereby overcoming many problems related to hierarchical model approaches. For example, by simulating the conditional reflex neural circuits inside the spinal cord (SC) of vertebrates and the ventral nerve cord (VNC) of invertebrates, control methods based on reflex mechanisms can directly and quickly trigger a series of rhythmic reflex behaviors^{[12, 13]} (such as crawling, etc.) based on sensory information, enabling robots to autonomously adapt to unpredictable irregular terrain^{[14, 15]}. Central pattern generators (CPGs) are another type of neurons or neural circuits that exist within SC or VNC. They can generate periodic motor signals through internal oscillations without a bursting input and can be regulated by a small number of descending neurons in the brain^{[16, 17]}. By simulating this oscillation mechanism, CPG-based control methods have less dependence on sensory information and can enable robots to spontaneously generate rhythmic signals and form various biomimetic motion behaviors^{[18, 19]} (such as swimming, walking, etc.). These methods rely on the inherent internal mechanisms of the system to generate motion behavior in robots, with simple structures and no need for complex real-time calculations. However, the simple structure also limits the flexibility of passive control systems.

Some studies use learning based methods combined with partial sensory feedback information to adjust the internal parameters of CPG controllers to achieve some simple target tasks^{[20, 21, 22]}(such as target location navigation, speed tracking, etc.) Although this semi-active control system enhances the flexibility of passive control, the control mode that only integrates CPG and simple descending drives is not conducive to learning complex motion skills and limits the diversity of autonomous behavior.

Among vertebrates, mammals have the most complete hierarchical central nervous system and exhibit highly autonomous, flexible, and reliable motor behaviors. The developed spinal cord and complex reflex circuits in the central nervous system form a passive control system that can generate motion signals, while the highly evolved cerebral cortex forms an active control system capable of processing complex sensory information and making advanced decisions^[1]. Analyzing the hierarchical structure of the central nervous system and the principles of sensory feedback, CPGs, and various descending drives mechanisms, and combining the advantages of active and passive control systems to design an autonomous robot control system, is expected to solve the problems of existing methods. Therefore, based on the analysis of the hierarchical collaboration relationship and functional mechanism of the mammalian central nervous system (Fig. 1a), we propose a hierarchical learning control framework for autonomous robots inspired by the central nervous system (Fig. 1b), which enables robots to acquire diverse and reusable skills, autonomously solve multiple tasks and adapt to unknown environments, and maintain their mobility in the absence of some sensing information. This framework combines the advantages of active and passive control systems, with flexibility and reliability.

Specifically, we make the following contributions to the design of autonomous robot control systems: 1) We analyze the working mechanism of the mammalian central nervous system, abstracts it as a three-layer semi-active control system, and describes the division of labor and cooperation relationships at different levels, providing a new perspective for the design of control systems. 2) We design and propose a semi-active hierarchical learning control framework based on the analysis results. The framework uses a learning based high-level neural network controller to mimic the cerebral cortex and basal ganglia, responsible for actively making decisions through sensory feedback and visual information; The framework uses a mid-level neural network controller to mimic the cerebellum and primary motor cortex, responsible for learning diverse and reusable skills and coordinating movements; The framework uses a low-level CPG module to mimic the brainstem and spinal cord, responsible for passively generating motion signals and being regulated by higher level controllers. By mimicking the descending drives mechanism of the spinal cord’s lateral and ventromedial pathways, the high-level controller can actively control the lower level controllers through two control loops, thereby directly or indirectly adjusting the movement pattern. 3) We propose a CPG module embedded with independent phases, which enhances the system’s fault tolerance and reduces the dependence of active control systems on sensing information. 4) We design a generalized skill pre-training method that references the learning mode of cerebellar motor skills, enabling the mid-level neural network controller to effectively explore reusable skills in high degree of freedom complex systems without the need for external data. 5) We design a multi-task learning method that references the planning patterns of movement in the cortex and basal ganglia, enabling the high-level neural network controller to quickly make autonomous decisions in various downstream tasks and unknown environments by using learned skills.

Following the proposed framework, we obtain a complete hierarchical controller and apply it to a complex hexapod robot PHAGE^[23]. We observe that the PHAGE robot exhibit strong autonomous decision-making and flexible movement abilities in various obstacle crossing tasks, as well as rapid adaptive capabilities in unknown situations. Specifically, we designed various obstacle crossing tasks as shown in Fig. 1c (climbing stairs, crossing ravines, crossing narrow alleys, climbing slopes) to demonstrate animal level autonomous decision-making and movement abilities. In addition, we demonstrate the robot’s ability to adapt to new environments and its rapid recovery capability under partial body damage conditions in various unknown obstacle environments shown in Fig. 1d. We also conduct comprehensive ablation studies in both simulation and real-world environments to evaluate the effectiveness of different components and demonstrate the advantages of the proposed framework.

The organizational structure of this paper is as follows: In the Results section, we show the working principle of the central nervous system and the detailed composition and performance evaluation results of the proposed framework. In the Discussion section, we discuss the similarity between the proposed control framework and the neural system and provide directions for further improvement and prospects for future work. In the Methods section, we provide specific implementation details.

1 Results

We first present the analysis results of the mammalian central nervous system and provide an overview of our proposed framework based on the analysis results. Subsequently, in a bottom-up order, we evaluate the motion generation results of the CPG module within the framework, the skill learning and regulation results of the mid-level controller, and the multi-task learning and decision-making results of the high-level controller to demonstrate the effectiveness of different core components of the framework and the excellent autonomous decision-making and motion control capabilities of the proposed control method. Finally, through the rapid motion recovery effect of the robot under limb amputation conditions, as well as the rapid adaptation results of the robot in various unknown environments, we demonstrate that the proposed control framework has less dependence on sensor information and strong adaptive ability.

1.1 Central nervous system inspired hierarchical learning control framework

Combined with related research, we analyze the mammalian central nervous system. As shown in Fig. 1a, the central nervous system of mammals (taking humans as an example) is a three-level control system from the cerebral cortex to the spinal cord^{[1, 2]}. The highest level includes most cortical regions, such as the prefrontal cortex, premotor area (PMA), supplementary motor area (SMA), posterior parietal cortex, and basal ganglia. They are responsible for processing various information and generating high-level decisions. First, the ascending nerves of the spinal cord will project proprioception of the body to the primary somatosensory cortex (S1), and the visual nerves will project visual information of the retina to the visual cortex. Areas 5 and 7 of the posterior parietal cortex process proprioception from S1 and visual information from the visual cortex, respectively, and generate abstract perceptual information. Subsequently, the prefrontal cortex decides what action to take according to the perceptual information, and then the basal ganglia enhances these motor intentions through the internal direct pathway while inhibiting inappropriate motor programs through the indirect pathway^{[24, 25]}. This information mainly converges to the SMA of the cortex through the thalamus. Then, the SMA and PMA jointly carry out motor planning and control distal and postural muscles. The control information travels down the spinal cord through two main pathways^{[2, 26, 27]}. On the one hand, through the ventromedial pathway, the high-level information is processed by the mid-level CNS composed of the cerebellum and primary motor cortex (M1) and transmitted to the brain stem and spinal cord at the lowest level to control the coordinated movement of limbs. In the above process, proprioception and decision-making information are projected from the pontine nucleus to the lateral cerebellum and then back to M1 through the thalamic ventral lateral nucleus (VLC) to generate finer motor signals and transmit them to the brainstem. Then, the postural muscles are controlled by spinal tracts such as the reticulospinal tract of the ventromedial column of the spinal cord. On the other hand, signals from SMA and PMA directly control the distal muscles through the corticospinal tract of the lateral column of the spinal cord. These afferent feedbacks will regulate CPGs neural circuits in the spinal cord, generating complex rhythmic signals to control the formation of autonomous movement of muscles. CPGs in the spinal cord have two levels of organization, in which the half-center rhythm generator layer (RG layer) is responsible for generating motor rhythms. The pattern formation layer (PF layer) is responsible for determining the exact form of muscle activation signals, mobilizing motoneurons to control muscles through spinal nerves^[28].

According to the above hierarchical structure and functional division of the central nervous system, we design the hierarchical learning control framework of Fig. 1b. The framework contains three levels of controllers from high to low. The high-level controller, which simulates the cortex and basal ganglia operation principle , is responsible for processing the proprioception and visual information provided by the sensors and making active and rapid decisions. Its internal convolutional neural network and somatosensory neural network process the depth image and proprioception, respectively, to analyze the environmental features, and speculate the next movement direction. The subsequent neural network processes such perceptual signals and provides high-level decisions. These decision vectors are divided into skill vectors and morphological parameters. The length of skill vectors determines the frequency of movement and simulates the regulation of basal ganglia on voluntary movement. They descend to the bottom CPG module through two control loops. The skill vectors are sent to the mid-level controller through the ventromedial loop. Inside the mid-level controller, the internal neural network combines the skills and proprioception to calculate the differential control signals. These signals are then sent to the CPG module, which generates various coordinated motion patterns. The morphological parameters are directly transmitted to the CPG module through the lateral loop, thus directly changing the shape of the limb (such as width and height). The CPG module comprises an oscillator and a desired pose solver, which simulate the functions of the RG and PF layers. The oscillator generates varying amplitude and phase signals, and the desired pose solver maps the amplitude and phase information into Cartesian space. The desired angle signals of the motors are calculated by inverse kinematics to control the position of the motors.

Based on the above hierarchical structure, we further analyze the autonomous behavioral learning mode of the central nervous system. Due to the role of the genetic system, CPGs in the spinal cord are encoded at birth^[29], and rhythmic movement can be generated without acquired learning. The cerebellum has the ability for motor balance and coordination control. When learning new actions, these abilities are repeatedly strengthened, eventually leading to unconscious motor skills that can be repeatedly called by cortical regions^[30]. Based on the above characteristics, mammals can devote more energy to the decision-making and learning process of the cerebral cortex on tasks to quickly solve various tasks and adapt to emergencies and new environments autonomously.

According to the characteristics of CPGs, we design a CPG module to generate autonomous motion signals without learning. We introduce the pre-encoded independent tripod gait phase to generate a stable phase signal. Based on the skill learning mechanism of the cerebellum, we propose a skill pre-training method based on unsupervised reinforcement learning and the CPG module, which can make the mid-level controller learn a variety of coordinated motor skills and can be repeatedly called by the high-level controller. According to the decision-making mode of the cerebral cortex on motion, we propose a two-stage multi-task learning method. The high-level controller can acquire the guided multi-task decision-making ability through reinforcement learning. Then, through distillation learning, the high-level controller learns to independently infer environmental features and motion direction from visual information to complete various tasks. This hierarchical independent learning mode can enhance the autonomy of the control system (the absence of higher levels will not affect the operation of lower levels), reduce the dependence on sensor information (the CPG module can also generate motion signals from internal oscillation in the case of sensor failure), and generate autonomous motion to cope with a variety of situations (solve multi-tasks and adapt to new environments).

1.2 Results of the CPG module’s motion generation

The purple box in Fig. 1b shows the structure of the proposed CPG module. We use the amplitude-phase oscillator to simulate the RG layer and the desired pose solver to simulate the PF layer, and the high-level and mid-level controllers can regulate them directly and indirectly.

The oscillator consists of six elements acting on the robot’s six legs respectively. We refer to the stable gait mode of insects^[31] (Fig. 2a,b) and design the independent tripod gait phase $\bm{\alpha}$ , which is only affected by the maximum oscillation frequency $\omega_{m}$ , as the fixed component of the mixed phase $\bm{\varphi}$ . The adjustable phase $\bm{\theta}$ provides the other component of $\bm{\varphi}$ . This mixed phase enhances the autonomy of the CPG module and retains flexibility. Even if the higher control information is missing, the independent phase component can ensure the CPG module produces a stable forward or backward gait (Fig. 2c). In this mode, the oscillator can generate a regular amplitude signal $\bm{r}$ , tripod gait phase $\bm{\alpha}$ , and adjustable phase $\bm{\theta}$ (Fig. 2e). Amplitude $\bm{r}$ and adjustable phase $\bm{\theta}$ can be adjusted by $\bm{\mu},\bm{\omega}$ generated by the mid-level controller based on sensory feedback, thus making the CPG module generate more complex gait (the range of $\bm{\mu},\bm{\omega}$ in the figure is mapped to $[-1,1]$ ). Compared with the single-phase mode in refs [20, 21, 22], the introduction of the independent phase enables the oscillator to generate a stable phase signal when the higher signal has a boundary value (such as $\omega_{i}=-1$ , light blue curve). This simple passive control module enhances the system’s fault tolerance and further reduces its dependence on sensory information. In addition, the overall phase signals are adjusted by the maximum frequency $\omega_{m}$ , thus controlling the overall speed of the robot. During 12-24s, the $\omega_{m}$ decreases, and the frequency of the three phases decreases simultaneously (Fig. 2e).

According to the amplitude and mixed-phase signals, the desired pose solver can calculate the desired Cartesian coordinates of the end of each leg and convert them into joint commands through inverse kinematics. Fig. 2e shows the Z and X coordinates curves and the smooth joint control signals. In addition, by changing the morphological parameters $\bm{m}_{p}=\{l,h,g_{p},g_{c},w_{y}\}$ (step length, height, support height, swing height, width) through the high-level controller, the robot can also show different shapes (Fig. 2d).

The results show that the CPG module is semi-autonomous and can generate stable motion mode by independent phase without external feedback. The incoming signals from higher-level controllers can affect the module, thus generating more complex control signals. These characteristics are the basis of subsequent multi-level active control.

1.3 Results of the mid-level controller’s skill learning and control

The mid-level controller is a reinforcement learning policy composed of neural networks, responsible for the robot active skill regulation and gait control. It receives proprioception and skill vectors from the high-level controller. Then, it generates various differential parameters $\bm{\mu},\bm{\omega}$ for the oscillator, thus adjusting the CPG module to generate gait signals of different modes. External feedback $\|\bm{z}\|$ is proportional to the maximum internal frequency $\omega_{m}$ , which can adjust the robot’s speed. Combining the unsupervised reinforcement learning algorithm^[32] and CPG module, we propose a skill pre-training method to learn the mid-level policy in simulation to find diversified motor skills (Fig. 3a). The CPG module can use low-dimensional parametric action space to generate high-dimensional joint control signals so that unsupervised learning can discover useful skills in robot systems with high degrees of freedom. (The learning process is reserved for the description of the Methods).

Fig. 3b shows the movement trajectories of the robot within 30 seconds of the XoY plane under the influence of different skills. To test the control effect of the mid-level policy, we select 24 skills with three different lengths in 8 different directions to drive 1024 robots with different morphological parameter combinations. It can be seen from Fig. 3b that the mid-level policy can generate movement trajectories in different directions under any combination of morphological parameters according to different skill vectors. With the increase of the skill length $\|\bm{z}\|$ , the motion frequency of the robot legs also increases, resulting in a more distant motion mode. Fig. 3c shows the gait maps corresponding to four skill vectors $\bm{z}_{1},\bm{z}_{3},\bm{z}_{5},\bm{z}_{7}$ with a length of 1. It can be seen from the figure that the mid-level policy can produce various gait patterns under the effect of different skill vectors. For example, the robot will switch between biped and tripod gait under the action of $\bm{z}_{1}$ . Under $\bm{z}_{3}$ , the robot learn to switch between quadruped, tripod, and biped gait (Supplementary Section 6 further visualizes and analyzes the discovered skills). The results show that, without external data, the mid-level controller can effectively learn diversified skills and coordinate the limbs to produce multiple gait and trajectories like the cerebellum. The higher controller can later reuse these bounded continuous skills for rapid decision-making in various situations.

1.4 Results of the high-level controller’s multi-task learning and decision

The high-level controller is also a reinforcement learning policy composed of neural networks, which is responsible for processing the robot’s proprioception and environmental information, respectively, and then giving the appropriate high-level decision $\bm{a}_{h}=[\delta\bm{m}_{p},\bm{z}]$ to actively command the lower levels to produce the appropriate movement pattern to deal with obstacles. $\delta\bm{m}_{p}$ directly changes the morphological parameters of the CPG module through the lateral loop. At the same time, $\bm{z}$ controls the mid-level controller to change the internal rhythm of the oscillator through the ventromedial loop. It is worth noting that $\|\bm{z}\|$ determines the maximum gait frequency $\omega_{m}$ of the robot, and this information is also transmitted directly from the lateral loop to the CPG module, which reflects the regulatory function of the basal ganglia on the willingness of the voluntary movement.

We propose a two-stage multi-task learning method, which allows the mid-level policy to remain unchanged after learning, and only learning the high-level policy quickly. In the first stage, we use the reinforcement learning algorithm to train multiple high-level policies in the multi-task target simulation environment ( Fig. 4b), allowing them to make decisions quickly. In the second stage, we use the distillation learning algorithm in the multi-task visual simulation environment to extract the learned high-level policies into a student policy, so that it can independently infer the environmental features and target heading directions according to the depth image, and then give appropriate high-level decision instructions (the two-stage learning process is reserved for the description of the Methods section). After completing the learning in the simulation environment, the student policy is directly deployed to the physical robot to perform four kinds of obstacle terrain crossing tasks autonomously.

We use different baselines for the first learning stage and conduct tests with increasing difficulty in the simulation (see Supplementary Section 2 for the configuration of different difficulties of each task) to evaluate the influence of different components of the proposed control framework on the control effect (including the introduction of the independent phase, hierarchical structure, and dual descending pathway). Fig. 4c shows the success rate of each baseline. The settings of each baseline are as follows: 1) NoAlpha: This removes the stable tripod gait phase in the CPG module, leaving the rest of the structure unchanged. 2) NoVpath: This removes the mid-level policy and the ventromedial pathway, and the high-level policy directly provides all control signals reaching the CPG module through the lateral loop. 3) NoLpath: This removes the lateral loop. The morphological parameters $\bm{m}_{p}$ cannot be adjusted.

Under the most difficult conditions of each terrain, the proposed method has a higher success rate (10% -80%). The effect of NoAlpha is the worst. Due to the lack of a stable gait, the robot needs to learn to crawl first, so its learning efficiency is the lowest under the same number of iterations. The effect of NoVpath is better than other baselines. Although it lacks the gait adjustment function of mid-level policy, it can directly adjust the morphological parameters according to the terrain to change the robot’s shape to deal with obstacles. However, the robot must relearn the gait adjustment method when facing more difficult obstacles due to the lack of reusable motion skills. Therefore, with the increase the difficulty, the effect decreases rapidly. The effect of NoLpath is inferior to NoVpath. The robot cannot directly adjust the shape without the lateral loop. In the narrow alley task, the defect is very obvious. The robot cannot change its width, resulting in a low success rate. In addition, we compare with the state-of-art double-layer CPG-RL algorithm[22]. Since no mid-level gait adjustment network exists, this method is similar to NoVpath, and the success rate will drop sharply as the difficulty increases.

Fig. 4d visualizes the changes of the five morphological parameters and the maximum frequency $\omega_{m}$ during the physical robot passing through the narrow alley. At 2th second, the robot finds the obstacle in front and starts to move. In 6-13 seconds, the high-level controller analyzes the characteristics of obstacles and makes the robot curl up and enter the alley. Its width $w_{y}$ decreases rapidly, its height $h$ increases, and its $\omega_{m}$ is high, making it move quickly in small steps. The squatting posture lasts until about the 34th second, and then the robot walks out of the alley, and the width returns to normal. (Supplementary Section 3 provides motion analysis of other tasks).

The above results show that the robot can use the learned skills to solve multiple tasks independently with the cooperation of high-level and lower-level controllers. The dual descending pathway structure enables the robot to have the ability of rapid decision-making and fine adjustment. The high-level controller can directly and quickly adjust the robot’s shape (width, height, etc.) through the lateral loop according to the environmental information and provide skill instructions to the mid-level controller through the ventromedial loop to produce accurate gait and motion trajectory. The reusable generalization skills of the mid-level policy improve the learning speed of multiple tasks, while distillation learning enables the robot to have the ability of autonomous reasoning.

1.5 Results of rapid recovery of movement in case of limb damage

Animals exhibit rapid motor recovery after injury through various compensatory behaviors^[33]. In the context of the bionic hierarchical structure, our novel method has the reliability of passive control systems, enables the robot to swiftly relearn its high-level controller in the event of limb damage, thereby coordinating different levels to restore its motion ability. The semi-autonomous CPG module further diminishes the control system’s reliance on sensor information, enhancing its robustness and resilience.

To simulate the effect of limb damage, we randomly disable the control and feedback signals of a leg of the hexapod robot. Fig. 5a demonstrates the robot’s adaptability in crossing a gap (difficulty 0.6) when the robot’s front, middle, and hind legs are broken, respectively. Regardless of which leg fails, the controller effectively guides the robot to adapt and utilize the remaining healthy legs to cross the gap and maintain a stable posture during crawling. Fig. 5b further illustrates the gait changes of the robot in this task. After the failure of the middle leg, the other five legs of the robot adjust their gait and redistribute their workload to complete the obstacle crossing. Overall, the robot’s gait frequency decreases, and the remaining legs have a longer time to swing and stance each time to maintain a stable crawling posture. Fig. 5c presents the changes in X-coordinate at the ends of all legs of the robot during this process. In the healthy state, the trajectories of the robot’s right front leg (RF), left middle leg (LM), right middle leg (RM), and left hind leg (LH) are very close, resulting in a coordinated pace. This coordination relationship is redistributed after the LM leg breaks down, leading the robot to adopt a new motion mode. Supplementary Section 4 provides a more detailed analysis of the impact of the different positions of the broken limb on recovery ability.

1.6 Results of unknown environment adaption

After entering the unknown environment, animals can recombine the learned knowledge and solve new tasks instead of learning from scratch. The high-level active control system of the proposed method also has this advantage. As shown in Fig. 6a, the hierarchical controller trained in the simulation environment is directly deployed to the robot system without modification, and its adaptability is tested in various unknown terrains (see Supplementary Video S5 for relevant videos). The robot can move independently on the outdoor dirt ground and cross the stone path to the opposite side through the high-level controller’s autonomous inference and decision-making ability. It can climb the stone bridge steeper than the learned slope task. It can adjust its shape to drill into an irregularly shaped cave and quickly climb the curb. Moreover, it can navigate autonomously in multi-objective combined terrain without being affected by limb fracture (Supplementary Section 10 introduces the composition of the combined terrain and further analyzes the motion effect of the robot).

Fig. 6b shows the change of morphological parameters adjusted by the high-level controller over time while climbing the curb. After seeing the stone curb, the robot decided to reduce the height and increase the width to ensure stability during climbing. At the same time, increase $g_{c}$ so that the legs can be lifted higher to climb the stone platform. After climbing the stone platform, the maximum frequency $\omega_{m}$ decreases, maintaining low sports activity to save energy consumption. Fig. 6c shows the change of skill vector $\bm{z}$ generated by the high-level controller with the number of decision steps in this process. Under the effect of the hierarchical structure of the proposed method, the high-level controller will sample reusable skills from the skill space according to the environmental information at different times and combine them into continuous signals to guide the lower-level controllers to generate appropriate gait and trajectory to deal with the new environment.

2 Discussion

We propose a general hierarchical learning control framework based on the structure and function of the central nervous system, which theoretically integrates active and passive control systems, and demonstrates in practice how to use hierarchical semi-active control systems to enhance the autonomous movement ability of robots. This framework provide new ideas and methods for the design and application of future robot control systems. This framework refers to the structure and mechanisms of the mammalian central nervous system, including three interdependent hierarchical controllers, which can learn various motor skills and enable robots to use sensory feedback to produce autonomous motor behavior. Taking the complex hexapod robot system as the experimental object, we prove that this method can combine the reinforcement learning algorithm to quickly learn the controller in the simulation environment and directly deploy it to the robot system to handle complex tasks and independently solve new tasks. The low-level CPG module can independently generate stable motion mode and can be adjusted by the higher system. The mid-level controller can learn various skills to be repeatedly called by the high-level controller, while the high-level controller can quickly learn to handle various tasks. The dual descending pathway structure between the higher and lower controllers realizes fast and accurate motion control.

This design concept based on the structure and mechanisms of the central nervous system significantly enhances the transparency and interpretability of the proposed control framework, giving it the advantages of the mammalian hierarchical nervous system^[1]. Different sensory signals are specially processed at different levels. The high-level controller processes visual information and proprioception, while the mid-level controller only focuses on proprioception. This specialized processing method can promote the controller to fine-tune future tasks, reflecting the information factorization ability of the nervous system. The lower-level controllers can run semi-autonomously. Even without the high-level controller, the mid-level controller and the CPG module can produce many motion modes. The CPG module can even produce a stable gait without any external information. This passive control mode can reduce the interference between different levels, enhance the system’s robustness, and reflect partial autonomy. The high-level controller can repeatedly call the mid-level and low-level controllers without relearning, which saves the learning cost and reflects the amortized control. The reward functions of the mid-level controller and the high-level controller adopt different settings. The mid-level controller adopts the intensive skill information reward, while the high-level controller uses the goal-conditioned global reward, reflecting the principle of modular objectives. The output of the high-level controller reaches the CPG module through two loops respectively, which is responsible for changing the overall shape and gait of the robot. Depending on the coupling relationship within the oscillator, the CPG module controls all limbs of the robot rather than independently controlling all joints. This coupling control can enhance the robot’s motion coordination ability, reflecting the multi-joint coordination. The high-level controller generates control signals on a coarser time scale, improving decision-making efficiency. In comparison, the lower-level controllers generate accurate periodic signals on a finer time scale, reflecting the temporal abstraction. We believe that the research results are helpful to the development of more complex artificial intelligence systems. By studying the hierarchical control principle generated by the system, it is also helpful to understand the details of biological motion control.

In ref [34], a modular motion controller structure based on CPGs is proposed. It contains multiple passive control modules at different levels, each module can be learned separately, and then through different combinations, a hexapod robot can produce versatile behaviors to solve many tasks. However, the combination of modules still needs to be provided by the operator according to the actual situation, and this design logic weakens the autonomous movement ability of the robot. Ref [6] proposed a three-layer generative model structure based on the principles of motion systems to simulate the deep temporal structure of human motion control, achieving multi-task active control of a humanoid robot in the simulation environment. However, this method relies on real-time feedback to generate control signals, and sensor or actuator failures can lead to unpredictable consequences. Our hierarchical learning control framework combines all the advantages of passive modular control structure and hierarchical active control structure. The passive motion generation mode of the CPG module based on spinal cord reduces sensory dependence, while the hierarchical active control structure of the reference central nervous system promotes the generation of autonomous behavior.

As mentioned above, the method proposed in this work improves the autonomous movement ability of robots and possesses excellent characteristics of the central nervous system. However, the absence of external force feedback from the foot or body may have adverse effects on the motion behavior when an irregular external force is applied to the robot. This disadvantage can be solved by introducing a local reflection loop to the CPG module^{[35, 36]}. By introducing a local conditioned reflex, the robot can automatically respond to external interference by intuition. On the other hand, although the CPG controller can produce smooth rhythm signals to enhance the stability of motion, it also makes it difficult for the robot to complete some explosive actions (such as jumping, galloping, etc.). One solution is introducing bias into the PF layer or directly outputting the joint position^{[37, 38]} to generate a burst control signal. Another solution is to use dynamic motion primitives (DMPs)^[39] instead of the CPG module. DMPs have similar functions to CPGs and can generate discrete or rhythmic actions, often used for trajectory generation^{[40, 41]}. This discrete control signal is expected to make the robot produce more agile behavior with a sense of power.

The proposed control framework integrates the hierarchical structure and principles of the mammalian central neural system. However, the cognitive and reasoning ability of the high-level decision controller can be further improved. In future work, large visual or language models can be integrated into the decision-making level to upgrade the high-level controller ^{[42, 43, 44, 45]}. On the other hand, combining the mechanical model or structural features of robots to enable them to have reconstruction capabilities may lead to stronger robustness^{[33, 46]}. In addition, the agile movement of animals also comes from their precise body mechanics. Through the innovation of the mechanical structure of the robot, its movement ability can be further improved. For example, changing the position of the motor joints can make the mass ratio of the joints is closer to the biological structure. Using “artificial muscle" instead of motors can make the mechanics of the robot more biologically compatible^{[47, 48]}.

This work generally shows how to develop a hierarchical learning control framework to help robots adapt to challenging environments by analyzing the mammalian central nervous system’s hierarchical structure and mechanisms. It provides theoretical practice and engineering experience for designing an autonomous controller for a future robot system and further promotes the landing and application of a bionic autonomous controller.

3 Methods

This section describes in detail the composition and learning methods of each part of the proposed hierarchical learning control framework and the application process for the hexapod robot (see Supplementary Section 1 for details of robot platform and physical simulation). The CPG module includes the oscillator and the desired pose solver. We introduce the differential equations of the oscillator and the internal stable phase embedding method, then show the desired pose solver’s signal adjustment process and the robot motor’s control method. Then, we give a detailed description of the pre-training method of the mid-level controller. Using the learned middle controller, we show how to get a high-level controller with autonomous decision-making ability through the two-stage learning process.

3.1 Half-center rhythm generator layer

To generate the basic motion rhythm signal, we use the Hopf oscillation differential equations^{[20, 49]} to implement the RG layer of CPGs. The following first-order differential equations give the dynamic system:

\begin{aligned} \dot{r}_{i}&=v_{i}\\ \dot{v}_{i}&=\frac{a^{2}}{4}(f({\mu_{i}})-r_{i})-av_{i}\\ \dot{\theta}_{i}&=f({\omega_{i}})\\ \dot{\alpha}_{i}&=\frac{1}{2}{\omega_{m}}+\sum_{j}{r_{j}m_{ij}\sin(\alpha_{j}-% \alpha_{i}-\Psi_{ij})}\end{aligned},

(1)

where $r_{i}$ is the amplitude of the oscillator, $v_{i}$ is the differential of amplitude, $\theta_{i}$ is the adjustable phase, $\alpha_{i}$ is the tripod gait phase, $a$ is a positive constant representing the convergence factor, $\mu_{i}$ and $\omega_{i}$ are amplitude and phase adjustment factors, $\omega_{m}$ is the maximum oscillation frequency. The oscillator finally produces a mixed phase $\varphi_{i}=(\alpha_{i}+\theta_{i})$ ( $i=1,2,...,6$ is the serial number of each leg). The coupling weight and bias between oscillation elements are $m_{ij}$ and $\Psi_{ij}$ . They form an additive coupling term to generate an independent tripod gait for the robot, where $m_{ij}=1$ , and the bias matrix $\bm{\Psi}$ is shown in the following formula

\bm{\Psi}_{6*6}=2\pi\left[\begin{array}[]{cccccc}0&0.5&0&0.5&0&0.5\\ -0.5&0&-0.5&0&-0.5&0\\ 0&0.5&0&0.5&0&0.5\\ -0.5&0&-0.5&0&-0.5&0\\ 0&0.5&0&0.5&0&0.5\\ -0.5&0&-0.5&0&-0.5&0\\ \end{array}\right].

(2)

Due to the effect of the coupling term, the left front leg (LF), the left hind leg (LH) and the right middle leg (RM) of the robot are a group. Their $\alpha$ is the same, while the other three legs are another group, and their $\alpha$ lags $\pi$ rads. This setting makes the six legs form a tripod gait. On this basis, the mid-level controller can adjust the $\mu_{i},\omega_{i}$ of each leg to directly change the amplitude $r_{i}$ and adjustable phase $\theta_{i}$ of the oscillator, then adjust the mixed phase $\varphi_{i}$ to make the CPG module produce different gaits. $f(\mu_{i}),f(\omega_{i})$ are used to calculate the internal natural amplitude and frequency, where $f(\mu_{i})=\mu_{min}+\frac{\mu_{i}+1}{2}(\mu_{max}-\mu_{min})$ and $f(\omega_{i})=\omega_{min}+\frac{\omega_{i}+1}{2}(\omega_{max}-\omega_{min})$ , they map $\mu_{i}\in[-1,1]$ , $\omega_{i}\in[-1,1]$ to $[\mu_{min}=1,\mu_{max}=4]$ , $[\omega_{min}=0,\omega_{max}=\omega_{m}=u(\|\bm{z}\|)\Omega]$ . $u$ is a linear mapping, which maps the $\|\bm{z}\|$ between 0 and 1 to $[0.2,1.0]$ . $\Omega$ is a fixed value $8\pi$ Hz, which can ensure that $\omega_{m}$ is always positive, thus ensuring that the independent tripod gait phase $\alpha$ is not affected by any external factors, and can always produce periodic tripod gait signals. This is different from previous work^{[20, 21, 22]}. These methods add the external feedback signal $\omega_{i}$ and the coupling term directly and take them as the differential of a single phase. When the feedback signal is boundary value (such as $f(\omega_{i})=0$ ), the only coupling term can not make the phase oscillate periodically, which makes the oscillator invalid.

We use the following formula to solve the state of the differential equations:

\begin{aligned} r_{i}^{t}=r_{i}^{t-1}+(\dot{r}_{i}^{t-1}+\dot{r}_{i}^{t})\frac% {dt}{2}\\ v_{i}^{t}=v_{i}^{t-1}+(\dot{v}_{i}^{t-1}+\dot{v}_{i}^{t})\frac{dt}{2}\\ \theta_{i}^{t}=\theta_{i}^{t-1}+(\dot{\theta}_{i}^{t-1}+\dot{\theta}_{i}^{t})% \frac{dt}{2}\\ \alpha_{i}^{t}=\alpha_{i}^{t-1}+(\dot{\alpha}_{i}^{t-1}+\dot{\alpha}_{i}^{t})% \frac{dt}{2}\end{aligned},

(3)

where $dt=0.005$ s.

3.2 Pattern formation layer

To reshape the rhythm signal, we use the desired pose solver to realize the PF layer function of CPGs. After the oscillator generates $r_{i},\varphi_{i}$ , we calculate the desired pose of the end of each leg and then obtain the position under Cartesian Coordinates of the end of the leg, then convert it into the desired motor angles through the inverse kinematics, to generate the control signal of the motors. The end position of each leg is calculated as follows:

\begin{aligned} &p_{x_{i}}=-{l}(r_{i}-1)\cos(\varphi_{i})\\ &p_{y_{i}}=L{w_{y}}\\ &p_{z_{i}}=\begin{cases}-{h}+{g_{c}}\sin(\varphi_{i}),if\ \sin(\varphi_{i})>0% \\ -h+{g_{p}}\sin(\varphi_{i}),otherwise\end{cases}\end{aligned},

(4)

where $\varphi_{i}$ is the mixed phase, and $\{p_{x_{i}},p_{y_{i}},p_{z_{i}}\}$ is the position of the end of leg in the leg’s local Cartesian Coordinates. $l$ is the step length, $L=l_{1}+l_{2}$ , $l_{1}$ , $l_{2}$ , $l_{3}$ are the lengths of the three links of the robot coxa, femur and tibia, $w_{y}$ is the width adjustment variable, $h$ is the height of the robot, $g_{c}$ is the maximum ground clearance in the swing process, and $g_{p}$ is the maximum ground penetration in the support process. These parameters constitute the CPG morphological parameter set $\bm{m}_{p}=\{l,h,g_{p},g_{c},w_{y}\}$ (Fig1.b shows the relationship between the above parameters and the gait). The high-level controller can directly adjust the shape of the robot by providing the deviation value $\delta\bm{m}_{p}=\{\delta l,\delta h,\delta g_{p},\delta g_{c},\delta w_{y}\}% \in[-1,1]$ . The adjustment process is as follows:

\begin{aligned} l&\leftarrow l+g(\delta l)\\ h&\leftarrow h+g(\delta h)\\ g_{p}&\leftarrow g_{p}+g(\delta g_{p})\\ g_{c}&\leftarrow g_{c}+g(\delta g_{c})\\ w_{y}&\leftarrow w_{y}+g(\delta w_{y})\\ \end{aligned},

(5)

where $g(\delta x)=0.02(x_{max}-x_{min})\delta x$ maps the deviation to the range of specified parameters, which is given in Table.1. When $l<0$ , the foot trajectory rotates clockwise, and the robot moves backward. On the contrary, when $l>0$ , the robot moves forward.

Table 1: Range of morphological parameters

$l$	$h$	$g_{p}$	$g_{c}$	$w_{y}$
$[-0.12,0.12]$	$[0.02,0.08]$	$[0.03,0.06]$	$[0.04,0.08]$	$[1.1,1.5]$

After obtaining the position of leg end, we calculate the desired angles of $J_{0}$ , $J_{1}$ and $J_{2}$ joints of each leg through the inverse kinematics model. (See Supplementary Section 11 for the calculation process). Through the PID controller inside the robot, the motors can be controlled to run to the specified angles.

3.3 Skill learning of the mid-level controller

The mid-level reinforcement learning control policy can combine with the CPG module to form many coordinated motor skills. To achieve this, we use the parameterized neural network $\pi$ as the mid-level policy, and output $\bm{a}=[\bm{\mu},\bm{\omega}]\in\mathbb{R}^{12}$ to adjust the internal amplitude and frequency of the oscillation, i.e. $\bm{a}_{t}\sim\pi(\bm{a}_{t}|\bm{s}_{t},\bm{z})$ , with a control frequency of 16.67Hz, according to the higher skill vector $\bm{z}$ and the robot’s proprioception $\bm{s}_{t}$ (including 18 joint angles of the legs, rotational quaternions, angular velocities and linear accelerations information measured by the internal measurement unit (IMU), as well as the morphological parameters and maximum oscillation frequency of the CPG module).

Combining the unsupervised reinforcement learning method of work [32] and the CPG module, we propose a new pre-training method, which can enable the mid-level policy to explore motor skills with excellent dynamic performance in the high degree of freedom system. The training process is shown in Fig. 3a. The mid-level policy, CPG module, and robot environment form a closed loop. The CPG module provides the control signal, while the robot environment feeds back the state $\bm{s}_{t}$ and motion reward $r_{t}^{m}$ to the mid-level policy. During the learning process, the skill vector $\bm{z}$ is randomly selected from the unit circle using polar coordinates, while the morphological parameters are uniformly sampled within the range in Table.1. By designing the parameterized state representation function $\phi$ , we can map the homologous state $\bm{s}^{d}_{t}$ of the robot to the potential space and align it with the skills, then feed back the skill rewards $r^{lsd}_{t}$ to the mid-level policy. By maximizing the sum of rewards through the reinforcement learning algorithm SAC^[50], we can learn $\pi$ that applies to different morphological parameter sets. The algorithm maximizes the difference between the initial state and the final state under the condition of a certain skill, so that the policy can produce different trajectories under the effect of different skills, and its optimization objective is as follows:

\begin{aligned} J_{1}(\pi)&=\mathbb{E}_{\bm{z},\tau}[(\phi(\bm{s}^{d}_{T})-% \phi(\bm{s}^{d}_{0}))^{T}\bm{z}]\\ &=\mathbb{E}_{\bm{z},\tau}[\sum_{t=0}^{T-1}(\phi(\bm{s}^{d}_{t+1})-\phi(\bm{s}% ^{d}_{t}))^{T}\bm{z}]\\ s.t.\quad&\forall x,y\in\mathcal{S}^{d}\quad\|\phi(x)-\phi(y)\|\leq\|x-y\|\\ \end{aligned},

(6)

where $\bm{z}$ is the skill vector, $\tau$ is the trajectory generated by the policy under the effect of the skill, $\bm{s}^{d}_{t}$ and $\bm{s}_{t}$ are generated at the same time. However, different from $\bm{s}_{t}$ , $\bm{s}^{d}_{t}$ mainly contains the robot’s XYZ coordinates, IMU information (roll, pitch, yaw angles, linear velocities, and angular velocities) and oscillator’s internal state $[\bm{r},\dot{\bm{r}},\bm{\theta},\dot{\bm{\theta}},\bm{\alpha},\dot{\bm{\alpha% }}]$ . The state representation function $\phi$ is also a learnable neural network responsible for mapping these information into the skill space and aligning skills with it. This asymmetric state structure makes the policy $\pi$ respond to the skill vectors associated with different oscillator’s internal states while considering the proprioception. To avoid infinite $\phi(\bm{s}^{d}_{t})$ , 1-Lipschitz constraint is used on $\phi$ . On the other hand, during the robot’s movement, it is necessary to maintain the body’s stability, so it is necessary to reduce the vertical speed of the robot. Therefore, another optimization objective is as follows:

J_{2}(\pi)=\mathbb{E}_{\bm{z},\tau}[\sum_{t=0}^{T-1}-w_{z}v_{z}^{2}],

(7)

where $w_{z}=0.1$ is the speed penalty weight, and $v_{z}$ is the speed of the robot in the z-axis direction. Thus, the reward per-step transition ( $\bm{s}_{t},\bm{a}_{t},r_{t},\bm{s}_{t+1}$ ) can be written as follows:

r_{t}=r^{lsd}_{t}+r^{m}_{t}=(\phi(\bm{s}^{d}_{t+1})-\phi(\bm{s}^{d}_{t}))^{T}% \bm{z}+(-w_{z}v_{z}^{2}).

(8)

Considering the discount factor $\gamma$ , the overall optimization objective of the mid-level policy is as follows:

J(\pi)=\mathbb{E}_{\bm{z},\tau}[\sum^{T-1}_{t=0}\gamma^{t}r_{t}],

(9)

we use SAC to optimize $\pi$ and use Stochastic Gradient Descent (SGD) to optimize $\phi$ (see Supplementary Section 5 for details of the algorithm and pseudo code).

Instead of sampling skills from Gaussian or von Mises distribution^{[32, 51]}, we uniformly sampled the skill vector $\bm{z}$ from the unit circle. Skills of different lengths can cover the whole space evenly. The mid-level policy can learn rich motion patterns under different oscillation frequencies. Another advantage is that the skill space can be conveniently used as the abstract action space of the high-level policy. In this regard, we design the following random skill generator:

		$\displaystyle R\sim U(0,1)$		(10)
		$\displaystyle\beta\sim U(0,2\pi)$
		$\displaystyle z_{x}=\sqrt{R}\cos(\beta)$
		$\displaystyle z_{y}=\sqrt{R}\sin(\beta)$

Through polar coordinate transformation, we can get the vector $\bm{z}=(z_{x},z_{y})$ uniformly distributed in the unit circle (See Supplementary Section 7 for the proof of uniform sampling).

3.4 Multi-task reinforcement learning of the high-level controller

To make the high-level reinforcement learning control policy have the ability of autonomous decision-making, we propose a two-stage multi-task reinforcement learning method. The first stage is carried out in the multi-task simulation environment ( Fig. 4b). We use the parameterized neural network $\eta$ as the high-level policy. It accepts the robot’s proprioception $\bm{s}_{t}$ and the environmental information $\bm{s}^{e}_{t}$ from the environment (including the height sampling points centered on the robot and the heading directions from the robot to the two target points). It generates the high-level decision action $\bm{a}^{h}=[\delta\bm{m}_{p},\bm{z}]\in\mathbb{R}^{7}$ , i.e. $\bm{a}^{h}_{t}\sim\eta(\bm{a}^{h}_{t}|\bm{s}_{t},\bm{s}^{e}_{t})$ . Use the learned skills to control the robot movement, we can get the environmental reward $r^{h}_{t}$ . Due to the time abstraction of the hierarchical structure, the action execution frequency (1.67Hz) of the high-level policy is only $\frac{1}{10}$ of that of the middle level, which saves computational resources and improves efficiency.

The learning process is shown in Fig. 4a phase 1. After learning the mid-level policy, fix it and only learn the high-level policy. Each time the high-level policy interacts with the environment, it will receive a reward $r^{h}_{t}$ . Through the model-free reinforcement learning algorithm (SAC), the total reward of each task can be maximized, and the learning process can be completed in a short time (about $\frac{1}{3}$ of the learning time of the mid-level policy). The overall optimization objective of $\eta$ can be written as:

J(\eta)=\mathbb{E}_{\tau}[\sum^{T-1}_{t=0}\gamma^{t}r^{h}_{t}],

(11)

where $\tau$ is the trajectory generated in each episode. Due to the time abstraction, the number of transitions collected during high-level policy training will be reduced. To solve the problem of low sample efficiency, we use step-conditioned critical SAC structure^[52], (see Supplementary Section 8 for details of the algorithm and pseudo code). The reward function uses the following form:

r^{h}=d_{l}\times(w_{v}\times r_{v}+w_{d}\times r_{d}+w_{b}\times r_{b}+w_{s}% \times r_{s}+w_{T}\times r_{T}),

(12)

where $d_{l}$ is the difficulty level ( $d_{l}=1,2,3,4,5$ ). The higher the difficulty level, the richer the reward. To enable the robot to adapt to obstacles quickly, we adopt the dynamic learning mode of course learning and divide the obstacle environment into different difficulty levels. When the robot reaches the final goal, it will be transferred to a more difficult environment. If the robot cannot achieve half of the goal, it will be transferred to a simpler environment. The total reward mainly includes five sub rewards: speed tracking reward $r_{v}$ , direction tracking reward $r_{d}$ , balance reward $r_{b}$ , collision reward $r_{s}$ and completion time reward $r_{T}$ . The weight set of sub rewards is $\{w_{v},w_{d},w_{b},w_{s},w_{T}\}$ .

To enable the robot to climb in any direction and overcome obstacles instead of moving around obstacles, we use the target points in the world coordinates to calculate the desired heading directions, which can be written as

\bm{d}=\frac{\bm{g}-\bm{x}}{\|\bm{g}-\bm{x}\|},

(13)

where $\bm{g}$ is the coordinates of the target points, and $\bm{x}$ is the coordinate of the robot in the world coordinates. The following speed tracking reward $r_{v}$ encourages the robot to move towards the target points

r_{v}=\min(<\bm{d},\bm{v}>,v_{max}),

(14)

where $\bm{v}$ is the speed of the robot in the world coordinates, $v_{m}$ is the maximum running speed of the robot, here is 0.3m/s.

Use the following direction tracking reward $r_{d}$ to prompt the robot to quickly adjust its motion direction

r_{d}=\exp(-|\bm{d}-\bm{d}_{c}|),

(15)

where $\bm{d}_{c}$ is the current heading directions vector of the robot in the world coordinates.

Use the following balance reward $r_{b}$ to punish the robot for its vertical movement and avoid the robot shaking up and down during the movement

r_{b}=-v_{z}^{2},

(16)

where $v_{z}$ is the speed of the robot in the vertical direction.

Use the following collision reward $r_{s}$ to reduce the number of collisions between the foot and the obstacle during robot movement.

{r_{s}=}\begin{cases}-1,if\ |f_{xy}|>4|f_{z}|\\ 0,otherwise\end{cases},

(17)

where $f_{xy}$ is the resultant force in the XY direction and $f_{z}$ is the force in the Z direction.

Use the following time reward $r_{T}$ to urge the robot to complete the task quickly

r_{T}=-0.1.

(18)

The weight set differs for the four obstacle environments to improve the learning effect. For stair climbing, we focus on the robot’s ability to climb quickly, so the weight of the balance reward is reduced, and its reward weight set is $\{w_{v}=1.0,w_{d}=1.0,w_{b}=0.5,w_{s}=1.0,w_{T}=1.0\}$ . For crossing the gap, we focus more on its balance ability, and the reward weight set is $\{w_{v}=1.0,w_{d}=1.0,w_{b}=2.0,w_{s}=1.0,w_{T}=1.0\}$ . For crossing the narrow alley, the robot should move quickly and avoid hitting the walls on both sides, so the weight of speed and collision reward has been improved, $\{w_{v}=2.0,w_{d}=1.0,w_{b}=2.0,w_{s}=2.0,w_{T}=1.0\}$ . When the robot climbs over the slope, it needs to adjust its heading directions quickly on the slope to avoid falling, so its direction tracking reward has a higher weight, $\{w_{v}=1.0,w_{d}=1.5,w_{b}=1.0,w_{s}=1.0,w_{T}=1.0\}$ .

3.5 Distillation learning of the high-level controller

In the first stage of the learning process, the robot uses the surrounding height field and target heading directions as the external environmental information in the simulation environment, which can accelerate the learning and sampling process. However, the real robot must perceive the external environment through the camera. Therefore, in the second stage, distillation learning is used in the multi-task visual simulation environment, and multiple high-level policies learned are extracted into a student policy so that it can independently infer the environmental features and target heading directions according to the depth image, and then give appropriate high-level decision instructions. We use the distillation learning method of teacher-student to collect virtual image information for training in the simulation.

The distillation learning process is shown in Fig. 4a phase 2. We use multiple high-level neural networks for different tasks as the teacher policies and a neural network that can receive image input as the student policy to learn in the most difficult obstacle environment (see Supplementary Section 9 for the structure of the student network). The student policy needs to predict the height information near the robot and the target heading directions vector $\hat{\bm{d}}$ from the depth image and generate the predicted action $\hat{\bm{a}}^{h}_{t}$ through the predicted information. If the heading directions vector predicted by the student policy is inaccurate initially, the direct use of the predicted vector as the state may lead to catastrophic distribution drift. To deal with this problem, we refer to ref [38] and use Dagger^[53] to train the student policy. Specifically, we use the mixed heading directions of the teacher and student as the target heading directions. The heading directions are calculated as follows:

{\bm{s}^{e}_{\theta}=}\begin{cases}\hat{\bm{d}},if\ |\bm{d}-\hat{\bm{d}}|<0.6% \\ \bm{d},otherwise\end{cases}

(19)

where $\hat{\bm{d}}$ is the heading directions vector predicted by the student policy, and $\bm{d}$ is the actual calculated heading directions vector. The predicted heading directions are adopted when the difference between the two is within the allowable range. Otherwise, the actual heading directions are adopted. $\bm{s}^{e}_{\theta}$ is the heading directions vector of observations obtained by the student policy.

During distillation, supervised learning is used to train the student policy. The loss function is as follows:

loss=\frac{1}{N}\sum_{i=1}^{N}(\|\hat{\bm{d}}-\bm{d}\|+\|\hat{\bm{a}}^{h}-\bm{% a}^{h}\|),

(20)

where $\hat{\bm{a}}^{h}$ is the high-level action predicted by the student policy, and $\bm{a}^{h}$ is the guiding action generated by teacher policies. In the training process, we add random noise to the depth image obtained in the simulation environment to enhance the robustness of the policy to image input in the physical environment.

4 Acknowledgements

Parts of the Fig. 1a by using pictures from Servier Medical Art. Servier Medical Art by Servier is licensed under a Creative Commons Attribution 3.0 Unported License (https://creativecommons.org/licenses/by/3.0/).

References

\bibcommenthead
[1] Merel, J., Botvinick, M. & Wayne, G. Hierarchical motor control in mammals and machines. Nature communications 10, 1–12 (2019).
[2] Bear, M., Connors, B. & Paradiso, M. A. Neuroscience: Exploring the Brain (Wolters Kluwer, Philadelphia, 2016).
[3] Ijspeert, A. J. Biorobotics: Using robots to emulate and investigate agile locomotion. science 346, 196–203 (2014).
[4] Ramdya, P. & Ijspeert, A. J. The neuromechanics of animal locomotion: From biology to robotics and back. Science Robotics 8, eadg0279 (2023).
[5] Merel, J. et al. Neural probabilistic motor primitives for humanoid control (2019). Paper presented at the 7th International Conference on Learning Representations, Ernest N. Morial Convention Center, New Orleans, 6–9 May 2019.
[6] Yuan, K., Sajid, N., Friston, K. & Li, Z. Hierarchical generative modelling for autonomous robots. Nature Machine Intelligence 5, 1402–1414 (2023).
[7] Won, J., Gopinath, D. & Hodgins, J. Control strategies for physically simulated characters performing two-player competitive sports. ACM Transactions on Graphics (TOG) 40, 1–11 (2021).
[8] Peng, X. B., Berseth, G., Yin, K. & Van De Panne, M. Deeploco: Dynamic locomotion skills using hierarchical deep reinforcement learning. Acm transactions on graphics (tog) 36, 1–13 (2017).
[9] Peng, X. B., Guo, Y., Halper, L., Levine, S. & Fidler, S. Ase: Large-scale reusable adversarial skill embeddings for physically simulated characters. ACM Transactions On Graphics (TOG) 41, 1–17 (2022).
[10] Eysenbach, B., Gupta, A., Ibarz, J. & Levine, S. Diversity is all you need: Learning skills without a reward function (2019). Paper presented at the 7th International Conference on Learning Representations, Ernest N. Morial Convention Center, New Orleans, 6–9 May 2019.
[11] Gehring, J., Synnaeve, G., Krause, A. & Usunier, N. Hierarchical skills for efficient exploration (2021). Paper presented at the 34th Advances in Neural Information Processing Systems, Virtual-only Conference, 6–14 December 2021.
[12] Boyle, J. H., Berri, S. & Cohen, N. Gait modulation in c. elegans: an integrated neuromechanical model. Frontiers in computational neuroscience 6, 10 (2012).
[13] Geyer, H. & Herr, H. A muscle-reflex model that encodes principles of legged mechanics produces human walking dynamics and muscle activities. IEEE Transactions on neural systems and rehabilitation engineering 18, 263–273 (2010).
[14] Roennau, A., Kerscher, T., Ziegenmeyer, M., Zöllner, J. M. & Dillmann, R. Kozłowski, K. R. (ed.) Adaptation of a six-legged walking robot to its local environment. (ed.Kozłowski, K. R.) Robot Motion and Control 2009, 155–164 (Springer London, London, 2009).
[15] Paskarbeit, J., Schilling, M., Schmitz, J. & Schneider, A. Obstacle crossing of a real, compliant robot based on local evasion movements and averaging of stance heights using singular value decomposition (2015). Paper presented at the 2015 IEEE International Conference on Robotics and Automation, Washington State Convention Center in Seattle, Washington, 26–30 May 2015.
[16] Namiki, S., Dickinson, M. H., Wong, A. M., Korff, W. & Card, G. M. The functional organization of descending sensory-motor pathways in drosophila. Elife 7, e34272 (2018).
[17] Grillner, S. & Wallen, P. Central pattern generators for locomotion, with special reference to vertebrates. Annual review of neuroscience 8, 233–261 (1985).
[18] Ijspeert, A. J., Crespi, A., Ryczko, D. & Cabelguen, J.-M. From swimming to walking with a salamander robot driven by a spinal cord model. science 315, 1416–1420 (2007).
[19] Owaki, D., Kano, T., Nagasawa, K., Tero, A. & Ishiguro, A. Simple robot suggests physical interlimb communication is essential for quadruped walking. Journal of The Royal Society Interface 10, 20120669 (2013).
[20] Shafiee, M., Bellegarda, G. & Ijspeert, A. Viability leads to the emergence of gait transitions in learning agile quadrupedal locomotion on challenging terrains. Nature Communications 15, 3073 (2024).
[21] Bellegarda, G. & Ijspeert, A. Cpg-rl: Learning central pattern generators for quadruped locomotion. IEEE Robotics and Automation Letters 7, 12547–12554 (2022).
[22] Bellegarda, G., Shafiee, M. & Ijspeert, A. Visual cpg-rl: Learning central pattern generators for visually-guided quadruped locomotion (2024). Paper presented at the 2024 IEEE International Conference on Robotics and Automation, Yokohama, Japan, 13–17 May 2024.
[23] Xiao-R.com. Phage ros slam hexapod bionic robot (2024). Product introduction https://www.xiao-r.com/Product/page/id/50.
[24] Graybiel, A. M. The basal ganglia. Current biology 10, R509–R511 (2000).
[25] Groenewegen, H. J. et al. The basal ganglia and motor control. Neural plasticity 10, 107–120 (2003).
[26] Kiehn, O. Locomotor circuits in the mammalian spinal cord. Annu. Rev. Neurosci. 29, 279–306 (2006).
[27] Kuypers, H. G. The descending pathways to the spinal cord, their anatomy and function. Progress in brain research 11, 178–202 (1964).
[28] Rybak, I. A., Shevtsova, N. A., Lafreniere-Roula, M. & McCrea, D. A. Modelling spinal circuitry involved in locomotor pattern generation: insights from deletions during fictive locomotion. The Journal of physiology 577, 617–639 (2006).
[29] Kullander, K. et al. Role of epha4 and ephrinb3 in local neuronal circuits that control walking. Science 299, 1889–1892 (2003).
[30] Glickstein, M. & Yeo, C. The cerebellum and motor learning. Journal of cognitive neuroscience 2, 69–80 (1990).
[31] Ramdya, P. et al. Climbing favours the tripod gait over alternative faster insect gaits. Nature communications 8, 14494 (2017).
[32] Park, S., Choi, J., Kim, J., Lee, H. & Kim, G. Lipschitz-constrained unsupervised skill discovery (2022). Paper presented at the 10th International Conference on Learning Representations, Virtual-only Conference, 25–29 April 2022.
[33] Bongard, J., Zykov, V. & Lipson, H. Resilient machines through continuous self-modeling. Science 314, 1118–1121 (2006).
[34] Thor, M. & Manoonpong, P. Versatile modular neural locomotion control with fast learning. Nature Machine Intelligence 4, 169–179 (2022).
[35] Owaki, D., Horikiri, S.-y., Nishii, J. & Ishiguro, A. Tegotae-based control produces adaptive inter-and intra-limb coordination in bipedal walking. Frontiers in neurorobotics 15, 629595 (2021).
[36] Thandiackal, R. et al. Emergence of robust self-organized undulatory swimming based on local hydrodynamic force sensing. Science robotics 6, eabf6354 (2021).
[37] Shafiee, M., Bellegarda, G. & Ijspeert, A. Puppeteer and marionette: Learning anticipatory quadrupedal locomotion based on interactions of a central pattern generator and supraspinal drive (2023). Paper presented at the 2023 IEEE International Conference on Robotics and Automation, ExCeL, London, 29 May – 2 June 2023.
[38] Cheng, X., Shi, K., Agarwal, A. & Pathak, D. Extreme parkour with legged robots (2023). Paper presented at the 2023 Conference on Robot Learning, Atlanta, Georgia, 6–9 November 2023.
[39] Gams, A., Nemec, B., Ijspeert, A. J. & Ude, A. Coupling movement primitives: Interaction with the environment and bimanual tasks. IEEE Transactions on Robotics 30, 816–830 (2014).
[40] Kober, J. & Peters, J. Learning motor primitives for robotics (2009). Papaer presented at the 2009 IEEE International Conference on Robotics and Automation, Kobe, Japan, 12–17 May 2009.
[41] Ude, A., Nemec, B., Petrić, T. & Morimoto, J. Orientation in cartesian space dynamic movement primitives (2014). Papaer presented at the 2014 IEEE International Conference on Robotics and Automation, Convention and Exhibition Center, Hong Kong, China, 31 May – 5 June 2014.
[42] Brohan, A. et al. Do as i can, not as i say: Grounding language in robotic affordances (2022). Papaer presented at the 6th Conference on Robot Learning, Auckland, New Zealand, 14–18 December 2022.
[43] Driess, D. et al. Palm-e: An embodied multimodal language model (2023). Papaer presented at the 40th International Conference on Machine Learning, Honolulu, Hawaii, 23–29 July 2023.
[44] Shah, D., Osiński, B., Levine, S. et al. Lm-nav: Robotic navigation with large pre-trained models of language, vision, and action (2022). Papaer presented at the 6th Conference on Robot Learning, Auckland, New Zealand, 14–18 December 2022.
[45] Goyal, A. et al. Rvt: Robotic view transformer for 3d object manipulation (2023). Papaer presented at the 7th Conference on Robot Learning, Atlanta, Georgia, 6–9 November 2023.
[46] Zhang, T. et al. Configuration synthesis of under-actuated resilient robotic systems (2014). Paper presented at the 2014 IEEE/ASME International Conference on Advanced Intelligent Mechatronics, Besançon, France, 7–11 July 2014.
[47] Kingsley, D. A., Quinn, R. D. & Ritzmann, R. E. A cockroach inspired robot with artificial muscles (2006). Paper presented at the 19th IEEE/RSJ International Conference on Intelligent Robots and Systems, International Convention Center, Beijing, 9–15 October 2006.
[48] Szczecinski, N. S., Goldsmith, C. A., Nourse, W. R. & Quinn, R. D. A perspective on the neuromorphic control of legged locomotion in past, present, and future insect-like robots. Neuromorphic Computing and Engineering 3, 023001 (2023).
[49] Liu, H., Jia, W. & Bi, L. Hopf oscillator based adaptive locomotion control for a bionic quadruped robot (2017). Paper presented at the the 14th IEEE International Conference on Mechatronics and Automation, Takamatsu, Japan, 6–9 August 2017.
[50] Haarnoja, T., Zhou, A., Abbeel, P. & Levine, S. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor (2018). Paper presented at the the 35th International Conference on Machine Learning, Stockholmsmässan, Stockholm, 10–15 July 2018.
[51] Hansen, S. et al. Fast task inference with variational intrinsic successor features (2020). Paper presented at the 8th International Conference on Learning Representations, Addis Ababa, Ethiopia, 26–30 April 2020.
[52] Whitney, W. F., Agarwal, R., Cho, K. & Gupta, A. Dynamics-aware embeddings (2020). Paper presented at the 8th International Conference on Learning Representations, Addis Ababa, Ethiopia, 26–30 April 2020.
[53] Ross, S., Gordon, G. & Bagnell, D. A reduction of imitation learning and structured prediction to no-regret online learning (2011). Paper presented at the 14th International Conference on Artificial Intelligence and Statistics, Fort Lauderdale, USA, 11–13 April 2011.