0% found this document useful (0 votes)
35 views10 pages

AKB-48 A Real-World Articulated Object Knowledge

AKB-48_A_Real-World_Articulated_Object_Knowledge_
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
35 views10 pages

AKB-48 A Real-World Articulated Object Knowledge

AKB-48_A_Real-World_Articulated_Object_Knowledge_
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

AKB-48: A Real-World Articulated Object Knowledge Base

Liu Liu Wenqiang Xu Haoyuan Fu Sucheng Qian Qiaojun Yu Yang Han Cewu Lu†
Shanghai Jiao Tong University
{liuliu1993, vinjohn, simon-fuhaoyuan, qiansucheng, yqjllxs, lucewu}@sjtu.edu.cn
2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) | 978-1-6654-6946-3/22/$31.00 ©2022 IEEE | DOI: 10.1109/CVPR52688.2022.01439

[email protected]

Figure 1. AKB-48 consists of 2,037 articulated object models of 48 categories scanned from the real world. The objects are annotated with
ArtiKG, and can support a full task spectrum from computer vision to robotics manipulation.

Abstract world. Using our dataset, we propose AKBNet, an integral


pipeline for Category-level Visual Articulation Manipula-
Human life is populated with articulated objects. A com- tion (C-VAM) task, in which we benchmark three sub-tasks,
prehensive understanding of articulated objects, namely namely pose estimation, object reconstruction and manipu-
appearance, structure, physical property, and semantics, lation. Dataset, codes, and models are publicly available at
will benefit many research communities. As current artic- https://liuliu66.github.io/AKB-48.
ulated object understanding solutions are usually based on
synthetic object dataset with CAD models without physics
properties, which prevent satisfied generalization from sim- 1. Introduction
ulation to real-world applications in visual and robotics Articulated objects, composed of more than one rigid
tasks. To bridge the gap, we present AKB-48: a large- part connected by joints allowing rotational or translational
scale Articulated object Knowledge Base which consists of movements in 3D space, are pervasive in our daily life.
2,037 real-world 3D articulated object models of 48 cat- Knowledge about the articulated objects can be beneficial
egories. Each object is described by a knowledge graph to many research communities, such as computer vision,
ArtiKG. To build the AKB-48, we present a fast articula- robotics and embodied AI. Thus, many articulated object
tion knowledge modeling (FArM) pipeline, which can fulfill datasets have been proposed to facilitate the research, such
the ArtiKG for an articulated object within 10-15 minutes, as PartNet-Mobility [31], ReArt-48 [17], RBO [20]. How-
and largely reduce the cost for object modeling in the real ever, these datasets generally focus more on the structural
† Cewu Lu is the corresponding author. He is the member of Qing Yuan information (e.g. part segmentation, kinematic structure),
Research Institute and MoE Key Lab of Artificial Intelligence, AI Institute, but pay less attention to the appearance (e.g. texture, fine
Shanghai Jiao Tong University, and Shanghai Qi Zhi Institute, China. geometry), the physics properties (e.g. per-part mass, iner-

978-1-6654-6946-3/22/$31.00 ©2022 IEEE 14789


DOI 10.1109/CVPR52688.2022.01439
Authorized licensed use limited to: HEFEI UNIVERSITY OF TECHNOLOGY. Downloaded on January 25,2025 at 15:15:12 UTC from IEEE Xplore. Restrictions apply.
tial, material and friction) and semantics (e.g. category, af- • Manipulation Module for Articulated Object Manip-
fordance). While some important tasks heavily rely on these ulation. Once we obtain the articulation information
information such as object detection (texture) [2], 3D recon- (e.g. part segments, per-part pose, joint properties, full
struction (fine geometry) [19], object manipulation (phys- mesh, etc.) through perception, we can learn the in-
ical property) [5], and so on, the lacking of such object teraction policy over the observations. We benchmark
knowledge in these datasets can prevent satisfied general- manipulation tasks with opening and pulling that are
ization for the learning models. corresponding to revolute and prismatic joint respec-
To boost the research on articulated objects, in this paper, tively. (Sec. 4.3)
we present AKB-48: a large-scale real-world Articulated
Knowledge Base which includes 48 categories, 2,037 in- To evaluate the AKBNet, we report the results individ-
stances. For each instance, the object model is scanned ually and systematically. For individual evaluation of each
from the real counterpart and refined manually (Sec. 3.2), module, we assume the input to the module is the ground
and the object knowledge is organized to a graph, named truth of the last module, while for systematical evaluation,
Articulation Knowledge Graph (ArtiKG), which contains the input is the output of the last module. Apparently, we
the detailed annotations of different kinds of object at- cannot benchmark all the tasks which can be supported by
tributes and properties (Sec. 3.1). To make the scanning and the proposed AKB-48. We hope it could serve as a good
annotation process feasible for large datasets, we present a platform for future articulation research in computer vision
Fast Articulation Knowledge Modeling (FArM) pipeline and robotics community.
(Sec. 3.3). In detail, we develop an object recording system Our contributions can be summarized in three folds:
with 3D sensors and turntables, a GUI that integrates struc- • We introduce AKB-48, containing 2,037 articulated
tural and semantic annotations, and standard real-world ex- models across 48 categories, in which we adopt a
periments for physical property annotation (Fig. 3). In this multi-modal knowledge graph ArtiKG to organize the
way, we can save a large amount of money and time bud- rich annotations. It can contribute to close the gap be-
get for modeling real-world articulated objects (∼$3 to buy, tween the current vision and embodied AI researches.
10-15min to annotate per object). A thorough comparison To the best of our knowledge, it is the first large-
between the CAD modeling and reverse scanning can be re- scale articulation dataset with rich annotations col-
ferred to Sec. 3.2. To summarize, our pipeline can save 33 lected from the real world.
folds on the money budget and 5 folds on the time budget.
To utilize the AKB-48 for research, we propose • We propose a fast articulation knowledge object mod-
AKBNet, an integral pipeline for Category-level Visual eling pipeline, FArM, which makes it much easier to
Articulation Manipulation (C-VAM) task. To address C- collect articulated objects from the real world. Our
VAM problem, the vision system AKBNet should be able pipeline greatly eases the cost on time and money
to estimate the object pose, reconstruct the object geome- when building real-world 3D model datasets.
try and learn the policy for manipulation at category level. • We propose an integral pipeline AKBNet for the in-
Thus, it consists of three perception sub-modules: tegral category-level visual articulation manipulation
(C-VAM) task. Experiments show our approach is ef-
• Pose Module for Category-level Articulated Object fective both individually and systematically in the real
Pose Estimation. This module aims to estimate the world.
per-part 6D pose of an unseen articulated object in one
category. However, prior researches generally study
on kinematic category, that is objects of a category are 2. Related Work
defined to have the same kinematic structure. Our pose 3D Model Repositories and Datasets. An unavoidable
module extends the concept of “category” to semantic challenge for analyzing 3D objects, especially for articu-
category, in which the category is defined by the se- lated objects, is the lack of large-scale training data with
mantics and different kinematic structures are allowed. sufficient 3D models and full annotations. To the best of
(Sec. 4.1) our knowledge, current 3D model repositories prefer to col-
lect CAD models by searching from the Internet such as
• Shape Module for Articulated Object Reconstruction. Trimble 3D Warehouse and Onshape [14]. ShapeNet [4]
After the pose is obtained, along with the shape code collects approximately 3 million shapes from online model
encoding from input images, we can reconstruct the repositories and categorizes them based on WordNet [22]
shape for each part [25]. Full geometry is critical taxonomy. But although ShapeNet contains many articu-
for manipulation to determine where to interact with. lated categories, the models of ShapeNet can only be con-
(Sec. 4.2) sidered as rigid shapes since they do not define parts within

14790

Authorized licensed use limited to: HEFEI UNIVERSITY OF TECHNOLOGY. Downloaded on January 25,2025 at 15:15:12 UTC from IEEE Xplore. Restrictions apply.
Pencil Box
Bag

Box Mail Box Multiviews Location: (0,-0.22,0.08)


Axis: (-1.12,0.01,-0.02)
Container Limit: [0,1.71]

Tuck Box Revolute joint
Location: (0,0,0)
Axis: (0,-0.02,0.99)
Bin Limit: [0,0.05] Location: (0,0.21,0.12)
Axis: (0.99,-0.02,-0.01)
Shoebox Limit: [-1.71, 0]
UUID Screw joint
daea5652-66bc-4001- Revolute joint
Cup
8df3-40cee0051db5
Mesh Mass: 118.23g
Friction: 0.15
Taxonomy Inertia: [0.22, ...]

z
Mass: 78.52g
Appearance Structure H: 0.08m y Segmentation Friction: 0.15
W: 0.16m Inertia: [0.19, ...]
L: 0.16m
x
Semantics Physics Scale Alignment Mass: 10.21g
Mass: 10.62g
Friction: 0.15 Friction: 0.15
Inertia: [0.04, ...] Inertia: [0.08, ...]

Figure 2. The Articulation Knowledge Graph (ArtiKG) defined in AKB-48 dataset. In ArtiKG, we annotate four types of knowledge:
Appearance, Structure, Semantics and Physical Property. The values are rounded up to percentile in this figure for presentation.

tend to build small-scale model datasets such as YCB [3]


6
4
7 and RBO [20]. Therefore, the data volume makes it hard
5
8
to be adopted in our category-level articulation tasks, which
requires generalization capacity among different instances.
3
In this paper, we present AKB-48 as the first large-scale
1
real-world base for articulation analysis.
2

Figure 3. The task-specific model acquisition equipment. (a) 1 is


a Rotating turntable for objects with multiple scales. 2 is a track- Articulation-related Tasks. Articulated objects have
ing marker. 3 is a light-absorbing item. 4 is a lift bracket. 5 is
been investigated for decades in both vision and robotics
the Shining 3D scanner. 6-8 are the realsense L515 cameras for
capturing multiviews of objects.
communities but hold different emphases. In vision tasks,
current works tend to solve category-level object recogni-
tion, segmentation or pose estimation that focus on gener-
alization among objects. Yi et al. [32] take a pair of unseg-
them. To deal with this problem, Mo et al. [24] first present mented shape representations as input to predict part seg-
a large-scale dataset PartNet that annotates hierarchical part mentation and deformation. For tackling with unseen ob-
semantic segmentation based on a subset of ShapeNet [4]. jects, Li et al. [16] follow the pose estimation setting and
One critical problem in PartNet is that it pays much atten- propose a normalized coordinate space to estimate 6D pose
tion to labeling each semantic part but ignores the kinemat- and joint state for articulated objects. In terms of joint-
ics structures. To solve this issue, PartNet-Mobility [31] centered perception tasks, several works attempt to mine
and Shape2Motion [30] further annotate joint properties on joint configurations of articulated objects [11, 18, 33]. To
the shapes, which target at articulation research. investigate manipulation points for articulated objects from
These datasets mostly follow the model construction visual input, Mo et al. attempt to define six types of ac-
paradigm from ShapeNet: collecting CAD models from tion primitives and predict interactions [23]. In terms of
the Internet and providing specific annotations for differ- robotics community, researchers usually solve interaction
ent tasks. This allows the early works (ShapeNet [4], ABC or manipulation tasks to achieve articulation inference such
dataset [14], etc.) to quickly build large-scale object model as robot interactive perception [12], feedback by visual ob-
bases. However, when the task is required to investigate servation [9] and task integration [21]. Besides, some works
new categories or kinematic structures, artists need to man- attempt to bridge the gap between vision and manipulation
ually build proper CAD models from scratch, which is very but still suffer from the small-scale issue. Therefore, we
time-consuming and laborious. On the other hand, current propose AKBNet to deal with category-level articulation
real-world researches focus on instance-level tasks so they tasks.

14791

Authorized licensed use limited to: HEFEI UNIVERSITY OF TECHNOLOGY. Downloaded on January 25,2025 at 15:15:12 UTC from IEEE Xplore. Restrictions apply.
3. Articulation Knowledge Base, AKB-48 mesh, intrinsic dimensions, are not discussed. Besides, as
the annotation information is modular organized, it is con-
When constructing the knowledge base, three instant venient for new attributes to be added to the ArtiKG. Be-
questions should be answered: (1) What kinds of knowl- sides, though the ArtiKG is designed for articulated objects,
edge should we annotate on the object? (2) What objects it can also be trivially extended to rigid, and flexible objects.
should we annotate, those from the real or the simulated
world? (3) How to annotate the object knowledge effi- 3.2. Object Selection: Real-world Scanning v.s
ciently? To answer these questions, we describe the ArtiKG CAD Modeling
in Sec. 3.1, make a thorough discussion on the object selec-
tion in Sec. 3.2, and finally propose the FArM pipeline in The choice between real-world scanning and CAD mod-
Sec. 3.3 and provide analysis (diversity, difficulty) about eling are considered from two perspectives, namely annota-
the dataset in Sec. 3.4. tion accuracy, cost on time and money.
Annotation Accuracy. According to the content of the Ar-
3.1. Articulated Object Knowledge Graph, ArtiKG tiKG, we can see objects from the real world have multi-
Different tasks require different kinds of object knowl- ple advantages over the CAD models, such as appearance
edge, to unify the annotation representation, we organize it and physical property. But admittedly, the CAD model can
into a multi-modal knowledge graph, named ArtiKG. The model inner structures such as the GUNdam or the trans-
ArtiKG consists of four major parts, namely appearance, former, while scanning techniques focus more on the sur-
structure, physics, and semantics. The details are described face. Since such objects with inner structures that cannot be
in the following and visualized in Fig. 2. easily disassembled posit challenges for both artists and the
scanners, we would like to update these objects when tech-
Appearance. For each instance, we store its shape with mesh niques are more ready. Fortunately, most daily objects can
data structure along with the textures. When scanning the be disassembled, so the scanning techniques can properly
object from the real world, we also collect the multi-view handle them.
RGB-D snapshots of the object.
Cost on Time and Money. As discussed earlier, ShapeNet-
Structure. The key difference between the articulated object like model collection paradigm is limited to large time and
and the rigid object is the kinematic structure. The articu- money cost of artists’ manual CAD model building when
lated object has concepts like joint and part, which are not investigating new categories or kinematic structures. On
meaningful for the rigid object. For each joint, we annotate the other hand, many daily articulated objects are cheap in
the joint type, parameters, and movement limits. For each reality and can be scanned by a layman. We compare the
part, we segment each kinematic part. average money and time budget in Table 1. For CAD mod-
Semantics. After the basic geometric and structural infor- eling, it is estimated from outsourcing services in Taobao
mation is annotated, we begin to assign the semantic infor- website1 . From our survey, most artists spend more than 2
mation to the object in a coarse-to-fine process. We give a hours (over 120 minutes) to model an articulated object and
uuid to each instance. Then we assign the category and the the labor cost is averagely over 100 dollars for one.
corresponding taxonomy to the object according to Word-
Net [22]. We also label the semantic part. Though we al- CAD modeling Real-world Scanning
ready annotate the kinematic part, it is not quite the same as Time (min) >120 20
the semantic part. Take a mug with a handle, for example, Money ($) >100 3
the handle is not attached to the mug body through a joint,
thus it is not a kinematic part, but it is a semantic part as it Table 1. Budget comparison between our real-world scanning and
indicates where the human normally grabs the mug. CAD modeling for articulated objects.

Physical property. Real objects exist in the physical world


and typically have physical properties, which are important To note, we are aware that many important articulated
for accurate simulation and real-world manipulation & in- objects in the real world are rather expensive like laptop,
teraction on articulated objects. Thus, we store physical at- microwave oven, doors etc. In such cases, we either collect
tribute annotations for our models, involving per-part mass, only the ones we can collect from the homes without re-
Per-part inertial, material and surface friction. buying, or buy one to measure the basic information and
propagate to the existing simulated objects like in PartNet-
Discussion. In this section, we only describe the object Mobility [31]. For these objects, the ArtiKG is labeled as
knowledge thet should take human’s effort to annotate, for ArtiKG-sim.
those which can be calculated through algorithms or triv-
1 https://www.taobao.com
ially inferred like surface normal, collision mesh/simplified

14792

Authorized licensed use limited to: HEFEI UNIVERSITY OF TECHNOLOGY. Downloaded on January 25,2025 at 15:15:12 UTC from IEEE Xplore. Restrictions apply.
Appearance Structure Semantics Physics
Dataset
Num AV AT Part Joint ST PS PM PI PF
Synthetic Model Dataset
ShapeNet [4] >50K <2K <5K - - ✓ - - - -
PartNet [24] >20K <2K <5K ✓ - ✓ - - - -
Shape2Motion [30] 2K <0.5K <1K ✓ ✓ - - - - -
PartNet-Mobility [31] 2K <0.5K <1K ✓ ✓ ✓ ✓ - - -
Real-World Model Dataset
YCB [3] 21 ∼40K ∼90K - - - - - - -
LineMod [10] 15 ∼19K ∼39K - - - - - - -
RBO [20] 14 ∼5K ∼10K ✓ ✓ - - - - -
AKB-48(Ours) 2,037 ∼56K ∼110K ✓ ✓ ✓ ✓ ✓ ✓ ✓

Table 2. Comparison with other popular model datasets. Our AKB-48 dataset provides four types of information for rich annotations in our
ArtiKG: Appearance, Structure, Semantics and Physics. AV: Average number of vertices. AT: Average number of triangles. ST: Semantic
Taxonomy. PS: Per-part Semantic label. PM: Per-part Mass. PI: Per-part Inertia Moment. PF: Per-part Friction.

3.3. Fast Articulation Knowledge Modeling (FArM) aligned watertight surface and the interface could automat-
Pipeline ically split the mesh into multiple smaller sub-components.
To note, if the parts can be disassembled in the real world,
Once we determine what to annotate and what object to
we just scan each part and assemble them into an integral
be annotated, the remaining problem is how to make the
model.
annotation process affordable.
Joint Annotation. In contrast to other object modeling
3.3.1 Model Acquisition Equipment. pipelines, articulated objects require joint annotation that
links two rigid segmented parts and describes the kinematic
To efficiently collect real-world articulated models, we structure as a tree. Our interface provides an inspector win-
setup a recording system, whose configuration is illustrated dow that allows the annotator to reorganize the parts into a
in Fig. 3. This apparatus is developed with three com- tree structure. Then, the annotators could add joint informa-
ponents: EinScan Pro 2020 for scanning2 , Intel RealSense tion to each link and annotate 6D vector (3 for joint location
D435 for RGB-D multi-view snapshot, multi-scale rotating and 3 for joint axis) in a 3D view that contains parent and
turntables and lift bracket. In our setup, each object can be child parts. To ensure the correctness of joint annotation,
scanned within 5 minutes. we provide an animation that demonstrates the motion un-
der current joint information and the annotators could fur-
3.3.2 Articulation Modeling ther refine the annotation.
After the model acquisition, we develop an articulated ob-
ject modeling interface with 3D GUI for annotation guid- 3.3.3 Physics Annotation
ance. Specifically, our modeling workflow split the whole
Real-world articulated objects exist in the physical world
process into three sub-processes:
and have physical properties. To enable our AKB-48 in
Object Alignment. This process requires the annotator to real-world robotic manipulation & interaction tasks, we also
align the scanned articulated object from camera space into annotate physical attribute annotations for each part of the
canonical space which is shared within a category. To assist articulated object.
the alignment, we define several primitive shapes such as
Per-part Mass. We record each rigid part’s weight in grams.
cube, sphere and cylinder with predefined axis, which are
For those objects that are inseparable on several parts, we
used to fit the targeted object.
adopt the drainage method [6] to measure volume for these
Part Segmentation. Different from synthetic models from parts and compute the weight by their densities according
the Internet that often include original mesh subgroups and to the materials.
part information, real-world scanned models require man-
Per-part Inertia Moment. It is hard to obtain per-part inertia
ual segmentation for each rigid part. In our interface,
moment in the real world since scanned articulated models
we provide a mesh cutting method with multi-view obser-
might contain hundreds of thousands of triangles, which is
vation. The annotators draw boundary polygons on the
in a very complicated structure. In our method, we simplify
2 https://www.einscan.com
these models with finite primitive shapes, such as cuboid

14793

Authorized licensed use limited to: HEFEI UNIVERSITY OF TECHNOLOGY. Downloaded on January 25,2025 at 15:15:12 UTC from IEEE Xplore. Restrictions apply.
Pose Module Per-part Point Cloud Physics Prediction
Per-part Mass 𝒎

Part Segmentation 𝑺 Per-part Friction 𝝁


Per-part Inertia Moment 𝑰
Feature
Extractor Joint Properties (𝐪, 𝐮)
Joint Type 𝜹

NOCS
Map Fitting

FC Per-part Pose (𝑹, 𝐭)


RL Agent
Joint State 𝜽

Joint Embedding 𝝍 Manipulation

Feature
Extractor
Shape Embedding 𝝓
RGB-D image Reconstruction
with detection Shape Module Manipulation Module
FC

Figure 4. The overall pipeline of AKBNet. The input of AKBNet is a single RGB-D image with a detected box, and there are three
components conducted: (1) Pose module for predicting per-part segmentation, 6D pose, joint type as well as joint properties. (2) Shape
module for generating full mesh of the articulated object with current joint state. (3) Manipulation module for enabling the RL agent (UR5
Robot Arm with a Robotiq 85 gripper) to manipulate the object, and also predicting per-part physics information.

and cone, and then compute the inertial moment in simula- jects in comparison with the current largest articulated ob-
tion based on the combination of these primitive shapes. ject datasets PartNet-Mobility [31], yet it comprises only
CAD models. More statistics such as category specification
Per-part Material and Friction. We also annotate the surface
and intra-category variety can be referred to supplementary
material and related parameters. For example, a transpar-
materials.
ent material will be annotated with the index of refraction,
and normal materials will be annotated with friction coeffi-
cients. These are obtained by searching Machinery’s Hand- 4. AKBNet
book [26].
In this section, we describe the AKBNet, an integral
3.4. Dataset Analysis pipeline for C-VAM problem. In AKBNet, the input is a
Object Categories. To build AKB-48 dataset, we take the single RGB-D image with detected 2D bounding boxes. We
following requirements into consideration: (1) Commonal- build three sub-modules in AKBNet that aims to estimate
ity. We require our AKB-48 could cover most of the artic- per-part 6D pose (Sec. 4.1), reconstruct full geometry of ar-
ulated object categories in the common daily scenes, such ticulated object (Sec. 4.2) and reason the interaction policy
as kitchen, bedroom and office room. (2) Variety. We con- through the perception (Sec. 4.3). The overall pipeline of
sider the objects with a wide variety of shapes, deforma- AKBNet is illustrated in Fig. 4.
bility, texture and kinematic structure for one category. (3)
Usage. The chosen objects should contain various function- 4.1. Pose Module
alities on usage. Besides, the ability to complete manipula- Given an image with a detected 2D bounding box, we
tion performance is prioritized. can obtain the partial point cloud P ∈ RN ×3 . Firstly, the
Statistics. We first compare AKB-48 with some other pop- input P is processed by a Pointnet++ [28] for feature ex-
ular datasets in Table 2. As it is shown, our object models traction, and we build two branches at the end for predicting
cover full features for real-world articulated object analysis. per-point segmentation S and part-level Normalized Object
Specifically, compared to the synthetic model repository, Coordinate Space [16] (NOCS) map P ′ ∈ RN ×3 . To solve
we hold a much finer surface with average of around 126K the unknown kinematic structure and joint type issues, we
triangles and real textures while synthetic models only con- introduce three extra branches on the feature extractor to
tain thousands of triangles and synthetic textures. In terms classify the joint type δ on its corresponding part k, and
of annotation, we provide part and joint annotations that are also to predict joint property including joint location qi and
enough for visual articulation tasks. Furthermore, we also joint axis ui . Finally, we apply the voting scheme to ob-
annotate physical information for each model that is never tain the final joint property q ∈ R3 and u ∈ R3 . We use
considered in both synthetic and real-world model reposi- cross-entropy loss for part segmentation Lseg and joint type
tories before. We believe the rich annotations could pro- classification Ltype , L2 loss for NOCS map Lnocs , joint lo-
mote further development in articulation research. As for cation Lloc and joint axis Lax prediction. Taking all the loss
the model number, we have a comparable number of ob- functions into consideration, the overall loss Lpos for pose

14794

Authorized licensed use limited to: HEFEI UNIVERSITY OF TECHNOLOGY. Downloaded on January 25,2025 at 15:15:12 UTC from IEEE Xplore. Restrictions apply.
module is: branches to predict per-part mass mk , friction µk and inertia
moment I k . We use L2 loss for training the physics predic-
\begin {split} \mathcal {L}_{pos} &= \lambda _{seg}\mathcal {L}_{seg} + \lambda _{nocs}\mathcal {L}_{nocs} \\ &= + \lambda _{loc}\mathcal {L}_{loc} + \lambda _{ax}\mathcal {L}_{ax} + \lambda _{type}\mathcal {L}_{type} \end {split}
tion submodule. Please refer to supplementary materials for
(1) more details.

Finally, we follow the pose optimization algorithm with 5. Experiments


kinematic constrains [16] to recover the 6D pose {R, t} for 5.1. Experimental Setup
each rigid part. R denotes rotation R ∈ SO(3) and t de-
notes translation t ∈ R3 . Dataset. For the pose module and shape module, we gen-
erate 100K RGB-D images with AKB-48 models for train-
4.2. Shape Module ing AKBNet using SAMERT data generation scheme [17]
Given a partial point cloud P, the shape module aims with scenes from NOCS [29]. And we also capture 10K
to re-build the full geometry Mθ with joint state θ. Fol- real-world images, in which 5K are used for fine-tuning the
lowed by A-SDF [25], we build a feature extractor process model and the other 5K is test set. For manipulation mod-
the concatenated partial point cloud P and Gaussian initial- ule, we select 68 and 32 instances for training and testing
ized shape embedding ϕ as well as joint embedding ψ, in the RL agent, in which the former is used for opening task
which ϕ indicates the shape information of the full articu- and the latter is for pulling task. During training, we use
lated object and ψ indicates the joint state information that different instances at every episode.
is shared across the same instance. we use SDF values [27] Implementation Details. When training pose module and
di as supervision and L1 loss for training the shape module shape module, we use Adam optimizer with initial learning
Fsha : rate 0.001. Batch size is 16. The total training epochs are
50 and 100 for training these two modules. The detailed
hyper-parameters are: λseg = 1, λnocs = 10, λloc = 1,
\mathcal {L}_{sha} = \lambda _{sha} \frac {1}{N} \sum _{i=1}^N \Vert F_{sha}(p_i,\phi ,\psi ) - d_i \Vert + \lambda _{\phi } \Vert \phi \Vert _2 (2) λax = 0.5, λtype = 1, λsha = 1, λϕ = 0.0001. For the ma-
nipulation module, the hyper-parameters are: batch size is
512, learning rate is 0.001, replay buffer size is 100K, soft
During inference, based on the predicted shape embed-
update coefficient is 0.05, discount factor is 0.95. We use
ding ϕ and joint embedding ψ, we follow Mu’s algorithm
RFUniverse [7] as the environment to train the RL agent.
[27] to reconstruct the full mesh Mθ .
For more details, please refer to the supplementary materi-
4.3. Manipulation Module als.

The manipulation module performs two tasks: opening Metrics. We adopt the following metrics to measure the
and pulling that are corresponding to the revolute and pris- AKBNet performance. For the pose module, We report
matic joints in articulation respectively. To achieve these three part-based metrics: rotation error measured in de-
tasks, we train two Reinforcement Learning (RL) agents grees, translation error measured in meters and 3D IoU for
(UR5 Robot Arm with a Robotiq 85 gripper) these tasks. each part. We also report the joint-based metrics: angle
We provide two State Representations: (1) object state, error of joint axis measured in degrees, location error in
consisting of 6D pose {R, t}, joint location q, axis u, full line-to-line distance measured in meters, joint type classi-
geometry Mθ under current joint state θ. (2) agent state, fication accuracy (%). For the shape module, we report the
consisting of the gripper’s pose {Rg , tg } and the gripper’s average Chamfer-L1 distance [25] for reconstruction evalu-
width wg . We assume that the agent can access all the in- ation. For the manipulation module, we report success rate
formation about itself so the agent state is ground truth in (%) as the metric. If the agent can grip the target part and
our method. The Actions include the agent’s end-effector’s move it through 50% of its motion range, it will be regarded
3D translation and the opening width of the gripper. The as a success.
Rewards are rotation angle along the joint axis of the tar-
get part for revolute joint and translation distance of that for 5.2. Pose Module Performance
prismatic joint. The RL agent is trained by two popular RL We evaluate NPCS [16], A-NCSH [16] and AKBNet on
baselines: Truncated Quantile Critics (TQC) [15] and Soft real-world test set for category-level articulation pose esti-
Actor-Critic (SAC) [8] with Hindsight Experience Replay mation task. For A-NCSH baseline, we use direct regres-
(HER) [1] algorithm. sion and classification scheme to predict kinematic struc-
We also perform physics prediction in our AKBNet. ture and joint type. The experimental results are illustrated
Specifically, the input is a feature vector of point cloud P k in Table 3. For pose estimation, we achieve 10.0, 0.023
for kth part. We train a 3-layer MLP and build three parallel and 52.7 on rotation, translation errors and 3D IoU, which

14795

Authorized licensed use limited to: HEFEI UNIVERSITY OF TECHNOLOGY. Downloaded on January 25,2025 at 15:15:12 UTC from IEEE Xplore. Restrictions apply.
Figure 5. Qualitative results. For one instance, from left to right: input RGB-D image, output of pose module, output of shape module,
manipulation demonstration.

are higher than NPCS and A-NCSH. For joint-related eval- rithm compared with that using SAC+HER. Experimental
uation, we can precisely predict joint type for unseen ar- results are illustrated in Table 5. With ground truth object
ticulated objects with 94.2% accuracy. Besides, AKBNet state, AKBNet could complete opening and pulling manipu-
achieves 8.7 and 0.019 errors in joint axis and location pre- lation tasks, with 68.6% and 92.4% success rate. However,
diction respectively. our method might not perform well when the object state is
predicted, with only 26.4% and 32.6% success rates. Qual-
Part-based Metrics itative results of AKBNet are illustrated in Fig. 5.
Method
rotation↓ translation↓ 3D IoU↑ Our AKBNet can also predict physics information in-
NPCS [16] 12.6 0.038 48.3 cluding per-part mass, friction and inertia moment. These
A-NCSH* [16] 10.5 0.026 50.8 predicted physics can enable force sensing for AKB-48 ob-
AKBNet 10.0 0.023 52.7
jects in simulation, which has the potential to realize force
Joint-based Metrics
Method controlling. For more details, please refer to supplementary
angle↓ distance↓ type↑
materials.
NPCS [16] - - -
A-NCSH* [16] 9.2 0.021 93.8 Method Mode Opening Pulling
AKBNet 8.7 0.019 94.2 AKBNet+SAC [8]+HER [1]
Object State GT 53.8 92.4
Object State Pre. 22.8 28.5
Object State GT 68.6 89.7
Table 3. Category-level articulation pose estimation results. ↓ AKBNet+TQC [15]+HER [1]
Object State Pre. 26.4 32.6
means the lower the better. ↑ means the higher the better. * in-
dicates that A-NCSH is re-implemented with the extra kinematic Table 5. Success rate (%) on articulated object manipulation task.
structure and joint type prediction modules. Pre. means we use predicted object state from the pose and shape
modules.
5.3. Shape Module Performance
The experimental results of the shape module are illus- 6. Conclusion and Crowd-Sourcing Data-
trated in Table 4. Within ground truth joint state input, the Collection Invitation
shape module could reconstruct the articulated object with In this paper, we present AKB-48, a large-scale artic-
5.6 Chamfer-L1 distance. On the other hand, we systemat- ulated object knowledge and benchmark C-VAM problem
ically evaluate the shape module given the predicted joint for dealing with articulation problems. Admittedly, there
state, which is deduced from predicted the linked two parts’ are a few articulated object categories that might not be col-
poses from the pose module. The Chamfer-L1 distance is lected in AKB-48, although we have covered large enough
3.3 higher than that with ground truth joint state, indicating categories in daily life. In the future, we will release our
that the predicated poses largely affect reconstruction per- FArM tool for collecting more articulated objects, and it
formance. could also support any scanned shapes such as mobile re-
Mode Chamer-L1 Distance constructor [13]. In future work, we will publish an online
Joint State GT 5.6 articulation model platform and invite crowd-sourcing data-
Joint State Pre. 8.9 collection to contribute to the articulation research commu-
nity.
Table 4. Articulated object reconstruction results. Pre. means that
we use the predicted joint state from the pose module. Acknowledgement This work was supported by the Na-
tional Key R&D Program of China (No. 2021ZD0110700),
5.4. Manipulation Module Performance
Shanghai Municipal Science and Technology Major Project
We evaluate opening and pulling tasks on the manipu- (2021SHZDZX0102), Shanghai Qi Zhi Institute, and
lation module of AKBNet using TQC+HER training algo- SHEITC (2018-RGZN-02046).

14796

Authorized licensed use limited to: HEFEI UNIVERSITY OF TECHNOLOGY. Downloaded on January 25,2025 at 15:15:12 UTC from IEEE Xplore. Restrictions apply.
References hashed signed distance fields. In Robotics: science and sys-
tems, volume 4. Citeseer, 2015. 8
[1] Marcin Andrychowicz, Filip Wolski, Alex Ray, Jonas [14] Sebastian Koch, Albert Matveev, Zhongshi Jiang, Francis
Schneider, Rachel Fong, Peter Welinder, Bob McGrew, Josh Williams, Alexey Artemov, Evgeny Burnaev, Marc Alexa,
Tobin, Pieter Abbeel, and Wojciech Zaremba. Hindsight Denis Zorin, and Daniele Panozzo. Abc: A big cad model
experience replay. In Proceedings of the 31st Interna- dataset for geometric deep learning. In Proceedings of
tional Conference on Neural Information Processing Sys- the IEEE/CVF Conference on Computer Vision and Pattern
tems, pages 5055–5065, 2017. 7, 8 Recognition, pages 9601–9611, 2019. 2, 3
[2] Joao Borrego, Atabak Dehban, Rui Figueiredo, Plinio [15] Arsenii Kuznetsov, Pavel Shvechikov, Alexander Grishin,
Moreno, Alexandre Bernardino, and José Santos-Victor. Ap- and Dmitry Vetrov. Controlling overestimation bias with
plying domain randomization to synthetic data for object cat- truncated mixture of continuous distributional quantile crit-
egory detection. arXiv preprint arXiv:1807.09834, 2018. 2 ics. In International Conference on Machine Learning, pages
[3] Berk Calli, Arjun Singh, Aaron Walsman, Siddhartha Srini- 5556–5566. PMLR, 2020. 7, 8
vasa, Pieter Abbeel, and Aaron M Dollar. The ycb object [16] Xiaolong Li, He Wang, Li Yi, Leonidas J Guibas, A Lynn
and model set: Towards common benchmarks for manipula- Abbott, and Shuran Song. Category-level articulated object
tion research. In 2015 international conference on advanced pose estimation. In Proceedings of the IEEE/CVF Confer-
robotics (ICAR), pages 510–517. IEEE, 2015. 3, 5 ence on Computer Vision and Pattern Recognition, pages
[4] Angel X Chang, Thomas Funkhouser, Leonidas Guibas, 3706–3715, 2020. 3, 6, 7, 8
Pat Hanrahan, Qixing Huang, Zimo Li, Silvio Savarese, [17] Liu Liu, Han Xue, Wenqiang Xu, Haoyuan Fu, and Cewu
Manolis Savva, Shuran Song, Hao Su, et al. Shapenet: Lu. Towards real-world category-level articulation pose es-
An information-rich 3d model repository. arXiv preprint timation. arXiv preprint arXiv:2105.03260, 2021. 1, 7
arXiv:1512.03012, 2015. 2, 3, 5 [18] Qihao Liu, Weichao Qiu, Weiyao Wang, Gregory D Hager,
[5] Peng Chang and Taşkin Padif. Sim2real2sim: Bridging the and Alan L Yuille. Nothing but geometric constraints: A
gap between simulation and real-world in flexible object ma- model-free method for articulated object pose estimation.
nipulation. In 2020 Fourth IEEE International Conference arXiv preprint arXiv:2012.00088, 2020. 3
on Robotic Computing (IRC), pages 56–62. IEEE, 2020. 2 [19] Robert Maier, Kihwan Kim, Daniel Cremers, Jan Kautz, and
[6] John D Cutnell and Kenneth W Johnson. Physics, Volume Matthias Nießner. Intrinsic3d: High-quality 3d reconstruc-
One: Chapters 1-17, volume 1. John Wiley & Sons, 2014. 5 tion by joint appearance and geometry optimization with
[7] Haoyuan Fu, Xu Wenqiang, Xue Han, Yang Huinan, Ye spatially-varying lighting. In Proceedings of the IEEE inter-
Ruolin, Huang Yongxi, Xue Zhendong, Wang Yanfeng, and national conference on computer vision, pages 3114–3122,
Cewu Lu. Rfuniverse. 7 2017. 2
[8] Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey [20] Roberto Martı́n-Martı́n, Clemens Eppner, and Oliver Brock.
Levine. Soft actor-critic: Off-policy maximum entropy deep The rbo dataset of articulated objects and interactions. The
reinforcement learning with a stochastic actor. In Interna- International Journal of Robotics Research, 38(9):1013–
tional conference on machine learning, pages 1861–1870. 1019, 2019. 1, 3, 5
PMLR, 2018. 7, 8 [21] Roberto Martı́n-Martı́n, Sebastian Höfer, and Oliver Brock.
[9] Karol Hausman, Scott Niekum, Sarah Osentoski, and Gau- An integrated approach to visual perception of articulated
rav S Sukhatme. Active articulation model estimation objects. In 2016 IEEE International Conference on Robotics
through interactive perception. In 2015 IEEE International and Automation (ICRA), pages 5091–5097. IEEE, 2016. 3
Conference on Robotics and Automation (ICRA), pages [22] George A Miller. Wordnet: a lexical database for english.
3305–3312. IEEE, 2015. 3 Communications of the ACM, 38(11):39–41, 1995. 2, 4
[10] Stefan Hinterstoisser, Stefan Holzer, Cedric Cagniart, Slo- [23] Kaichun Mo, Leonidas Guibas, Mustafa Mukadam, Abhi-
bodan Ilic, and Vincent Lepetit. Multimodal templates for nav Gupta, and Shubham Tulsiani. Where2act: From pix-
real-time detection of texture-less objects in heavily cluttered els to actions for articulated 3d objects. arXiv preprint
scenes. In IEEE International Conference on Computer Vi- arXiv:2101.02692, 2021. 3
sion, 2012. 5 [24] Kaichun Mo, Shilin Zhu, Angel X Chang, Li Yi, Subarna
[11] Ajinkya Jain, Rudolf Lioutikov, and Scott Niekum. Tripathi, Leonidas J Guibas, and Hao Su. Partnet: A large-
Screwnet: Category-independent articulation model estima- scale benchmark for fine-grained and hierarchical part-level
tion from depth images using screw theory. arXiv preprint 3d object understanding. In Proceedings of the IEEE Con-
arXiv:2008.10518, 2020. 3 ference on Computer Vision and Pattern Recognition, pages
[12] Dov Katz and Oliver Brock. Manipulating articulated objects 909–918, 2019. 3, 5
with interactive perception. In 2008 IEEE International Con- [25] Jiteng Mu, Weichao Qiu, Adam Kortylewski, Alan Yuille,
ference on Robotics and Automation, pages 272–277. IEEE, Nuno Vasconcelos, and Xiaolong Wang. A-sdf: Learning
2008. 3 disentangled signed distance functions for articulated shape
[13] Matthew Klingensmith, Ivan Dryanovski, Siddhartha S representation. arXiv preprint arXiv:2104.07645, 2021. 2, 7
Srinivasa, and Jizhong Xiao. Chisel: Real time large scale [26] Erik Oberg and Franklin Day Jones. Machinery’s Handbook.
3d reconstruction onboard a mobile device using spatially Industrial Press, 1914. 6

14797

Authorized licensed use limited to: HEFEI UNIVERSITY OF TECHNOLOGY. Downloaded on January 25,2025 at 15:15:12 UTC from IEEE Xplore. Restrictions apply.
[27] Jeong Joon Park, Peter Florence, Julian Straub, Richard
Newcombe, and Steven Lovegrove. Deepsdf: Learning con-
tinuous signed distance functions for shape representation.
In Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition, pages 165–174, 2019. 7
[28] Charles R Qi, Li Yi, Hao Su, and Leonidas J Guibas. Point-
net++ deep hierarchical feature learning on point sets in a
metric space. In Proceedings of the 31st International Con-
ference on Neural Information Processing Systems, pages
5105–5114, 2017. 6
[29] He Wang, Srinath Sridhar, Jingwei Huang, Julien Valentin,
Shuran Song, and Leonidas J Guibas. Normalized object
coordinate space for category-level 6d object pose and size
estimation. In Proceedings of the IEEE/CVF Conference
on Computer Vision and Pattern Recognition, pages 2642–
2651, 2019. 7
[30] Xiaogang Wang, Bin Zhou, Yahao Shi, Xiaowu Chen, Qin-
ping Zhao, and Kai Xu. Shape2motion: Joint analysis of
motion parts and attributes from 3d shapes. In Proceed-
ings of the IEEE Conference on Computer Vision and Pattern
Recognition, pages 8876–8884, 2019. 3, 5
[31] Fanbo Xiang, Yuzhe Qin, Kaichun Mo, Yikuan Xia, Hao
Zhu, Fangchen Liu, Minghua Liu, Hanxiao Jiang, Yifu Yuan,
He Wang, et al. Sapien: A simulated part-based interactive
environment. In Proceedings of the IEEE/CVF Conference
on Computer Vision and Pattern Recognition, pages 11097–
11107, 2020. 1, 3, 4, 5, 6
[32] Li Yi, Haibin Huang, Difan Liu, Evangelos Kalogerakis, Hao
Su, and Leonidas Guibas. Deep part induction from articu-
lated object pairs. ACM Transactions on Graphics, 37(6),
2019. 3
[33] Vicky Zeng, Timothy E Lee, Jacky Liang, and Oliver Kroe-
mer. Visual identification of articulated object parts. arXiv
preprint arXiv:2012.00284, 2020. 3

14798

Authorized licensed use limited to: HEFEI UNIVERSITY OF TECHNOLOGY. Downloaded on January 25,2025 at 15:15:12 UTC from IEEE Xplore. Restrictions apply.

You might also like