AKB-48 A Real-World Articulated Object Knowledge
AKB-48 A Real-World Articulated Object Knowledge
Liu Liu Wenqiang Xu Haoyuan Fu Sucheng Qian Qiaojun Yu Yang Han Cewu Lu†
Shanghai Jiao Tong University
{liuliu1993, vinjohn, simon-fuhaoyuan, qiansucheng, yqjllxs, lucewu}@sjtu.edu.cn
2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) | 978-1-6654-6946-3/22/$31.00 ©2022 IEEE | DOI: 10.1109/CVPR52688.2022.01439
Figure 1. AKB-48 consists of 2,037 articulated object models of 48 categories scanned from the real world. The objects are annotated with
ArtiKG, and can support a full task spectrum from computer vision to robotics manipulation.
14790
Authorized licensed use limited to: HEFEI UNIVERSITY OF TECHNOLOGY. Downloaded on January 25,2025 at 15:15:12 UTC from IEEE Xplore. Restrictions apply.
Pencil Box
Bag
z
Mass: 78.52g
Appearance Structure H: 0.08m y Segmentation Friction: 0.15
W: 0.16m Inertia: [0.19, ...]
L: 0.16m
x
Semantics Physics Scale Alignment Mass: 10.21g
Mass: 10.62g
Friction: 0.15 Friction: 0.15
Inertia: [0.04, ...] Inertia: [0.08, ...]
Figure 2. The Articulation Knowledge Graph (ArtiKG) defined in AKB-48 dataset. In ArtiKG, we annotate four types of knowledge:
Appearance, Structure, Semantics and Physical Property. The values are rounded up to percentile in this figure for presentation.
14791
Authorized licensed use limited to: HEFEI UNIVERSITY OF TECHNOLOGY. Downloaded on January 25,2025 at 15:15:12 UTC from IEEE Xplore. Restrictions apply.
3. Articulation Knowledge Base, AKB-48 mesh, intrinsic dimensions, are not discussed. Besides, as
the annotation information is modular organized, it is con-
When constructing the knowledge base, three instant venient for new attributes to be added to the ArtiKG. Be-
questions should be answered: (1) What kinds of knowl- sides, though the ArtiKG is designed for articulated objects,
edge should we annotate on the object? (2) What objects it can also be trivially extended to rigid, and flexible objects.
should we annotate, those from the real or the simulated
world? (3) How to annotate the object knowledge effi- 3.2. Object Selection: Real-world Scanning v.s
ciently? To answer these questions, we describe the ArtiKG CAD Modeling
in Sec. 3.1, make a thorough discussion on the object selec-
tion in Sec. 3.2, and finally propose the FArM pipeline in The choice between real-world scanning and CAD mod-
Sec. 3.3 and provide analysis (diversity, difficulty) about eling are considered from two perspectives, namely annota-
the dataset in Sec. 3.4. tion accuracy, cost on time and money.
Annotation Accuracy. According to the content of the Ar-
3.1. Articulated Object Knowledge Graph, ArtiKG tiKG, we can see objects from the real world have multi-
Different tasks require different kinds of object knowl- ple advantages over the CAD models, such as appearance
edge, to unify the annotation representation, we organize it and physical property. But admittedly, the CAD model can
into a multi-modal knowledge graph, named ArtiKG. The model inner structures such as the GUNdam or the trans-
ArtiKG consists of four major parts, namely appearance, former, while scanning techniques focus more on the sur-
structure, physics, and semantics. The details are described face. Since such objects with inner structures that cannot be
in the following and visualized in Fig. 2. easily disassembled posit challenges for both artists and the
scanners, we would like to update these objects when tech-
Appearance. For each instance, we store its shape with mesh niques are more ready. Fortunately, most daily objects can
data structure along with the textures. When scanning the be disassembled, so the scanning techniques can properly
object from the real world, we also collect the multi-view handle them.
RGB-D snapshots of the object.
Cost on Time and Money. As discussed earlier, ShapeNet-
Structure. The key difference between the articulated object like model collection paradigm is limited to large time and
and the rigid object is the kinematic structure. The articu- money cost of artists’ manual CAD model building when
lated object has concepts like joint and part, which are not investigating new categories or kinematic structures. On
meaningful for the rigid object. For each joint, we annotate the other hand, many daily articulated objects are cheap in
the joint type, parameters, and movement limits. For each reality and can be scanned by a layman. We compare the
part, we segment each kinematic part. average money and time budget in Table 1. For CAD mod-
Semantics. After the basic geometric and structural infor- eling, it is estimated from outsourcing services in Taobao
mation is annotated, we begin to assign the semantic infor- website1 . From our survey, most artists spend more than 2
mation to the object in a coarse-to-fine process. We give a hours (over 120 minutes) to model an articulated object and
uuid to each instance. Then we assign the category and the the labor cost is averagely over 100 dollars for one.
corresponding taxonomy to the object according to Word-
Net [22]. We also label the semantic part. Though we al- CAD modeling Real-world Scanning
ready annotate the kinematic part, it is not quite the same as Time (min) >120 20
the semantic part. Take a mug with a handle, for example, Money ($) >100 3
the handle is not attached to the mug body through a joint,
thus it is not a kinematic part, but it is a semantic part as it Table 1. Budget comparison between our real-world scanning and
indicates where the human normally grabs the mug. CAD modeling for articulated objects.
14792
Authorized licensed use limited to: HEFEI UNIVERSITY OF TECHNOLOGY. Downloaded on January 25,2025 at 15:15:12 UTC from IEEE Xplore. Restrictions apply.
Appearance Structure Semantics Physics
Dataset
Num AV AT Part Joint ST PS PM PI PF
Synthetic Model Dataset
ShapeNet [4] >50K <2K <5K - - ✓ - - - -
PartNet [24] >20K <2K <5K ✓ - ✓ - - - -
Shape2Motion [30] 2K <0.5K <1K ✓ ✓ - - - - -
PartNet-Mobility [31] 2K <0.5K <1K ✓ ✓ ✓ ✓ - - -
Real-World Model Dataset
YCB [3] 21 ∼40K ∼90K - - - - - - -
LineMod [10] 15 ∼19K ∼39K - - - - - - -
RBO [20] 14 ∼5K ∼10K ✓ ✓ - - - - -
AKB-48(Ours) 2,037 ∼56K ∼110K ✓ ✓ ✓ ✓ ✓ ✓ ✓
Table 2. Comparison with other popular model datasets. Our AKB-48 dataset provides four types of information for rich annotations in our
ArtiKG: Appearance, Structure, Semantics and Physics. AV: Average number of vertices. AT: Average number of triangles. ST: Semantic
Taxonomy. PS: Per-part Semantic label. PM: Per-part Mass. PI: Per-part Inertia Moment. PF: Per-part Friction.
3.3. Fast Articulation Knowledge Modeling (FArM) aligned watertight surface and the interface could automat-
Pipeline ically split the mesh into multiple smaller sub-components.
To note, if the parts can be disassembled in the real world,
Once we determine what to annotate and what object to
we just scan each part and assemble them into an integral
be annotated, the remaining problem is how to make the
model.
annotation process affordable.
Joint Annotation. In contrast to other object modeling
3.3.1 Model Acquisition Equipment. pipelines, articulated objects require joint annotation that
links two rigid segmented parts and describes the kinematic
To efficiently collect real-world articulated models, we structure as a tree. Our interface provides an inspector win-
setup a recording system, whose configuration is illustrated dow that allows the annotator to reorganize the parts into a
in Fig. 3. This apparatus is developed with three com- tree structure. Then, the annotators could add joint informa-
ponents: EinScan Pro 2020 for scanning2 , Intel RealSense tion to each link and annotate 6D vector (3 for joint location
D435 for RGB-D multi-view snapshot, multi-scale rotating and 3 for joint axis) in a 3D view that contains parent and
turntables and lift bracket. In our setup, each object can be child parts. To ensure the correctness of joint annotation,
scanned within 5 minutes. we provide an animation that demonstrates the motion un-
der current joint information and the annotators could fur-
3.3.2 Articulation Modeling ther refine the annotation.
After the model acquisition, we develop an articulated ob-
ject modeling interface with 3D GUI for annotation guid- 3.3.3 Physics Annotation
ance. Specifically, our modeling workflow split the whole
Real-world articulated objects exist in the physical world
process into three sub-processes:
and have physical properties. To enable our AKB-48 in
Object Alignment. This process requires the annotator to real-world robotic manipulation & interaction tasks, we also
align the scanned articulated object from camera space into annotate physical attribute annotations for each part of the
canonical space which is shared within a category. To assist articulated object.
the alignment, we define several primitive shapes such as
Per-part Mass. We record each rigid part’s weight in grams.
cube, sphere and cylinder with predefined axis, which are
For those objects that are inseparable on several parts, we
used to fit the targeted object.
adopt the drainage method [6] to measure volume for these
Part Segmentation. Different from synthetic models from parts and compute the weight by their densities according
the Internet that often include original mesh subgroups and to the materials.
part information, real-world scanned models require man-
Per-part Inertia Moment. It is hard to obtain per-part inertia
ual segmentation for each rigid part. In our interface,
moment in the real world since scanned articulated models
we provide a mesh cutting method with multi-view obser-
might contain hundreds of thousands of triangles, which is
vation. The annotators draw boundary polygons on the
in a very complicated structure. In our method, we simplify
2 https://www.einscan.com
these models with finite primitive shapes, such as cuboid
14793
Authorized licensed use limited to: HEFEI UNIVERSITY OF TECHNOLOGY. Downloaded on January 25,2025 at 15:15:12 UTC from IEEE Xplore. Restrictions apply.
Pose Module Per-part Point Cloud Physics Prediction
Per-part Mass 𝒎
NOCS
Map Fitting
Feature
Extractor
Shape Embedding 𝝓
RGB-D image Reconstruction
with detection Shape Module Manipulation Module
FC
Figure 4. The overall pipeline of AKBNet. The input of AKBNet is a single RGB-D image with a detected box, and there are three
components conducted: (1) Pose module for predicting per-part segmentation, 6D pose, joint type as well as joint properties. (2) Shape
module for generating full mesh of the articulated object with current joint state. (3) Manipulation module for enabling the RL agent (UR5
Robot Arm with a Robotiq 85 gripper) to manipulate the object, and also predicting per-part physics information.
and cone, and then compute the inertial moment in simula- jects in comparison with the current largest articulated ob-
tion based on the combination of these primitive shapes. ject datasets PartNet-Mobility [31], yet it comprises only
CAD models. More statistics such as category specification
Per-part Material and Friction. We also annotate the surface
and intra-category variety can be referred to supplementary
material and related parameters. For example, a transpar-
materials.
ent material will be annotated with the index of refraction,
and normal materials will be annotated with friction coeffi-
cients. These are obtained by searching Machinery’s Hand- 4. AKBNet
book [26].
In this section, we describe the AKBNet, an integral
3.4. Dataset Analysis pipeline for C-VAM problem. In AKBNet, the input is a
Object Categories. To build AKB-48 dataset, we take the single RGB-D image with detected 2D bounding boxes. We
following requirements into consideration: (1) Commonal- build three sub-modules in AKBNet that aims to estimate
ity. We require our AKB-48 could cover most of the artic- per-part 6D pose (Sec. 4.1), reconstruct full geometry of ar-
ulated object categories in the common daily scenes, such ticulated object (Sec. 4.2) and reason the interaction policy
as kitchen, bedroom and office room. (2) Variety. We con- through the perception (Sec. 4.3). The overall pipeline of
sider the objects with a wide variety of shapes, deforma- AKBNet is illustrated in Fig. 4.
bility, texture and kinematic structure for one category. (3)
Usage. The chosen objects should contain various function- 4.1. Pose Module
alities on usage. Besides, the ability to complete manipula- Given an image with a detected 2D bounding box, we
tion performance is prioritized. can obtain the partial point cloud P ∈ RN ×3 . Firstly, the
Statistics. We first compare AKB-48 with some other pop- input P is processed by a Pointnet++ [28] for feature ex-
ular datasets in Table 2. As it is shown, our object models traction, and we build two branches at the end for predicting
cover full features for real-world articulated object analysis. per-point segmentation S and part-level Normalized Object
Specifically, compared to the synthetic model repository, Coordinate Space [16] (NOCS) map P ′ ∈ RN ×3 . To solve
we hold a much finer surface with average of around 126K the unknown kinematic structure and joint type issues, we
triangles and real textures while synthetic models only con- introduce three extra branches on the feature extractor to
tain thousands of triangles and synthetic textures. In terms classify the joint type δ on its corresponding part k, and
of annotation, we provide part and joint annotations that are also to predict joint property including joint location qi and
enough for visual articulation tasks. Furthermore, we also joint axis ui . Finally, we apply the voting scheme to ob-
annotate physical information for each model that is never tain the final joint property q ∈ R3 and u ∈ R3 . We use
considered in both synthetic and real-world model reposi- cross-entropy loss for part segmentation Lseg and joint type
tories before. We believe the rich annotations could pro- classification Ltype , L2 loss for NOCS map Lnocs , joint lo-
mote further development in articulation research. As for cation Lloc and joint axis Lax prediction. Taking all the loss
the model number, we have a comparable number of ob- functions into consideration, the overall loss Lpos for pose
14794
Authorized licensed use limited to: HEFEI UNIVERSITY OF TECHNOLOGY. Downloaded on January 25,2025 at 15:15:12 UTC from IEEE Xplore. Restrictions apply.
module is: branches to predict per-part mass mk , friction µk and inertia
moment I k . We use L2 loss for training the physics predic-
\begin {split} \mathcal {L}_{pos} &= \lambda _{seg}\mathcal {L}_{seg} + \lambda _{nocs}\mathcal {L}_{nocs} \\ &= + \lambda _{loc}\mathcal {L}_{loc} + \lambda _{ax}\mathcal {L}_{ax} + \lambda _{type}\mathcal {L}_{type} \end {split}
tion submodule. Please refer to supplementary materials for
(1) more details.
The manipulation module performs two tasks: opening Metrics. We adopt the following metrics to measure the
and pulling that are corresponding to the revolute and pris- AKBNet performance. For the pose module, We report
matic joints in articulation respectively. To achieve these three part-based metrics: rotation error measured in de-
tasks, we train two Reinforcement Learning (RL) agents grees, translation error measured in meters and 3D IoU for
(UR5 Robot Arm with a Robotiq 85 gripper) these tasks. each part. We also report the joint-based metrics: angle
We provide two State Representations: (1) object state, error of joint axis measured in degrees, location error in
consisting of 6D pose {R, t}, joint location q, axis u, full line-to-line distance measured in meters, joint type classi-
geometry Mθ under current joint state θ. (2) agent state, fication accuracy (%). For the shape module, we report the
consisting of the gripper’s pose {Rg , tg } and the gripper’s average Chamfer-L1 distance [25] for reconstruction evalu-
width wg . We assume that the agent can access all the in- ation. For the manipulation module, we report success rate
formation about itself so the agent state is ground truth in (%) as the metric. If the agent can grip the target part and
our method. The Actions include the agent’s end-effector’s move it through 50% of its motion range, it will be regarded
3D translation and the opening width of the gripper. The as a success.
Rewards are rotation angle along the joint axis of the tar-
get part for revolute joint and translation distance of that for 5.2. Pose Module Performance
prismatic joint. The RL agent is trained by two popular RL We evaluate NPCS [16], A-NCSH [16] and AKBNet on
baselines: Truncated Quantile Critics (TQC) [15] and Soft real-world test set for category-level articulation pose esti-
Actor-Critic (SAC) [8] with Hindsight Experience Replay mation task. For A-NCSH baseline, we use direct regres-
(HER) [1] algorithm. sion and classification scheme to predict kinematic struc-
We also perform physics prediction in our AKBNet. ture and joint type. The experimental results are illustrated
Specifically, the input is a feature vector of point cloud P k in Table 3. For pose estimation, we achieve 10.0, 0.023
for kth part. We train a 3-layer MLP and build three parallel and 52.7 on rotation, translation errors and 3D IoU, which
14795
Authorized licensed use limited to: HEFEI UNIVERSITY OF TECHNOLOGY. Downloaded on January 25,2025 at 15:15:12 UTC from IEEE Xplore. Restrictions apply.
Figure 5. Qualitative results. For one instance, from left to right: input RGB-D image, output of pose module, output of shape module,
manipulation demonstration.
are higher than NPCS and A-NCSH. For joint-related eval- rithm compared with that using SAC+HER. Experimental
uation, we can precisely predict joint type for unseen ar- results are illustrated in Table 5. With ground truth object
ticulated objects with 94.2% accuracy. Besides, AKBNet state, AKBNet could complete opening and pulling manipu-
achieves 8.7 and 0.019 errors in joint axis and location pre- lation tasks, with 68.6% and 92.4% success rate. However,
diction respectively. our method might not perform well when the object state is
predicted, with only 26.4% and 32.6% success rates. Qual-
Part-based Metrics itative results of AKBNet are illustrated in Fig. 5.
Method
rotation↓ translation↓ 3D IoU↑ Our AKBNet can also predict physics information in-
NPCS [16] 12.6 0.038 48.3 cluding per-part mass, friction and inertia moment. These
A-NCSH* [16] 10.5 0.026 50.8 predicted physics can enable force sensing for AKB-48 ob-
AKBNet 10.0 0.023 52.7
jects in simulation, which has the potential to realize force
Joint-based Metrics
Method controlling. For more details, please refer to supplementary
angle↓ distance↓ type↑
materials.
NPCS [16] - - -
A-NCSH* [16] 9.2 0.021 93.8 Method Mode Opening Pulling
AKBNet 8.7 0.019 94.2 AKBNet+SAC [8]+HER [1]
Object State GT 53.8 92.4
Object State Pre. 22.8 28.5
Object State GT 68.6 89.7
Table 3. Category-level articulation pose estimation results. ↓ AKBNet+TQC [15]+HER [1]
Object State Pre. 26.4 32.6
means the lower the better. ↑ means the higher the better. * in-
dicates that A-NCSH is re-implemented with the extra kinematic Table 5. Success rate (%) on articulated object manipulation task.
structure and joint type prediction modules. Pre. means we use predicted object state from the pose and shape
modules.
5.3. Shape Module Performance
The experimental results of the shape module are illus- 6. Conclusion and Crowd-Sourcing Data-
trated in Table 4. Within ground truth joint state input, the Collection Invitation
shape module could reconstruct the articulated object with In this paper, we present AKB-48, a large-scale artic-
5.6 Chamfer-L1 distance. On the other hand, we systemat- ulated object knowledge and benchmark C-VAM problem
ically evaluate the shape module given the predicted joint for dealing with articulation problems. Admittedly, there
state, which is deduced from predicted the linked two parts’ are a few articulated object categories that might not be col-
poses from the pose module. The Chamfer-L1 distance is lected in AKB-48, although we have covered large enough
3.3 higher than that with ground truth joint state, indicating categories in daily life. In the future, we will release our
that the predicated poses largely affect reconstruction per- FArM tool for collecting more articulated objects, and it
formance. could also support any scanned shapes such as mobile re-
Mode Chamer-L1 Distance constructor [13]. In future work, we will publish an online
Joint State GT 5.6 articulation model platform and invite crowd-sourcing data-
Joint State Pre. 8.9 collection to contribute to the articulation research commu-
nity.
Table 4. Articulated object reconstruction results. Pre. means that
we use the predicted joint state from the pose module. Acknowledgement This work was supported by the Na-
tional Key R&D Program of China (No. 2021ZD0110700),
5.4. Manipulation Module Performance
Shanghai Municipal Science and Technology Major Project
We evaluate opening and pulling tasks on the manipu- (2021SHZDZX0102), Shanghai Qi Zhi Institute, and
lation module of AKBNet using TQC+HER training algo- SHEITC (2018-RGZN-02046).
14796
Authorized licensed use limited to: HEFEI UNIVERSITY OF TECHNOLOGY. Downloaded on January 25,2025 at 15:15:12 UTC from IEEE Xplore. Restrictions apply.
References hashed signed distance fields. In Robotics: science and sys-
tems, volume 4. Citeseer, 2015. 8
[1] Marcin Andrychowicz, Filip Wolski, Alex Ray, Jonas [14] Sebastian Koch, Albert Matveev, Zhongshi Jiang, Francis
Schneider, Rachel Fong, Peter Welinder, Bob McGrew, Josh Williams, Alexey Artemov, Evgeny Burnaev, Marc Alexa,
Tobin, Pieter Abbeel, and Wojciech Zaremba. Hindsight Denis Zorin, and Daniele Panozzo. Abc: A big cad model
experience replay. In Proceedings of the 31st Interna- dataset for geometric deep learning. In Proceedings of
tional Conference on Neural Information Processing Sys- the IEEE/CVF Conference on Computer Vision and Pattern
tems, pages 5055–5065, 2017. 7, 8 Recognition, pages 9601–9611, 2019. 2, 3
[2] Joao Borrego, Atabak Dehban, Rui Figueiredo, Plinio [15] Arsenii Kuznetsov, Pavel Shvechikov, Alexander Grishin,
Moreno, Alexandre Bernardino, and José Santos-Victor. Ap- and Dmitry Vetrov. Controlling overestimation bias with
plying domain randomization to synthetic data for object cat- truncated mixture of continuous distributional quantile crit-
egory detection. arXiv preprint arXiv:1807.09834, 2018. 2 ics. In International Conference on Machine Learning, pages
[3] Berk Calli, Arjun Singh, Aaron Walsman, Siddhartha Srini- 5556–5566. PMLR, 2020. 7, 8
vasa, Pieter Abbeel, and Aaron M Dollar. The ycb object [16] Xiaolong Li, He Wang, Li Yi, Leonidas J Guibas, A Lynn
and model set: Towards common benchmarks for manipula- Abbott, and Shuran Song. Category-level articulated object
tion research. In 2015 international conference on advanced pose estimation. In Proceedings of the IEEE/CVF Confer-
robotics (ICAR), pages 510–517. IEEE, 2015. 3, 5 ence on Computer Vision and Pattern Recognition, pages
[4] Angel X Chang, Thomas Funkhouser, Leonidas Guibas, 3706–3715, 2020. 3, 6, 7, 8
Pat Hanrahan, Qixing Huang, Zimo Li, Silvio Savarese, [17] Liu Liu, Han Xue, Wenqiang Xu, Haoyuan Fu, and Cewu
Manolis Savva, Shuran Song, Hao Su, et al. Shapenet: Lu. Towards real-world category-level articulation pose es-
An information-rich 3d model repository. arXiv preprint timation. arXiv preprint arXiv:2105.03260, 2021. 1, 7
arXiv:1512.03012, 2015. 2, 3, 5 [18] Qihao Liu, Weichao Qiu, Weiyao Wang, Gregory D Hager,
[5] Peng Chang and Taşkin Padif. Sim2real2sim: Bridging the and Alan L Yuille. Nothing but geometric constraints: A
gap between simulation and real-world in flexible object ma- model-free method for articulated object pose estimation.
nipulation. In 2020 Fourth IEEE International Conference arXiv preprint arXiv:2012.00088, 2020. 3
on Robotic Computing (IRC), pages 56–62. IEEE, 2020. 2 [19] Robert Maier, Kihwan Kim, Daniel Cremers, Jan Kautz, and
[6] John D Cutnell and Kenneth W Johnson. Physics, Volume Matthias Nießner. Intrinsic3d: High-quality 3d reconstruc-
One: Chapters 1-17, volume 1. John Wiley & Sons, 2014. 5 tion by joint appearance and geometry optimization with
[7] Haoyuan Fu, Xu Wenqiang, Xue Han, Yang Huinan, Ye spatially-varying lighting. In Proceedings of the IEEE inter-
Ruolin, Huang Yongxi, Xue Zhendong, Wang Yanfeng, and national conference on computer vision, pages 3114–3122,
Cewu Lu. Rfuniverse. 7 2017. 2
[8] Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey [20] Roberto Martı́n-Martı́n, Clemens Eppner, and Oliver Brock.
Levine. Soft actor-critic: Off-policy maximum entropy deep The rbo dataset of articulated objects and interactions. The
reinforcement learning with a stochastic actor. In Interna- International Journal of Robotics Research, 38(9):1013–
tional conference on machine learning, pages 1861–1870. 1019, 2019. 1, 3, 5
PMLR, 2018. 7, 8 [21] Roberto Martı́n-Martı́n, Sebastian Höfer, and Oliver Brock.
[9] Karol Hausman, Scott Niekum, Sarah Osentoski, and Gau- An integrated approach to visual perception of articulated
rav S Sukhatme. Active articulation model estimation objects. In 2016 IEEE International Conference on Robotics
through interactive perception. In 2015 IEEE International and Automation (ICRA), pages 5091–5097. IEEE, 2016. 3
Conference on Robotics and Automation (ICRA), pages [22] George A Miller. Wordnet: a lexical database for english.
3305–3312. IEEE, 2015. 3 Communications of the ACM, 38(11):39–41, 1995. 2, 4
[10] Stefan Hinterstoisser, Stefan Holzer, Cedric Cagniart, Slo- [23] Kaichun Mo, Leonidas Guibas, Mustafa Mukadam, Abhi-
bodan Ilic, and Vincent Lepetit. Multimodal templates for nav Gupta, and Shubham Tulsiani. Where2act: From pix-
real-time detection of texture-less objects in heavily cluttered els to actions for articulated 3d objects. arXiv preprint
scenes. In IEEE International Conference on Computer Vi- arXiv:2101.02692, 2021. 3
sion, 2012. 5 [24] Kaichun Mo, Shilin Zhu, Angel X Chang, Li Yi, Subarna
[11] Ajinkya Jain, Rudolf Lioutikov, and Scott Niekum. Tripathi, Leonidas J Guibas, and Hao Su. Partnet: A large-
Screwnet: Category-independent articulation model estima- scale benchmark for fine-grained and hierarchical part-level
tion from depth images using screw theory. arXiv preprint 3d object understanding. In Proceedings of the IEEE Con-
arXiv:2008.10518, 2020. 3 ference on Computer Vision and Pattern Recognition, pages
[12] Dov Katz and Oliver Brock. Manipulating articulated objects 909–918, 2019. 3, 5
with interactive perception. In 2008 IEEE International Con- [25] Jiteng Mu, Weichao Qiu, Adam Kortylewski, Alan Yuille,
ference on Robotics and Automation, pages 272–277. IEEE, Nuno Vasconcelos, and Xiaolong Wang. A-sdf: Learning
2008. 3 disentangled signed distance functions for articulated shape
[13] Matthew Klingensmith, Ivan Dryanovski, Siddhartha S representation. arXiv preprint arXiv:2104.07645, 2021. 2, 7
Srinivasa, and Jizhong Xiao. Chisel: Real time large scale [26] Erik Oberg and Franklin Day Jones. Machinery’s Handbook.
3d reconstruction onboard a mobile device using spatially Industrial Press, 1914. 6
14797
Authorized licensed use limited to: HEFEI UNIVERSITY OF TECHNOLOGY. Downloaded on January 25,2025 at 15:15:12 UTC from IEEE Xplore. Restrictions apply.
[27] Jeong Joon Park, Peter Florence, Julian Straub, Richard
Newcombe, and Steven Lovegrove. Deepsdf: Learning con-
tinuous signed distance functions for shape representation.
In Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition, pages 165–174, 2019. 7
[28] Charles R Qi, Li Yi, Hao Su, and Leonidas J Guibas. Point-
net++ deep hierarchical feature learning on point sets in a
metric space. In Proceedings of the 31st International Con-
ference on Neural Information Processing Systems, pages
5105–5114, 2017. 6
[29] He Wang, Srinath Sridhar, Jingwei Huang, Julien Valentin,
Shuran Song, and Leonidas J Guibas. Normalized object
coordinate space for category-level 6d object pose and size
estimation. In Proceedings of the IEEE/CVF Conference
on Computer Vision and Pattern Recognition, pages 2642–
2651, 2019. 7
[30] Xiaogang Wang, Bin Zhou, Yahao Shi, Xiaowu Chen, Qin-
ping Zhao, and Kai Xu. Shape2motion: Joint analysis of
motion parts and attributes from 3d shapes. In Proceed-
ings of the IEEE Conference on Computer Vision and Pattern
Recognition, pages 8876–8884, 2019. 3, 5
[31] Fanbo Xiang, Yuzhe Qin, Kaichun Mo, Yikuan Xia, Hao
Zhu, Fangchen Liu, Minghua Liu, Hanxiao Jiang, Yifu Yuan,
He Wang, et al. Sapien: A simulated part-based interactive
environment. In Proceedings of the IEEE/CVF Conference
on Computer Vision and Pattern Recognition, pages 11097–
11107, 2020. 1, 3, 4, 5, 6
[32] Li Yi, Haibin Huang, Difan Liu, Evangelos Kalogerakis, Hao
Su, and Leonidas Guibas. Deep part induction from articu-
lated object pairs. ACM Transactions on Graphics, 37(6),
2019. 3
[33] Vicky Zeng, Timothy E Lee, Jacky Liang, and Oliver Kroe-
mer. Visual identification of articulated object parts. arXiv
preprint arXiv:2012.00284, 2020. 3
14798
Authorized licensed use limited to: HEFEI UNIVERSITY OF TECHNOLOGY. Downloaded on January 25,2025 at 15:15:12 UTC from IEEE Xplore. Restrictions apply.