Inferring Articulated Rigid Body Dynamics from RGBD Video

IROS 2022

1University of Southern California

2Stanford University

3Microsoft Research

4Google Research


Being able to reproduce physical phenomena ranging from light interaction to contact mechanics, simulators are becoming increasingly useful in more and more application domains where real-world interaction or labeled data are difficult to obtain. Despite recent progress, significant human effort is needed to configure simulators to accurately reproduce real-world behavior. We introduce a pipeline that combines inverse rendering with differentiable simulation to create digital twins of real-world articulated mechanisms from depth or RGB videos. Our approach automatically discovers joint types and estimates their kinematic parameters, while the dynamic properties of the overall mechanism are tuned to attain physically accurate simulations. Control policies optimized in our derived simulation transfer successfully back to the original system, as we demonstrate on a simulated system. Further, our approach accurately reconstructs the kinematic tree of an articulated mechanism being manipulated by a robot, and highly nonlinear dynamics of a real-world coupled pendulum mechanism.

Pipeline of our video2sim approach which creates a simulation from a video of an articulated mechanism. In this example, a cartpole is identified by first finding the objects in the video and extracting their segmentation maps using a segmentation network. Then, the poses of the rigid objects are tracked given the image sequence by leveraging an inverse renderer that allows us to optimize the SE(3) poses given a loss function defined on the pixel values. The poses are then used to identify the articulations in the scene, i.e. the types of joints, and their kinematic parameters, that connect the rigid bodies. Finally, we leverage gradient-based Bayesian inference algorithms to approximate posterior distributions over simulation parameters that allow us to reconstruct a full simulation of the observed mechanism with its accurate dynamical properties.



Articulated Tree

We infer the articulation of a simulated tree structure that has five rigid bodies connected via two revolute joints and two fixed joints.


Given a sequence of depth images of a simulated cartpole, we find the articulations and dynamical properties to accurately reproduce the dynamics of the cartpole.

Rott's Pendulum

Our approach finds a realistic digital twin of Rott's coupled pendulum system given an RGB video of such mechanism.

Craftsman System

By recording a depth video of a robot manipulating a system made of wood parts from a Craftsman toy construction set, we are able to recover the kinematic structure.


We thank Chris Denniston and David Millard for their feedback and discussion around this work, as well as Vedant Mistry for his contributions to an earlier prototype of our implementation.

We are grateful to Maria Hakuba, Hansjörg Frei and Christoph Schar for kindly permitting us to use their video of Rott's mechanism in this work.

This research was supported by a Google Ph.D. Fellowship.

Last updated on September 11, 2022