RoboPEPP: Vision-Based Robot Pose and Joint Angle Estimation through Embedding Predictive Pre-Training

CVPR 2025 (Highlight)

📖 Paper: RoboPEPP
📖 Pre-print: https://arxiv.org/abs/2411.17662
📹 Video: https://youtu.be/pbM60-kHSdE

Authors: Raktim Gautam Goswami¹, Prashanth Krishnamurthy¹, Yann LeCun^2,3, Farshad Khorrami¹
¹ New York University Tandon School of Engineering
² New York University Courant Institute of Mathematical Sciences
³ Meta-FAIR

💡 RoboPEPP Contributions

Pre-Training: A robot pose and joint angle estimation framework with embedding-predictive pre-training to enhance the network’s understanding of the robot’s physical model.
Fine-Tuning: An efficient network for robot pose and joint angle estimation using the pre-trained encoder-predictor alongside joint angle and keypoint estimators, trained using randomly masked inputs to enhance occlusion robustness
Keypoint Filtering: A confidence-based keypoint filtering method to handle cases where only part of the robot is visible in the image
Experiments: Extensive experiments showing RoboPEPP’s superior pose estimation, joint angle prediction, occlusion robustness, and computational efficiency.

Fig. 1: Overview of our RoboPEPP framework.

🔨 Environment creation

conda create --name robopepp python=3.10
conda activate robopepp
pip install -r requirements.txt
export CUDA_HOME=/path/to/cuda/cuda-12.2/
pip install -v "git+https://github.com/facebookresearch/pytorch3d.git@stable"

Note:
Before running the code, you may need to configure accelerate and log in to wandb.
If you prefer not to use them, you can disable them in the configuration files in config folder by setting their values to false.

📊💾 Dataset Download

RoboPEPP uses the DREAM dataset.
Please download the dataset and place it in a dedicated folder on your system.

Additionally, download the required URDF files. The urdf files are already in the repository in the urdf folder.

Before training, make sure to update the dataset root path and URDF file path in the appropriate configuration files located in the config folder.

✈️ Training

As outlined in the manuscript, RoboPEPP undergoes a two-stage training process:

Embedding Predictive Pre-Training for the encoder-predictor.
End-to-End Fine-Tuning of the full model.

Embedding Predictive Pre-Training

To perform embedding predictive pre-training, follow the instructions provided here.

Alternatively, you can skip this step and use our pre-trained weights directly.

End-to-end Fine-Tuning

Before End-to-end Fine-Tuning, update the location of the pre-trained model from the above in the configuration file.

mkdir checkpoints
accelerate launch train.py --config <path-to-config-file>
# example
accelerate launch train.py --config configs/panda.yaml

This will train the model end-to-end and save the weights in the checkpoints/ folder. The pre-trained weights are available here.

Evaluation

During testing, we leverage the GroundingDINO framework from the Grounded-SAM-2 repository to predict bounding boxes. While our test pipeline supports real-time bounding box prediction, we recommend pre-processing the test dataset using the steps here. This enables faster testing.

Run the test code using the trained weights as follows

python test.py --config <path-to-config-file>

This will evaluate the model on the test_seq specified in your config file, and print metrics including ADD AUC, PCK, and Joint Error. If evaluating the sim-to-real fine-tuned models, change the checkpoint_name in the config file to the correct .pt file (e.g., robopepp_ssl_realsense.pt)

Sim-to-Real Fine-Tuning

RoboPEPP also supports self-supervised Sim-to-Real fine-tuning for real-world sequences in the Panda dataset.

This process adapts the model to the real world sequence specified by s2r_seq in the configuration file.

accelerate launch train_s2r.py --config configs/panda.yaml

This will fine-tune the model and store the updated weights in the checkpoints/ directory.

📝 Results

Table 1: Comparison of robot pose estimation using AUC on the ADD metric. Best values among methods using unknown joint angles and bounding boxes during evaluation are bolded. HPE∗ denotes HPE [4] evaluated with the same off-the-shelf bounding box detector as RoboPEPP.

Fig. 2: Qualitative Comparison on Panda Photo (Example 1) and Occlusion (Example 2 and 3) datasets: Predicted poses and joint angles are used to generate a mesh overlaid on the original image, where closer alignment indicates greater accuracy. Highlighted rectangles indicate regions where other methods’ meshes misalign, while RoboPEPP achieves high precision.

📧 Citation

@inproceedings{goswami2025robopepp,
  title={Robopepp: Vision-based robot pose and joint angle estimation through embedding predictive pre-training},
  author={Goswami, Raktim Gautam and Krishnamurthy, Prashanth and LeCun, Yann and Khorrami, Farshad},
  booktitle={Proceedings of the Computer Vision and Pattern Recognition Conference},
  pages={6930--6939},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
assets		assets
bbox_grounding_dino		bbox_grounding_dino
configs		configs
datasets		datasets
ijepa		ijepa
models		models
urdfs		urdfs
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt
test.py		test.py
train.py		train.py
train_s2r.py		train_s2r.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RoboPEPP: Vision-Based Robot Pose and Joint Angle Estimation through Embedding Predictive Pre-Training

CVPR 2025 (Highlight)

💡 RoboPEPP Contributions

🔨 Environment creation

📊💾 Dataset Download

✈️ Training

Embedding Predictive Pre-Training

End-to-end Fine-Tuning

Evaluation

Sim-to-Real Fine-Tuning

📝 Results

📧 Citation

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

RoboPEPP: Vision-Based Robot Pose and Joint Angle Estimation through Embedding Predictive Pre-Training

CVPR 2025 (Highlight)

💡 RoboPEPP Contributions

🔨 Environment creation

📊💾 Dataset Download

✈️ Training

Embedding Predictive Pre-Training

End-to-end Fine-Tuning

Evaluation

Sim-to-Real Fine-Tuning

📝 Results

📧 Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages