A PyTorch implementation of the OpenVLA: An Open-Source Vision-Language-Action Model paper model that combines visual perception, natural language understanding, and robotic action prediction for embodied AI tasks.
OpenVLA is a multimodal transformer that:
- Sees: Processes RGB images using a vision encoder (DINOv2 + Siglip)
- Understands: Interprets natural language instructions using Qwen2-0.5B
- Acts: Predicts discretized robotic actions (7-DOF: position, rotation, gripper)
┌─────────────┐ ┌──────────────┐ ┌─────────────────┐
│ Image │ │ Instruction │ │ Action │
│ (224x224x3) │ │ (Text) │ │ (7D vector) │
└─────┬───────┘ └──────┬───────┘ └─────────────────┘
│ │ ▲
▼ ▼ │
┌─────────────┐ ┌──────────────┐ │
│ Vision │ │ Language │ │
│ Encoder │ │ Tokenizer │ │
│ (DINOv2) │ │ (Qwen2) │ │
└─────┬───────┘ └──────┬───────┘ │
│ │ │
└───────┬───────────┘ │
▼ │
┌──────────────┐ │
│ Qwen2-0.5B │ │
│ Language LM │ │
└──────┬───────┘ │
│ │
▼ │
┌──────────────┐ ┌─────────────────┐ │
│ Logits │ │ Action │ │
│ (Vocab Size) │───▶│ Discretizer │───┘
└──────────────┘ └─────────────────┘
- Python 3.8+
- PyTorch 2.0+
- CUDA (recommended for training)
# Clone the repository
git clone https://github.com/pie33000/OpenVLA
cd OpenVLA
# Create conda environment
conda create -n openvla python=3.12
conda activate openvla
# Install dependencies
pip install -r requirements.txt- VIOLA: Vision-based manipulation tasks
- ALL Open X Embodiement Datasets
Install Google Cloud CLI
https://cloud.google.com/storage/docs/gsutil_install#linux
Download Viola Dataset
gsutil -m cp -r gs://gresearch/robotics/viola/0.1.0 .
python convert_tf_to_numpy.pyThis will:
- Download TensorFlow datasets
- Convert to HDF5 format in
data/directory - Extract: images, instructions, and 7-DOF actions
data/
├── viola_0.h5
├── viola_1.h5
└── viola_2.h5
Each HDF5 file contains:
images: RGB images (H, W, 3)instrs: Natural language instructionsactions: 7-DOF action vectors [x, y, z, rx, ry, rz, gripper]
# Train the model
python train.pyTraining features:
- Scalable Data Loading: Handles large datasets efficiently
- Action Discretization: Converts continuous actions to discrete tokens
- Cross-entropy Loss: Standard language modeling objective
- Memory Optimization: File handle caching and lazy loading
- Base: DINOv2 vision transformer + Siglip
- Output: 1536-dimensional visual embeddings
- Input: 224×224 RGB images
discretizer = ActionDiscretizer(
tokenizer=tokenizer, # Qwen2 tokenizer
action_dim=7, # [x,y,z,rx,ry,rz,gripper]
num_bins=256, # Discretization bins
min_action=-1.0, # Action range minimum
max_action=1.0 # Action range maximum
)
# Convert continuous action to tokens
action = np.array([-0.5, 0.5, 0.0, 0.0, 0.0, 0.0, 1.0])
tokens = discretizer(action) # Returns token IDs
# Convert tokens back to actions
decoded_action = discretizer.decode_token_ids_to_actions(tokens)class OpenVLA(nn.Module):
def __init__(self, dim=896, device="cuda", action_dim=7, num_bins=256):
# Vision encoder (DINOv2)
self.vision_encoder = VisionEncoder(dim=dim)
# Language model (Qwen2-0.5B)
self.language_model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2-0.5B")
# Action discretizer
self.action_discretizer = ActionDiscretizer(...)# 6-DOF manipulation (no gripper)
discretizer = ActionDiscretizer(
tokenizer=tokenizer,
action_dim=6,
num_bins=512, # Higher resolution
min_action=-2.0,
max_action=2.0
)
# Mobile manipulation (base + arm)
discretizer = ActionDiscretizer(
tokenizer=tokenizer,
action_dim=10, # [base_x, base_y, base_theta, arm_joints...]
num_bins=256,
min_action=-1.0,
max_action=1.0
)- Fork the repository
- Create a feature branch
- Make your changes
- Add tests if applicable
- Submit a pull request
This project is licensed under the MIT License - see the LICENSE file for details.
- DINOv2: Learning Robust Visual Features without Supervision
- SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features
- Qwen2 Technical Report
- RT-1: Robotics Transformer
For questions or issues, please open a GitHub issue or contact the maintainers.