Humanoid Robot Walking Using Reinforcement Learning in MuJoCo

This is a group project for NUS ME5406 Part II, we are Group 8.

Overview

This project implements various reinforcement learning algorithms to train a humanoid robot to walk in the MuJoCo environment. Our team members have implemented different algorithms:

Dong Sihan: PPO and TD3
Hu Bowen: SAC
Xu Chunnan: DDPG and D4PG

Each algorithm is implemented with comprehensive training frameworks and visualization tools to analyze and compare their performance in the humanoid walking task.

PPO and TD3 Implementation for Humanoid Control (main branch)

This repository contains an implementation of Proximal Policy Optimization (PPO) for training a humanoid agent in the MuJoCo environment. The implementation features parallel environment training, generalized advantage estimation (GAE), and a Beta distribution-based policy network.

This repository also implements the Twin Delayed Deep Deterministic Policy Gradient (TD3) algorithm for controlling the Humanoid-v5 environment in Gymnasium. The implementation includes training and testing scripts with comprehensive visualization capabilities.

Swith to different Branch

Swith to DDPG and D4PG branch

git checkout D4PG-xcn

Check D4PG-xcn branch's README for instructions of training DDPG and D4PG

Switch to SAC branch

git checkout loggcc-branch

Check loggcc-branch branch's README for instructions of training SAC

Demo

PPO Training Demo

TD3 Training Demo

Project Structure

ppo_ours.py: Main training script containing the PPO implementation
td3_ours.py Main TD3 implementation file
runs/: Directory containing current tensorboard logs (for ongoing training)
data_best/: Directory for storing the best model checkpoints for PPO (model_best.pt)
humanoid/: Directory containing pretrained tensorboard logs (for reference)
models/: Directory for saved TD3 model weights (best_model.pth)

Dependencies

The project requires the following dependencies with their minimum versions:

Python >= 3.10
PyTorch >= 2.6.0
Gymnasium[mujoco] >= 1.1.1
NumPy >= 1.24.0
TensorBoard >= 2.19.0
tqdm >= 4.67.1

Installation

Install UV for python package management:

curl -LsSf https://astral.sh/uv/install.sh | sh

Clone Repo:

git clone https://github.com/Sihanzz/ME5406_Group_Project.git

PPO Usage

Switch to main branch

git checkout main

PPO Training

Install dependencies:

uv sync

To start training PPO the agent:

uv run ppo_ours.py train

Or just test PPO

uv run ppo_ours.py

The training script will:

Initialize parallel environments for training
Use a Beta distribution-based policy network
Implement PPO with GAE for advantage estimation
Save the best model based on episode rewards
Log training metrics to TensorBoard

PPO Monitoring Training

To monitor training progress:

uv run tensorboard --logdir runs

This will start a TensorBoard server where you can visualize:

Episode returns
Policy and value losses
Entropy
Learning rate

Note: If you encounter any issues with tensorboard, make sure you have activated the correct virtual environment and installed all dependencies.

PPO Model Checkpoints

The best model is automatically saved in the data_best/ directory. You can load a pretrained model by setting load_pretrained = True in the training script.

PPO Implementation Details

The implementation includes:

Parallel environment training for efficient sampling
Generalized Advantage Estimation (GAE)
Beta distribution-based policy network
Orthogonal initialization of network weights
Cosine annealing learning rate scheduler
Normalized observations and rewards
Gradient clipping

PPO Hyperparameters

Key hyperparameters (defined in ppo_ours.py):

NUM_ENVS: Number of parallel environments (default: 4)
SAMPLE_STEPS: Steps to sample per iteration (default: 2048)
TOTAL_STEPS: Total training steps (default: 4,000,000)
MINI_BATCH_SIZE: Mini batch size for training (default: 256)
EPOCHES: Number of epochs per iteration (default: 10)
GAMMA: Discount factor (default: 0.99)
GAE_LAMBDA: GAE lambda parameter (default: 0.95)
CLIP_EPS: PPO clipping epsilon (default: 0.2)

TD3 Usage

TD3 Training

uv run td3_ours.py --train

Key training parameters:

--env: Environment name (default: Humanoid-v5)
--max_steps: Maximum training steps (default: 1,000,000)
--batch_size: Batch size (default: 256)
--learning_rate: Learning rate (default: 3e-4)
--gamma: Discount factor (default: 0.99)
--tau: Target network update rate (default: 0.005)

TD3 Testing/Visualization

uv run td3_ours.py

The testing script will:

Load the best saved model
Run forever evaluation episodes until you manually stop
Display performance metrics
Show real-time visualization

TD3 Model Architecture

Actor Network: 256-256 hidden layers with ReLU activation
Critic Networks: Two independent 256-256 networks
Target networks updated via Polyak averaging (τ = 0.005)

TD3 Key Hyperparameters

Policy update delay (d): 2
Target policy smoothing noise (σ): 0.2
Noise clip range (c): 0.5
Initial random steps: 25,000
Replay buffer size: 1,000,000

TD3 Monitoring

Training progress can be monitored using TensorBoard:

uv run tensorboard --logdir runs/TD3_training

TD3 Implementation Details

The implementation follows the original TD3 paper with several optimizations:

Efficient network architecture
Comprehensive logging
Robust model saving/loading
Real-time visualization

References

Fujimoto et al. (2018) "Addressing Function Approximation Error in Actor-Critic Methods"
Original TD3 paper: https://arxiv.org/abs/1802.09477
Schulman et al. (2017) "Proximal Policy Optimization Algorithms"
Original PPO paper: https://arxiv.org/abs/1707.06347
CleanRL implementation reference: https://github.com/vwxyzjn/cleanrl

License

This project is open source and available under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
data_best		data_best
humanoid		humanoid
models		models
runs		runs
.python-version		.python-version
1.png		1.png
5.png		5.png
README.md		README.md
best_model.pth		best_model.pth
model_best.pt		model_best.pt
ppo.gif		ppo.gif
ppo_ours.py		ppo_ours.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
td3.gif		td3.gif
td3_ours.py		td3_ours.py
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Humanoid Robot Walking Using Reinforcement Learning in MuJoCo

Overview

PPO and TD3 Implementation for Humanoid Control (main branch)

Swith to different Branch

Swith to DDPG and D4PG branch

Switch to SAC branch

Demo

PPO Training Demo

TD3 Training Demo

Project Structure

Dependencies

Installation

PPO Usage

Switch to main branch

PPO Training

PPO Monitoring Training

PPO Model Checkpoints

PPO Implementation Details

PPO Hyperparameters

TD3 Usage

TD3 Training

TD3 Testing/Visualization

TD3 Model Architecture

TD3 Key Hyperparameters

TD3 Monitoring

TD3 Implementation Details

References

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Humanoid Robot Walking Using Reinforcement Learning in MuJoCo

Overview

PPO and TD3 Implementation for Humanoid Control (main branch)

Swith to different Branch

Swith to DDPG and D4PG branch

Switch to SAC branch

Demo

PPO Training Demo

TD3 Training Demo

Project Structure

Dependencies

Installation

PPO Usage

Switch to main branch

PPO Training

PPO Monitoring Training

PPO Model Checkpoints

PPO Implementation Details

PPO Hyperparameters

TD3 Usage

TD3 Training

TD3 Testing/Visualization

TD3 Model Architecture

TD3 Key Hyperparameters

TD3 Monitoring

TD3 Implementation Details

References

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages