Skip to content

loggcc/ME5406_Group_Project

 
 

Repository files navigation

Humanoid Robot Walking Using Reinforcement Learning in MuJoCo

This is a group project for NUS ME5406 Part II, we are Group 8.

Overview

This project implements various reinforcement learning algorithms to train a humanoid robot to walk in the MuJoCo environment. Our team members have implemented different algorithms:

  • Dong Sihan: PPO and TD3
  • Hu Bowen: SAC
  • Xu Chunnan: DDPG and D4PG

Each algorithm is implemented with comprehensive training frameworks and visualization tools to analyze and compare their performance in the humanoid walking task.

PPO and TD3 Implementation for Humanoid Control (main branch)

This repository contains an implementation of Proximal Policy Optimization (PPO) for training a humanoid agent in the MuJoCo environment. The implementation features parallel environment training, generalized advantage estimation (GAE), and a Beta distribution-based policy network.

This repository also implements the Twin Delayed Deep Deterministic Policy Gradient (TD3) algorithm for controlling the Humanoid-v5 environment in Gymnasium. The implementation includes training and testing scripts with comprehensive visualization capabilities.

Swith to different Branch

Swith to DDPG and D4PG branch

git checkout D4PG-xcn

Check D4PG-xcn branch's README for instructions of training DDPG and D4PG

Switch to SAC branch

git checkout loggcc-branch

Check loggcc-branch branch's README for instructions of training SAC

Demo

PPO Training Demo

PPO Humanoid Training TD3 Humanoid Training

TD3 Training Demo

TD3 Humanoid Training TD3 Humanoid Training

Project Structure

  • ppo_ours.py: Main training script containing the PPO implementation
  • td3_ours.py Main TD3 implementation file
  • runs/: Directory containing current tensorboard logs (for ongoing training)
  • data_best/: Directory for storing the best model checkpoints for PPO (model_best.pt)
  • humanoid/: Directory containing pretrained tensorboard logs (for reference)
  • models/: Directory for saved TD3 model weights (best_model.pth)

Dependencies

The project requires the following dependencies with their minimum versions:

  • Python >= 3.10
  • PyTorch >= 2.6.0
  • Gymnasium[mujoco] >= 1.1.1
  • NumPy >= 1.24.0
  • TensorBoard >= 2.19.0
  • tqdm >= 4.67.1

Installation

  1. Install UV for python package management:
curl -LsSf https://astral.sh/uv/install.sh | sh
  1. Clone Repo:
git clone https://github.com/Sihanzz/ME5406_Group_Project.git

PPO Usage

Switch to main branch

git checkout main

PPO Training

Install dependencies:

uv sync

To start training PPO the agent:

uv run ppo_ours.py train

Or just test PPO

uv run ppo_ours.py

The training script will:

  • Initialize parallel environments for training
  • Use a Beta distribution-based policy network
  • Implement PPO with GAE for advantage estimation
  • Save the best model based on episode rewards
  • Log training metrics to TensorBoard

PPO Monitoring Training

To monitor training progress:

uv run tensorboard --logdir runs

This will start a TensorBoard server where you can visualize:

  • Episode returns
  • Policy and value losses
  • Entropy
  • Learning rate

Note: If you encounter any issues with tensorboard, make sure you have activated the correct virtual environment and installed all dependencies.

PPO Model Checkpoints

The best model is automatically saved in the data_best/ directory. You can load a pretrained model by setting load_pretrained = True in the training script.

PPO Implementation Details

The implementation includes:

  • Parallel environment training for efficient sampling
  • Generalized Advantage Estimation (GAE)
  • Beta distribution-based policy network
  • Orthogonal initialization of network weights
  • Cosine annealing learning rate scheduler
  • Normalized observations and rewards
  • Gradient clipping

PPO Hyperparameters

Key hyperparameters (defined in ppo_ours.py):

  • NUM_ENVS: Number of parallel environments (default: 4)
  • SAMPLE_STEPS: Steps to sample per iteration (default: 2048)
  • TOTAL_STEPS: Total training steps (default: 4,000,000)
  • MINI_BATCH_SIZE: Mini batch size for training (default: 256)
  • EPOCHES: Number of epochs per iteration (default: 10)
  • GAMMA: Discount factor (default: 0.99)
  • GAE_LAMBDA: GAE lambda parameter (default: 0.95)
  • CLIP_EPS: PPO clipping epsilon (default: 0.2)

TD3 Usage

TD3 Training

uv run td3_ours.py --train

Key training parameters:

  • --env: Environment name (default: Humanoid-v5)
  • --max_steps: Maximum training steps (default: 1,000,000)
  • --batch_size: Batch size (default: 256)
  • --learning_rate: Learning rate (default: 3e-4)
  • --gamma: Discount factor (default: 0.99)
  • --tau: Target network update rate (default: 0.005)

TD3 Testing/Visualization

uv run td3_ours.py

The testing script will:

  1. Load the best saved model
  2. Run forever evaluation episodes until you manually stop
  3. Display performance metrics
  4. Show real-time visualization

TD3 Model Architecture

  • Actor Network: 256-256 hidden layers with ReLU activation
  • Critic Networks: Two independent 256-256 networks
  • Target networks updated via Polyak averaging (τ = 0.005)

TD3 Key Hyperparameters

  • Policy update delay (d): 2
  • Target policy smoothing noise (σ): 0.2
  • Noise clip range (c): 0.5
  • Initial random steps: 25,000
  • Replay buffer size: 1,000,000

TD3 Monitoring

Training progress can be monitored using TensorBoard:

uv run tensorboard --logdir runs/TD3_training

TD3 Implementation Details

The implementation follows the original TD3 paper with several optimizations:

  • Efficient network architecture
  • Comprehensive logging
  • Robust model saving/loading
  • Real-time visualization

References

License

This project is open source and available under the MIT License.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • Python 100.0%