This is a group project for NUS ME5406 Part II, we are Group 8.
This project implements various reinforcement learning algorithms to train a humanoid robot to walk in the MuJoCo environment. Our team members have implemented different algorithms:
- Dong Sihan: PPO and TD3
- Hu Bowen: SAC
- Xu Chunnan: DDPG and D4PG
Each algorithm is implemented with comprehensive training frameworks and visualization tools to analyze and compare their performance in the humanoid walking task.
This repository contains an implementation of Proximal Policy Optimization (PPO) for training a humanoid agent in the MuJoCo environment. The implementation features parallel environment training, generalized advantage estimation (GAE), and a Beta distribution-based policy network.
This repository also implements the Twin Delayed Deep Deterministic Policy Gradient (TD3) algorithm for controlling the Humanoid-v5 environment in Gymnasium. The implementation includes training and testing scripts with comprehensive visualization capabilities.
git checkout D4PG-xcn
Check D4PG-xcn branch's README for instructions of training DDPG and D4PG
git checkout loggcc-branch
Check loggcc-branch branch's README for instructions of training SAC
ppo_ours.py: Main training script containing the PPO implementationtd3_ours.pyMain TD3 implementation fileruns/: Directory containing current tensorboard logs (for ongoing training)data_best/: Directory for storing the best model checkpoints for PPO (model_best.pt)humanoid/: Directory containing pretrained tensorboard logs (for reference)models/: Directory for saved TD3 model weights (best_model.pth)
The project requires the following dependencies with their minimum versions:
- Python >= 3.10
- PyTorch >= 2.6.0
- Gymnasium[mujoco] >= 1.1.1
- NumPy >= 1.24.0
- TensorBoard >= 2.19.0
- tqdm >= 4.67.1
- Install UV for python package management:
curl -LsSf https://astral.sh/uv/install.sh | sh- Clone Repo:
git clone https://github.com/Sihanzz/ME5406_Group_Project.git
git checkout main
Install dependencies:
uv syncTo start training PPO the agent:
uv run ppo_ours.py trainOr just test PPO
uv run ppo_ours.pyThe training script will:
- Initialize parallel environments for training
- Use a Beta distribution-based policy network
- Implement PPO with GAE for advantage estimation
- Save the best model based on episode rewards
- Log training metrics to TensorBoard
To monitor training progress:
uv run tensorboard --logdir runsThis will start a TensorBoard server where you can visualize:
- Episode returns
- Policy and value losses
- Entropy
- Learning rate
Note: If you encounter any issues with tensorboard, make sure you have activated the correct virtual environment and installed all dependencies.
The best model is automatically saved in the data_best/ directory. You can load a pretrained model by setting load_pretrained = True in the training script.
The implementation includes:
- Parallel environment training for efficient sampling
- Generalized Advantage Estimation (GAE)
- Beta distribution-based policy network
- Orthogonal initialization of network weights
- Cosine annealing learning rate scheduler
- Normalized observations and rewards
- Gradient clipping
Key hyperparameters (defined in ppo_ours.py):
NUM_ENVS: Number of parallel environments (default: 4)SAMPLE_STEPS: Steps to sample per iteration (default: 2048)TOTAL_STEPS: Total training steps (default: 4,000,000)MINI_BATCH_SIZE: Mini batch size for training (default: 256)EPOCHES: Number of epochs per iteration (default: 10)GAMMA: Discount factor (default: 0.99)GAE_LAMBDA: GAE lambda parameter (default: 0.95)CLIP_EPS: PPO clipping epsilon (default: 0.2)
uv run td3_ours.py --trainKey training parameters:
--env: Environment name (default: Humanoid-v5)--max_steps: Maximum training steps (default: 1,000,000)--batch_size: Batch size (default: 256)--learning_rate: Learning rate (default: 3e-4)--gamma: Discount factor (default: 0.99)--tau: Target network update rate (default: 0.005)
uv run td3_ours.pyThe testing script will:
- Load the best saved model
- Run forever evaluation episodes until you manually stop
- Display performance metrics
- Show real-time visualization
- Actor Network: 256-256 hidden layers with ReLU activation
- Critic Networks: Two independent 256-256 networks
- Target networks updated via Polyak averaging (τ = 0.005)
- Policy update delay (d): 2
- Target policy smoothing noise (σ): 0.2
- Noise clip range (c): 0.5
- Initial random steps: 25,000
- Replay buffer size: 1,000,000
Training progress can be monitored using TensorBoard:
uv run tensorboard --logdir runs/TD3_trainingThe implementation follows the original TD3 paper with several optimizations:
- Efficient network architecture
- Comprehensive logging
- Robust model saving/loading
- Real-time visualization
- Fujimoto et al. (2018) "Addressing Function Approximation Error in Actor-Critic Methods"
- Original TD3 paper: https://arxiv.org/abs/1802.09477
- Schulman et al. (2017) "Proximal Policy Optimization Algorithms"
- Original PPO paper: https://arxiv.org/abs/1707.06347
- CleanRL implementation reference: https://github.com/vwxyzjn/cleanrl
This project is open source and available under the MIT License.



