If you'd like to get started right away, you can choose an example from the following registry. Each example will have its own README explaining how to run the code.
| Method | Model size | Suggested GPU requirement |
|---|---|---|
| SFT | 3B | ? |
| SFT | 70B | ≥ 8x H100 or 8x A100 |
| DPO | 3B | ? |
| DPO | 70B | ≥ 8x H100 or 8x A100 |
| GRPO | 3B | ? |
| GRPO | 70B | ≥ 8x H100 or 8x A100 |
pyproject.toml: Defines project metadata, dependencies (core, dev, optional groups), and potentially tool configurations (like Ruff, Black, pytest). Managed by uv.uv.lock: The cross-platform lock file generated by uv lock to ensure reproducible environments. Should be committed to Git..python-version: (Optional but recommended) Specifies the target Python version (e.g., 3.10 or 3.11) for consistency. Can be managed by uv python pin.README.md: This page. High-level overview, setup instructions, contribution guidelines link.LICENSE: Apache 2.0 license file..gitignore: Standard Python gitignore, including .venv/, pycache/, *.pyc, and potentially data/model cache directories if not managed elsewhere..pre-commit-config.yaml: Configuration for pre-commit hooks.
cd metaflow-post-training
uv add outerbounds metaflow-torchrunWhen Large Language Models (LLMs) are trained to predict the next token given the previous ones it is called pre-training. After doing this classification task trillions of times, more steps are required to make a pre-trained LLM usable in products. <enter> a set of learning regimes often referred to as post-training or alignment. These methods and the workflows that implement them are crucial in customizing and aligning language models with specific tasks and human preferences. For example, OpenAI’s base GPT language models undergo additional post-training routines to become the models we interact with through ChatGPT and developer APIs. meta-llama/Meta-Llama-3-70B underwent rounds of SFT, rejection sampling, DPO, and PPO to become meta-llama/Meta-Llama-3-70B-Instruct, a model more suited for natural lanugage conversation with humans. Of late, a particular wave of reasoning models has been front and center in the AI research community, spurred on by open-source releases like deepseek-ai/DeepSeek-R1 and highlighted by impressive closed-source models represented by OpenAI's oX series models, Claude Sonnet's thinking models, and Gemini 2.5.
Steering models towards desirable generations is an active area of research.
This repository outlines how to build LLM post-training workflows, including Supervised Fine-Tuning (SFT), Direct Preference Optimization (DPO), and Group Relative Policy Optimization (GRPO), among others. Each example will show how using Metaflow, an open-source framework co-maintained by Netflix and Outerbounds, you can efficiently scale and productionize models using every method. Crucially, adopting a workflow platform to drive experimentation and deployment facilitates treating these models less like special objects, and more like normal models to which we can apply data science practices like rigorous model selection (different base models, different model sizes), hyperparameter tuning, and evaluation across these dimensions. More over, the examples in the repository will all demonstrate how to connect the outputs of post-training workflows to inference environments relevant to ML engineers and data scientists, including: notebooks, batch inference workflows, and real-time servers.
LLM post-training can be broadly divided into supervised learning approaches and preference-based (often reinforcement) learning approaches. Our repository will cover the spectrum:
-
Supervised Fine-Tuning (SFT) – Fine-tuning the model on a dataset of prompts with desired outputs (human-written or high-quality synthetic responses). This is also known as instruction tuning when the data consists of instruction-response pairs.
-
Direct Preference Optimization (DPO) – A newer method that skips the explicit RL step by directly optimizing the LLM on comparison data via a special loss function, effectively baking the reward model’s job into the training loss.
-
Group Relative Policy Optimization (GRPO) – A cutting-edge RL algorithm (variant of PPO) that improves efficiency and stability by using grouped comparisons to estimate advantages, allowing the value model to be removed. This was introduced in early 2024 and became very popular as part of the DeepSeek-R1 pipeline that excels in logical reasoning tasks related to math and coding.