Metaflow post training resources

Quick start

If you'd like to get started right away, you can choose an example from the following registry. Each example will have its own README explaining how to run the code.

Method	Model size	Suggested GPU requirement
SFT	3B	?
SFT	70B	≥ 8x H100 or 8x A100
DPO	3B	?
DPO	70B	≥ 8x H100 or 8x A100
GRPO	3B	?
GRPO	70B	≥ 8x H100 or 8x A100

Detailed directory structure

pyproject.toml: Defines project metadata, dependencies (core, dev, optional groups), and potentially tool configurations (like Ruff, Black, pytest). Managed by uv.
uv.lock: The cross-platform lock file generated by uv lock to ensure reproducible environments. Should be committed to Git.
.python-version: (Optional but recommended) Specifies the target Python version (e.g., 3.10 or 3.11) for consistency. Can be managed by uv python pin.
README.md: This page. High-level overview, setup instructions, contribution guidelines link.
LICENSE: Apache 2.0 license file.
.gitignore: Standard Python gitignore, including .venv/, pycache/, *.pyc, and potentially data/model cache directories if not managed elsewhere.
.pre-commit-config.yaml: Configuration for pre-commit hooks.

Detailed setup

Install `uv`

`add` dependencies into project

cd metaflow-post-training
uv add outerbounds metaflow-torchrun

Overview of methods

Background

When Large Language Models (LLMs) are trained to predict the next token given the previous ones it is called pre-training. After doing this classification task trillions of times, more steps are required to make a pre-trained LLM usable in products. <enter> a set of learning regimes often referred to as post-training or alignment. These methods and the workflows that implement them are crucial in customizing and aligning language models with specific tasks and human preferences. For example, OpenAI’s base GPT language models undergo additional post-training routines to become the models we interact with through ChatGPT and developer APIs. meta-llama/Meta-Llama-3-70B underwent rounds of SFT, rejection sampling, DPO, and PPO to become meta-llama/Meta-Llama-3-70B-Instruct, a model more suited for natural lanugage conversation with humans. Of late, a particular wave of reasoning models has been front and center in the AI research community, spurred on by open-source releases like deepseek-ai/DeepSeek-R1 and highlighted by impressive closed-source models represented by OpenAI's oX series models, Claude Sonnet's thinking models, and Gemini 2.5.

Steering models towards desirable generations is an active area of research.

This repository outlines how to build LLM post-training workflows, including Supervised Fine-Tuning (SFT), Direct Preference Optimization (DPO), and Group Relative Policy Optimization (GRPO), among others. Each example will show how using Metaflow, an open-source framework co-maintained by Netflix and Outerbounds, you can efficiently scale and productionize models using every method. Crucially, adopting a workflow platform to drive experimentation and deployment facilitates treating these models less like special objects, and more like normal models to which we can apply data science practices like rigorous model selection (different base models, different model sizes), hyperparameter tuning, and evaluation across these dimensions. More over, the examples in the repository will all demonstrate how to connect the outputs of post-training workflows to inference environments relevant to ML engineers and data scientists, including: notebooks, batch inference workflows, and real-time servers.

LLM post-training can be broadly divided into supervised learning approaches and preference-based (often reinforcement) learning approaches. Our repository will cover the spectrum:

Supervised Fine-Tuning (SFT) – Fine-tuning the model on a dataset of prompts with desired outputs (human-written or high-quality synthetic responses). This is also known as instruction tuning when the data consists of instruction-response pairs.
Direct Preference Optimization (DPO) – A newer method that skips the explicit RL step by directly optimizing the LLM on comparison data via a special loss function, effectively baking the reward model’s job into the training loss.
Group Relative Policy Optimization (GRPO) – A cutting-edge RL algorithm (variant of PPO) that improves efficiency and stability by using grouped comparisons to estimate advantages, allowing the value model to be removed. This was introduced in early 2024 and became very popular as part of the DeepSeek-R1 pipeline that excels in logical reasoning tasks related to math and coding.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Metaflow post training resources

Quick start

Detailed directory structure

Detailed setup

Install `uv`

`add` dependencies into project

Overview of methods

Background

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
dpo		dpo
grpo		grpo
sft		sft
.gitignore		.gitignore
.python-version		.python-version
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Folders and files

Latest commit

History

Repository files navigation

Metaflow post training resources

Quick start

Detailed directory structure

Detailed setup

Install uv

add dependencies into project

Overview of methods

Background

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Install `uv`

`add` dependencies into project

Packages