Skip to content

emattia/metaflow-post-training

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Metaflow post training resources

Quick start

If you'd like to get started right away, you can choose an example from the following registry. Each example will have its own README explaining how to run the code.

Method Model size Suggested GPU requirement
SFT 3B ?
SFT 70B ≥ 8x H100 or 8x A100
DPO 3B ?
DPO 70B ≥ 8x H100 or 8x A100
GRPO 3B ?
GRPO 70B ≥ 8x H100 or 8x A100

Detailed directory structure

  • pyproject.toml: Defines project metadata, dependencies (core, dev, optional groups), and potentially tool configurations (like Ruff, Black, pytest). Managed by uv.
  • uv.lock: The cross-platform lock file generated by uv lock to ensure reproducible environments. Should be committed to Git.
  • .python-version: (Optional but recommended) Specifies the target Python version (e.g., 3.10 or 3.11) for consistency. Can be managed by uv python pin.
  • README.md: This page. High-level overview, setup instructions, contribution guidelines link.
  • LICENSE: Apache 2.0 license file.
  • .gitignore: Standard Python gitignore, including .venv/, pycache/, *.pyc, and potentially data/model cache directories if not managed elsewhere.
  • .pre-commit-config.yaml: Configuration for pre-commit hooks.

Detailed setup

add dependencies into project

cd metaflow-post-training
uv add outerbounds metaflow-torchrun

Overview of methods

Background

When Large Language Models (LLMs) are trained to predict the next token given the previous ones it is called pre-training. After doing this classification task trillions of times, more steps are required to make a pre-trained LLM usable in products. <enter> a set of learning regimes often referred to as post-training or alignment. These methods and the workflows that implement them are crucial in customizing and aligning language models with specific tasks and human preferences. For example, OpenAI’s base GPT language models undergo additional post-training routines to become the models we interact with through ChatGPT and developer APIs​. meta-llama/Meta-Llama-3-70B underwent rounds of SFT, rejection sampling, DPO, and PPO to become meta-llama/Meta-Llama-3-70B-Instruct, a model more suited for natural lanugage conversation with humans. Of late, a particular wave of reasoning models has been front and center in the AI research community, spurred on by open-source releases like deepseek-ai/DeepSeek-R1 and highlighted by impressive closed-source models represented by OpenAI's oX series models, Claude Sonnet's thinking models, and Gemini 2.5.

Steering models towards desirable generations is an active area of research.

This repository outlines how to build LLM post-training workflows, including Supervised Fine-Tuning (SFT), Direct Preference Optimization (DPO), and Group Relative Policy Optimization (GRPO), among others. Each example will show how using Metaflow, an open-source framework co-maintained by Netflix and Outerbounds, you can efficiently scale and productionize models using every method. Crucially, adopting a workflow platform to drive experimentation and deployment facilitates treating these models less like special objects, and more like normal models to which we can apply data science practices like rigorous model selection (different base models, different model sizes), hyperparameter tuning, and evaluation across these dimensions. More over, the examples in the repository will all demonstrate how to connect the outputs of post-training workflows to inference environments relevant to ML engineers and data scientists, including: notebooks, batch inference workflows, and real-time servers.

LLM post-training can be broadly divided into supervised learning approaches and preference-based (often reinforcement) learning approaches. Our repository will cover the spectrum:

  • Supervised Fine-Tuning (SFT) – Fine-tuning the model on a dataset of prompts with desired outputs (human-written or high-quality synthetic responses). This is also known as instruction tuning when the data consists of instruction-response pairs.

  • Direct Preference Optimization (DPO) – A newer method that skips the explicit RL step by directly optimizing the LLM on comparison data via a special loss function, effectively baking the reward model’s job into the training loss​.

  • Group Relative Policy Optimization (GRPO) – A cutting-edge RL algorithm (variant of PPO) that improves efficiency and stability by using grouped comparisons to estimate advantages, allowing the value model to be removed​. This was introduced in early 2024 and became very popular as part of the DeepSeek-R1 pipeline that excels in logical reasoning tasks related to math and coding​.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors