Skip to content

SandAI-org/MagiCompiler

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

2 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

MagiCompiler

Break the Boundaries of Local Compilation for Large Models

GitHub Repo license License Python PyTorch

blog Hugging Face Twitter Follow Discord

MagiCompiler Overview

πŸ“’ 1. Latest News

  • [03/23/2026] πŸš€ MagiCompiler is officially open-sourced! Delivering whole-graph compilation for multi-modality inference and FSDP-aware whole-layer compilation for large model training.

πŸ“– 2. About

MagiCompiler is an advanced compiler and runtime augmentation framework built on top of torch.compile. Designed specifically for large-scale Transformer-like architectures, it addresses the critical bottlenecks of memory walls and operator overheads.

By stepping beyond traditional local operator optimization, MagiCompiler introduces system-level optimizations, seamlessly accelerating both training and multi-modality inference workloads with minimal code intrusion.


πŸ’‘ 3. Design Philosophy

Compiler as Manager

"Reimagining the compiler: from generating kernels to orchestrating the entire dataflow."

MagiCompiler's core philosophy is Compiler as Manager. We believe a modern deep learning compiler should not be restricted to mere kernel fusion. Instead, it acts as a global manager that owns the full lifecycle of execution. MagiCompiler actively manages subgraph dispatching, dynamically orchestrates dataflow (like offloading and prefetching), and controls memory allocation, ensuring optimal balance between compute efficiency and memory footprint.

Key Features

🎯 1. Unified Inference & Training

Tailored for Transformer-like architectures with scenario-specific strategies:

  • Inference: Achieves full-graph capture across Transformer boundaries, maximizing kernel fusion scope.
  • Training: Introduces FSDP-aware layer-wise compilation. Unlocks aggressive cross-op fusion while keeping distributed parameter sharding entirely transparent.

⚑️ 2. Easy to Use, Free Gain, Plug and Play

No complex model refactoring needed. Just two decorators deliver up to 20%+ extra speedups out-of-the-box, seamlessly integrating into SOTA multi-modality frameworks.

🧠 3. Smart Asynchronous Offloading

For memory-constrained setups, our built-in selective offloading policy perfectly overlaps H2D transfers with computation, eliminating pipeline bubbles.

♻️ 4. Heuristic Activation Recomputation

Say goodbye to manual torch.utils.checkpoint. MagiCompiler automatically saves compute-bound ops (e.g., MatMul, Attention) and recomputes memory-bound ones, slashing peak memory without sacrificing throughput.

πŸ›  5. Better Interpretability

Toggle MAGI_ENABLE_FX_GRAPH_VIZ=1 and let our powerful introspection toolchain do the rest. All implicit artifacts from graphs to kernels are automatically dumped as human-readable files, making compiler debugging highly accessible.


βš™οΈ 4. Installation

Requirements:

  • Python >= 3.12
  • PyTorch >= 2.9
  • CUDA Toolkit
# Step 1 β€” Clone the repo
git clone https://github.com/SandAI-org/MagiCompiler.git
cd MagiCompiler

# Step 2 β€” System dependencies (optional, for FX graph visualization; Debian/Ubuntu)
sudo apt update && sudo apt install -y graphviz

# Step 3 β€” Python dependencies
pip install -r requirements.txt

# Step 4 β€” Install MagiCompiler (pick one)
pip install .   # End users (recommended)
# pip install -e . --no-build-isolation --config-settings editable_mode=compat  # Developer / editable

πŸš€ 5. Quick Start

🧹 1. One Decorator to Rule Them All (@magi_compile)

Remove scattered torch.compile or torch.compiler.disable calls. Decorate your core Transformer block once for automatic full-graph capture and dynamic shape support (defaulting to dim 0).

import torch
from torch import nn
from magi_compiler import magi_compile

# Decorate your core module once. No more scattered compile tweaks!
@magi_compile
class TransformerBlock(nn.Module):
    def __init__(self, hidden_dim):
        super().__init__()
        self.attn = Attention(hidden_dim)
        self.mlp = MLP(hidden_dim)

    def forward(self, x: torch.Tensor, mask: torch.Tensor | None) -> torch.Tensor:
        x = x + self.attn(x, mask)
        x = x + self.mlp(x)
        return x

model = TransformerBlock(hidden_dim=1024).cuda()

# Execute normally - whole-graph compilation handles dynamic batches automatically!
out = model(torch.randn(4, 128, 1024, device="cuda"), None)
out = model(torch.randn(8, 128, 1024, device="cuda"), None)

πŸ› οΈ 2. Bridge Custom Kernels (@magi_register_custom_op)

Using custom kernels (FlashAttention, MoE routers) that break FX tracing? Don't disable compilation. Wrap them to teach the compiler how to handle them during graph partitioning and recomputation.

from magi_compiler import magi_register_custom_op

@magi_register_custom_op(
    name="athena::flash_attn",
    infer_output_meta_fn=["q"],       # Output shape matches parameter 'q'
    is_subgraph_boundary=True,        # Split graph here for subgraph compilation
    is_compute_sensitive=True,        # Retain this output during recomputation
)
def flash_attn(q: torch.Tensor, k: torch.Tensor, v: torch.Tensor) -> torch.Tensor:
    ... # Your custom kernel or C++ extension

πŸ”§ 3. Advanced Configurations

Explore magi_compiler/config.py for power-user features like custom backend toggles and fine-grained memory management. (Comprehensive guides for popular training/inference frameworks are coming soon!)


πŸ“Š 6. Benchmark

πŸ”₯ H100 Extreme Acceleration

On a single NVIDIA H100, MagiCompiler outperforms current SOTA solutions (like LightX2V) by 9% to 26% across mainstream open-source video generation models.

H100 Inference Benchmark

πŸ’» RTX 5090 Near Real-Time

Thanks to our underlying JIT offloading engine, daVinci-MagiHuman achieves near real-time speeds, even on heavily VRAM-constrained consumer GPUs.

RTX 5090 Inference Latency


πŸ—Ί 7. Roadmap

We are actively developing MagiCompiler. Here is a glimpse into our upcoming milestones:

  • Ecosystem Integration: Benchmarks and out-of-the-box integration guides for popular frameworks (e.g., sglang-diffusion, vllm-omni, and LLaMA training).
  • Official Hub & Tech Blog: A dedicated website for advanced tutorials, documentation, and frontier engineering insights.
  • Hardware-Aware Auto-Scheduler: An adaptive engine that dynamically orchestrates optimal strategies (auto-recomputation boundaries, offloading) based on your hardware constraints.
  • Next-Gen Custom Backend (v2.0): Pushing hardware limits with extreme kernel-level efficiency, native distributed communication and MegaKernels.

πŸ“ 8. Citation

If you find MagiCompiler useful in your research or production, please consider citing us:

@software{magi_compiler_2026,
  author = {Hongyu Jia and Zhiyao Cen and Taoran Wang and Yunbo Zhang},
  title = {MagiCompiler: Break the Boundaries of Local Compilation for Large Models},
  year = {2026},
  url = {https://github.com/SandAI-org/MagiCompiler}
}

πŸ™ 9. Acknowledgement

MagiCompiler is deeply inspired by and builds upon the shoulders of giants. We extend our heartfelt gratitude to the PyTorch team for their foundational work on torch.compile and torch.fx, and to the vLLM community for their pioneering contributions to large model inference.

We are moving fast, and we want you on board! MagiCompiler is under rapid development. If you are passionate about pushing the limits of large model compilation, we'd love to have you with us. From opening issues and discussing architectures to submitting core PRs, every contribution matters. Let's engineer the future of AI infrastructure together!


πŸ“œ 10. License

This project is licensed under the Apache License 2.0.

About

A plug-and-play compiler that delivers free-lunch optimizations for both inference and training.

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages