Skip to content

Add zeco and lasp2 implementation for benchmarking#1

Open
petrpan26 wants to merge 22 commits intomainfrom
feat/zeco-and-lasp2
Open

Add zeco and lasp2 implementation for benchmarking#1
petrpan26 wants to merge 22 commits intomainfrom
feat/zeco-and-lasp2

Conversation

@petrpan26
Copy link
Copy Markdown
Owner

No description provided.

Hoang Phan added 22 commits November 4, 2025 04:38
This PR implements Blelloch parallel prefix scan to reduce inter-GPU
communication from O(P) sequential steps (ring) to O(log P) parallel
steps (tree-based).

Key improvements:
- O(log P) communication complexity (e.g., 128 GPUs: 128 steps → 14 steps)
- Work-efficient tree-based algorithm
- Supports non-power-of-2 GPU counts
- Reuses KV/DKV buffers to avoid allocation overhead

Implementation details:

1. **BlellochScanner** (lasp/utils/blelloch_ops.py):
   - Tree-based up-sweep and down-sweep communication
   - Correct sender/receiver logic using "right edge" of subtrees
   - Distance-based decay in down-sweep for proper accumulation
   - Support for reverse scan (suffix) for backward pass
   - Global rank conversion for multi-group data parallelism

2. **lasp_blelloch** (lasp/lasp_blelloch.py):
   - Combines Blelloch scan with fused Triton kernels
   - Correct inclusive-to-exclusive conversion: λ^(-C) * (inclusive - local)
   - Buffer reuse pattern matching lasp_fuse_parallel
   - Forward: prefix scan, Backward: suffix scan

3. **Tests and benchmarks**:
   - test_blelloch_correctness.py: Gradient correctness tests
   - test_non_power_of_two.py: Non-power-of-2 world sizes
   - benchmark_blelloch.py: Performance benchmarks
   - benchmark_all_methods.py: Comprehensive comparison

Tested with:
- Single GPU and multi-GPU (4-8 GPUs)
- Data parallelism (dp_size > 1) with sequence parallelism
- Power-of-2 and non-power-of-2 world sizes
- Forward and backward pass correctness
Changed Blelloch scan to compute exclusive prefix directly instead of
converting from inclusive, avoiding division by lambda^n which causes
overflow when lambda is small.

Implementation:
1. Compute inclusive prefix using standard up-sweep + down-sweep
2. Convert to exclusive via simple rank shift: each rank i receives
   inclusive[i-1] from rank i-1, rank 0 gets zero

This matches the pattern used in lasp_naive where the ring naturally
produces exclusive prefix, avoiding the numerical issues of computing
1/lambda^n which overflows to infinity when s >= 1.0.

Fixes NaN gradients in backward pass.
Root cause: In suffix scan (backward pass), the rank shift was sending
in the wrong direction. For suffix scan, rank i should receive from
rank i+1 (not i-1) and send to rank i-1 (not i+1).

The bug: Used scan_rank±1 for both prefix and suffix, which worked for
prefix but was backwards for suffix due to the scan_rank reversal.

The fix:
- Separate logic for prefix vs suffix scan in rank shift
- Prefix: rank i receives from i-1, sends to i+1 (left to right)
- Suffix: rank i receives from i+1, sends to i-1 (right to left)
- Use actual rank (not scan_rank) for the shift communication
- Add actual_to_global_rank() helper to avoid scan_rank confusion

This should fix the 10x larger backward gradient errors (dk: 0.209,
dv: 0.297) by ensuring the suffix scan produces correct exclusive
values for each rank.
Root cause: With 32+ GPUs, the rank shift was hanging because blocking
send/recv created a sequential dependency chain. Each rank had to wait
for the previous rank to send before it could send to the next rank,
creating O(P) latency and potential deadlock.

The fix: Use dist.irecv() and dist.isend() (non-blocking) instead of
blocking send/recv. This allows all ranks to initiate their send/recv
operations simultaneously, then wait for completion.

Benefits:
- Prevents deadlock with large GPU counts (tested hang at 32 GPUs)
- Allows parallel execution of send/recv operations
- Maintains O(1) latency for the rank shift step

This preserves the O(log P) overall complexity of Blelloch scan.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant