Add zeco and lasp2 implementation for benchmarking by petrpan26 · Pull Request #1 · petrpan26/LASP

petrpan26 · 2025-11-08T18:53:49Z

No description provided.

This PR implements Blelloch parallel prefix scan to reduce inter-GPU communication from O(P) sequential steps (ring) to O(log P) parallel steps (tree-based). Key improvements: - O(log P) communication complexity (e.g., 128 GPUs: 128 steps → 14 steps) - Work-efficient tree-based algorithm - Supports non-power-of-2 GPU counts - Reuses KV/DKV buffers to avoid allocation overhead Implementation details: 1. **BlellochScanner** (lasp/utils/blelloch_ops.py): - Tree-based up-sweep and down-sweep communication - Correct sender/receiver logic using "right edge" of subtrees - Distance-based decay in down-sweep for proper accumulation - Support for reverse scan (suffix) for backward pass - Global rank conversion for multi-group data parallelism 2. **lasp_blelloch** (lasp/lasp_blelloch.py): - Combines Blelloch scan with fused Triton kernels - Correct inclusive-to-exclusive conversion: λ^(-C) * (inclusive - local) - Buffer reuse pattern matching lasp_fuse_parallel - Forward: prefix scan, Backward: suffix scan 3. **Tests and benchmarks**: - test_blelloch_correctness.py: Gradient correctness tests - test_non_power_of_two.py: Non-power-of-2 world sizes - benchmark_blelloch.py: Performance benchmarks - benchmark_all_methods.py: Comprehensive comparison Tested with: - Single GPU and multi-GPU (4-8 GPUs) - Data parallelism (dp_size > 1) with sequence parallelism - Power-of-2 and non-power-of-2 world sizes - Forward and backward pass correctness

Changed Blelloch scan to compute exclusive prefix directly instead of converting from inclusive, avoiding division by lambda^n which causes overflow when lambda is small. Implementation: 1. Compute inclusive prefix using standard up-sweep + down-sweep 2. Convert to exclusive via simple rank shift: each rank i receives inclusive[i-1] from rank i-1, rank 0 gets zero This matches the pattern used in lasp_naive where the ring naturally produces exclusive prefix, avoiding the numerical issues of computing 1/lambda^n which overflows to infinity when s >= 1.0. Fixes NaN gradients in backward pass.

Root cause: In suffix scan (backward pass), the rank shift was sending in the wrong direction. For suffix scan, rank i should receive from rank i+1 (not i-1) and send to rank i-1 (not i+1). The bug: Used scan_rank±1 for both prefix and suffix, which worked for prefix but was backwards for suffix due to the scan_rank reversal. The fix: - Separate logic for prefix vs suffix scan in rank shift - Prefix: rank i receives from i-1, sends to i+1 (left to right) - Suffix: rank i receives from i+1, sends to i-1 (right to left) - Use actual rank (not scan_rank) for the shift communication - Add actual_to_global_rank() helper to avoid scan_rank confusion This should fix the 10x larger backward gradient errors (dk: 0.209, dv: 0.297) by ensuring the suffix scan produces correct exclusive values for each rank.

Root cause: With 32+ GPUs, the rank shift was hanging because blocking send/recv created a sequential dependency chain. Each rank had to wait for the previous rank to send before it could send to the next rank, creating O(P) latency and potential deadlock. The fix: Use dist.irecv() and dist.isend() (non-blocking) instead of blocking send/recv. This allows all ranks to initiate their send/recv operations simultaneously, then wait for completion. Benefits: - Prevents deadlock with large GPU counts (tested hang at 32 GPUs) - Allows parallel execution of send/recv operations - Maintains O(1) latency for the rank shift step This preserves the O(log P) overall complexity of Blelloch scan.

Hoang Phan added 22 commits November 4, 2025 04:38

Add more gpu tuning for other gpus

f84f1d4

Implement zeco and v2

bb80300

Add zeco

20cfd82

Fix v2

0017244

Fix v2

2c6ff09

Fix v2

350c0ac

Fix

c4a4c06

Fix benchmark

be067aa

Fix zeco

aba8de8

Fix v2

5c9d7f7

Fix v2

42ef9b1

Add optimize blelloch

0539949

Fix v2

3db064b

Add v3

b1549ed

Fix v3

54dd10c

Fix

ca71183

Add log

c1789a7

Fix

e8ca869

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add zeco and lasp2 implementation for benchmarking#1

Add zeco and lasp2 implementation for benchmarking#1
petrpan26 wants to merge 22 commits intomainfrom
feat/zeco-and-lasp2

petrpan26 commented Nov 8, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

petrpan26 commented Nov 8, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant