Add contiguity checks to TMA schedulers by tbqh · Pull Request #6024 · NVIDIA/Fuser

tbqh · 2026-03-03T11:08:48Z

TMA requires contiguity to be set at compile time. The TMA schedules were not checking for this. Instead, sometimes they will check vectorization >= 2, which is a runtime value. Symbolic (not marked as contiguous) tensors will fail to compile, even if they are contiguous. ReductionWithIterDim has many such cases, for example:

C++ exception with description " INTERNAL ASSERT FAILED at /opt/pytorch/nvfuser/csrc/device_lower/analysis/tma.cpp:1186, please report a bug with repro script to NVFuser at https://github.com/NVIDIA/Fuser/issues. Expected std::get<1>(raw_tma_domain.back()) . The innermost dimension of the TMA domain must be contiguous Exception raised from getTMAInfo at /opt/pytorch/nvfuser/csrc/device_lower/analysis/tma.cpp:1186 (most recent call first): frame #0: nvfuser::nvfCheckFail(char const*, char const*, long, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0xcc (0xae719770b68c in ./bin/test_nvfuser) frame #1: nvfuser::nvfErrorFail(char const*, char const*, long, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0x54 (0xae719770b73c in ./bin/test_nvfuser)

github-actions · 2026-03-03T11:11:12Z

Review updated until commit d49ac87

Description

Add compile-time contiguity check for TMA (Tensor Memory Access) schedulers
New utility function inputsHaveContiguousInnerDim validates innermost dimension contiguity
Update normalization_inner, normalization_inner_outer, pointwise, and reduction schedulers
TMA requires contiguity at compile time; runtime checks like vectorization >= 2 are insufficient

Changes walkthrough

Relevant files

Enhancement

utils.cpp `Add inputsHaveContiguousInnerDim utility function` csrc/scheduler/utils.cpp Added new function `inputsHaveContiguousInnerDim(Fusion* fusion)` to check if all input TensorViews have compile-time known contiguous innermost dimension Iterates through fusion inputs, checking contiguity from innermost to outermost non-broadcast dimension Returns false if any input lacks contiguous innermost dimension	+26/-0
utils.h `Declare inputsHaveContiguousInnerDim function` csrc/scheduler/utils.h Added declaration for `inputsHaveContiguousInnerDim` function with documentation	+4/-0

Bug fix

normalization_inner.cpp `Add contiguity check to normalization_inner mayUseTma` csrc/scheduler/normalization_inner.cpp Added contiguity check call to `mayUseTma` function Returns false if inputs lack contiguous innermost dimension	+5/-0
normalization_inner_outer.cpp `Add contiguity check to normalization_inner_outer` csrc/scheduler/normalization_inner_outer.cpp Added contiguity check to `preferWarpSpecialized` function Prevents TMA usage for non-contiguous inputs on Blackwell+ GPUs	+4/-0
pointwise.cpp `Add contiguity check to pointwise mayUseTma` csrc/scheduler/pointwise.cpp Modified `mayUseTma` to accept `Fusion* fusion` parameter Added contiguity check inside `mayUseTma` function Updated call site in `computeHeuristics` to pass fusion	+10/-2
reduction.cpp `Add contiguity check to reduction mayUseTma functions` csrc/scheduler/reduction.cpp Modified `mayUseTma` and `mayUseTmaOuter` to accept `Fusion* fusion` parameter Added contiguity check to both functions Updated call sites in `computeHeuristics` to pass fusion	+12/-2

PR Reviewer Guide

Here are some key observations to aid the review process:

🧪 PR contains tests

⚡ Recommended focus areas for review

Function implementation correctness

The inputsHaveContiguousInnerDim function implementation looks correct. It properly iterates from the innermost dimension outward, skips broadcast dimensions, and checks for contiguity. However, verify that the function handles edge cases: (1) inputs with only broadcast dimensions (should return false), (2) inputs with multiple consecutive broadcast dimensions followed by a contiguous dimension. The current logic with found_inner flag appears to handle these correctly.

bool inputsHaveContiguousInnerDim(Fusion* fusion) {
  for (auto tv : ir_utils::filterByType<TensorView>(fusion->inputs())) {
    const auto& contig = tv->domain()->contiguity();
    if (contig.empty()) {
      return false;
    }
    const auto& alloc_dom = tv->getMaybeAllocationDomain();
    NVF_ERROR(contig.size() == alloc_dom.size());
    bool found_inner = false;
    for (int64_t i = static_cast<int64_t>(alloc_dom.size()) - 1; i >= 0; --i) {
      if (alloc_dom[i]->isBroadcast()) {
        continue;
      }
      if (!contig[i].has_value() || !*contig[i]) {
        return false;
      }
      found_inner = true;
      break;
    }
    if (!found_inner) {
      return false;
    }
  }
  return true;
}

Function signature change

The mayUseTma function signature changed from taking only pointwise_utils::FusionRuntimeProperties& prop to also taking Fusion* fusion. Ensure all call sites are updated correctly. The diff shows the call site at line 440-441 is updated properly. Verify no other call sites exist that would break.

bool mayUseTma(
    Fusion* fusion,
    const pointwise_utils::FusionRuntimeProperties& prop) {
  // Hardware requirement: Don't use TMA for pre-Hopper GPUs
  if (at::cuda::getCurrentDeviceProperties()->major < 9) {
    return false;
  }

  // TMA requires compile-time known contiguous innermost dimension on inputs
  if (!scheduler_utils::inputsHaveContiguousInnerDim(fusion)) {
    return false;
  }
  // Check if there are TMA-compatible inputs
  if (!mayHaveTmaCompatibleInputs(prop)) {
    return false;
  }

Function signature consistency

Both mayUseTma and mayUseTmaOuter function signatures were updated to include Fusion* fusion parameter. The call sites at lines 321 and 327 are updated correctly. Ensure these changes are consistent with the header file declarations if any.

bool mayUseTma(
    Fusion* fusion,
    const reduction_scheduler_utils::FusionRuntimeProperties& props) {
  auto dev_prop = at::cuda::getCurrentDeviceProperties();

  if (dev_prop->major < 9) {
    return false;
  }

  if (!scheduler_utils::inputsHaveContiguousInnerDim(fusion)) {
    return false;
  }

  // Require the reduction shape is 2D inner reduction: [I, R]
  if (!props.fastest_dim_reduction ||
      props.total_reduction_numel != props.inner_most_dimension_numel) {
    return false;
  }

  int64_t dtype_bytes = props.max_dtype_size_bit_for_vectorization / 8;
  uint64_t total_reduction_bytes = props.total_reduction_numel * dtype_bytes;

  // Minimum TMA transfer size, below which it seems much slower than non-TMA.
  uint64_t min_tma_bytes = 16384;

  if (total_reduction_bytes < min_tma_bytes) {
    return false;
  }

  // Require reduction dim fits into smem, until we add iteration over large
  // reduction dim.
  const int64_t smem_elems =
      (static_cast<int64_t>(dev_prop->sharedMemPerBlockOptin) * 8) /
      props.max_dtype_size_bit_for_vectorization;

  if (props.inner_most_dimension_numel > smem_elems) {
    return false;
  }

  // Smem check assumes only one input tensor.
  if (props.n_tensor_inputs != 1) {
    return false;
  }

  // Require that the innermost dim is contiguous.
  if (props.vectorize_factor <= 1) {
    return false;
  }

  uint64_t vect_bits =
      props.vectorize_factor * props.max_dtype_size_bit_for_vectorization;

  // TMA requires 16-byte alignment (128 bits) for memory transactions
  if (vect_bits % 128 != 0) {
    return false;
  }

  return true;
}

bool mayUseTmaOuter(
    Fusion* fusion,
    const reduction_scheduler_utils::FusionRuntimeProperties& props) {
  auto dev_prop = at::cuda::getCurrentDeviceProperties();

  if (dev_prop->major < 9) {
    return false;
  }

  if (!scheduler_utils::inputsHaveContiguousInnerDim(fusion)) {
    return false;
  }

greptile-apps · 2026-03-03T11:15:02Z

Greptile Summary

This PR correctly adds compile-time contiguity checks to prevent TMA assertion failures in five scheduler entry points (pointwise, reduction inner/outer, normalization-inner, and normalization-inner-outer). The new inputsHaveContiguousInnerDim utility properly handles edge cases (0-D tensors, all-broadcast tensors, unknown contiguity) and is consistently applied across all schedulers. The implementation is well-structured and sound.

Confidence Score: 5/5

The PR correctly implements compile-time contiguity checks across all TMA scheduler paths, preventing the reported assertion failures.
The core fix is correct and well-implemented. The new utility function properly validates that input TensorViews have contiguous innermost non-broadcast/non-reduction dimensions, which is required for TMA compilation. All five TMA decision points are consistently guarded with this check. The implementation handles edge cases appropriately (0-D tensors, all-broadcast tensors, unknown contiguity), and no actionable issues remain.
No files require special attention.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[computeHeuristics / preferWarpSpecialized] --> B{Hardware check\nSM >= 9 or 10?}
    B -- No --> Z[Disable TMA]
    B -- Yes --> C{inputsHaveContiguousInnerDim?}
    C -- No --> Z
    C -- Yes --> D{Scheduler-specific\nruntime checks\ne.g. vectorize_factor,\ndtype size, smem}
    D -- Fail --> Z
    D -- Pass --> E[Use TMA scheduler]

    subgraph inputsHaveContiguousInnerDim
        F[For each TensorView input] --> G{contiguity empty?\n0-D tensor}
        G -- Yes --> F
        G -- No --> H[Walk alloc domain\nfrom innermost]
        H --> I{dim is reduction\nor broadcast?}
        I -- Yes --> H
        I -- No --> J{contig known\nand true?}
        J -- Yes, found_inner=true --> F
        J -- No --> K[return false]
        H -->|exhausted all dims\nfound_inner=false| F
        F -->|all inputs checked| L[return true]
    end

_{Last reviewed commit: 7845bb1}

greptile-apps

_{6 files reviewed, 3 comments}

_{Edit Code Review Agent Settings | Greptile}

csrc/scheduler/reduction.cpp

csrc/scheduler/utils.cpp

greptile-apps · 2026-03-03T16:24:04Z

Additional Comments (1)

csrc/scheduler/reduction.cpp
The comment "Require that the innermost dim is contiguous." at line 242 is now misleading. The compile-time contiguity check (inputsHaveContiguousInnerDim) is already performed at line 207 earlier in this function. The vectorize_factor <= 1 check here gates on minimum vectorization efficiency for TMA, not on contiguity.

Consider clarifying the comment to reflect its actual purpose:

  // Require at least 2x vectorization for TMA efficiency.
  if (props.vectorize_factor <= 1) {
    return false;
  }

The same stale comment exists in mayUseTmaOuter at line 290.

csrc/scheduler/utils.cpp

greptile-apps · 2026-03-04T21:35:02Z

Additional Comments (1)

csrc/scheduler/reduction.cpp
Stale comment after compile-time contiguity check was separated.

Now that compile-time contiguity is verified up-front by inputsHaveContiguousInnerDim (line 207), this comment is misleading. props.vectorize_factor is a runtime value derived from alignment, not a compile-time contiguity marker. The same misleading comment appears identically in mayUseTmaOuter (line 290).

Update to reflect what this guard actually checks:

  // Require vectorization for TMA 128-bit alignment.
  if (props.vectorize_factor <= 1) {
    return false;
  }

csrc/scheduler/utils.cpp

liqiangxl

LGTM. Thanks for the fix.

liqiangxl · 2026-03-05T16:14:03Z

TMA requires contiguity to be set at compile time. The TMA schedules were not checking for this. Instead, sometimes they will check vectorization >= 2, which is a runtime value. Symbolic (not marked as contiguous) tensors will fail to compile, even if they are contiguous. ReductionWithIterDim has many such cases, for example:

Error from segmentation group 0: Expected (definition() ->isStrictlyOneOf< ReductionOp, MmaOp, MatmulOp, LinearOp, GroupedMmaOp, ScaledMmaOp>()) . Error rfactoring T2_l_float[rblockIdx.y82{8}, rS83{( ceilDiv(( ceilDiv(i0, 128) ), 8) )}, rUR88{8}, rthreadIdx.y89{16}, rblockIdx.x84{( ceilDiv(i2, 128) )}, rthreadIdx.x86{32}, rV87{4}, rS10{i3}, iS11{i4}] because its definition is not a reduction. Exception raised from rFactor at /opt/pytorch/nvfuser/csrc/tensor_view.cpp:804 (most recent call first):

The err msg doesn't seem related to TMA load. I was expecting something like Expected std::get<1>(raw_tma_domain.back()) . The innermost dimension of the TMA domain must be contiguous

tbqh · 2026-03-09T11:20:41Z

The err msg doesn't seem related to TMA load. I was expecting something like Expected std::get<1>(raw_tma_domain.back()) . The innermost dimension of the TMA domain must be contiguous

My mistake, that was the error from a different set of failing tests, OuterReductionTest. That rfactor errors seem to be due to outer-TMA not canonicalizing the TMA dimensions before scheduling. I will address this on a separate PR.

For ReductionWithIterDim, we get these errors which are expected:
Expected std::get<1>(raw_tma_domain.back()) . The innermost dimension of the TMA domain must be contiguous Exception raised from getTMAInfo at /opt/pytorch/nvfuser/csrc/device_lower/analysis/tma.cpp:1186 (most recent call first): frame #0: nvfuser::nvfCheckFail(char const*, char const*, long, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0xcc (0xae719770b68c in ./bin/test_nvfuser) frame #1: nvfuser::nvfErrorFail(char const*, char const*, long, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0x54 (0xae719770b73c in ./bin/test_nvfuser)

csrc/scheduler/utils.cpp

Add contiguity checks to TMA schedulers

8a6cdbc

tbqh marked this pull request as ready for review March 3, 2026 11:10

tbqh requested a review from liqiangxl March 3, 2026 11:10

greptile-apps bot reviewed Mar 3, 2026

View reviewed changes

csrc/scheduler/reduction.cpp Outdated Show resolved Hide resolved

Fix deleted TMA-enabled check

19e798d

greptile-apps bot reviewed Mar 3, 2026

View reviewed changes

csrc/scheduler/utils.cpp Show resolved Hide resolved

Add check

d49ac87

greptile-apps bot reviewed Mar 4, 2026

View reviewed changes

csrc/scheduler/utils.cpp Show resolved Hide resolved

liqiangxl reviewed Mar 5, 2026

View reviewed changes

csrc/scheduler/utils.cpp Outdated Show resolved Hide resolved

liqiangxl approved these changes Mar 5, 2026

View reviewed changes

Address code comments

b7f7e02

greptile-apps bot reviewed Mar 9, 2026

View reviewed changes

csrc/scheduler/utils.cpp Show resolved Hide resolved

tbqh added 2 commits March 9, 2026 13:02

Address code comments

1078019

Fix lint

7845bb1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add contiguity checks to TMA schedulers#6024

Add contiguity checks to TMA schedulers#6024
tbqh wants to merge 6 commits intomainfrom
tbqh/tma_check_contiguity

tbqh commented Mar 3, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Mar 3, 2026 •

edited

Loading

Changes walkthrough

PR Reviewer Guide

Uh oh!

greptile-apps bot commented Mar 3, 2026 •

edited

Loading

Uh oh!

greptile-apps bot left a comment •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

greptile-apps bot commented Mar 3, 2026

Uh oh!

Uh oh!

greptile-apps bot commented Mar 4, 2026

Uh oh!

Uh oh!

liqiangxl left a comment

Uh oh!

liqiangxl commented Mar 5, 2026

Uh oh!

tbqh commented Mar 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

tbqh commented Mar 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Mar 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Changes walkthrough

PR Reviewer Guide

Uh oh!

greptile-apps bot commented Mar 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Flowchart

Uh oh!

greptile-apps bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

greptile-apps bot commented Mar 3, 2026

Uh oh!

Uh oh!

greptile-apps bot commented Mar 4, 2026

Uh oh!

Uh oh!

liqiangxl left a comment

Choose a reason for hiding this comment

Uh oh!

liqiangxl commented Mar 5, 2026

Uh oh!

tbqh commented Mar 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

tbqh commented Mar 3, 2026 •

edited

Loading

github-actions bot commented Mar 3, 2026 •

edited

Loading

greptile-apps bot commented Mar 3, 2026 •

edited

Loading

greptile-apps bot left a comment •

edited

Loading