Conversation
|
Review updated until commit d49ac87 Description
|
| Relevant files | |||||||||
|---|---|---|---|---|---|---|---|---|---|
| Enhancement |
| ||||||||
| Bug fix |
|
PR Reviewer Guide
Here are some key observations to aid the review process:
| 🧪 PR contains tests |
| ⚡ Recommended focus areas for review |
Function implementation correctness
inputsHaveContiguousInnerDim function implementation looks correct. It properly iterates from the innermost dimension outward, skips broadcast dimensions, and checks for contiguity. However, verify that the function handles edge cases: (1) inputs with only broadcast dimensions (should return false), (2) inputs with multiple consecutive broadcast dimensions followed by a contiguous dimension. The current logic with found_inner flag appears to handle these correctly. |
Greptile SummaryThis PR correctly adds compile-time contiguity checks to prevent TMA assertion failures in five scheduler entry points (pointwise, reduction inner/outer, normalization-inner, and normalization-inner-outer). The new Confidence Score: 5/5
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[computeHeuristics / preferWarpSpecialized] --> B{Hardware check\nSM >= 9 or 10?}
B -- No --> Z[Disable TMA]
B -- Yes --> C{inputsHaveContiguousInnerDim?}
C -- No --> Z
C -- Yes --> D{Scheduler-specific\nruntime checks\ne.g. vectorize_factor,\ndtype size, smem}
D -- Fail --> Z
D -- Pass --> E[Use TMA scheduler]
subgraph inputsHaveContiguousInnerDim
F[For each TensorView input] --> G{contiguity empty?\n0-D tensor}
G -- Yes --> F
G -- No --> H[Walk alloc domain\nfrom innermost]
H --> I{dim is reduction\nor broadcast?}
I -- Yes --> H
I -- No --> J{contig known\nand true?}
J -- Yes, found_inner=true --> F
J -- No --> K[return false]
H -->|exhausted all dims\nfound_inner=false| F
F -->|all inputs checked| L[return true]
end
Last reviewed commit: 7845bb1 |
Additional Comments (1)
Consider clarifying the comment to reflect its actual purpose: The same stale comment exists in |
Additional Comments (1)
Now that compile-time contiguity is verified up-front by Update to reflect what this guard actually checks: |
liqiangxl
left a comment
There was a problem hiding this comment.
LGTM. Thanks for the fix.
The err msg doesn't seem related to TMA load. I was expecting something like |
My mistake, that was the error from a different set of failing tests, For |
TMA requires contiguity to be set at compile time. The TMA schedules were not checking for this. Instead, sometimes they will check
vectorization >= 2, which is a runtime value. Symbolic (not marked as contiguous) tensors will fail to compile, even if they are contiguous.ReductionWithIterDimhas many such cases, for example:C++ exception with description " INTERNAL ASSERT FAILED at /opt/pytorch/nvfuser/csrc/device_lower/analysis/tma.cpp:1186, please report a bug with repro script to NVFuser at https://github.com/NVIDIA/Fuser/issues. Expected std::get<1>(raw_tma_domain.back()) . The innermost dimension of the TMA domain must be contiguous Exception raised from getTMAInfo at /opt/pytorch/nvfuser/csrc/device_lower/analysis/tma.cpp:1186 (most recent call first): frame #0: nvfuser::nvfCheckFail(char const*, char const*, long, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0xcc (0xae719770b68c in ./bin/test_nvfuser) frame #1: nvfuser::nvfErrorFail(char const*, char const*, long, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0x54 (0xae719770b73c in ./bin/test_nvfuser)