Skip to content

Add PTOAS support for TAXPY, THISTOGRAM, TGET_SCALE_ADDR, TROWARGMAX and TROWARGMIN#461

Open
HecreReed wants to merge 4 commits intohw-native-sys:mainfrom
HecreReed:codex/add-missing-other-ops
Open

Add PTOAS support for TAXPY, THISTOGRAM, TGET_SCALE_ADDR, TROWARGMAX and TROWARGMIN#461
HecreReed wants to merge 4 commits intohw-native-sys:mainfrom
HecreReed:codex/add-missing-other-ops

Conversation

@HecreReed
Copy link
Copy Markdown
Collaborator

Summary

  • add PTO IR ops and verifiers for pto.taxpy, pto.thistogram, pto.tget_scale_addr, pto.trowargmax, and pto.trowargmin
  • add EmitC lowering for the 5 new ops
  • update ptobc opcode mapping and add basic smoke/verifier coverage

Validation

  • ninja -C build ptoas
  • manual smoke with build/tools/ptoas/ptoas on the 7 new test/basic/*.pto cases
  • ctest --output-on-failure -R ptobc_opcode_coverage_check

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds several new operations to the PTO dialect, including THistogramOp, TGetScaleAddrOp, TAxpyOp, TRowArgMaxOp, and TRowArgMinOp, along with their associated TableGen definitions, IR verifiers, memory effects, and EmitC lowering patterns. Feedback focuses on improving the IR verifiers: specifically, the TAxpyOp verifier should account for bf16 to f32 widening on the A5 architecture, and both TRowArgMaxOp and TRowArgMinOp should be extended to support bf16 source elements on A5 for consistency with other operations.

Comment on lines +2637 to +2640
bool widenF16ToF32 = srcElem.isF16() && dstElem.isF32();
if (!(sameType || widenF16ToF32))
return emitOpError(
"expects dst/src element types to match, or dst=f32 and src=f16");
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The verifyA5 lambda for TAxpyOp does not account for bf16 to f32 widening. Since bf16 is explicitly allowed for both src and dst in the subsequent checks (lines 2641-2644), it should also be allowed to widen to f32. The error message should also be updated to reflect this.

Suggested change
bool widenF16ToF32 = srcElem.isF16() && dstElem.isF32();
if (!(sameType || widenF16ToF32))
return emitOpError(
"expects dst/src element types to match, or dst=f32 and src=f16");
bool widenF16ToF32 = srcElem.isF16() && dstElem.isF32();
bool widenBF16ToF32 = srcElem.isBF16() && dstElem.isF32();
if (!(sameType || widenF16ToF32 || widenBF16ToF32))
return emitOpError(
"expects dst/src element types to match, or dst=f32 and src=f16/bf16");

Comment on lines +7286 to +7288
auto srcElem = getElemTy(srcTy).dyn_cast<mlir::FloatType>();
if (!srcElem || (!srcElem.isF16() && !srcElem.isF32()))
return emitOpError("expects src element type to be f16 or f32");
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The TRowArgMaxOp verifier currently only allows f16 and f32 for the source element type. Given that TAxpyOp supports bf16 on A5, it is likely that TRowArgMaxOp should also support bf16 on A5. Consider updating the verifier to allow bf16 when the architecture is A5 by splitting the verifyByArch lambda or adding an architecture check.

Comment on lines +7360 to +7362
auto srcElem = getElemTy(srcTy).dyn_cast<mlir::FloatType>();
if (!srcElem || (!srcElem.isF16() && !srcElem.isF32()))
return emitOpError("expects src element type to be f16 or f32");
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Similar to TRowArgMaxOp, the TRowArgMinOp verifier should likely support bf16 source elements on A5 architecture.

@HecreReed
Copy link
Copy Markdown
Collaborator Author

/run a3

@reedhecre
Copy link
Copy Markdown

A3 板测失败

失败用例

  • rsqrt (run, exit=2)
  • rems (run, exit=2)
  • rem (run, exit=2)

@reedhecre
Copy link
Copy Markdown

A3 板测失败详情:PR #461

rsqrt

stage=run info=exit=2

[ERROR] Mismatch: golden_v2.bin vs v2.bin, max diff=1.618960976600647 at idx=417 (golden=1.742221474647522, out=3.361182451248169, dtype=float32)
[ERROR] compare failed
[2026-04-08 18:04:45] ERROR: testcase failed (exit 2): rsqrt
rems

stage=run info=exit=2

/home/zhongxuan/ptoas-board-monitor/runtime/runs/20260408_175504_manual_pr461/npu_validation/Rems/rems/rems_kernel.cpp:93:3: error: no matching function for call to 'TREMS'
  TREMS(v13, v12, v4);
  ^~~~~
/home/zhongxuan/ptoas-board-monitor/runtime/runs/20260408_175504_manual_pr461/payload/pto-isa/include/pto/common/pto_instr.hpp:1446:22: note: candidate function template not viable: requires at least 4 arguments, but 3 were provided
PTO_INST RecordEvent TREMS(TileDataDst &dst, TileDataSrc &src, typename TileDataSrc::DType scalar, TileDataTmp &tmp,
                     ^
1 error generated.
gmake[2]: *** [CMakeFiles/rems_kernel.dir/build.make:76: CMakeFiles/rems_kernel.dir/rems_kernel.cpp.o] Error 1
gmake[2]: *** Waiting for unfinished jobs....
gmake[1]: *** [CMakeFiles/Makefile2:85: CMakeFiles/rems_kernel.dir/all] Error 2
gmake: *** [Makefile:91: all] Error 2
[2026-04-08 18:07:31] ERROR: testcase failed (exit 2): rems
rem

stage=run info=exit=2

/home/zhongxuan/ptoas-board-monitor/runtime/runs/20260408_175504_manual_pr461/npu_validation/Rem/rem/rem_kernel.cpp:99:3: error: no matching function for call to 'TREM'
  TREM(v18, v16, v17);
  ^~~~
/home/zhongxuan/ptoas-board-monitor/runtime/runs/20260408_175504_manual_pr461/payload/pto-isa/include/pto/common/pto_instr.hpp:1679:22: note: candidate function template not viable: requires at least 4 arguments, but 3 were provided
PTO_INST RecordEvent TREM(TileDataDst &dst, TileDataSrc0 &src0, TileDataSrc1 &src1, TileDataTmp &tmp,
                     ^
1 error generated.
gmake[2]: *** [CMakeFiles/rem_kernel.dir/build.make:76: CMakeFiles/rem_kernel.dir/rem_kernel.cpp.o] Error 1
gmake[2]: *** Waiting for unfinished jobs....
gmake[1]: *** [CMakeFiles/Makefile2:85: CMakeFiles/rem_kernel.dir/all] Error 2
gmake: *** [Makefile:91: all] Error 2
[2026-04-08 18:17:02] ERROR: testcase failed (exit 2): rem

@HecreReed HecreReed marked this pull request as ready for review April 9, 2026 01:02
@gemini-code-assist
Copy link
Copy Markdown

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: db7b6d8d1a

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +9147 to +9149
PTO_ADD_READ(getSrcMutable());
PTO_ADD_READ(getScalarMutable());
PTO_ADD_WRITE(getDstMutable());
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Mark TAXPY destination as read in memory effects

pto.taxpy is defined as an in-place accumulate (dst += src * scalar), so it depends on the previous contents of dst. In TAxpyOp::getEffects the destination is marked only as Write, which can let transformations that consume MemoryEffectsOpInterface treat earlier updates to the same tile as dead/overwritable. This can change results when multiple taxpy operations target the same destination.

Useful? React with 👍 / 👎.

Comment on lines +9117 to +9119
PTO_ADD_READ(getSrcMutable());
PTO_ADD_READ(getIdxMutable());
PTO_ADD_WRITE(getDstMutable());
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Mark THISTOGRAM destination as read in memory effects

pto.thistogram performs histogram accumulation into dst, so it semantically reads existing bin values before writing updated ones. THistogramOp::getEffects currently models dst as write-only, which can permit incorrect scheduling or dead-store style rewrites for sequences that repeatedly update the same histogram tile.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants