Skip to content

fix(a5): normalize risky vec col-major TMOV to row-major via treshape#440

Open
TaoTao-real wants to merge 13 commits intohw-native-sys:mainfrom
TaoTao-real:codex/tmov-colmajor-align-repro-clean
Open

fix(a5): normalize risky vec col-major TMOV to row-major via treshape#440
TaoTao-real wants to merge 13 commits intohw-native-sys:mainfrom
TaoTao-real:codex/tmov-colmajor-align-repro-clean

Conversation

@TaoTao-real
Copy link
Copy Markdown
Contributor

@TaoTao-real TaoTao-real commented Apr 3, 2026

Summary

  • Add PTOA5NormalizeTMovPass to normalize risky A5 pto.tmov patterns:
    • Match: vec -> vec and both src/dst tiles are col_major + none_box.
    • Rewrite: treshape(row_major src) + treshape(row_major dst) + tmov(row_major -> row_major).
  • Wire the pass into ptoas pipeline before PTOViewToMemref.
  • Keep/expand regression guards in test/samples/runop.sh for:
    • test_tmov_col_major_16x1_align_a5
    • test_tmov_row_major_1x16_control_a5
    • decode_projection_incore_0
    • rmsnorm_incore_0
  • Add decode_projection_incore_0 sample into test/samples/Sync for A5 regression coverage.
  • Update test/npu_validation/scripts/generate_testcase.py to generate board-friendly params for decode_projection_incore_0 / rmsnorm_incore_0:
    • bf16 pointer buffers sized to full [16, hidden] windows
    • decode gamma pointer sized to 8192
    • scalar row offset (arg3) forced to 0 in single-block validation

Motivation

On A5, vec->vec tmov with col_major tiles can enter an alignment-sensitive backend path and trigger UB alignment exceptions in real kernels (observed in rmsnorm_incore_0 / decode_projection_incore_0).

The pass avoids this unsafe lowering path by normalizing to a row-major reinterpret route while preserving tile alias semantics (no real data movement introduced by treshape).

Design Notes

  • A5-only behavior (pto.target_arch == a5).
  • Fail-fast safety checks:
    • only static 2D tile shapes/valid-shapes are rewritten;
    • if a risky tmov remains after rewrite, pass emits error and fails.
  • Existing op attributes on tmov are preserved on rewritten op.

Validation

  • Build: ninja -C build ptoas
  • A5 compile checks (--pto-arch=a5 --pto-level=level3 --enable-insert-sync):
    • test_tmov_col_major_16x1_align_a5: TRESHAPE present
    • test_tmov_row_major_1x16_control_a5: no TRESHAPE
    • decode_projection_incore_0 / rmsnorm_incore_0: TRESHAPE present
  • runop.sh targeted guard run for the 4 samples: pass

Risk / Rollback

  • Risk is scoped to A5 and a narrow tmov pattern.
  • Rollback is straightforward: revert this PR or disable/remove PTOA5NormalizeTMovPass in pipeline.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces new test samples and regression guards for TMOV alignment on the A5 architecture, specifically covering 16x1 column-major and 1x16 row-major tile configurations. The changes include new .pto and .py test files and updates to the runop.sh test runner to validate the emitted C++ code. A critical issue was identified in the test runner where an undefined variable target_arch_lc would cause the new tests to be incorrectly skipped; a suggestion was provided to use a consistent inline transformation for the architecture check.

fi
if [[ ( "$base" == "test_tmov_col_major_16x1_align_a5" || \
"$base" == "test_tmov_row_major_1x16_control_a5" ) && \
"${target_arch_lc}" != "a5" ]]; then
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The variable target_arch_lc is not defined in this script. This will cause the condition to always evaluate to true (as an undefined variable expands to an empty string), resulting in these tests being skipped even when the target architecture is correctly set to a5. You should use the inline transformation consistent with the rest of the file to check the architecture.

Suggested change
"${target_arch_lc}" != "a5" ]]; then
"$(printf '%s' "$target_arch" | tr '[:upper:]' '[:lower:]')" != "a5" ]]; then

@TaoTao-real
Copy link
Copy Markdown
Contributor Author

/run a5 test_tmov_col_major_16x1_align_a5 test_tmov_row_major_1x16_control_a5 --pto-level=level3

@reedhecre
Copy link
Copy Markdown

A5 板测失败

  • 触发方式:manual
  • 源码提交:77007c944e9b
  • 结果汇总:OK 0 / FAIL 0 / SKIP 0
  • 日志:/root/ptoas-board-monitor-a5/logs/20260403_165306_manual_pr440.log
  • 手动指令:/run a5 test_tmov_col_major_16x1_align_a5 test_tmov_row_major_1x16_control_a5 --pto-level=level3
  • 触发人:TaoTao-real
  • 指定用例:test_tmov_col_major_16x1_align_a5,test_tmov_row_major_1x16_control_a5
  • PTOAS 参数:--pto-level=level3
  • 触发评论:fix(a5): normalize risky vec col-major TMOV to row-major via treshape #440 (comment)
  • 失败阶段:sample-build-and-test / exit=1

日志尾部

es/validation_runtime.py [not in RUN_ONLY_CASES], test/samples/planmemory/plan_memory_reuse_sequential.py [not in RUN_ONLY_CASES], test/samples/planmemory/plan_memory_peak_exact_capacity.py [not in RUN_ONLY_CASES], test/samples/planmemory/plan_memory_peak_8_overlapping.py [not in RUN_ONLY_CASES], test/samples/planmemory/plan_memory_no_reuse_overlap.py [not in RUN_ONLY_CASES], test/samples/planmemory/plan_memory_nested_loops.py [not in RUN_ONLY_CASES], test/samples/planmemory/plan_memory_loop_no_reuse_outer_live.py [not in RUN_ONLY_CASES], test/samples/planmemory/plan_memory_loop_in_if.py [not in RUN_ONLY_CASES], test/samples/planmemory/plan_memory_if_yield.py [not in RUN_ONLY_CASES], test/samples/planmemory/plan_memory_if_in_loop.py [not in RUN_ONLY_CASES], test/samples/planmemory/plan_memory_fragmentation_two_holes.py [not in RUN_ONLY_CASES], test/samples/planmemory/plan_memory_fragmentation_hole_fit.py [not in RUN_ONLY_CASES], test/samples/planmemory/plan_memory_for_iter_args_yield.py [not in RUN_ONLY_CASES], test/samples/planmemory/plan_memory_bind_tile_alias_liveness.py [not in RUN_ONLY_CASES], test/samples/Xors/xors_golden.py [not in RUN_ONLY_CASES], test/samples/Xors/xors_compare.py [not in RUN_ONLY_CASES], test/samples/Xors/xors.py [not in RUN_ONLY_CASES], test/samples/Xor/xor_golden.py [not in RUN_ONLY_CASES], test/samples/Xor/xor_compare.py [not in RUN_ONLY_CASES], test/samples/Xor/xor.py [not in RUN_ONLY_CASES], ... (+413 more)

===== STAGE sample-build-and-test @ 2026-04-03 16:54:44 =====
bash test/samples/runop.sh --enablebc all
PTOAS_OUT_DIR=/tmp/ptoas-board-monitor-a5/runs/20260403_165306_manual_pr440/payload/test/samples
test/samples/runop.sh: line 274: target_arch_lc: unbound variable
===== END STAGE sample-build-and-test rc=1 @ 2026-04-03 16:54:46 =====

@TaoTao-real
Copy link
Copy Markdown
Contributor Author

/run a5 test_tmov_col_major_16x1_align_a5 test_tmov_row_major_1x16_control_a5 --pto-level=level3

@reedhecre
Copy link
Copy Markdown

A5 板测失败

  • 触发方式:manual
  • 源码提交:7d02cb803083
  • 结果汇总:OK 0 / FAIL 0 / SKIP 0
  • 日志:/root/ptoas-board-monitor-a5/logs/20260403_171805_manual_pr440.log
  • 手动指令:/run a5 test_tmov_col_major_16x1_align_a5 test_tmov_row_major_1x16_control_a5 --pto-level=level3
  • 触发人:TaoTao-real
  • 指定用例:test_tmov_col_major_16x1_align_a5,test_tmov_row_major_1x16_control_a5
  • PTOAS 参数:--pto-level=level3
  • 触发评论:fix(a5): normalize risky vec col-major TMOV to row-major via treshape #440 (comment)
  • 失败阶段:sample-build-and-test / exit=1

日志尾部

planmemory/plan_memory_loop_no_reuse_outer_live.py [not in RUN_ONLY_CASES], test/samples/planmemory/plan_memory_loop_in_if.py [not in RUN_ONLY_CASES], test/samples/planmemory/plan_memory_if_yield.py [not in RUN_ONLY_CASES], test/samples/planmemory/plan_memory_if_in_loop.py [not in RUN_ONLY_CASES], test/samples/planmemory/plan_memory_fragmentation_two_holes.py [not in RUN_ONLY_CASES], test/samples/planmemory/plan_memory_fragmentation_hole_fit.py [not in RUN_ONLY_CASES], test/samples/planmemory/plan_memory_for_iter_args_yield.py [not in RUN_ONLY_CASES], test/samples/planmemory/plan_memory_bind_tile_alias_liveness.py [not in RUN_ONLY_CASES], test/samples/Xors/xors_golden.py [not in RUN_ONLY_CASES], test/samples/Xors/xors_compare.py [not in RUN_ONLY_CASES], test/samples/Xors/xors.py [not in RUN_ONLY_CASES], test/samples/Xor/xor_golden.py [not in RUN_ONLY_CASES], test/samples/Xor/xor_compare.py [not in RUN_ONLY_CASES], test/samples/Xor/xor.py [not in RUN_ONLY_CASES], ... (+413 more)

===== STAGE sample-build-and-test @ 2026-04-03 17:19:52 =====
bash test/samples/runop.sh --enablebc all
PTOAS_OUT_DIR=/tmp/ptoas-board-monitor-a5/runs/20260403_171805_manual_pr440/payload/test/samples
========== SUMMARY ==========
Sync(test_tmov_col_major_16x1_align_a5.pto) FAIL ptoas failed: test_tmov_col_major_16x1_align_a5.pto
Sync(test_tmov_col_major_16x1_align_a5.py) FAIL ptoas failed: test_tmov_col_major_16x1_align_a5-pto-ir.pto
Sync(test_tmov_row_major_1x16_control_a5.pto) FAIL ptoas failed: test_tmov_row_major_1x16_control_a5.pto
Sync(test_tmov_row_major_1x16_control_a5.py) FAIL ptoas failed: test_tmov_row_major_1x16_control_a5-pto-ir.pto
-----------------------------
OK=0  FAIL=4  SKIP=0
=============================
===== END STAGE sample-build-and-test rc=1 @ 2026-04-03 17:19:55 =====

@TaoTao-real
Copy link
Copy Markdown
Contributor Author

/run a5 test_tmov_col_major_16x1_align_a5 --pto-level=level3

@reedhecre
Copy link
Copy Markdown

A5 板测失败

  • 触发方式:manual
  • 源码提交:a1f5bfbb13d4
  • 结果汇总:OK 0 / FAIL 0 / SKIP 0
  • 日志:/root/ptoas-board-monitor-a5/logs/20260407_104405_manual_pr440.log
  • 手动指令:/run a5 test_tmov_col_major_16x1_align_a5 --pto-level=level3
  • 触发人:TaoTao-real
  • 指定用例:test_tmov_col_major_16x1_align_a5
  • PTOAS 参数:--pto-level=level3
  • 触发评论:fix(a5): normalize risky vec col-major TMOV to row-major via treshape #440 (comment)
  • 失败阶段:sample-build-and-test / exit=1

日志尾部

ak_8_overlapping.py [not in RUN_ONLY_CASES], test/samples/planmemory/plan_memory_no_reuse_overlap.py [not in RUN_ONLY_CASES], test/samples/planmemory/plan_memory_nested_loops.py [not in RUN_ONLY_CASES], test/samples/planmemory/plan_memory_loop_no_reuse_outer_live.py [not in RUN_ONLY_CASES], test/samples/planmemory/plan_memory_loop_in_if.py [not in RUN_ONLY_CASES], test/samples/planmemory/plan_memory_if_yield.py [not in RUN_ONLY_CASES], test/samples/planmemory/plan_memory_if_in_loop.py [not in RUN_ONLY_CASES], test/samples/planmemory/plan_memory_fragmentation_two_holes.py [not in RUN_ONLY_CASES], test/samples/planmemory/plan_memory_fragmentation_hole_fit.py [not in RUN_ONLY_CASES], test/samples/planmemory/plan_memory_for_iter_args_yield.py [not in RUN_ONLY_CASES], test/samples/planmemory/plan_memory_bind_tile_alias_liveness.py [not in RUN_ONLY_CASES], test/samples/Xors/xors_golden.py [not in RUN_ONLY_CASES], test/samples/Xors/xors_compare.py [not in RUN_ONLY_CASES], test/samples/Xors/xors.py [not in RUN_ONLY_CASES], test/samples/Xor/xor_golden.py [not in RUN_ONLY_CASES], test/samples/Xor/xor_compare.py [not in RUN_ONLY_CASES], test/samples/Xor/xor.py [not in RUN_ONLY_CASES], ... (+429 more)

===== STAGE sample-build-and-test @ 2026-04-07 10:45:49 =====
bash test/samples/runop.sh --enablebc all
PTOAS_OUT_DIR=/tmp/ptoas-board-monitor-a5/runs/20260407_104405_manual_pr440/payload/test/samples
========== SUMMARY ==========
Sync(test_tmov_col_major_16x1_align_a5.pto) FAIL ptoas failed: test_tmov_col_major_16x1_align_a5.pto
Sync(test_tmov_col_major_16x1_align_a5.py) FAIL ptoas failed: test_tmov_col_major_16x1_align_a5-pto-ir.pto
-----------------------------
OK=0  FAIL=2  SKIP=0
=============================
===== END STAGE sample-build-and-test rc=1 @ 2026-04-07 10:45:51 =====

@TaoTao-real
Copy link
Copy Markdown
Contributor Author

/run a5 test_tmov_col_major_16x1_align_a5 --pto-level=level3

@reedhecre
Copy link
Copy Markdown

A5 板测成功

  • 触发方式:manual
  • 源码提交:b11f4c417509
  • 结果汇总:OK 1 / FAIL 0 / SKIP 0
  • 日志:/root/ptoas-board-monitor-a5/logs/20260407_111211_manual_pr440.log
  • 结果 TSV:/root/ptoas-board-monitor-a5/logs/20260407_111211_manual_pr440.tsv
  • 手动指令:/run a5 test_tmov_col_major_16x1_align_a5 --pto-level=level3
  • 触发人:TaoTao-real
  • 指定用例:test_tmov_col_major_16x1_align_a5
  • PTOAS 参数:--pto-level=level3
  • 触发评论:fix(a5): normalize risky vec col-major TMOV to row-major via treshape #440 (comment)

@TaoTao-real
Copy link
Copy Markdown
Contributor Author

/run a5 test_tmov_row_major_1x16_control_a5 --pto-level=level3

@reedhecre
Copy link
Copy Markdown

A5 板测成功

  • 触发方式:manual
  • 源码提交:e95697640388
  • 结果汇总:OK 1 / FAIL 0 / SKIP 0
  • 日志:/root/ptoas-board-monitor-a5/logs/20260407_112405_manual_pr440.log
  • 结果 TSV:/root/ptoas-board-monitor-a5/logs/20260407_112405_manual_pr440.tsv
  • 手动指令:/run a5 test_tmov_row_major_1x16_control_a5 --pto-level=level3
  • 触发人:TaoTao-real
  • 指定用例:test_tmov_row_major_1x16_control_a5
  • PTOAS 参数:--pto-level=level3
  • 触发评论:fix(a5): normalize risky vec col-major TMOV to row-major via treshape #440 (comment)

@TaoTao-real
Copy link
Copy Markdown
Contributor Author

/run a5 rmsnorm_incore_0 --pto-level=level3

@reedhecre
Copy link
Copy Markdown

A5 板测失败

  • 触发方式:manual
  • 源码提交:089e31d1bee2
  • 结果汇总:OK 0 / FAIL 1 / SKIP 0
  • 日志:/root/ptoas-board-monitor-a5/logs/20260407_120611_manual_pr440.log
  • 手动指令:/run a5 rmsnorm_incore_0 --pto-level=level3
  • 触发人:TaoTao-real
  • 指定用例:rmsnorm_incore_0
  • PTOAS 参数:--pto-level=level3
  • 触发评论:fix(a5): normalize risky vec col-major TMOV to row-major via treshape #440 (comment)
  • 失败阶段:board-validation / exit=1

失败用例

  • rmsnorm_incore_0 (run, exit=1)

@reedhecre
Copy link
Copy Markdown

A5 板测失败详情:PR #440

rmsnorm_incore_0

stage=run info=exit=1

[ERROR] aclrtSynchronizeStream(stream) failed: 507035 (/tmp/ptoas-board-monitor-a5/runs/20260407_120611_manual_pr440/npu_validation/Sync/rmsnorm_incore_0/main.cpp:100)
[ERROR] RecentErrMsg: EZ9999: Inner Error!
EZ9999[PID: 3016698] 2026-04-07-12:08:03.540.730 (EZ9999):  The error from device(chipId:0, dieId:0), serial number is 133, there is an aivec error exception, core id is 0, error code = 340, dump info: pc start: 0x100040800000, current: 0x10004080019c, sc error info: 0xffffffffffff, su error info: 0xe7ffd23d1fdc0017,0x4240141410009bfd, mte error info: 0x2005d, vec error info: 0x408008bc0031007a, cube error info: 0, l1 error info: 0, aic error mask: 0x395856, para base: 0x100040200000, mte error: 0.[FUNC:ProcessDavidStarsCoreErrorInfo][FILE:device_error_proc_c.cc][LINE:580]
        TraceBack (most recent call last):
       The extend info: errcode:(340) errorStr: The instruction access UB address is not aligned. subErrType: 0x4.[FUNC:ProcessDavidStarsCoreErrorInfo][FILE:device_error_proc_c.cc][LINE:583]
       Kernel task happen error, retCode=0x31, [vector core exception].[FUNC:PreCheckTaskErr][FILE:davinci_kernel_task.cc][LINE:1728]
       AIV Kernel happen error, retCode=0x31.[FUNC:GetError][FILE:stream.cc][LINE:1478]
       [AIC_INFO] after execute:args print end[FUNC:GetError][FILE:stream.cc][LINE:1478]
       [DFX_INFO]Aicore kernel execute failed, device_id=1, stream_id=62, report_stream_id=62, task_id=0, flip_num=0, fault kernel_name=_Z16rmsnorm_incore_0Pu6__bf16PfS_i, fault kernel info ext=_Z16rmsnorm_incore_0Pu6__bf16PfS_i, program id=0, hash=14225444651633779129.[FUNC:GetError][FILE:stream.cc][LINE:1478]
       rtStreamSynchronize execution failed, reason=vector core exception[FUNC:FuncErrorReason][FILE:error_message_manage.cc][LINE:65]
       synchronize stream failed, runtime result = 507035[FUNC:ReportCallError][FILE:log_inner.cpp][LINE:148]
[2026-04-07 12:08:39] ERROR: testcase failed (exit 1): rmsnorm_incore_0
[2026-04-07 12:08:39] === SUMMARY ===
[2026-04-07 12:08:39] OK=0 FAIL=1 SKIP=0
[2026-04-07 12:08:39] RESULTS_TSV=/tmp/ptoas-board-monitor-a5/runs/20260407_120611_manual_pr440/remote_npu_validation_results.tsv

@TaoTao-real
Copy link
Copy Markdown
Contributor Author

/run a5 rmsnorm_incore_0 --pto-level=level3

@reedhecre
Copy link
Copy Markdown

A5 板测失败

  • 触发方式:manual
  • 源码提交:7b9bd705e92e
  • 结果汇总:OK 0 / FAIL 1 / SKIP 0
  • 日志:/root/ptoas-board-monitor-a5/logs/20260407_121411_manual_pr440.log
  • 手动指令:/run a5 rmsnorm_incore_0 --pto-level=level3
  • 触发人:TaoTao-real
  • 指定用例:rmsnorm_incore_0
  • PTOAS 参数:--pto-level=level3
  • 触发评论:fix(a5): normalize risky vec col-major TMOV to row-major via treshape #440 (comment)
  • 失败阶段:board-validation / exit=1

失败用例

  • rmsnorm_incore_0 (run, exit=1)

@reedhecre
Copy link
Copy Markdown

A5 板测失败详情:PR #440

rmsnorm_incore_0

stage=run info=exit=1

[ERROR] aclrtSynchronizeStream(stream) failed: 507035 (/tmp/ptoas-board-monitor-a5/runs/20260407_121411_manual_pr440/npu_validation/Sync/rmsnorm_incore_0/main.cpp:100)
[ERROR] RecentErrMsg: EZ9999: Inner Error!
EZ9999[PID: 3020870] 2026-04-07-12:16:04.565.902 (EZ9999):  The error from device(chipId:0, dieId:0), serial number is 134, there is an aivec error exception, core id is 0, error code = 340, dump info: pc start: 0x100040800000, current: 0x10004080019c, sc error info: 0xffffffffffff, su error info: 0xe7ffd23d1fdc0017,0x4240141410009bfd, mte error info: 0x2005d, vec error info: 0x408008bc0031007a, cube error info: 0, l1 error info: 0, aic error mask: 0x395856, para base: 0x100040200000, mte error: 0.[FUNC:ProcessDavidStarsCoreErrorInfo][FILE:device_error_proc_c.cc][LINE:580]
        TraceBack (most recent call last):
       The extend info: errcode:(340) errorStr: The instruction access UB address is not aligned. subErrType: 0x4.[FUNC:ProcessDavidStarsCoreErrorInfo][FILE:device_error_proc_c.cc][LINE:583]
       Kernel task happen error, retCode=0x31, [vector core exception].[FUNC:PreCheckTaskErr][FILE:davinci_kernel_task.cc][LINE:1728]
       AIV Kernel happen error, retCode=0x31.[FUNC:GetError][FILE:stream.cc][LINE:1478]
       [AIC_INFO] after execute:args print end[FUNC:GetError][FILE:stream.cc][LINE:1478]
       [DFX_INFO]Aicore kernel execute failed, device_id=1, stream_id=62, report_stream_id=62, task_id=0, flip_num=0, fault kernel_name=_Z16rmsnorm_incore_0Pu6__bf16PfS_i, fault kernel info ext=_Z16rmsnorm_incore_0Pu6__bf16PfS_i, program id=0, hash=14225444651633779129.[FUNC:GetError][FILE:stream.cc][LINE:1478]
       rtStreamSynchronize execution failed, reason=vector core exception[FUNC:FuncErrorReason][FILE:error_message_manage.cc][LINE:65]
       synchronize stream failed, runtime result = 507035[FUNC:ReportCallError][FILE:log_inner.cpp][LINE:148]
[2026-04-07 12:16:41] ERROR: testcase failed (exit 1): rmsnorm_incore_0
[2026-04-07 12:16:41] === SUMMARY ===
[2026-04-07 12:16:41] OK=0 FAIL=1 SKIP=0
[2026-04-07 12:16:41] RESULTS_TSV=/tmp/ptoas-board-monitor-a5/runs/20260407_121411_manual_pr440/remote_npu_validation_results.tsv

@TaoTao-real
Copy link
Copy Markdown
Contributor Author

/run a5 rmsnorm_incore_0 --pto-level=level3

@reedhecre
Copy link
Copy Markdown

A5 板测失败

  • 触发方式:manual
  • 源码提交:1e007a2aca08
  • 结果汇总:OK 0 / FAIL 1 / SKIP 0
  • 日志:/root/ptoas-board-monitor-a5/logs/20260407_125605_manual_pr440.log
  • 手动指令:/run a5 rmsnorm_incore_0 --pto-level=level3
  • 触发人:TaoTao-real
  • 指定用例:rmsnorm_incore_0
  • PTOAS 参数:--pto-level=level3
  • 触发评论:fix(a5): normalize risky vec col-major TMOV to row-major via treshape #440 (comment)
  • 失败阶段:board-validation / exit=1

失败用例

  • rmsnorm_incore_0 (run, exit=1)

@reedhecre
Copy link
Copy Markdown

A5 板测失败详情:PR #440

rmsnorm_incore_0

stage=run info=exit=1

[ERROR] aclrtSynchronizeStream(stream) failed: 507035 (/tmp/ptoas-board-monitor-a5/runs/20260407_125605_manual_pr440/npu_validation/Sync/rmsnorm_incore_0/main.cpp:100)
[ERROR] RecentErrMsg: EZ9999: Inner Error!
EZ9999[PID: 3028425] 2026-04-07-12:57:57.458.821 (EZ9999):  The error from device(chipId:0, dieId:0), serial number is 135, there is an aivec error exception, core id is 0, error code = 340, dump info: pc start: 0x100040800000, current: 0x1000408002a4, sc error info: 0xffffffffffff, su error info: 0xe7ffd23d1fdc0017,0x4240141410009bfd, mte error info: 0x2005d, vec error info: 0x40800c10003100a4, cube error info: 0, l1 error info: 0, aic error mask: 0x395856, para base: 0x100040200000, mte error: 0.[FUNC:ProcessDavidStarsCoreErrorInfo][FILE:device_error_proc_c.cc][LINE:580]
        TraceBack (most recent call last):
       The extend info: errcode:(340) errorStr: The instruction access UB address is not aligned. subErrType: 0x4.[FUNC:ProcessDavidStarsCoreErrorInfo][FILE:device_error_proc_c.cc][LINE:583]
       Kernel task happen error, retCode=0x31, [vector core exception].[FUNC:PreCheckTaskErr][FILE:davinci_kernel_task.cc][LINE:1728]
       AIV Kernel happen error, retCode=0x31.[FUNC:GetError][FILE:stream.cc][LINE:1478]
       [AIC_INFO] after execute:args print end[FUNC:GetError][FILE:stream.cc][LINE:1478]
       [DFX_INFO]Aicore kernel execute failed, device_id=1, stream_id=62, report_stream_id=62, task_id=0, flip_num=0, fault kernel_name=_Z16rmsnorm_incore_0Pu6__bf16PfS_i, fault kernel info ext=_Z16rmsnorm_incore_0Pu6__bf16PfS_i, program id=0, hash=2490582683396417117.[FUNC:GetError][FILE:stream.cc][LINE:1478]
       rtStreamSynchronize execution failed, reason=vector core exception[FUNC:FuncErrorReason][FILE:error_message_manage.cc][LINE:65]
       synchronize stream failed, runtime result = 507035[FUNC:ReportCallError][FILE:log_inner.cpp][LINE:148]
[2026-04-07 12:58:29] ERROR: testcase failed (exit 1): rmsnorm_incore_0
[2026-04-07 12:58:29] === SUMMARY ===
[2026-04-07 12:58:29] OK=0 FAIL=1 SKIP=0
[2026-04-07 12:58:29] RESULTS_TSV=/tmp/ptoas-board-monitor-a5/runs/20260407_125605_manual_pr440/remote_npu_validation_results.tsv

@TaoTao-real
Copy link
Copy Markdown
Contributor Author

/run a5 rmsnorm_incore_0 --pto-level=level3

@reedhecre
Copy link
Copy Markdown

A5 板测成功

  • 触发方式:manual
  • 源码提交:267e9789303e
  • 结果汇总:OK 1 / FAIL 0 / SKIP 0
  • 日志:/root/ptoas-board-monitor-a5/logs/20260407_130705_manual_pr440.log
  • 结果 TSV:/root/ptoas-board-monitor-a5/logs/20260407_130705_manual_pr440.tsv
  • 手动指令:/run a5 rmsnorm_incore_0 --pto-level=level3
  • 触发人:TaoTao-real
  • 指定用例:rmsnorm_incore_0
  • PTOAS 参数:--pto-level=level3
  • 触发评论:fix(a5): normalize risky vec col-major TMOV to row-major via treshape #440 (comment)

@TaoTao-real
Copy link
Copy Markdown
Contributor Author

/run a5 rmsnorm_incore_0 --pto-level=level3

@reedhecre
Copy link
Copy Markdown

A5 板测成功

  • 触发方式:manual
  • 源码提交:3b4a6c98f181
  • 结果汇总:OK 1 / FAIL 0 / SKIP 0
  • 日志:/root/ptoas-board-monitor-a5/logs/20260407_132807_manual_pr440.log
  • 结果 TSV:/root/ptoas-board-monitor-a5/logs/20260407_132807_manual_pr440.tsv
  • 手动指令:/run a5 rmsnorm_incore_0 --pto-level=level3
  • 触发人:TaoTao-real
  • 指定用例:rmsnorm_incore_0
  • PTOAS 参数:--pto-level=level3
  • 触发评论:fix(a5): normalize risky vec col-major TMOV to row-major via treshape #440 (comment)

@TaoTao-real
Copy link
Copy Markdown
Contributor Author

/run a5 rmsnorm_incore_0 --pto-level=level3

@reedhecre
Copy link
Copy Markdown

A5 板测成功

  • 触发方式:manual
  • 源码提交:eb3f57e424dc
  • 结果汇总:OK 1 / FAIL 0 / SKIP 0
  • 日志:/root/ptoas-board-monitor-a5/logs/20260407_134105_manual_pr440.log
  • 结果 TSV:/root/ptoas-board-monitor-a5/logs/20260407_134105_manual_pr440.tsv
  • 手动指令:/run a5 rmsnorm_incore_0 --pto-level=level3
  • 触发人:TaoTao-real
  • 指定用例:rmsnorm_incore_0
  • PTOAS 参数:--pto-level=level3
  • 触发评论:fix(a5): normalize risky vec col-major TMOV to row-major via treshape #440 (comment)

@TaoTao-real
Copy link
Copy Markdown
Contributor Author

/run a5 rmsnorm_incore_0 --pto-level=level3

@reedhecre
Copy link
Copy Markdown

A5 板测失败

  • 触发方式:manual
  • 源码提交:f3ef4b53f98c
  • 结果汇总:OK 0 / FAIL 1 / SKIP 0
  • 日志:/root/ptoas-board-monitor-a5/logs/20260407_164205_manual_pr440.log
  • 手动指令:/run a5 rmsnorm_incore_0 --pto-level=level3
  • 触发人:TaoTao-real
  • 指定用例:rmsnorm_incore_0
  • PTOAS 参数:--pto-level=level3
  • 触发评论:fix(a5): normalize risky vec col-major TMOV to row-major via treshape #440 (comment)
  • 失败阶段:board-validation / exit=1

失败用例

  • rmsnorm_incore_0 (run, exit=1)

@reedhecre
Copy link
Copy Markdown

A5 板测失败详情:PR #440

rmsnorm_incore_0

stage=run info=exit=1

[ERROR] aclrtSynchronizeStream(stream) failed: 507035 (/tmp/ptoas-board-monitor-a5/runs/20260407_164205_manual_pr440/npu_validation/Sync/rmsnorm_incore_0/main.cpp:100)
[ERROR] RecentErrMsg: EZ9999: Inner Error!
EZ9999[PID: 3092131] 2026-04-07-16:43:58.604.194 (EZ9999):  The error from device(chipId:0, dieId:0), serial number is 142, there is an aivec error exception, core id is 0, error code = 340, dump info: pc start: 0x100040800000, current: 0x1000408002a4, sc error info: 0xffffffffffff, su error info: 0xe7ffd23d1fdc0017,0x4240141410009bfd, mte error info: 0x2005d, vec error info: 0x40800c10003100a4, cube error info: 0, l1 error info: 0, aic error mask: 0x395856, para base: 0x100040200000, mte error: 0.[FUNC:ProcessDavidStarsCoreErrorInfo][FILE:device_error_proc_c.cc][LINE:580]
        TraceBack (most recent call last):
       The extend info: errcode:(340) errorStr: The instruction access UB address is not aligned. subErrType: 0x4.[FUNC:ProcessDavidStarsCoreErrorInfo][FILE:device_error_proc_c.cc][LINE:583]
       Kernel task happen error, retCode=0x31, [vector core exception].[FUNC:PreCheckTaskErr][FILE:davinci_kernel_task.cc][LINE:1728]
       AIV Kernel happen error, retCode=0x31.[FUNC:GetError][FILE:stream.cc][LINE:1478]
       [AIC_INFO] after execute:args print end[FUNC:GetError][FILE:stream.cc][LINE:1478]
       [DFX_INFO]Aicore kernel execute failed, device_id=1, stream_id=62, report_stream_id=62, task_id=0, flip_num=0, fault kernel_name=_Z16rmsnorm_incore_0Pu6__bf16PfS_i, fault kernel info ext=_Z16rmsnorm_incore_0Pu6__bf16PfS_i, program id=0, hash=2490582683396417117.[FUNC:GetError][FILE:stream.cc][LINE:1478]
       rtStreamSynchronize execution failed, reason=vector core exception[FUNC:FuncErrorReason][FILE:error_message_manage.cc][LINE:65]
       synchronize stream failed, runtime result = 507035[FUNC:ReportCallError][FILE:log_inner.cpp][LINE:148]
[2026-04-07 16:44:37] ERROR: testcase failed (exit 1): rmsnorm_incore_0
[2026-04-07 16:44:37] === SUMMARY ===
[2026-04-07 16:44:37] OK=0 FAIL=1 SKIP=0
[2026-04-07 16:44:37] RESULTS_TSV=/tmp/ptoas-board-monitor-a5/runs/20260407_164205_manual_pr440/remote_npu_validation_results.tsv

@TaoTao-real
Copy link
Copy Markdown
Contributor Author

/run a5 test_tmov_col_major_16x1_align_a5 --pto-level=level3

@reedhecre
Copy link
Copy Markdown

A5 板测失败

  • 触发方式:manual
  • 源码提交:7091804d65d1
  • 结果汇总:OK 0 / FAIL 1 / SKIP 0
  • 日志:/root/ptoas-board-monitor-a5/logs/20260408_093909_manual_pr440.log
  • 手动指令:/run a5 test_tmov_col_major_16x1_align_a5 --pto-level=level3
  • 触发人:TaoTao-real
  • 指定用例:test_tmov_col_major_16x1_align_a5
  • PTOAS 参数:--pto-level=level3
  • 触发评论:fix(a5): normalize risky vec col-major TMOV to row-major via treshape #440 (comment)
  • 失败阶段:board-validation / exit=1

失败用例

  • test_tmov_col_major_16x1_align_a5 (run, exit=1)

@reedhecre
Copy link
Copy Markdown

A5 板测失败详情:PR #440

test_tmov_col_major_16x1_align_a5

stage=run info=exit=1

[ERROR] aclrtSynchronizeStream(stream) failed: 507035 (/tmp/ptoas-board-monitor-a5/runs/20260408_093909_manual_pr440/npu_validation/Sync/test_tmov_col_major_16x1_align_a5/main.cpp:91)
[ERROR] RecentErrMsg: EZ9999: Inner Error!
EZ9999[PID: 3225549] 2026-04-08-09:41:05.733.453 (EZ9999):  The error from device(chipId:0, dieId:0), serial number is 143, there is an aivec error exception, core id is 0, error code = 340, dump info: pc start: 0x100040800000, current: 0x1000408000f0, sc error info: 0xffffffffffff, su error info: 0xe7ffd23d1fdc0017,0x4240141410009bfd, mte error info: 0x2005d, vec error info: 0x4080021000310036, cube error info: 0, l1 error info: 0, aic error mask: 0x395856, para base: 0x100040200000, mte error: 0.[FUNC:ProcessDavidStarsCoreErrorInfo][FILE:device_error_proc_c.cc][LINE:580]
        TraceBack (most recent call last):
       The extend info: errcode:(340) errorStr: The instruction access UB address is not aligned. subErrType: 0x4.[FUNC:ProcessDavidStarsCoreErrorInfo][FILE:device_error_proc_c.cc][LINE:583]
       Kernel task happen error, retCode=0x31, [vector core exception].[FUNC:PreCheckTaskErr][FILE:davinci_kernel_task.cc][LINE:1728]
       AIV Kernel happen error, retCode=0x31.[FUNC:GetError][FILE:stream.cc][LINE:1478]
       [AIC_INFO] after execute:args print end[FUNC:GetError][FILE:stream.cc][LINE:1478]
       [DFX_INFO]Aicore kernel execute failed, device_id=1, stream_id=62, report_stream_id=62, task_id=0, flip_num=0, fault kernel_name=_Z33test_tmov_col_major_16x1_align_a5PfS_, fault kernel info ext=_Z33test_tmov_col_major_16x1_align_a5PfS_, program id=0, hash=17355159616808799143.[FUNC:GetError][FILE:stream.cc][LINE:1478]
       rtStreamSynchronize execution failed, reason=vector core exception[FUNC:FuncErrorReason][FILE:error_message_manage.cc][LINE:65]
       synchronize stream failed, runtime result = 507035[FUNC:ReportCallError][FILE:log_inner.cpp][LINE:148]
[2026-04-08 09:41:43] ERROR: testcase failed (exit 1): test_tmov_col_major_16x1_align_a5
[2026-04-08 09:41:43] === SUMMARY ===
[2026-04-08 09:41:43] OK=0 FAIL=1 SKIP=0
[2026-04-08 09:41:43] RESULTS_TSV=/tmp/ptoas-board-monitor-a5/runs/20260408_093909_manual_pr440/remote_npu_validation_results.tsv

@TaoTao-real TaoTao-real changed the title test(sync): add minimal A5 TMOV col-major alignment repro fix(a5): normalize risky vec col-major TMOV to row-major via treshape Apr 9, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants