[Ilu backend] Fix the implement of normalizations and add other ops.#199
[Ilu backend] Fix the implement of normalizations and add other ops.#199madengfei wants to merge 37 commits intoXPU-Forces:masterfrom
Conversation
madengfei
commented
Apr 1, 2026
- Fix the implement of normalizations, test accuracy ok.
- Add implements of RoPE, KVCache, PagedPrefillAttention,PagedDecodeAttention,Swiglu,Gelu, Silu, etc.
Merge in SWAPP/mojo_opset from adapt_ilu_operators to master * commit '49ea319f35bd27437f5f374c5ef092b80a84098c': [NO-JIRA] Add ilu backend triton ops implements.
|
可以参考着补充一下 perf test,后续优化性能可以直接跑 perf |
|
好的,已经开始做了。 |
|
我已经仔细阅读了这个 PR 的全部变更内容。下面是我的详细 Review。 PR #199 Review: [Ilu backend] Fix the implement of normalizations and add other ops概览: 本 PR 为 ILU (Iluvatar 天数智芯) 后端新增了大量 Triton kernel 实现(+4682/-416,43 个文件),涵盖 RoPE、KVCache、PagedAttention、SwiGLU、GELU、SiLU、GroupGemm、Sampling、SDPA 等算子,并修复了已有的 normalization 实现。 一、架构设计问题 (High Priority)1. Core 层注入平台特定代码,破坏了架构分层 多处在
这些应该通过
建议: 为这些算子创建对应的 TTX operator 子类(部分已有如 2. try:
from mojo_opset.backends.ttx.kernels import linear_fwd_impl
return linear_fwd_impl(input, self.weight, self.bias)
except (NotImplementedError, TypeError):
return F.linear(input, self.weight, self.bias)
二、代码重复问题 (Medium Priority)3. 大量工具函数/常量在多个文件中重复定义 以下内容在
建议: 提取到 4.
三、正确性与健壮性问题 (Medium-High Priority)5. Prefill Attention kernel 采用两遍扫描,存在精度与性能隐患
建议: 后续可参考 FlashAttention 或 FlashDecoding 的 Triton 实现做优化(目前 PR 评论中也提到了补 perf test 的事)。 6. KV gather、GQA expand 和 block 拼接都是 PyTorch + Python 层面的循环,对于大 batch 会很慢。虽然从正确性角度可以接受作为初始实现,但需要明确标注为"功能验证版本"。 7. assert block_size <= 128, f"temp: only support block_size <= 128, but got {block_size}"以及 kernel 内 8. conv_state_indices: Optional[str] = None,应为 9. 在 if gqa_interleave:
k_expanded = k_unpadded.repeat((1, num_q_heads // num_kv_heads, 1))而在 if GQA_INTERLEAVE:
kv_head_id = q_head_id % NUM_KV_HEADS
else:
kv_head_id = q_head_id // (NUM_Q_HEADS // NUM_KV_HEADS)
四、代码质量问题 (Low-Medium Priority)10. # device = get_platform()
device = get_torch_device()应该直接删除注释行。 11. try:
import torch_npu
except ImportError:
torch_npu = None放在通用的 12. 这个值的含义和单位(KB? bytes?)不明确,也没有根据实际硬件信息动态确定。 13. 虽然存在 14. 未使用的导入
15. 新增的 五、正面评价
六、总结建议
总体来说,这个 PR 作为 ILU 后端的功能基线是可以接受的(accuracy test 已通过),但在合入 master 前建议优先解决 P0 的架构问题,至少要避免对 core 层的侵入式修改。 From Opus. |
有理有据,令人信服,这是哪位大师? |
|
这个reviewer太牛了,就照着这改了 |
…es#191) * feat(*): support quantized operators for torch_npu backend. * feat(moe_quant): support quantized fused moe. * refactor(*): refactor the LLM-generated code by LLM itself. * chore(moe.py): add TODO. * chore(README.md): udpate operator status. * misc: update MoE configurations in test_moe. --------- Co-authored-by: wwens7 <zhaowenshuo.oo@bytedance.com>
|
我再试试gemini老师 |
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces support for the ILU backend by adding various Triton-based kernels and operators, including activation functions, attention mechanisms, normalization, and linear operations. My review identified several critical issues regarding the implementation of post-normalization logic in fused kernels and performance concerns related to kernel launch overhead and algorithmic complexity in the attention and grouped GEMM implementations.
I am having trouble creating individual review comments. Click here to see my feedback.
mojo_opset/backends/ttx/kernels/ilu/fused_add_layernorm.py (178-181)
The logic for add_mode="post" appears to be incorrect. The kernel computes Y = LayerNorm(hidden_states + residual), which is a pre-normalization operation. For a post-normalization architecture, the operation is typically output = LayerNorm(hidden_states) + residual. This implementation does not seem to support the post-norm case correctly. Returning Y for both the output and the next residual for post-norm models is likely to cause incorrect model behavior.
mojo_opset/backends/ttx/kernels/ilu/fused_add_rmsnorm.py (192-195)
The logic for add_mode="post" appears to be incorrect. The kernel computes Y = RMSNorm(hidden_states + residual), which is a pre-normalization operation. For a post-normalization architecture, the operation is typically output = RMSNorm(hidden_states) + residual. This implementation does not seem to support the post-norm case correctly. Returning Y for both the output and the next residual for post-norm models is likely to cause incorrect model behavior.
mojo_opset/backends/ttx/kernels/ilu/group_gemm.py (247-303)
The current implementation of m_grouped_matmul_impl launches a separate kernel for each group within a Python loop. This approach can be inefficient due to kernel launch overhead, especially when dealing with a large number of small groups. This file already contains more advanced kernels like _m_grouped_matmul_bNmajor_kernel and _m_grouped_matmul_bKmajor_kernel that are designed to process all groups in a single launch. It would be more performant to refactor this function to use one of those kernels.
mojo_opset/backends/ttx/kernels/ilu/quant_group_linear.py (105-128)
The quant_group_linear_reduce_sum_impl function launches a Triton kernel for each item in the batch via a Python loop. This can lead to significant performance degradation due to kernel launch overhead, especially with large batch sizes. To improve performance, consider refactoring the implementation to handle the batch dimension inside the Triton kernel, allowing for a single kernel launch to process the entire batch.
mojo_opset/backends/ttx/kernels/ilu/sdpa.py (24-108)
The forward kernel _sdpa_masked_fwd_kernel computes the full attention matrix for each query token without tiling over the sequence dimension. This results in O(S^2) complexity per head, which can be very inefficient and consume a large amount of memory for long sequences. For better performance, consider implementing a tiled approach, similar to FlashAttention, which has O(S) complexity.
|
目前已经按照Opus老师的建议改了一版。 @zhangjihang-BD |
|
RoPE有一个拆分的重构刚合入,麻烦rebase下 |
|
* style: remove redundant argument of MojoMoE * ci: add test for moe_gating * fix: set ._backend attribute for all mojo operator/function classes * ci: add dtype assertions in moe_gating test
* move runtime utils from modeling * refactor: reorganize modeling modules into model-specific packages * fix permute into mem frag --------- Co-authored-by: mc-zhang <zhangbolun6@huawei.com>
|
\gemini review |
|
需要补一下 perf test 哈,包括 ilu 的 perf helper func(主要是 device perf) |
ok |
tests/utils.py
Outdated
| perf_fn = perf_npu | ||
| elif device == 'mlu': | ||
| perf_fn = perf_mlu | ||
| elif platform == "ilu": |
There was a problem hiding this comment.
话说为什么这里会存在diverge;现在的auto_switch_platform装饰器都是基于get_platform的,用torch_device的话是否会存在不一致
There was a problem hiding this comment.
ok,"device=get_platform()”确实就是一致的,diverge是不合理的。我改下。