Skip to content

Develop#221

Open
YWHyuk wants to merge 167 commits intomasterfrom
develop
Open

Develop#221
YWHyuk wants to merge 167 commits intomasterfrom
develop

Conversation

@YWHyuk
Copy link
Copy Markdown
Collaborator

@YWHyuk YWHyuk commented Apr 7, 2026

Changelog — developmaster

TOGSim (simulator)

  • Memory backend: updated to Ramulator 2.1.
  • Stats & robustness: Clearer DRAM bandwidth reporting, safer idle-stat handling, fixes for local/remote memory stats.
  • Scheduling: Internal graph API cleanup (non-breaking, no user-facing API changes).Trace files support comments; improved CLI help.

Compiler & runtime (PyTorchSim / MLIR)

  • PyTorch version: 2.1 → 2.8 (PyTorch version update #196)
  • Operators: SDPA can now be routed to a dedicated NPU kernel via torch.nn.attention.sdpa_kernel([SDPBackend.FLASH_ATTENTION]) context manager; TopK, Bitonic sort, Cat added. ([BUG]Support for repeat_interleave operation to enable Grouped Query Attention (GQA) #198)
  • CNNs: MobileNet CI and 1×1 spatial conv as linear; baseline group convolution decomposition + tests. ([BUG] Cannot schedule MobileNet-SSDLite model #205)
  • Dtypes / codegen: Fixed float16 codegen in MLIR templates; worked around gem5 lmul8 widening issue by avoiding the problematic vector-width in codegen.
  • TOGSim session: Run kernels under with TOGSimulator(config_path=...): so config and simulator lifecycle are scoped to the block.
  • Multi-tenant launch: Call torch.npu.launch_model(opt_fn, *args, stream_index=..., timestamp=..., **kwargs) inside that block.
  • Cleanup: Removed legacy scheduler code; standardized on the TOGSimulator-oriented API.

Device (OpenReg / NPU)

  • Device API: Use torch.device("npu") (and torch.device("npu:0"), etc.) like any built-in device type — no extra package import beyond import torch; the NPU backend registers with PyTorch's device system.
  • Eager mode: Eager mode fallback is applied automatically when graph compilation is not available; eager_to_compile() API provided for explicit transition.

⚠️ Breaking Changes

  • Multi-tenant API redesign: The scheduler-based multi-tenant launch pattern has been replaced. The old API required manual Scheduler instantiation, Request object construction, and a while not scheduler.is_finished(): loop. The new API uses a with TOGSimulator(config_path=...): context and torch.npu.launch_model(..., stream_index=..., timestamp=...) calls directly. See test_scheduler.py for the updated usage pattern.

CI, tests, experiments

  • Added or tightened tests for DeepSeek, YOLOv5, MobileNet; CI image updated for PyTorch 2.8.

Other

  • Misc. codegen, indexing, and matmul-related bugfixes and small refactors.

YWHyuk added 30 commits December 5, 2025 13:05
[Frontend] Use ops instead of raw assembly code
YWHyuk and others added 28 commits March 16, 2026 20:44
…dening errors

Updated the frontend to strictly validate vector element counts, preventing invalid LMUL=8 configurations in Gem5. Fixed a mismatch in the ext() operation's type-checking logic.
…ftmax

Note: Known compilation errors persist when using smaller tile sizes; investigation into the tile-stride logic is ongoing.
[FIX] Fix zero systolic array utilization during SDPA execution
-Update Ramulator version to 2.1
-Update Ramulator2 DRAM configs
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants