Develop by YWHyuk · Pull Request #221 · PSAL-POSTECH/PyTorchSim

YWHyuk · 2026-04-07T04:20:42Z

Changelog — `develop` → `master`

TOGSim (simulator)

Memory backend: updated to Ramulator 2.1.
Stats & robustness: Clearer DRAM bandwidth reporting, safer idle-stat handling, fixes for local/remote memory stats.
Scheduling: Internal graph API cleanup (non-breaking, no user-facing API changes).Trace files support comments; improved CLI help.

Compiler & runtime (PyTorchSim / MLIR)

PyTorch version: 2.1 → 2.8 (PyTorch version update #196)
Operators: SDPA can now be routed to a dedicated NPU kernel via torch.nn.attention.sdpa_kernel([SDPBackend.FLASH_ATTENTION]) context manager; TopK, Bitonic sort, Cat added. ([BUG]Support for repeat_interleave operation to enable Grouped Query Attention (GQA) #198)
CNNs: MobileNet CI and 1×1 spatial conv as linear; baseline group convolution decomposition + tests. ([BUG] Cannot schedule MobileNet-SSDLite model #205)
Dtypes / codegen: Fixed float16 codegen in MLIR templates; worked around gem5 lmul8 widening issue by avoiding the problematic vector-width in codegen.
TOGSim session: Run kernels under with TOGSimulator(config_path=...): so config and simulator lifecycle are scoped to the block.
Multi-tenant launch: Call torch.npu.launch_model(opt_fn, *args, stream_index=..., timestamp=..., **kwargs) inside that block.
Cleanup: Removed legacy scheduler code; standardized on the TOGSimulator-oriented API.

Device (OpenReg / NPU)

Device API: Use torch.device("npu") (and torch.device("npu:0"), etc.) like any built-in device type — no extra package import beyond import torch; the NPU backend registers with PyTorch's device system.
Eager mode: Eager mode fallback is applied automatically when graph compilation is not available; eager_to_compile() API provided for explicit transition.

⚠️ Breaking Changes

Multi-tenant API redesign: The scheduler-based multi-tenant launch pattern has been replaced. The old API required manual Scheduler instantiation, Request object construction, and a while not scheduler.is_finished(): loop. The new API uses a with TOGSimulator(config_path=...): context and torch.npu.launch_model(..., stream_index=..., timestamp=...) calls directly. See test_scheduler.py for the updated usage pattern.

CI, tests, experiments

Added or tightened tests for DeepSeek, YOLOv5, MobileNet; CI image updated for PyTorch 2.8.

Other

Misc. codegen, indexing, and matmul-related bugfixes and small refactors.

[Frontend] Use ops instead of raw assembly code

…IR kernels

…GSim

…dening errors Updated the frontend to strictly validate vector element counts, preventing invalid LMUL=8 configurations in Gem5. Fixed a mismatch in the ext() operation's type-checking logic.

…ftmax Note: Known compilation errors persist when using smaller tile sizes; investigation into the tile-stride logic is ongoing.

…omposition (#205)

[FIX] Fix zero systolic array utilization during SDPA execution

Feat/deepseek

-Update Ramulator version to 2.1 -Update Ramulator2 DRAM configs

… demand in docker-image workflow

YWHyuk added 30 commits December 5, 2025 13:05

[Frontend] Use ops instead of raw assembly code

0d4ae79

[Test] Add matmul vector fusion case

bea9bd2

[Frontend] Fix ops conversion

837b062

[Frontend] Use custom malloc in the validation wrapper code

a33659a

[Device] Add missing operations

4e2d0a0

[Frontend] Add typecasting for logical operation

6e70edc

[Device] register amp

54f450a

[Frontend+Test] Support scatter pattern with a test case

8985ab8

[Fix] minor bugs

1c2c8bf

[Fix] Fix the acceess to wrong variable

1895958

[Log] Add print lock to prevent log crash

cd14109

[Device] Add custom zero_, zeors_like

5fe87e9

[Frontend/Spike] Use 64byte aligned buffer size

db18cbd

[Refactor] Seperate OpOverrides

1152428

[Test] Add Llama1&2 test cases

8452f5c

[TOGSim] Add error handling

00cd8c7

[Scheduler] Use given config file for compilations

a8d96cd

[Fix/ops] Fix wrong implementation of sigmoid

8aac3ab

[Tests] Use manual mask for Llama

fd6a846

[TOGSim] Use YAML instead of json

dea7f47

[Frontend] Use YAML config file instead of json

d66df91

[Test] Change attention masek for Llama

dce58d0

[Autotune] Fix autotune log path

1c2ab36

[Fix] Fix codegen error in ops.select

20af550

Merge pull request #164 from PSAL-POSTECH/ops

2276450

[Frontend] Use ops instead of raw assembly code

[Tutorial] Update environment setting for the tutorial

c39c3a3

[Tutorial] Add tutorail env setting scripts

8678fe6

[Tutorial] Change format of config files to yml

0a5d0e7

[Tutorial] Fix typo dockerfile

008cf4c

[Tutorial] Fix wrong config name

18d7bab

YWHyuk and others added 28 commits March 16, 2026 20:44

[Cleanup] Unflag debug option

be23638

[CI] Add deepseek test case

e925ae4

[Template/SPDA] Cleanup test case + Add an activate option

db85991

[Frontend] Handle RecompileSignal in MLIRKernel code generation

dd71c70

[Frontend] Enhance vector size handling for low-precision paths in ML…

c5f085e

…IR kernels

[Refactor] move to TOGSimulator-based scheduler API

fdd5b54

[CI] Add missing package + Add test cases

3847f9b

[FIX] Fix zero systolic array utilization during SDPA execution in TO…

1d7a3a9

…GSim

[Frontend/Fix] Enforce vector length constraints and resolve ext() wi…

10f5923

…dening errors Updated the frontend to strictly validate vector element counts, preventing invalid LMUL=8 configurations in Gem5. Fixed a mismatch in the ext() operation's type-checking logic.

[Frontend] Add optimized GQA decode implementation with tile-based so…

a32f9e0

…ftmax Note: Known compilation errors persist when using smaller tile sizes; investigation into the tile-stride logic is ongoing.

[PyTorchSim/Frontend] Use kernel specific filelock to avoid race

9e20d95

[Fix] replace outdated config name

070c43a

[Experiment] use timing mode for validation script

9fc0811

[CI] Run validation script only for vector_lane==128

cf56c59

[TOGSim] Add error handling of idle stat couting

8d22583

[TOGSim] Update DRAM Bw stat with exact number

0b60ddd

[Experiment] Fix ils script to use updated config

6bc1204

[CI] Remove dump folder mount for test

336fdf3

[Decompse] Add naive group convolution decomposition + test

8838bfe

[Frontend] Fix attribute passing to TOGSIM

9b0ab3b

[Frontend] Fix loop_size argument passing

5cbe9d1

[Script] Add utility option

f03f727

[Cleanup] #219 cleanup the deprecated scheduler module

1ae39bf

[Frontend/MobileNet] Add MobileNet CI and 1x1 spatial conv linear dec…

8ca844a

…omposition (#205)

Merge pull request #220 from student-Jungmin/feat/deepseek

699d9b9

[FIX] Fix zero systolic array utilization during SDPA execution

[Test] Add missing mobilenet test script

6f74722

Merge pull request #215 from PSAL-POSTECH/feat/deepseek

1c28159

Feat/deepseek

[TOGSim] Migration to Ramulator2.1

7b6cfe5

-Update Ramulator version to 2.1 -Update Ramulator2 DRAM configs

YWHyuk force-pushed the develop branch from 700eb4c to 7b6cfe5 Compare April 7, 2026 04:51

[CI] Add thirdparty release manifest; pin base image tag and build on…

dd991c1

… demand in docker-image workflow

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Develop#221

Develop#221
YWHyuk wants to merge 167 commits intomasterfrom
develop

YWHyuk commented Apr 7, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Conversation

YWHyuk commented Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changelog — develop → master

TOGSim (simulator)

Compiler & runtime (PyTorchSim / MLIR)

Device (OpenReg / NPU)

⚠️ Breaking Changes

CI, tests, experiments

Other

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

YWHyuk commented Apr 7, 2026 •

edited

Loading

Changelog — `develop` → `master`