LLMServingSim 2.0: A Unified Simulator for Heterogeneous and Disaggregated LLM Serving Infrastructure
llm_profileextended to support MoE architectures (Mixtral-8x7B, Phi-mini-MoE) and TPU-v6e-1; scikit-learn-based attention latency predictor replaces tabular lookup for improved accuracy across varying batch sizes and sequence lengths- Multi-instance simulation with configurable request routing (Round Robin, Random, Custom) and Prefill/Decode (P/D) disaggregation across separate prefill and decode instances
- MoE simulation with expert parallelism across NPUs and expert offloading to CPU or CXL memory, with configurable expert routing policies (Round Robin, Random, Fast, Custom)
- Prefix caching via RadixAttention with optional second-tier prefix pool across CPU and CXL
memory (
--enable-prefix-caching,--enable-prefix-sharing,--prefix-storage) - Sub-batch interleaving to overlap XPU and PIM computation
(
--enable-sub-batch-interleaving) - Power and energy modeling per node covering NPU, CPU, DRAM, interconnect, NIC, and storage
- CXL memory expansion support with configurable multi-tier memory bandwidth and latency
- Per-request latency metrics: TTFT, TPOT, and ITL with p99 percentile reporting
See full changelog here.
git clone --recurse-submodules https://github.com/casys-kaist/LLMServingSim.git
cd LLMServingSimThis will configure and run the Docker environment. See docker.sh for details.
./docker.shThis will compile ASTRA-Sim (analytical backend) and install Chakra. See compile.sh for details.
./compile.shAll configurations for LLMServingSim are generated automatically by
inference_serving/config_builder.py from a cluster_config file.
The cluster_config file specifies node topology, instance layout, hardware type, memory
hierarchy, and interconnect parameters. It also supports per-layer placement rules for weights,
KV cache, and experts, as well as PIM-enabled device configuration.
Config paths:
- Cluster config:
cluster_config/{config_name}.json - Logical topology config (ns3 backend only):
astra-sim/inputs/logical_topology/{topology_name}.json
Dataset path:
- Dataset:
dataset/{dataset_name}.jsonl - Runtime-generated traces:
astra-sim/inputs/trace/
See cluster_config/ for example configurations and cluster_config/README.md for the
configuration format reference.
Test run:
python main.py \
--cluster-config 'cluster_config/single_node_single_instance.json' \
--fp 16 --block-size 16 \
--dataset 'dataset/sharegpt_req100_rate10_llama.jsonl' \
--output 'output/example_single_run.csv' \
--num-req 100 --log-interval 1.0See run.sh for additional examples covering multi-instance, P/D disaggregation, MoE,
prefix caching, CXL memory, PIM, power modeling, and sub-batch interleaving:
./run.shThe current version supports the following models and hardware:
Models: meta-llama/Llama-3.1-8B, meta-llama/Llama-3.1-70B,
microsoft/Phi-mini-MoE-instruct, mistralai/Mixtral-8x7B-v0.1
Hardware: A6000, H100, TPU-v6e-1
New models and hardware can be added using the provided profiler. See Adding a New Model & Hardware.
| Parameter | Default | Description |
|---|---|---|
--cluster-config |
single_node_single_instance.json |
Node- and instance-level configuration |
--max-batch |
0 |
Maximum batch size; 0 means no limit |
--max-num-batched-tokens |
2048 |
Maximum tokens processed per iteration |
--fp |
16 |
Floating-point precision in bits |
--request-routing-policy |
RR |
Request routing across instances (RR, RAND, CUSTOM) |
--expert-routing-policy |
FAST |
Expert token routing for MoE (RR, RAND, FAST, CUSTOM) |
--enable-prefix-caching |
False |
Enable prefix caching via RadixAttention |
--enable-prefix-sharing |
False |
Enable second-tier prefix cache pooling |
--prefix-storage |
None |
Storage tier for the second-tier prefix pool (None, CPU, CXL) |
--enable-local-offloading |
False |
Enable weight offloading to local memory |
--enable-attn-offloading |
False |
Enable attention computation offloading to PIM |
--enable-sub-batch-interleaving |
False |
Enable sub-batch interleaving for XPU/PIM overlap |
--enable-attn-prediction |
False |
Enable real-time attention latency prediction |
--prioritize-prefill |
False |
Prioritize prefill requests in scheduling |
--block-size |
16 |
KV cache block size in tokens |
--dataset |
None |
Path to .jsonl dataset; if None, add requests manually in main.py |
--output |
None |
Path for per-request CSV output; if None, stdout only |
--gen |
True |
Set to False to skip the initiation (prefill) phase |
--num-req |
100 |
Number of requests to simulate |
--log-interval |
0.5 |
Throughput logging interval in seconds |
--log-level |
WARNING |
Logging verbosity (WARNING, INFO, DEBUG) |
--network-backend |
analytical |
Network simulation backend (analytical, ns3) |
The simulator reports runtime information through a configurable logger. It logs which requests are processed at each iteration and periodically reports throughput, memory usage, and power consumption.
Adjusting --log-level to INFO or DEBUG enables more detailed output, including per-layer
memory load and store activity.
{output_path}.csv contains per-request latency metrics. An example is provided at
output/example_run.csv.
LLMServingSim uses the PyTorch-based profiler in llm_profile/ to generate per-layer latency,
attention latency, and power models for a given hardware target. Once profiling is complete,
create a cluster config referencing the new hardware name and run main.py as usual.
See llm_profile/README.md for full profiling instructions.
The current version supports Llama-based model architectures. Models that deviate from this architecture may require modifications to the following:
inference_serving/memory_model.py — functions calculate_sizes and get_weight
calculate_sizes computes input, weight, and output tensor sizes for each layer type.
get_weight aggregates total model size from calculate_sizes.
Modify these according to the target model architecture.
inference_serving/trace_generator.py — function synthesize_trace
This function constructs the per-iteration execution trace by stacking layers according to the model architecture. When modifying it, ensure:
- The ATTENTION layer is correctly separated per request
- The output size of layer i matches the input size of layer i+1
- ALLREDUCE operations are correctly placed for tensor-parallel synchronization
The evaluation/ directory contains the artifact evaluation flow for Figures 5 to 10
from the paper. It includes figure-specific shell scripts, plotting code, parsers, processed
reference inputs, and preserved example outputs under evaluation/artifacts/.
Before running artifact evaluation, complete the setup steps above (./docker.sh and
./compile.sh) and run the evaluation commands inside that environment.
Enter evaluation/ first:
cd evaluationRun an individual figure:
bash figure_5.sh
bash figure_6.sh
bash figure_7.sh
bash figure_8.sh
bash figure_9.sh
bash figure_10.shTo reproduce the full evaluation set in one pass:
bash run_all.shTo compare generated parsed outputs against preserved artifact snapshots:
# Compare all figures (5-10)
bash compare.sh
# Compare one figure
bash compare.sh 5
# Compare multiple selected figures
bash compare.sh 5 7 9
# Equivalent single-figure form
bash compare.sh figure_5For visual validation, compare generated PDFs with the corresponding *_ref.pdf files in each
figure folder.
See evaluation/README.md for detailed folder structure, reference-comparison guidance, and per-figure notes.
ISPASS 2026
LLMServingSim 2.0: A Unified Simulator for Heterogeneous and Disaggregated LLM Serving Infrastructure
Jaehong Cho*, Hyunmin Choi*, Guseul Heo, Jongse Park (KAIST) [Paper] (To Appear)
*Equal contribution
CAL 2025
LLMServingSim2.0: A Unified Simulator for Heterogeneous Hardware and Serving Techniques in LLM Infrastructure
Jaehong Cho, Hyunmin Choi, Jongse Park (KAIST) [Paper]
IISWC 2024
LLMServingSim: A HW/SW Co-Simulation Infrastructure for LLM Inference Serving at Scale
Jaehong Cho, Minsu Kim, Hyunmin Choi, Guseul Heo, Jongse Park (KAIST) [Paper]
If you use LLMServingSim in your research, please cite:
@ARTICLE{11224567,
author={Cho, Jaehong and Choi, Hyunmin and Park, Jongse},
journal={IEEE Computer Architecture Letters},
title={{LLMServingSim2.0: A Unified Simulator for Heterogeneous Hardware and Serving
Techniques in LLM Infrastructure}},
year={2025},
volume={24},
number={02},
pages={361-364},
doi={10.1109/LCA.2025.3628325},
ISSN={1556-6064},
publisher={IEEE Computer Society},
address={Los Alamitos, CA, USA},
month=jul
}
@INPROCEEDINGS{10763697,
author={Cho, Jaehong and Kim, Minsu and Choi, Hyunmin and Heo, Guseul and Park, Jongse},
booktitle={2024 IEEE International Symposium on Workload Characterization (IISWC)},
title={{LLMServingSim: A HW/SW Co-Simulation Infrastructure for LLM Inference Serving
at Scale}},
year={2024},
pages={15-29},
doi={10.1109/IISWC63097.2024.00012}
}