feat: add openai_vision multimodal benchmark support by AuZhoomLee · Pull Request #2 · vast-enterprise/LightLLM

AuZhoomLee · 2026-01-23T13:12:32Z

Summary

Add openai_vision server_api mode for benchmarking vision models with multimodal inputs
Add get_custom_input_data_multimodal function for loading JSONL data with images
Add --max_requests option to limit number of requests for testing
Add progress bar with tqdm for better UX
Skip first_token_time and decode_token_time statistics in non-streaming mode (only show QPS and request_time)

Test plan

Test with --server_api openai_vision and multimodal JSONL data
Verify QPS and request_time statistics are correctly displayed
Verify first_token_time and decode_token_time are skipped in non-streaming mode

🤖 Generated with Claude Code

Co-authored-by: wangzaijun <wangzaijun@sensetime.com> Co-authored-by: root <root@DESKTOP-5FJJCPK.localdomain>

…doc (ModelTC#1175) ### Testing Done Tested in a clean docker container without vllm installed. ```bash root@worker3218:/ws# python -m lightllm.server.api_server --model_dir /home/dist/Qwen3-0.6B/ --disable_cudagraph --host 0.0.0.0 WARNING 01-12 13:45:20 [sgl_utils.py:14] sgl_kernel is not installed, you can't use the api of it. You can solve it by running `pip install sgl_kernel`. WARNING 01-12 13:45:20 [sgl_utils.py:29] sgl_kernel is not installed, or the installed version did not support fa3. Try to upgrade it. WARNING 01-12 13:45:20 [light_utils.py:13] lightllm_kernel is not installed, you can't use the api of it. WARNING 01-12 13:45:20 [vllm_utils.py:18] vllm is not installed, you can't use the api of it. You can solve it by running `pip install vllm`. INFO 01-12 13:45:20 [communication_op.py:57] deep_ep is not installed, you can't use the api of it. INFO 01-12 13:45:20 [cache_tensor_manager.py:17] USE_GPU_TENSOR_CACHE is On WARNING 01-12 13:45:20 [grouped_fused_moe_ep.py:28] no deepep or deep_gemm WARNING 01-12 13:45:20 [nixl_kv_transporter.py:19] nixl is not installed, which is required for pd disagreggation!!! INFO 01-12 13:45:21 [shm_size_check.py:21] SHM check: Available=500.00 GB,Recommended=2.32 GB.Sufficient: True INFO 01-12 13:45:21 [api_start.py:94] zmq mode head: ipc:///tmp/_28765_0_ INFO 01-12 13:45:21 [api_start.py:96] use tgi api: False INFO 01-12 13:45:21 [api_start.py:219] alloced ports: [10017, 10004, 10209, 10223, 10297, 10257, 10068, 10179, 10206, 10285] INFO 01-12 13:45:21 [api_start.py:270] all start args:Namespace(run_mode='normal', host='0.0.0.0', port=8000, httpserver_workers=1, zmq_mode='ipc:///tmp/_28765_0_', pd_master_ip='0.0.0.0', pd_master_port=1212, pd_decode_rpyc_port=42000, select_p_d_node_strategy='round_robin', config_server_host=None, config_server_port=None, nixl_pd_kv_page_num=16, nixl_pd_kv_page_size=1024, model_name='default_model_name', model_dir='/home/dist/Qwen3-0.6B/', tokenizer_mode='fast', load_way='HF', max_total_token_num=None, mem_fraction=0.9, batch_max_tokens=8448, eos_id=[151645], tool_call_parser=None, reasoning_parser=None, chat_template=None, running_max_req_size=1000, nnodes=1, node_rank=0, multinode_httpmanager_port=12345, multinode_router_gloo_port=20001, tp=1, dp=1, dp_balancer='bs_balancer', max_req_total_len=16384, nccl_host='127.0.0.1', nccl_port=28765, use_config_server_to_init_nccl=False, trust_remote_code=False, disable_log_stats=False, log_stats_interval=10, disable_shm_warning=False, router_token_ratio=0.0, router_max_new_token_len=1024, router_max_wait_tokens=1, disable_aggressive_schedule=False, use_dynamic_prompt_cache=False, disable_dynamic_prompt_cache=False, chunked_prefill_size=4096, disable_chunked_prefill=False, diverse_mode=False, token_healing_mode=False, output_constraint_mode='none', first_token_constraint_mode=False, enable_multimodal=False, enable_multimodal_audio=False, enable_mps=False, disable_custom_allreduce=False, enable_custom_allgather=False, enable_tpsp_mix_mode=False, enable_dp_prefill_balance=False, enable_prefill_microbatch_overlap=False, enable_decode_microbatch_overlap=False, llm_prefill_att_backend=['triton'], llm_decode_att_backend=['triton'], llm_kv_type='None', llm_kv_quant_group_size=8, cache_capacity=200, embed_cache_storage_size=4, data_type='bfloat16', return_all_prompt_logprobs=False, use_reward_model=False, long_truncation_mode=None, use_tgi_api=False, health_monitor=False, metric_gateway=None, job_name='lightllm', grouping_key=[], push_interval=10, visual_infer_batch_size=1, visual_send_batch_size=1, visual_gpu_ids=[0], visual_tp=1, visual_dp=1, visual_nccl_ports=[29500], enable_monitor_auth=False, disable_cudagraph=True, enable_prefill_cudagraph=False, prefll_cudagraph_max_handle_token=512, graph_max_batch_size=256, graph_split_batch_size=32, graph_grow_step_size=16, graph_max_len_in_batch=16384, quant_type='none', quant_cfg=None, vit_quant_type='none', vit_quant_cfg=None, sampling_backend='triton', penalty_counter_mode='gpu_counter', ep_redundancy_expert_config_path=None, auto_update_redundancy_expert=False, enable_fused_shared_experts=False, mtp_mode=None, mtp_draft_model_dir=None, mtp_step=0, kv_quant_calibration_config_path=None, schedule_time_interval=0.03, enable_cpu_cache=False, cpu_cache_storage_size=2, cpu_cache_token_page_size=256, enable_disk_cache=False, disk_cache_storage_size=10, disk_cache_dir=None, enable_dp_prompt_cache_fetch=False, router_port=10017, detokenization_port=10004, http_server_port=10209, visual_port=10223, audio_port=10297, cache_port=10257, metric_port=10068, multi_level_kv_cache_port=10179, pd_node_infer_rpyc_ports=[10285], pd_node_id=288479957063433772586255832729030629155, pd_p_allowed_port_min=20000, pd_p_allowed_port_max=30000) WARNING 01-12 13:45:27 [sgl_utils.py:14] sgl_kernel is not installed, you can't use the api of it. You can solve it by running `pip install sgl_kernel`. WARNING 01-12 13:45:27 [sgl_utils.py:29] sgl_kernel is not installed, or the installed version did not support fa3. Try to upgrade it. WARNING 01-12 13:45:27 [light_utils.py:13] lightllm_kernel is not installed, you can't use the api of it. WARNING 01-12 13:45:27 [vllm_utils.py:18] vllm is not installed, you can't use the api of it. You can solve it by running `pip install vllm`. INFO 01-12 13:45:27 [communication_op.py:57] deep_ep is not installed, you can't use the api of it. 2026-01-12 13:45:27 | server | 140078322902144 | INFO : server started on [0.0.0.0]:10068 INFO 01-12 13:45:27 [start_utils.py:37] init func start_metric_manager : init ok WARNING 01-12 13:45:33 [sgl_utils.py:14] sgl_kernel is not installed, you can't use the api of it. You can solve it by running `pip install sgl_kernel`. WARNING 01-12 13:45:33 [sgl_utils.py:29] sgl_kernel is not installed, or the installed version did not support fa3. Try to upgrade it. WARNING 01-12 13:45:33 [light_utils.py:13] lightllm_kernel is not installed, you can't use the api of it. WARNING 01-12 13:45:33 [vllm_utils.py:18] vllm is not installed, you can't use the api of it. You can solve it by running `pip install vllm`. INFO 01-12 13:45:33 [communication_op.py:57] deep_ep is not installed, you can't use the api of it. INFO 01-12 13:45:33 [cache_tensor_manager.py:17] USE_GPU_TENSOR_CACHE is On WARNING 01-12 13:45:33 [sgl_utils.py:14] sgl_kernel is not installed, you can't use the api of it. You can solve it by running `pip install sgl_kernel`. WARNING 01-12 13:45:33 [sgl_utils.py:29] sgl_kernel is not installed, or the installed version did not support fa3. Try to upgrade it. WARNING 01-12 13:45:33 [light_utils.py:13] lightllm_kernel is not installed, you can't use the api of it. WARNING 01-12 13:45:33 [vllm_utils.py:18] vllm is not installed, you can't use the api of it. You can solve it by running `pip install vllm`. INFO 01-12 13:45:33 [communication_op.py:57] deep_ep is not installed, you can't use the api of it. INFO 01-12 13:45:33 [cache_tensor_manager.py:17] USE_GPU_TENSOR_CACHE is On WARNING 01-12 13:45:33 [grouped_fused_moe_ep.py:28] no deepep or deep_gemm WARNING 01-12 13:45:33 [grouped_fused_moe_ep.py:28] no deepep or deep_gemm INFO 01-12 13:45:33 [manager.py:36] pub_to_httpserver sendhwm 1000 WARNING 01-12 13:45:33 [nixl_kv_transporter.py:19] nixl is not installed, which is required for pd disagreggation!!! 2026-01-12 13:45:33 | server | 140078322902144 | INFO : accepted ('127.0.0.1', 47548) with fd 25 2026-01-12 13:45:33 | server | 140046992746048 | INFO : welcome ('127.0.0.1', 47548) INFO 01-12 13:45:38 [cache_tensor_manager.py:17] USE_GPU_TENSOR_CACHE is On WARNING 01-12 13:45:38 [sgl_utils.py:14] sgl_kernel is not installed, you can't use the api of it. You can solve it by running `pip install sgl_kernel`. WARNING 01-12 13:45:38 [sgl_utils.py:29] sgl_kernel is not installed, or the installed version did not support fa3. Try to upgrade it. WARNING 01-12 13:45:38 [vllm_utils.py:18] vllm is not installed, you can't use the api of it. You can solve it by running `pip install vllm`. WARNING 01-12 13:45:38 [light_utils.py:13] lightllm_kernel is not installed, you can't use the api of it. WARNING 01-12 13:45:38 [grouped_fused_moe_ep.py:28] no deepep or deep_gemm INFO 01-12 13:45:38 [communication_op.py:57] deep_ep is not installed, you can't use the api of it. WARNING 01-12 13:45:40 [nixl_kv_transporter.py:19] nixl is not installed, which is required for pd disagreggation!!! INFO 01-12 13:45:40 [model_rpc.py:67] Initialized RPC server for rank 0. INFO 01-12 13:45:40 [model_rpc.py:168] use ChunkedPrefillBackend INFO 01-12 13:45:43 [basemodel.py:169] Initial quantization. The default quantization method is none pid 45988 Loading model weights with 1 workers: 0%| | 0/1 [00:00<?, ?it/s]INFO 01-12 13:45:43 [embedding_weight.py:30] loaded weight vocab_size: 151936 pid 45988 Loading model weights with 1 workers: 100%|██████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 1.19it/s] INFO 01-12 13:45:43 [mem_utils.py:30] mode setting params: None INFO 01-12 13:45:43 [mem_utils.py:40] Model kv cache using mem_manager class: <class 'lightllm.common.kv_cache_mem_manager.mem_manager.MemoryManager'> INFO 01-12 13:45:43 [mem_manager.py:99] 69.76169700622559 GB space is available after load the model weight INFO 01-12 13:45:43 [mem_manager.py:99] 0.109375 MB is the size of one token kv cache INFO 01-12 13:45:43 [mem_manager.py:99] 653128 is the profiled max_total_token_num with the mem_fraction 0.9 INFO 01-12 13:45:43 [mem_manager.py:99] INFO 01-12 13:45:44 [basemodel.py:126] use prefill att backend: TritonAttBackend INFO 01-12 13:45:44 [basemodel.py:127] use decode att backend: TritonAttBackend warming up: 0%| | 0/12 [00:00<?, ?it/s]WARNING 01-12 13:46:16 [autotuner.py:169] No kernel config for silu_and_mul_fwd:v1 in {N=3072,out_dtype=torch.bfloat16}_MTT_S5000.json,the performance may be suboptimal!You can use LIGHTLLM_TRITON_AUTOTUNE_LEVEL=1 to enable autotune. WARNING 01-12 13:46:16 [kernel_config.py:40] can not find config_path /ws/lightllm/common/all_kernel_configs/moe_silu_and_mul_kernel/{N=3072,out_dtype=torch.bfloat16}_MTT_S5000.json kernel name moe_silu_and_mul_kernel use default kernel setting warming up: 100%|████████████████████████████████████████████████████████████████████████████████████████████████| 12/12 [00:40<00:00, 3.41s/it] INFO 01-12 13:46:25 [basemodel.py:846] begin check max_len infer INFO 01-12 13:46:25 [basemodel.py:882] check max_len 8448 infer ok INFO 01-12 13:46:40 [base_backend.py:184] loaded model class <class 'lightllm.models.qwen3.model.Qwen3TpPartModel'> INFO 01-12 13:46:40 [manager.py:194] use req queue ChunkedPrefillQueue INFO 01-12 13:46:40 [start_utils.py:37] init func start_router_process : init ok INFO 01-12 13:46:40 [start_utils.py:37] init func start_detokenization_process : init ok INFO 01-12 13:46:40 [api_start.py:58] start process pid 38328 INFO 01-12 13:46:40 [api_start.py:59] http server pid 5689 [2026-01-12 13:46:40 +0800] [5689] [INFO] Starting gunicorn 23.0.0 [2026-01-12 13:46:40 +0800] [5689] [INFO] Listening at: http://0.0.0.0:8000 (5689) [2026-01-12 13:46:40 +0800] [5689] [INFO] Using worker: uvicorn.workers.UvicornWorker [2026-01-12 13:46:40 +0800] [5690] [INFO] Booting worker with pid: 5690 WARNING 01-12 13:46:46 [sgl_utils.py:14] sgl_kernel is not installed, you can't use the api of it. You can solve it by running `pip install sgl_kernel`. WARNING 01-12 13:46:46 [sgl_utils.py:29] sgl_kernel is not installed, or the installed version did not support fa3. Try to upgrade it. WARNING 01-12 13:46:46 [light_utils.py:13] lightllm_kernel is not installed, you can't use the api of it. WARNING 01-12 13:46:46 [vllm_utils.py:18] vllm is not installed, you can't use the api of it. You can solve it by running `pip install vllm`. INFO 01-12 13:46:46 [communication_op.py:57] deep_ep is not installed, you can't use the api of it. INFO 01-12 13:46:46 [cache_tensor_manager.py:17] USE_GPU_TENSOR_CACHE is On WARNING 01-12 13:46:46 [grouped_fused_moe_ep.py:28] no deepep or deep_gemm [2026-01-12 13:46:47 +0800] [5690] [INFO] Started server process [5690] [2026-01-12 13:46:47 +0800] [5690] [INFO] Waiting for application startup. INFO 01-12 13:46:47 [api_http.py:359] server start up 2026-01-12 13:46:47 | server | 140078322902144 | INFO : accepted ('127.0.0.1', 35962) with fd 26 2026-01-12 13:46:47 | server | 140046984353344 | INFO : welcome ('127.0.0.1', 35962) 2026-01-12 13:46:47 | server | 140078322902144 | INFO : accepted ('127.0.0.1', 35966) with fd 27 2026-01-12 13:46:47 | server | 140046975960640 | INFO : welcome ('127.0.0.1', 35966) INFO 01-12 13:46:48 [req_id_generator.py:34] ReqIDGenerator init finished INFO 01-12 13:46:48 [api_http.py:363] server start up ok, loop use is <uvloop.Loop running=True closed=False debug=False> [2026-01-12 13:46:48 +0800] [5690] [INFO] Application startup complete. DEBUG 01-12 13:47:52 [manager.py:283] dp_i 0 frozen token num: 0 DEBUG 01-12 13:47:52 [manager.py:283] DEBUG 01-12 13:47:52 [manager.py:284] dp_i 0 estimated_peak_token_count: 0 DEBUG 01-12 13:47:52 [manager.py:284] [2026-01-12 13:48:13 +0800] [5689] [INFO] Handling signal: winch [2026-01-12 13:48:13 +0800] [5689] [INFO] Handling signal: winch [2026-01-12 13:48:13 +0800] [5689] [INFO] Handling signal: winch [2026-01-12 13:48:13 +0800] [5689] [INFO] Handling signal: winch DEBUG 01-12 13:48:55 [manager.py:283] dp_i 0 frozen token num: 0 DEBUG 01-12 13:48:55 [manager.py:283] DEBUG 01-12 13:48:55 [manager.py:284] dp_i 0 estimated_peak_token_count: 0 DEBUG 01-12 13:48:55 [manager.py:284] DEBUG 01-12 13:49:58 [manager.py:283] dp_i 0 frozen token num: 0 DEBUG 01-12 13:49:58 [manager.py:283] DEBUG 01-12 13:49:58 [manager.py:284] dp_i 0 estimated_peak_token_count: 0 DEBUG 01-12 13:49:58 [manager.py:284] DEBUG 01-12 13:51:02 [manager.py:283] dp_i 0 frozen token num: 0 DEBUG 01-12 13:51:02 [manager.py:283] DEBUG 01-12 13:51:02 [manager.py:284] dp_i 0 estimated_peak_token_count: 0 DEBUG 01-12 13:51:02 [manager.py:284] INFO 01-12 13:51:09 [manager.py:417] recieved req X-Request-Id: X-Session-Id: start_time:2026-01-12 13:51:09 lightllm_req_id:8 INFO 01-12 13:51:09 [manager.py:422] router recive req id 8 cost time 0.05662369728088379 s DEBUG 01-12 13:51:09 [manager.py:320] Prefill Batch: batch_id=-1, time:1768197069.7485027s req_ids:[8] DEBUG 01-12 13:51:09 [manager.py:320] INFO 01-12 13:51:09 [manager.py:55] detokenization recv req id 8 cost time 0.07959198951721191 s DEBUG 01-12 13:51:11 [manager.py:251] dp_i 0 current batch size: 1 DEBUG 01-12 13:51:11 [manager.py:251] dp_i 0 paused req num: 0 DEBUG 01-12 13:51:11 [manager.py:251] dp_i 0 frozen token num: 0 DEBUG 01-12 13:51:11 [manager.py:251] dp_i 0 estimated_peak_token_count: 39 DEBUG 01-12 13:51:11 [manager.py:251] dp_i 0 token used ratio: 6.12437378278071e-06 not contain prompt cache tree unrefed token DEBUG 01-12 13:51:11 [manager.py:251] dp_i 0 token used ratio: 6.12437378278071e-06 contain prompt cache tree unrefed token DEBUG 01-12 13:51:14 [manager.py:251] dp_i 0 current batch size: 1 DEBUG 01-12 13:51:14 [manager.py:251] dp_i 0 paused req num: 0 DEBUG 01-12 13:51:14 [manager.py:251] dp_i 0 frozen token num: 0 DEBUG 01-12 13:51:14 [manager.py:251] dp_i 0 estimated_peak_token_count: 39 DEBUG 01-12 13:51:14 [manager.py:251] dp_i 0 token used ratio: 7.655467228475888e-06 not contain prompt cache tree unrefed token DEBUG 01-12 13:51:14 [manager.py:251] dp_i 0 token used ratio: 7.655467228475888e-06 contain prompt cache tree unrefed token INFO 01-12 13:51:16 [manager.py:163] detoken release req id 8 INFO 01-12 13:51:16 [manager.py:614] X-Request-Id: X-Session-Id: start_time:2026-01-12 13:51:09 lightllm_req_id:8 first_token_cost:6353.325128555298ms total_cost_time:6671.096563339233ms,out_token_counter:17 mean_per_token_cost_time: 18.692437340231503ms prompt_token_num:4 gpu cache hit: False gpu_prompt_cache_len:0 gpu_prompt_cache_ratio:0.0 cpu cache hit: False cpu_prompt_cache_len:0 cpu_prompt_cache_ratio:0.0 disk cache hit: False disk_prompt_cache_len:0 disk_prompt_cache_ratio:0.0 mtp_avg_token_per_step:1.0 127.0.0.1:55472 - "POST /generate HTTP/1.1" 200 DEBUG 01-12 13:51:16 [req_manager.py:78] freed all request size 1008 DEBUG 01-12 13:51:16 [infer_batch.py:172] free a batch state: DEBUG 01-12 13:51:16 [infer_batch.py:172] radix refed token num 0 DEBUG 01-12 13:51:16 [infer_batch.py:172] radix hold token num 21 DEBUG 01-12 13:51:16 [infer_batch.py:172] mem manager can alloc token num 653107 DEBUG 01-12 13:51:16 [infer_batch.py:172] mem manager total size 653128 INFO 01-12 13:51:16 [batch.py:56] router release req id 8 INFO 01-12 13:51:16 [shm_req_manager.py:111] all shm req has been release ok INFO 01-12 13:51:19 [manager.py:417] recieved req X-Request-Id: X-Session-Id: start_time:2026-01-12 13:51:19 lightllm_req_id:16 INFO 01-12 13:51:19 [manager.py:422] router recive req id 16 cost time 0.019651412963867188 s DEBUG 01-12 13:51:19 [manager.py:320] Prefill Batch: batch_id=-1, time:1768197079.421846s req_ids:[16] DEBUG 01-12 13:51:19 [manager.py:320] INFO 01-12 13:51:19 [manager.py:55] detokenization recv req id 16 cost time 0.021979331970214844 s INFO 01-12 13:51:19 [manager.py:163] detoken release req id 16 INFO 01-12 13:51:19 [manager.py:614] X-Request-Id: X-Session-Id: start_time:2026-01-12 13:51:19 lightllm_req_id:16 first_token_cost:102.96440124511719ms total_cost_time:407.08088874816895ms,out_token_counter:17 mean_per_token_cost_time: 17.88920514723834ms prompt_token_num:4 gpu cache hit: True gpu_prompt_cache_len:3 gpu_prompt_cache_ratio:0.75 cpu cache hit: False cpu_prompt_cache_len:0 cpu_prompt_cache_ratio:0.0 disk cache hit: False disk_prompt_cache_len:0 disk_prompt_cache_ratio:0.0 mtp_avg_token_per_step:1.0 127.0.0.1:47146 - "POST /generate HTTP/1.1" 200 DEBUG 01-12 13:51:19 [req_manager.py:78] freed all request size 1008 DEBUG 01-12 13:51:19 [infer_batch.py:172] free a batch state: DEBUG 01-12 13:51:19 [infer_batch.py:172] radix refed token num 0 DEBUG 01-12 13:51:19 [infer_batch.py:172] radix hold token num 35 DEBUG 01-12 13:51:19 [infer_batch.py:172] mem manager can alloc token num 653093 DEBUG 01-12 13:51:19 [infer_batch.py:172] mem manager total size 653128 INFO 01-12 13:51:19 [batch.py:56] router release req id 16 INFO 01-12 13:51:19 [shm_req_manager.py:111] all shm req has been release ok INFO 01-12 13:51:22 [manager.py:417] recieved req X-Request-Id: X-Session-Id: start_time:2026-01-12 13:51:22 lightllm_req_id:24 INFO 01-12 13:51:22 [manager.py:422] router recive req id 24 cost time 0.015377998352050781 s DEBUG 01-12 13:51:22 [manager.py:320] Prefill Batch: batch_id=-1, time:1768197082.1040523s req_ids:[24] DEBUG 01-12 13:51:22 [manager.py:320] INFO 01-12 13:51:22 [manager.py:55] detokenization recv req id 24 cost time 0.016767501831054688 s INFO 01-12 13:51:22 [manager.py:163] detoken release req id 24 INFO 01-12 13:51:22 [manager.py:614] X-Request-Id: X-Session-Id: start_time:2026-01-12 13:51:22 lightllm_req_id:24 first_token_cost:86.02452278137207ms total_cost_time:432.842493057251ms,out_token_counter:17 mean_per_token_cost_time: 20.4010570750517ms prompt_token_num:4 gpu cache hit: True gpu_prompt_cache_len:3 gpu_prompt_cache_ratio:0.75 cpu cache hit: False cpu_prompt_cache_len:0 cpu_prompt_cache_ratio:0.0 disk cache hit: False disk_prompt_cache_len:0 disk_prompt_cache_ratio:0.0 mtp_avg_token_per_step:1.0 127.0.0.1:47156 - "POST /generate HTTP/1.1" 200 DEBUG 01-12 13:51:22 [req_manager.py:78] freed all request size 1008 DEBUG 01-12 13:51:22 [infer_batch.py:172] free a batch state: DEBUG 01-12 13:51:22 [infer_batch.py:172] radix refed token num 0 DEBUG 01-12 13:51:22 [infer_batch.py:172] radix hold token num 51 DEBUG 01-12 13:51:22 [infer_batch.py:172] mem manager can alloc token num 653077 DEBUG 01-12 13:51:22 [infer_batch.py:172] mem manager total size 653128 INFO 01-12 13:51:22 [batch.py:56] router release req id 24 INFO 01-12 13:51:22 [shm_req_manager.py:111] all shm req has been release ok INFO 01-12 13:51:26 [manager.py:417] recieved req X-Request-Id: X-Session-Id: start_time:2026-01-12 13:51:26 lightllm_req_id:32 INFO 01-12 13:51:26 [manager.py:422] router recive req id 32 cost time 0.008630990982055664 s DEBUG 01-12 13:51:26 [manager.py:320] Prefill Batch: batch_id=-1, time:1768197086.9206343s req_ids:[32] DEBUG 01-12 13:51:26 [manager.py:320] INFO 01-12 13:51:26 [manager.py:55] detokenization recv req id 32 cost time 0.011269092559814453 s INFO 01-12 13:51:27 [manager.py:163] detoken release req id 32 INFO 01-12 13:51:27 [manager.py:614] X-Request-Id: X-Session-Id: start_time:2026-01-12 13:51:26 lightllm_req_id:32 first_token_cost:74.12481307983398ms total_cost_time:378.31759452819824ms,out_token_counter:17 mean_per_token_cost_time: 17.89369302637437ms prompt_token_num:4 gpu cache hit: True gpu_prompt_cache_len:3 gpu_prompt_cache_ratio:0.75 cpu cache hit: False cpu_prompt_cache_len:0 cpu_prompt_cache_ratio:0.0 disk cache hit: False disk_prompt_cache_len:0 disk_prompt_cache_ratio:0.0 mtp_avg_token_per_step:1.0 127.0.0.1:47160 - "POST /generate HTTP/1.1" 200 DEBUG 01-12 13:51:27 [req_manager.py:78] freed all request size 1008 DEBUG 01-12 13:51:27 [infer_batch.py:172] free a batch state: DEBUG 01-12 13:51:27 [infer_batch.py:172] radix refed token num 0 DEBUG 01-12 13:51:27 [infer_batch.py:172] radix hold token num 68 DEBUG 01-12 13:51:27 [infer_batch.py:172] mem manager can alloc token num 653060 DEBUG 01-12 13:51:27 [infer_batch.py:172] mem manager total size 653128 INFO 01-12 13:51:27 [batch.py:56] router release req id 32 INFO 01-12 13:51:27 [shm_req_manager.py:111] all shm req has been release ok INFO 01-12 13:51:44 [manager.py:417] recieved req X-Request-Id: X-Session-Id: start_time:2026-01-12 13:51:44 lightllm_req_id:40 INFO 01-12 13:51:44 [manager.py:422] router recive req id 40 cost time 0.009232759475708008 s DEBUG 01-12 13:51:44 [manager.py:320] Prefill Batch: batch_id=-1, time:1768197104.2886696s req_ids:[40] DEBUG 01-12 13:51:44 [manager.py:320] INFO 01-12 13:51:44 [manager.py:55] detokenization recv req id 40 cost time 0.010197639465332031 s DEBUG 01-12 13:51:47 [manager.py:251] dp_i 0 current batch size: 1 DEBUG 01-12 13:51:47 [manager.py:251] dp_i 0 paused req num: 0 DEBUG 01-12 13:51:47 [manager.py:251] dp_i 0 frozen token num: 0 DEBUG 01-12 13:51:47 [manager.py:251] dp_i 0 estimated_peak_token_count: 2022 DEBUG 01-12 13:51:47 [manager.py:251] dp_i 0 token used ratio: 0.00019597996104898273 not contain prompt cache tree unrefed token DEBUG 01-12 13:51:47 [manager.py:251] dp_i 0 token used ratio: 0.0002955010350191693 contain prompt cache tree unrefed token DEBUG 01-12 13:51:50 [manager.py:251] dp_i 0 current batch size: 1 DEBUG 01-12 13:51:50 [manager.py:251] dp_i 0 paused req num: 0 DEBUG 01-12 13:51:50 [manager.py:251] dp_i 0 frozen token num: 0 DEBUG 01-12 13:51:50 [manager.py:251] dp_i 0 estimated_peak_token_count: 2022 DEBUG 01-12 13:51:50 [manager.py:251] dp_i 0 token used ratio: 0.0002618169792138754 not contain prompt cache tree unrefed token DEBUG 01-12 13:51:50 [manager.py:251] dp_i 0 token used ratio: 0.0003613380531840619 contain prompt cache tree unrefed token DEBUG 01-12 13:51:53 [manager.py:251] dp_i 0 current batch size: 1 DEBUG 01-12 13:51:53 [manager.py:251] dp_i 0 paused req num: 0 DEBUG 01-12 13:51:53 [manager.py:251] dp_i 0 frozen token num: 0 DEBUG 01-12 13:51:53 [manager.py:251] dp_i 0 estimated_peak_token_count: 2020 DEBUG 01-12 13:51:53 [manager.py:251] dp_i 0 token used ratio: 0.0005052608370794086 not contain prompt cache tree unrefed token DEBUG 01-12 13:51:53 [manager.py:251] dp_i 0 token used ratio: 0.0006047819110495952 contain prompt cache tree unrefed token DEBUG 01-12 13:51:56 [manager.py:251] dp_i 0 current batch size: 1 DEBUG 01-12 13:51:56 [manager.py:251] dp_i 0 paused req num: 0 DEBUG 01-12 13:51:56 [manager.py:251] dp_i 0 frozen token num: 0 DEBUG 01-12 13:51:56 [manager.py:251] dp_i 0 estimated_peak_token_count: 2020 DEBUG 01-12 13:51:56 [manager.py:251] dp_i 0 token used ratio: 0.0007456425080535515 not contain prompt cache tree unrefed token DEBUG 01-12 13:51:56 [manager.py:251] dp_i 0 token used ratio: 0.000845163582023738 contain prompt cache tree unrefed token DEBUG 01-12 13:51:59 [manager.py:251] dp_i 0 current batch size: 1 DEBUG 01-12 13:51:59 [manager.py:251] dp_i 0 paused req num: 0 DEBUG 01-12 13:51:59 [manager.py:251] dp_i 0 frozen token num: 0 DEBUG 01-12 13:51:59 [manager.py:251] dp_i 0 estimated_peak_token_count: 2020 DEBUG 01-12 13:51:59 [manager.py:251] dp_i 0 token used ratio: 0.0009875552724733895 not contain prompt cache tree unrefed token DEBUG 01-12 13:51:59 [manager.py:251] dp_i 0 token used ratio: 0.001087076346443576 contain prompt cache tree unrefed token DEBUG 01-12 13:52:02 [manager.py:251] dp_i 0 current batch size: 1 DEBUG 01-12 13:52:02 [manager.py:251] dp_i 0 paused req num: 0 DEBUG 01-12 13:52:02 [manager.py:251] dp_i 0 frozen token num: 0 DEBUG 01-12 13:52:02 [manager.py:251] dp_i 0 estimated_peak_token_count: 2020 DEBUG 01-12 13:52:02 [manager.py:251] dp_i 0 token used ratio: 0.0012264058500018372 not contain prompt cache tree unrefed token DEBUG 01-12 13:52:02 [manager.py:251] dp_i 0 token used ratio: 0.001325926923972024 contain prompt cache tree unrefed token DEBUG 01-12 13:52:05 [manager.py:251] dp_i 0 current batch size: 1 DEBUG 01-12 13:52:05 [manager.py:251] dp_i 0 paused req num: 0 DEBUG 01-12 13:52:05 [manager.py:251] dp_i 0 frozen token num: 0 DEBUG 01-12 13:52:05 [manager.py:251] dp_i 0 estimated_peak_token_count: 2020 DEBUG 01-12 13:52:05 [manager.py:251] dp_i 0 token used ratio: 0.0014086059700395635 not contain prompt cache tree unrefed token DEBUG 01-12 13:52:05 [manager.py:251] dp_i 0 token used ratio: 0.00150812704400975 contain prompt cache tree unrefed token DEBUG 01-12 13:52:08 [manager.py:251] dp_i 0 current batch size: 1 DEBUG 01-12 13:52:08 [manager.py:251] dp_i 0 paused req num: 0 DEBUG 01-12 13:52:08 [manager.py:251] dp_i 0 frozen token num: 0 DEBUG 01-12 13:52:08 [manager.py:251] dp_i 0 estimated_peak_token_count: 2020 DEBUG 01-12 13:52:08 [manager.py:251] dp_i 0 token used ratio: 0.0015724329687289474 not contain prompt cache tree unrefed token DEBUG 01-12 13:52:08 [manager.py:251] dp_i 0 token used ratio: 0.001671954042699134 contain prompt cache tree unrefed token DEBUG 01-12 13:52:11 [manager.py:251] dp_i 0 current batch size: 1 DEBUG 01-12 13:52:11 [manager.py:251] dp_i 0 paused req num: 0 DEBUG 01-12 13:52:11 [manager.py:251] dp_i 0 frozen token num: 0 DEBUG 01-12 13:52:11 [manager.py:251] dp_i 0 estimated_peak_token_count: 2020 DEBUG 01-12 13:52:11 [manager.py:251] dp_i 0 token used ratio: 0.0017331977805269412 not contain prompt cache tree unrefed token DEBUG 01-12 13:52:11 [manager.py:251] dp_i 0 token used ratio: 0.0018327188544971277 contain prompt cache tree unrefed token DEBUG 01-12 13:52:14 [manager.py:251] dp_i 0 current batch size: 1 DEBUG 01-12 13:52:14 [manager.py:251] dp_i 0 paused req num: 0 DEBUG 01-12 13:52:14 [manager.py:251] dp_i 0 frozen token num: 0 DEBUG 01-12 13:52:14 [manager.py:251] dp_i 0 estimated_peak_token_count: 2020 DEBUG 01-12 13:52:14 [manager.py:251] dp_i 0 token used ratio: 0.0018939625923249349 not contain prompt cache tree unrefed token DEBUG 01-12 13:52:14 [manager.py:251] dp_i 0 token used ratio: 0.0019934836662951214 contain prompt cache tree unrefed token DEBUG 01-12 13:52:17 [manager.py:251] dp_i 0 current batch size: 1 DEBUG 01-12 13:52:17 [manager.py:251] dp_i 0 paused req num: 0 DEBUG 01-12 13:52:17 [manager.py:251] dp_i 0 frozen token num: 0 DEBUG 01-12 13:52:17 [manager.py:251] dp_i 0 estimated_peak_token_count: 2020 DEBUG 01-12 13:52:17 [manager.py:251] dp_i 0 token used ratio: 0.0020531963106772333 not contain prompt cache tree unrefed token DEBUG 01-12 13:52:17 [manager.py:251] dp_i 0 token used ratio: 0.00215271738464742 contain prompt cache tree unrefed token DEBUG 01-12 13:52:20 [manager.py:251] dp_i 0 current batch size: 1 DEBUG 01-12 13:52:20 [manager.py:251] dp_i 0 paused req num: 0 DEBUG 01-12 13:52:20 [manager.py:251] dp_i 0 frozen token num: 0 DEBUG 01-12 13:52:20 [manager.py:251] dp_i 0 estimated_peak_token_count: 2020 DEBUG 01-12 13:52:20 [manager.py:251] dp_i 0 token used ratio: 0.002213961122475227 not contain prompt cache tree unrefed token DEBUG 01-12 13:52:20 [manager.py:251] dp_i 0 token used ratio: 0.0023134821964454133 contain prompt cache tree unrefed token DEBUG 01-12 13:52:23 [manager.py:251] dp_i 0 current batch size: 1 DEBUG 01-12 13:52:23 [manager.py:251] dp_i 0 paused req num: 0 DEBUG 01-12 13:52:23 [manager.py:251] dp_i 0 frozen token num: 0 DEBUG 01-12 13:52:23 [manager.py:251] dp_i 0 estimated_peak_token_count: 2020 DEBUG 01-12 13:52:23 [manager.py:251] dp_i 0 token used ratio: 0.0023731948408275256 not contain prompt cache tree unrefed token DEBUG 01-12 13:52:23 [manager.py:251] dp_i 0 token used ratio: 0.002472715914797712 contain prompt cache tree unrefed token DEBUG 01-12 13:52:26 [manager.py:251] dp_i 0 current batch size: 1 DEBUG 01-12 13:52:26 [manager.py:251] dp_i 0 paused req num: 0 DEBUG 01-12 13:52:26 [manager.py:251] dp_i 0 frozen token num: 0 DEBUG 01-12 13:52:26 [manager.py:251] dp_i 0 estimated_peak_token_count: 2020 DEBUG 01-12 13:52:26 [manager.py:251] dp_i 0 token used ratio: 0.002509462157494396 not contain prompt cache tree unrefed token DEBUG 01-12 13:52:26 [manager.py:251] dp_i 0 token used ratio: 0.002608983231464583 contain prompt cache tree unrefed token DEBUG 01-12 13:52:29 [manager.py:251] dp_i 0 current batch size: 1 DEBUG 01-12 13:52:29 [manager.py:251] dp_i 0 paused req num: 0 DEBUG 01-12 13:52:29 [manager.py:251] dp_i 0 frozen token num: 0 DEBUG 01-12 13:52:29 [manager.py:251] dp_i 0 estimated_peak_token_count: 2020 DEBUG 01-12 13:52:29 [manager.py:251] dp_i 0 token used ratio: 0.0026288874462586202 not contain prompt cache tree unrefed token DEBUG 01-12 13:52:29 [manager.py:251] dp_i 0 token used ratio: 0.0027284085202288065 contain prompt cache tree unrefed token DEBUG 01-12 13:52:32 [manager.py:251] dp_i 0 current batch size: 1 DEBUG 01-12 13:52:32 [manager.py:251] dp_i 0 paused req num: 0 DEBUG 01-12 13:52:32 [manager.py:251] dp_i 0 frozen token num: 0 DEBUG 01-12 13:52:32 [manager.py:251] dp_i 0 estimated_peak_token_count: 2020 DEBUG 01-12 13:52:32 [manager.py:251] dp_i 0 token used ratio: 0.002746781641577149 not contain prompt cache tree unrefed token DEBUG 01-12 13:52:32 [manager.py:251] dp_i 0 token used ratio: 0.002846302715547335 contain prompt cache tree unrefed token DEBUG 01-12 13:52:35 [manager.py:251] dp_i 0 current batch size: 1 DEBUG 01-12 13:52:35 [manager.py:251] dp_i 0 paused req num: 0 DEBUG 01-12 13:52:35 [manager.py:251] dp_i 0 frozen token num: 0 DEBUG 01-12 13:52:35 [manager.py:251] dp_i 0 estimated_peak_token_count: 2020 DEBUG 01-12 13:52:35 [manager.py:251] dp_i 0 token used ratio: 0.002861613650004287 not contain prompt cache tree unrefed token DEBUG 01-12 13:52:35 [manager.py:251] dp_i 0 token used ratio: 0.0029611347239744735 contain prompt cache tree unrefed token DEBUG 01-12 13:52:38 [manager.py:251] dp_i 0 current batch size: 1 DEBUG 01-12 13:52:38 [manager.py:251] dp_i 0 paused req num: 0 DEBUG 01-12 13:52:38 [manager.py:251] dp_i 0 frozen token num: 0 DEBUG 01-12 13:52:38 [manager.py:251] dp_i 0 estimated_peak_token_count: 2020 DEBUG 01-12 13:52:38 [manager.py:251] dp_i 0 token used ratio: 0.002939699415734741 not contain prompt cache tree unrefed token DEBUG 01-12 13:52:38 [manager.py:251] dp_i 0 token used ratio: 0.0030392204897049277 contain prompt cache tree unrefed token DEBUG 01-12 13:52:41 [manager.py:251] dp_i 0 current batch size: 1 DEBUG 01-12 13:52:41 [manager.py:251] dp_i 0 paused req num: 0 DEBUG 01-12 13:52:41 [manager.py:251] dp_i 0 frozen token num: 0 DEBUG 01-12 13:52:41 [manager.py:251] dp_i 0 estimated_peak_token_count: 2020 DEBUG 01-12 13:52:41 [manager.py:251] dp_i 0 token used ratio: 0.0030116608076824146 not contain prompt cache tree unrefed token DEBUG 01-12 13:52:41 [manager.py:251] dp_i 0 token used ratio: 0.003111181881652601 contain prompt cache tree unrefed token INFO 01-12 13:52:42 [manager.py:163] detoken release req id 40 INFO 01-12 13:52:42 [manager.py:614] X-Request-Id: X-Session-Id: start_time:2026-01-12 13:51:44 lightllm_req_id:40 first_token_cost:91.23969078063965ms total_cost_time:58654.03771400452ms,out_token_counter:2000 mean_per_token_cost_time: 29.28139901161194ms prompt_token_num:4 gpu cache hit: True gpu_prompt_cache_len:3 gpu_prompt_cache_ratio:0.75 cpu cache hit: False cpu_prompt_cache_len:0 cpu_prompt_cache_ratio:0.0 disk cache hit: False disk_prompt_cache_len:0 disk_prompt_cache_ratio:0.0 mtp_avg_token_per_step:1.0 127.0.0.1:50156 - "POST /generate HTTP/1.1" 200 DEBUG 01-12 13:52:42 [req_manager.py:78] freed all request size 1008 DEBUG 01-12 13:52:42 [infer_batch.py:172] free a batch state: DEBUG 01-12 13:52:42 [infer_batch.py:172] radix refed token num 0 DEBUG 01-12 13:52:42 [infer_batch.py:172] radix hold token num 2068 DEBUG 01-12 13:52:42 [infer_batch.py:172] mem manager can alloc token num 651060 DEBUG 01-12 13:52:42 [infer_batch.py:172] mem manager total size 653128 INFO 01-12 13:52:42 [batch.py:56] router release req id 40 INFO 01-12 13:52:42 [shm_req_manager.py:111] all shm req has been release ok DEBUG 01-12 13:52:50 [manager.py:283] dp_i 0 frozen token num: 0 DEBUG 01-12 13:52:50 [manager.py:283] DEBUG 01-12 13:52:50 [manager.py:284] dp_i 0 estimated_peak_token_count: 0 DEBUG 01-12 13:52:50 [manager.py:284] DEBUG 01-12 13:53:53 [manager.py:283] dp_i 0 frozen token num: 0 DEBUG 01-12 13:53:53 [manager.py:283] DEBUG 01-12 13:53:53 [manager.py:284] dp_i 0 estimated_peak_token_count: 0 DEBUG 01-12 13:53:53 [manager.py:284] DEBUG 01-12 13:54:56 [manager.py:283] dp_i 0 frozen token num: 0 DEBUG 01-12 13:54:56 [manager.py:283] DEBUG 01-12 13:54:56 [manager.py:284] dp_i 0 estimated_peak_token_count: 0 DEBUG 01-12 13:54:56 [manager.py:284] DEBUG 01-12 13:56:00 [manager.py:283] dp_i 0 frozen token num: 0 DEBUG 01-12 13:56:00 [manager.py:283] DEBUG 01-12 13:56:00 [manager.py:284] dp_i 0 estimated_peak_token_count: 0 DEBUG 01-12 13:56:00 [manager.py:284] DEBUG 01-12 13:57:03 [manager.py:283] dp_i 0 frozen token num: 0 DEBUG 01-12 13:57:03 [manager.py:283] DEBUG 01-12 13:57:03 [manager.py:284] dp_i 0 estimated_peak_token_count: 0 DEBUG 01-12 13:57:03 [manager.py:284] DEBUG 01-12 13:58:06 [manager.py:283] dp_i 0 frozen token num: 0 DEBUG 01-12 13:58:06 [manager.py:283] DEBUG 01-12 13:58:06 [manager.py:284] dp_i 0 estimated_peak_token_count: 0 DEBUG 01-12 13:58:06 [manager.py:284] DEBUG 01-12 13:59:09 [manager.py:283] dp_i 0 frozen token num: 0 DEBUG 01-12 13:59:09 [manager.py:283] DEBUG 01-12 13:59:09 [manager.py:284] dp_i 0 estimated_peak_token_count: 0 DEBUG 01-12 13:59:09 [manager.py:284] INFO 01-12 14:00:06 [manager.py:417] recieved req X-Request-Id: X-Session-Id: start_time:2026-01-12 14:00:06 lightllm_req_id:48 INFO 01-12 14:00:06 [manager.py:422] router recive req id 48 cost time 0.00828862190246582 s DEBUG 01-12 14:00:06 [manager.py:320] Prefill Batch: batch_id=-1, time:1768197606.2045314s req_ids:[48] DEBUG 01-12 14:00:06 [manager.py:320] INFO 01-12 14:00:06 [manager.py:55] detokenization recv req id 48 cost time 0.010654926300048828 s DEBUG 01-12 14:00:06 [manager.py:251] dp_i 0 current batch size: 1 DEBUG 01-12 14:00:06 [manager.py:251] dp_i 0 paused req num: 0 DEBUG 01-12 14:00:06 [manager.py:251] dp_i 0 frozen token num: 0 DEBUG 01-12 14:00:06 [manager.py:251] dp_i 0 estimated_peak_token_count: 222 DEBUG 01-12 14:00:06 [manager.py:251] dp_i 0 token used ratio: 4.746389681655051e-05 not contain prompt cache tree unrefed token DEBUG 01-12 14:00:06 [manager.py:251] dp_i 0 token used ratio: 0.0032091718621770926 contain prompt cache tree unrefed token DEBUG 01-12 14:00:09 [manager.py:251] dp_i 0 current batch size: 1 DEBUG 01-12 14:00:09 [manager.py:251] dp_i 0 paused req num: 0 DEBUG 01-12 14:00:09 [manager.py:251] dp_i 0 frozen token num: 0 DEBUG 01-12 14:00:09 [manager.py:251] dp_i 0 estimated_peak_token_count: 222 DEBUG 01-12 14:00:09 [manager.py:251] dp_i 0 token used ratio: 0.0002878455677906934 not contain prompt cache tree unrefed token DEBUG 01-12 14:00:09 [manager.py:251] dp_i 0 token used ratio: 0.003449553533151235 contain prompt cache tree unrefed token INFO 01-12 14:00:10 [manager.py:163] detoken release req id 48 INFO 01-12 14:00:10 [manager.py:614] X-Request-Id: X-Session-Id: start_time:2026-01-12 14:00:06 lightllm_req_id:48 first_token_cost:94.14434432983398ms total_cost_time:3917.818784713745ms,out_token_counter:200 mean_per_token_cost_time: 19.118372201919556ms prompt_token_num:4 gpu cache hit: True gpu_prompt_cache_len:3 gpu_prompt_cache_ratio:0.75 cpu cache hit: False cpu_prompt_cache_len:0 cpu_prompt_cache_ratio:0.0 disk cache hit: False disk_prompt_cache_len:0 disk_prompt_cache_ratio:0.0 mtp_avg_token_per_step:1.0 127.0.0.1:53836 - "POST /generate HTTP/1.1" 200 DEBUG 01-12 14:00:10 [req_manager.py:78] freed all request size 1008 DEBUG 01-12 14:00:10 [infer_batch.py:172] free a batch state: DEBUG 01-12 14:00:10 [infer_batch.py:172] radix refed token num 0 DEBUG 01-12 14:00:10 [infer_batch.py:172] radix hold token num 2266 DEBUG 01-12 14:00:10 [infer_batch.py:172] mem manager can alloc token num 650862 DEBUG 01-12 14:00:10 [infer_batch.py:172] mem manager total size 653128 INFO 01-12 14:00:10 [batch.py:56] router release req id 48 INFO 01-12 14:00:10 [shm_req_manager.py:111] all shm req has been release ok DEBUG 01-12 14:00:12 [manager.py:283] dp_i 0 frozen token num: 0 DEBUG 01-12 14:00:12 [manager.py:283] DEBUG 01-12 14:00:12 [manager.py:284] dp_i 0 estimated_peak_token_count: 0 DEBUG 01-12 14:00:12 [manager.py:284] DEBUG 01-12 14:01:16 [manager.py:283] dp_i 0 frozen token num: 0 DEBUG 01-12 14:01:16 [manager.py:283] DEBUG 01-12 14:01:16 [manager.py:284] dp_i 0 estimated_peak_token_count: 0 DEBUG 01-12 14:01:16 [manager.py:284] DEBUG 01-12 14:02:19 [manager.py:283] dp_i 0 frozen token num: 0 DEBUG 01-12 14:02:19 [manager.py:283] DEBUG 01-12 14:02:19 [manager.py:284] dp_i 0 estimated_peak_token_count: 0 DEBUG 01-12 14:02:19 [manager.py:284] [2026-01-12 14:03:16 +0800] [5689] [INFO] Handling signal: winch DEBUG 01-12 14:03:22 [manager.py:283] dp_i 0 frozen token num: 0 DEBUG 01-12 14:03:22 [manager.py:283] DEBUG 01-12 14:03:22 [manager.py:284] dp_i 0 estimated_peak_token_count: 0 DEBUG 01-12 14:03:22 [manager.py:284] DEBUG 01-12 14:04:25 [manager.py:283] dp_i 0 frozen token num: 0 DEBUG 01-12 14:04:25 [manager.py:283] DEBUG 01-12 14:04:25 [manager.py:284] dp_i 0 estimated_peak_token_count: 0 DEBUG 01-12 14:04:25 [manager.py:284] DEBUG 01-12 14:05:28 [manager.py:283] dp_i 0 frozen token num: 0 DEBUG 01-12 14:05:28 [manager.py:283] DEBUG 01-12 14:05:28 [manager.py:284] dp_i 0 estimated_peak_token_count: 0 DEBUG 01-12 14:05:28 [manager.py:284] [2026-01-12 14:06:28 +0800] [5689] [INFO] Handling signal: winch [2026-01-12 14:06:28 +0800] [5689] [INFO] Handling signal: winch [2026-01-12 14:06:28 +0800] [5689] [INFO] Handling signal: winch DEBUG 01-12 14:06:31 [manager.py:283] dp_i 0 frozen token num: 0 DEBUG 01-12 14:06:31 [manager.py:283] DEBUG 01-12 14:06:31 [manager.py:284] dp_i 0 estimated_peak_token_count: 0 DEBUG 01-12 14:06:31 [manager.py:284] DEBUG 01-12 14:07:35 [manager.py:283] dp_i 0 frozen token num: 0 DEBUG 01-12 14:07:35 [manager.py:283] DEBUG 01-12 14:07:35 [manager.py:284] dp_i 0 estimated_peak_token_count: 0 DEBUG 01-12 14:07:35 [manager.py:284] DEBUG 01-12 14:08:38 [manager.py:283] dp_i 0 frozen token num: 0 DEBUG 01-12 14:08:38 [manager.py:283] DEBUG 01-12 14:08:38 [manager.py:284] dp_i 0 estimated_peak_token_count: 0 DEBUG 01-12 14:08:38 [manager.py:284] DEBUG 01-12 14:09:41 [manager.py:283] dp_i 0 frozen token num: 0 DEBUG 01-12 14:09:41 [manager.py:283] DEBUG 01-12 14:09:41 [manager.py:284] dp_i 0 estimated_peak_token_count: 0 DEBUG 01-12 14:09:41 [manager.py:284] DEBUG 01-12 14:10:44 [manager.py:283] dp_i 0 frozen token num: 0 DEBUG 01-12 14:10:44 [manager.py:283] DEBUG 01-12 14:10:44 [manager.py:284] dp_i 0 estimated_peak_token_count: 0 DEBUG 01-12 14:10:44 [manager.py:284] DEBUG 01-12 14:11:47 [manager.py:283] dp_i 0 frozen token num: 0 DEBUG 01-12 14:11:47 [manager.py:283] DEBUG 01-12 14:11:47 [manager.py:284] dp_i 0 estimated_peak_token_count: 0 DEBUG 01-12 14:11:47 [manager.py:284] [2026-01-12 14:11:57 +0800] [5689] [INFO] Handling signal: winch DEBUG 01-12 14:12:51 [manager.py:283] dp_i 0 frozen token num: 0 DEBUG 01-12 14:12:51 [manager.py:283] DEBUG 01-12 14:12:51 [manager.py:284] dp_i 0 estimated_peak_token_count: 0 DEBUG 01-12 14:12:51 [manager.py:284] DEBUG 01-12 14:13:54 [manager.py:283] dp_i 0 frozen token num: 0 DEBUG 01-12 14:13:54 [manager.py:283] DEBUG 01-12 14:13:54 [manager.py:284] dp_i 0 estimated_peak_token_count: 0 DEBUG 01-12 14:13:54 [manager.py:284] ``` Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

Co-authored-by: shihaobai <42648726+shihaobai@users.noreply.github.com>

Co-authored-by: wangzaijun <wangzaijun@sensetime.com>

Co-authored-by: sangchengmeng <sangchengmeng@sensetime.com>

- Add get_custom_input_data_multimodal for loading multimodal JSONL data - Add async_post_stream_openai_chat for OpenAI Chat Completions API - Support --server_api openai_vision mode for vision model benchmarking - Add --max_requests option to limit number of requests - Add progress bar with tqdm for better UX - Improve error handling and timeout configuration Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

…mode In non-streaming mode (openai_vision), the response only contains one time point, making first_token_time and decode_token_time statistics meaningless. This change skips these statistics when the lists are empty, only showing QPS and request_time metrics. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

SangChengC and others added 10 commits January 9, 2026 15:35

fix-internvl (ModelTC#1171)

04b214b

add att backend (ModelTC#1165)

b485822

Co-authored-by: wangzaijun <wangzaijun@sensetime.com> Co-authored-by: root <root@DESKTOP-5FJJCPK.localdomain>

fix unit test (ModelTC#1173)

5d2f630

fix openai v1 (ModelTC#1178)

f0481a8

Co-authored-by: shihaobai <42648726+shihaobai@users.noreply.github.com>

add diverse_stage2 add optimize diverse_stage1 (ModelTC#1174)

2fbd2b8

Co-authored-by: wangzaijun <wangzaijun@sensetime.com>

check image tag and image num (ModelTC#1176)

206b170

Co-authored-by: sangchengmeng <sangchengmeng@sensetime.com>

fix cpu kv cache offload async error (ModelTC#1180)

1bd148d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add openai_vision multimodal benchmark support#2

feat: add openai_vision multimodal benchmark support#2
AuZhoomLee wants to merge 10 commits intomainfrom
qwen-vlm-30B

AuZhoomLee commented Jan 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Conversation

AuZhoomLee commented Jan 23, 2026

Summary

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants