feat: add openai_vision multimodal benchmark support#2
Open
AuZhoomLee wants to merge 10 commits intomainfrom
Open
feat: add openai_vision multimodal benchmark support#2AuZhoomLee wants to merge 10 commits intomainfrom
AuZhoomLee wants to merge 10 commits intomainfrom
Conversation
Co-authored-by: wangzaijun <wangzaijun@sensetime.com> Co-authored-by: root <root@DESKTOP-5FJJCPK.localdomain>
…doc (ModelTC#1175) ### Testing Done Tested in a clean docker container without vllm installed. ```bash root@worker3218:/ws# python -m lightllm.server.api_server --model_dir /home/dist/Qwen3-0.6B/ --disable_cudagraph --host 0.0.0.0 WARNING 01-12 13:45:20 [sgl_utils.py:14] sgl_kernel is not installed, you can't use the api of it. You can solve it by running `pip install sgl_kernel`. WARNING 01-12 13:45:20 [sgl_utils.py:29] sgl_kernel is not installed, or the installed version did not support fa3. Try to upgrade it. WARNING 01-12 13:45:20 [light_utils.py:13] lightllm_kernel is not installed, you can't use the api of it. WARNING 01-12 13:45:20 [vllm_utils.py:18] vllm is not installed, you can't use the api of it. You can solve it by running `pip install vllm`. INFO 01-12 13:45:20 [communication_op.py:57] deep_ep is not installed, you can't use the api of it. INFO 01-12 13:45:20 [cache_tensor_manager.py:17] USE_GPU_TENSOR_CACHE is On WARNING 01-12 13:45:20 [grouped_fused_moe_ep.py:28] no deepep or deep_gemm WARNING 01-12 13:45:20 [nixl_kv_transporter.py:19] nixl is not installed, which is required for pd disagreggation!!! INFO 01-12 13:45:21 [shm_size_check.py:21] SHM check: Available=500.00 GB,Recommended=2.32 GB.Sufficient: True INFO 01-12 13:45:21 [api_start.py:94] zmq mode head: ipc:///tmp/_28765_0_ INFO 01-12 13:45:21 [api_start.py:96] use tgi api: False INFO 01-12 13:45:21 [api_start.py:219] alloced ports: [10017, 10004, 10209, 10223, 10297, 10257, 10068, 10179, 10206, 10285] INFO 01-12 13:45:21 [api_start.py:270] all start args:Namespace(run_mode='normal', host='0.0.0.0', port=8000, httpserver_workers=1, zmq_mode='ipc:///tmp/_28765_0_', pd_master_ip='0.0.0.0', pd_master_port=1212, pd_decode_rpyc_port=42000, select_p_d_node_strategy='round_robin', config_server_host=None, config_server_port=None, nixl_pd_kv_page_num=16, nixl_pd_kv_page_size=1024, model_name='default_model_name', model_dir='/home/dist/Qwen3-0.6B/', tokenizer_mode='fast', load_way='HF', max_total_token_num=None, mem_fraction=0.9, batch_max_tokens=8448, eos_id=[151645], tool_call_parser=None, reasoning_parser=None, chat_template=None, running_max_req_size=1000, nnodes=1, node_rank=0, multinode_httpmanager_port=12345, multinode_router_gloo_port=20001, tp=1, dp=1, dp_balancer='bs_balancer', max_req_total_len=16384, nccl_host='127.0.0.1', nccl_port=28765, use_config_server_to_init_nccl=False, trust_remote_code=False, disable_log_stats=False, log_stats_interval=10, disable_shm_warning=False, router_token_ratio=0.0, router_max_new_token_len=1024, router_max_wait_tokens=1, disable_aggressive_schedule=False, use_dynamic_prompt_cache=False, disable_dynamic_prompt_cache=False, chunked_prefill_size=4096, disable_chunked_prefill=False, diverse_mode=False, token_healing_mode=False, output_constraint_mode='none', first_token_constraint_mode=False, enable_multimodal=False, enable_multimodal_audio=False, enable_mps=False, disable_custom_allreduce=False, enable_custom_allgather=False, enable_tpsp_mix_mode=False, enable_dp_prefill_balance=False, enable_prefill_microbatch_overlap=False, enable_decode_microbatch_overlap=False, llm_prefill_att_backend=['triton'], llm_decode_att_backend=['triton'], llm_kv_type='None', llm_kv_quant_group_size=8, cache_capacity=200, embed_cache_storage_size=4, data_type='bfloat16', return_all_prompt_logprobs=False, use_reward_model=False, long_truncation_mode=None, use_tgi_api=False, health_monitor=False, metric_gateway=None, job_name='lightllm', grouping_key=[], push_interval=10, visual_infer_batch_size=1, visual_send_batch_size=1, visual_gpu_ids=[0], visual_tp=1, visual_dp=1, visual_nccl_ports=[29500], enable_monitor_auth=False, disable_cudagraph=True, enable_prefill_cudagraph=False, prefll_cudagraph_max_handle_token=512, graph_max_batch_size=256, graph_split_batch_size=32, graph_grow_step_size=16, graph_max_len_in_batch=16384, quant_type='none', quant_cfg=None, vit_quant_type='none', vit_quant_cfg=None, sampling_backend='triton', penalty_counter_mode='gpu_counter', ep_redundancy_expert_config_path=None, auto_update_redundancy_expert=False, enable_fused_shared_experts=False, mtp_mode=None, mtp_draft_model_dir=None, mtp_step=0, kv_quant_calibration_config_path=None, schedule_time_interval=0.03, enable_cpu_cache=False, cpu_cache_storage_size=2, cpu_cache_token_page_size=256, enable_disk_cache=False, disk_cache_storage_size=10, disk_cache_dir=None, enable_dp_prompt_cache_fetch=False, router_port=10017, detokenization_port=10004, http_server_port=10209, visual_port=10223, audio_port=10297, cache_port=10257, metric_port=10068, multi_level_kv_cache_port=10179, pd_node_infer_rpyc_ports=[10285], pd_node_id=288479957063433772586255832729030629155, pd_p_allowed_port_min=20000, pd_p_allowed_port_max=30000) WARNING 01-12 13:45:27 [sgl_utils.py:14] sgl_kernel is not installed, you can't use the api of it. You can solve it by running `pip install sgl_kernel`. WARNING 01-12 13:45:27 [sgl_utils.py:29] sgl_kernel is not installed, or the installed version did not support fa3. Try to upgrade it. WARNING 01-12 13:45:27 [light_utils.py:13] lightllm_kernel is not installed, you can't use the api of it. WARNING 01-12 13:45:27 [vllm_utils.py:18] vllm is not installed, you can't use the api of it. You can solve it by running `pip install vllm`. INFO 01-12 13:45:27 [communication_op.py:57] deep_ep is not installed, you can't use the api of it. 2026-01-12 13:45:27 | server | 140078322902144 | INFO : server started on [0.0.0.0]:10068 INFO 01-12 13:45:27 [start_utils.py:37] init func start_metric_manager : init ok WARNING 01-12 13:45:33 [sgl_utils.py:14] sgl_kernel is not installed, you can't use the api of it. You can solve it by running `pip install sgl_kernel`. WARNING 01-12 13:45:33 [sgl_utils.py:29] sgl_kernel is not installed, or the installed version did not support fa3. Try to upgrade it. WARNING 01-12 13:45:33 [light_utils.py:13] lightllm_kernel is not installed, you can't use the api of it. WARNING 01-12 13:45:33 [vllm_utils.py:18] vllm is not installed, you can't use the api of it. You can solve it by running `pip install vllm`. INFO 01-12 13:45:33 [communication_op.py:57] deep_ep is not installed, you can't use the api of it. INFO 01-12 13:45:33 [cache_tensor_manager.py:17] USE_GPU_TENSOR_CACHE is On WARNING 01-12 13:45:33 [sgl_utils.py:14] sgl_kernel is not installed, you can't use the api of it. You can solve it by running `pip install sgl_kernel`. WARNING 01-12 13:45:33 [sgl_utils.py:29] sgl_kernel is not installed, or the installed version did not support fa3. Try to upgrade it. WARNING 01-12 13:45:33 [light_utils.py:13] lightllm_kernel is not installed, you can't use the api of it. WARNING 01-12 13:45:33 [vllm_utils.py:18] vllm is not installed, you can't use the api of it. You can solve it by running `pip install vllm`. INFO 01-12 13:45:33 [communication_op.py:57] deep_ep is not installed, you can't use the api of it. INFO 01-12 13:45:33 [cache_tensor_manager.py:17] USE_GPU_TENSOR_CACHE is On WARNING 01-12 13:45:33 [grouped_fused_moe_ep.py:28] no deepep or deep_gemm WARNING 01-12 13:45:33 [grouped_fused_moe_ep.py:28] no deepep or deep_gemm INFO 01-12 13:45:33 [manager.py:36] pub_to_httpserver sendhwm 1000 WARNING 01-12 13:45:33 [nixl_kv_transporter.py:19] nixl is not installed, which is required for pd disagreggation!!! 2026-01-12 13:45:33 | server | 140078322902144 | INFO : accepted ('127.0.0.1', 47548) with fd 25 2026-01-12 13:45:33 | server | 140046992746048 | INFO : welcome ('127.0.0.1', 47548) INFO 01-12 13:45:38 [cache_tensor_manager.py:17] USE_GPU_TENSOR_CACHE is On WARNING 01-12 13:45:38 [sgl_utils.py:14] sgl_kernel is not installed, you can't use the api of it. You can solve it by running `pip install sgl_kernel`. WARNING 01-12 13:45:38 [sgl_utils.py:29] sgl_kernel is not installed, or the installed version did not support fa3. Try to upgrade it. WARNING 01-12 13:45:38 [vllm_utils.py:18] vllm is not installed, you can't use the api of it. You can solve it by running `pip install vllm`. WARNING 01-12 13:45:38 [light_utils.py:13] lightllm_kernel is not installed, you can't use the api of it. WARNING 01-12 13:45:38 [grouped_fused_moe_ep.py:28] no deepep or deep_gemm INFO 01-12 13:45:38 [communication_op.py:57] deep_ep is not installed, you can't use the api of it. WARNING 01-12 13:45:40 [nixl_kv_transporter.py:19] nixl is not installed, which is required for pd disagreggation!!! INFO 01-12 13:45:40 [model_rpc.py:67] Initialized RPC server for rank 0. INFO 01-12 13:45:40 [model_rpc.py:168] use ChunkedPrefillBackend INFO 01-12 13:45:43 [basemodel.py:169] Initial quantization. The default quantization method is none pid 45988 Loading model weights with 1 workers: 0%| | 0/1 [00:00<?, ?it/s]INFO 01-12 13:45:43 [embedding_weight.py:30] loaded weight vocab_size: 151936 pid 45988 Loading model weights with 1 workers: 100%|██████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 1.19it/s] INFO 01-12 13:45:43 [mem_utils.py:30] mode setting params: None INFO 01-12 13:45:43 [mem_utils.py:40] Model kv cache using mem_manager class: <class 'lightllm.common.kv_cache_mem_manager.mem_manager.MemoryManager'> INFO 01-12 13:45:43 [mem_manager.py:99] 69.76169700622559 GB space is available after load the model weight INFO 01-12 13:45:43 [mem_manager.py:99] 0.109375 MB is the size of one token kv cache INFO 01-12 13:45:43 [mem_manager.py:99] 653128 is the profiled max_total_token_num with the mem_fraction 0.9 INFO 01-12 13:45:43 [mem_manager.py:99] INFO 01-12 13:45:44 [basemodel.py:126] use prefill att backend: TritonAttBackend INFO 01-12 13:45:44 [basemodel.py:127] use decode att backend: TritonAttBackend warming up: 0%| | 0/12 [00:00<?, ?it/s]WARNING 01-12 13:46:16 [autotuner.py:169] No kernel config for silu_and_mul_fwd:v1 in {N=3072,out_dtype=torch.bfloat16}_MTT_S5000.json,the performance may be suboptimal!You can use LIGHTLLM_TRITON_AUTOTUNE_LEVEL=1 to enable autotune. WARNING 01-12 13:46:16 [kernel_config.py:40] can not find config_path /ws/lightllm/common/all_kernel_configs/moe_silu_and_mul_kernel/{N=3072,out_dtype=torch.bfloat16}_MTT_S5000.json kernel name moe_silu_and_mul_kernel use default kernel setting warming up: 100%|████████████████████████████████████████████████████████████████████████████████████████████████| 12/12 [00:40<00:00, 3.41s/it] INFO 01-12 13:46:25 [basemodel.py:846] begin check max_len infer INFO 01-12 13:46:25 [basemodel.py:882] check max_len 8448 infer ok INFO 01-12 13:46:40 [base_backend.py:184] loaded model class <class 'lightllm.models.qwen3.model.Qwen3TpPartModel'> INFO 01-12 13:46:40 [manager.py:194] use req queue ChunkedPrefillQueue INFO 01-12 13:46:40 [start_utils.py:37] init func start_router_process : init ok INFO 01-12 13:46:40 [start_utils.py:37] init func start_detokenization_process : init ok INFO 01-12 13:46:40 [api_start.py:58] start process pid 38328 INFO 01-12 13:46:40 [api_start.py:59] http server pid 5689 [2026-01-12 13:46:40 +0800] [5689] [INFO] Starting gunicorn 23.0.0 [2026-01-12 13:46:40 +0800] [5689] [INFO] Listening at: http://0.0.0.0:8000 (5689) [2026-01-12 13:46:40 +0800] [5689] [INFO] Using worker: uvicorn.workers.UvicornWorker [2026-01-12 13:46:40 +0800] [5690] [INFO] Booting worker with pid: 5690 WARNING 01-12 13:46:46 [sgl_utils.py:14] sgl_kernel is not installed, you can't use the api of it. You can solve it by running `pip install sgl_kernel`. WARNING 01-12 13:46:46 [sgl_utils.py:29] sgl_kernel is not installed, or the installed version did not support fa3. Try to upgrade it. WARNING 01-12 13:46:46 [light_utils.py:13] lightllm_kernel is not installed, you can't use the api of it. WARNING 01-12 13:46:46 [vllm_utils.py:18] vllm is not installed, you can't use the api of it. You can solve it by running `pip install vllm`. INFO 01-12 13:46:46 [communication_op.py:57] deep_ep is not installed, you can't use the api of it. INFO 01-12 13:46:46 [cache_tensor_manager.py:17] USE_GPU_TENSOR_CACHE is On WARNING 01-12 13:46:46 [grouped_fused_moe_ep.py:28] no deepep or deep_gemm [2026-01-12 13:46:47 +0800] [5690] [INFO] Started server process [5690] [2026-01-12 13:46:47 +0800] [5690] [INFO] Waiting for application startup. INFO 01-12 13:46:47 [api_http.py:359] server start up 2026-01-12 13:46:47 | server | 140078322902144 | INFO : accepted ('127.0.0.1', 35962) with fd 26 2026-01-12 13:46:47 | server | 140046984353344 | INFO : welcome ('127.0.0.1', 35962) 2026-01-12 13:46:47 | server | 140078322902144 | INFO : accepted ('127.0.0.1', 35966) with fd 27 2026-01-12 13:46:47 | server | 140046975960640 | INFO : welcome ('127.0.0.1', 35966) INFO 01-12 13:46:48 [req_id_generator.py:34] ReqIDGenerator init finished INFO 01-12 13:46:48 [api_http.py:363] server start up ok, loop use is <uvloop.Loop running=True closed=False debug=False> [2026-01-12 13:46:48 +0800] [5690] [INFO] Application startup complete. DEBUG 01-12 13:47:52 [manager.py:283] dp_i 0 frozen token num: 0 DEBUG 01-12 13:47:52 [manager.py:283] DEBUG 01-12 13:47:52 [manager.py:284] dp_i 0 estimated_peak_token_count: 0 DEBUG 01-12 13:47:52 [manager.py:284] [2026-01-12 13:48:13 +0800] [5689] [INFO] Handling signal: winch [2026-01-12 13:48:13 +0800] [5689] [INFO] Handling signal: winch [2026-01-12 13:48:13 +0800] [5689] [INFO] Handling signal: winch [2026-01-12 13:48:13 +0800] [5689] [INFO] Handling signal: winch DEBUG 01-12 13:48:55 [manager.py:283] dp_i 0 frozen token num: 0 DEBUG 01-12 13:48:55 [manager.py:283] DEBUG 01-12 13:48:55 [manager.py:284] dp_i 0 estimated_peak_token_count: 0 DEBUG 01-12 13:48:55 [manager.py:284] DEBUG 01-12 13:49:58 [manager.py:283] dp_i 0 frozen token num: 0 DEBUG 01-12 13:49:58 [manager.py:283] DEBUG 01-12 13:49:58 [manager.py:284] dp_i 0 estimated_peak_token_count: 0 DEBUG 01-12 13:49:58 [manager.py:284] DEBUG 01-12 13:51:02 [manager.py:283] dp_i 0 frozen token num: 0 DEBUG 01-12 13:51:02 [manager.py:283] DEBUG 01-12 13:51:02 [manager.py:284] dp_i 0 estimated_peak_token_count: 0 DEBUG 01-12 13:51:02 [manager.py:284] INFO 01-12 13:51:09 [manager.py:417] recieved req X-Request-Id: X-Session-Id: start_time:2026-01-12 13:51:09 lightllm_req_id:8 INFO 01-12 13:51:09 [manager.py:422] router recive req id 8 cost time 0.05662369728088379 s DEBUG 01-12 13:51:09 [manager.py:320] Prefill Batch: batch_id=-1, time:1768197069.7485027s req_ids:[8] DEBUG 01-12 13:51:09 [manager.py:320] INFO 01-12 13:51:09 [manager.py:55] detokenization recv req id 8 cost time 0.07959198951721191 s DEBUG 01-12 13:51:11 [manager.py:251] dp_i 0 current batch size: 1 DEBUG 01-12 13:51:11 [manager.py:251] dp_i 0 paused req num: 0 DEBUG 01-12 13:51:11 [manager.py:251] dp_i 0 frozen token num: 0 DEBUG 01-12 13:51:11 [manager.py:251] dp_i 0 estimated_peak_token_count: 39 DEBUG 01-12 13:51:11 [manager.py:251] dp_i 0 token used ratio: 6.12437378278071e-06 not contain prompt cache tree unrefed token DEBUG 01-12 13:51:11 [manager.py:251] dp_i 0 token used ratio: 6.12437378278071e-06 contain prompt cache tree unrefed token DEBUG 01-12 13:51:14 [manager.py:251] dp_i 0 current batch size: 1 DEBUG 01-12 13:51:14 [manager.py:251] dp_i 0 paused req num: 0 DEBUG 01-12 13:51:14 [manager.py:251] dp_i 0 frozen token num: 0 DEBUG 01-12 13:51:14 [manager.py:251] dp_i 0 estimated_peak_token_count: 39 DEBUG 01-12 13:51:14 [manager.py:251] dp_i 0 token used ratio: 7.655467228475888e-06 not contain prompt cache tree unrefed token DEBUG 01-12 13:51:14 [manager.py:251] dp_i 0 token used ratio: 7.655467228475888e-06 contain prompt cache tree unrefed token INFO 01-12 13:51:16 [manager.py:163] detoken release req id 8 INFO 01-12 13:51:16 [manager.py:614] X-Request-Id: X-Session-Id: start_time:2026-01-12 13:51:09 lightllm_req_id:8 first_token_cost:6353.325128555298ms total_cost_time:6671.096563339233ms,out_token_counter:17 mean_per_token_cost_time: 18.692437340231503ms prompt_token_num:4 gpu cache hit: False gpu_prompt_cache_len:0 gpu_prompt_cache_ratio:0.0 cpu cache hit: False cpu_prompt_cache_len:0 cpu_prompt_cache_ratio:0.0 disk cache hit: False disk_prompt_cache_len:0 disk_prompt_cache_ratio:0.0 mtp_avg_token_per_step:1.0 127.0.0.1:55472 - "POST /generate HTTP/1.1" 200 DEBUG 01-12 13:51:16 [req_manager.py:78] freed all request size 1008 DEBUG 01-12 13:51:16 [infer_batch.py:172] free a batch state: DEBUG 01-12 13:51:16 [infer_batch.py:172] radix refed token num 0 DEBUG 01-12 13:51:16 [infer_batch.py:172] radix hold token num 21 DEBUG 01-12 13:51:16 [infer_batch.py:172] mem manager can alloc token num 653107 DEBUG 01-12 13:51:16 [infer_batch.py:172] mem manager total size 653128 INFO 01-12 13:51:16 [batch.py:56] router release req id 8 INFO 01-12 13:51:16 [shm_req_manager.py:111] all shm req has been release ok INFO 01-12 13:51:19 [manager.py:417] recieved req X-Request-Id: X-Session-Id: start_time:2026-01-12 13:51:19 lightllm_req_id:16 INFO 01-12 13:51:19 [manager.py:422] router recive req id 16 cost time 0.019651412963867188 s DEBUG 01-12 13:51:19 [manager.py:320] Prefill Batch: batch_id=-1, time:1768197079.421846s req_ids:[16] DEBUG 01-12 13:51:19 [manager.py:320] INFO 01-12 13:51:19 [manager.py:55] detokenization recv req id 16 cost time 0.021979331970214844 s INFO 01-12 13:51:19 [manager.py:163] detoken release req id 16 INFO 01-12 13:51:19 [manager.py:614] X-Request-Id: X-Session-Id: start_time:2026-01-12 13:51:19 lightllm_req_id:16 first_token_cost:102.96440124511719ms total_cost_time:407.08088874816895ms,out_token_counter:17 mean_per_token_cost_time: 17.88920514723834ms prompt_token_num:4 gpu cache hit: True gpu_prompt_cache_len:3 gpu_prompt_cache_ratio:0.75 cpu cache hit: False cpu_prompt_cache_len:0 cpu_prompt_cache_ratio:0.0 disk cache hit: False disk_prompt_cache_len:0 disk_prompt_cache_ratio:0.0 mtp_avg_token_per_step:1.0 127.0.0.1:47146 - "POST /generate HTTP/1.1" 200 DEBUG 01-12 13:51:19 [req_manager.py:78] freed all request size 1008 DEBUG 01-12 13:51:19 [infer_batch.py:172] free a batch state: DEBUG 01-12 13:51:19 [infer_batch.py:172] radix refed token num 0 DEBUG 01-12 13:51:19 [infer_batch.py:172] radix hold token num 35 DEBUG 01-12 13:51:19 [infer_batch.py:172] mem manager can alloc token num 653093 DEBUG 01-12 13:51:19 [infer_batch.py:172] mem manager total size 653128 INFO 01-12 13:51:19 [batch.py:56] router release req id 16 INFO 01-12 13:51:19 [shm_req_manager.py:111] all shm req has been release ok INFO 01-12 13:51:22 [manager.py:417] recieved req X-Request-Id: X-Session-Id: start_time:2026-01-12 13:51:22 lightllm_req_id:24 INFO 01-12 13:51:22 [manager.py:422] router recive req id 24 cost time 0.015377998352050781 s DEBUG 01-12 13:51:22 [manager.py:320] Prefill Batch: batch_id=-1, time:1768197082.1040523s req_ids:[24] DEBUG 01-12 13:51:22 [manager.py:320] INFO 01-12 13:51:22 [manager.py:55] detokenization recv req id 24 cost time 0.016767501831054688 s INFO 01-12 13:51:22 [manager.py:163] detoken release req id 24 INFO 01-12 13:51:22 [manager.py:614] X-Request-Id: X-Session-Id: start_time:2026-01-12 13:51:22 lightllm_req_id:24 first_token_cost:86.02452278137207ms total_cost_time:432.842493057251ms,out_token_counter:17 mean_per_token_cost_time: 20.4010570750517ms prompt_token_num:4 gpu cache hit: True gpu_prompt_cache_len:3 gpu_prompt_cache_ratio:0.75 cpu cache hit: False cpu_prompt_cache_len:0 cpu_prompt_cache_ratio:0.0 disk cache hit: False disk_prompt_cache_len:0 disk_prompt_cache_ratio:0.0 mtp_avg_token_per_step:1.0 127.0.0.1:47156 - "POST /generate HTTP/1.1" 200 DEBUG 01-12 13:51:22 [req_manager.py:78] freed all request size 1008 DEBUG 01-12 13:51:22 [infer_batch.py:172] free a batch state: DEBUG 01-12 13:51:22 [infer_batch.py:172] radix refed token num 0 DEBUG 01-12 13:51:22 [infer_batch.py:172] radix hold token num 51 DEBUG 01-12 13:51:22 [infer_batch.py:172] mem manager can alloc token num 653077 DEBUG 01-12 13:51:22 [infer_batch.py:172] mem manager total size 653128 INFO 01-12 13:51:22 [batch.py:56] router release req id 24 INFO 01-12 13:51:22 [shm_req_manager.py:111] all shm req has been release ok INFO 01-12 13:51:26 [manager.py:417] recieved req X-Request-Id: X-Session-Id: start_time:2026-01-12 13:51:26 lightllm_req_id:32 INFO 01-12 13:51:26 [manager.py:422] router recive req id 32 cost time 0.008630990982055664 s DEBUG 01-12 13:51:26 [manager.py:320] Prefill Batch: batch_id=-1, time:1768197086.9206343s req_ids:[32] DEBUG 01-12 13:51:26 [manager.py:320] INFO 01-12 13:51:26 [manager.py:55] detokenization recv req id 32 cost time 0.011269092559814453 s INFO 01-12 13:51:27 [manager.py:163] detoken release req id 32 INFO 01-12 13:51:27 [manager.py:614] X-Request-Id: X-Session-Id: start_time:2026-01-12 13:51:26 lightllm_req_id:32 first_token_cost:74.12481307983398ms total_cost_time:378.31759452819824ms,out_token_counter:17 mean_per_token_cost_time: 17.89369302637437ms prompt_token_num:4 gpu cache hit: True gpu_prompt_cache_len:3 gpu_prompt_cache_ratio:0.75 cpu cache hit: False cpu_prompt_cache_len:0 cpu_prompt_cache_ratio:0.0 disk cache hit: False disk_prompt_cache_len:0 disk_prompt_cache_ratio:0.0 mtp_avg_token_per_step:1.0 127.0.0.1:47160 - "POST /generate HTTP/1.1" 200 DEBUG 01-12 13:51:27 [req_manager.py:78] freed all request size 1008 DEBUG 01-12 13:51:27 [infer_batch.py:172] free a batch state: DEBUG 01-12 13:51:27 [infer_batch.py:172] radix refed token num 0 DEBUG 01-12 13:51:27 [infer_batch.py:172] radix hold token num 68 DEBUG 01-12 13:51:27 [infer_batch.py:172] mem manager can alloc token num 653060 DEBUG 01-12 13:51:27 [infer_batch.py:172] mem manager total size 653128 INFO 01-12 13:51:27 [batch.py:56] router release req id 32 INFO 01-12 13:51:27 [shm_req_manager.py:111] all shm req has been release ok INFO 01-12 13:51:44 [manager.py:417] recieved req X-Request-Id: X-Session-Id: start_time:2026-01-12 13:51:44 lightllm_req_id:40 INFO 01-12 13:51:44 [manager.py:422] router recive req id 40 cost time 0.009232759475708008 s DEBUG 01-12 13:51:44 [manager.py:320] Prefill Batch: batch_id=-1, time:1768197104.2886696s req_ids:[40] DEBUG 01-12 13:51:44 [manager.py:320] INFO 01-12 13:51:44 [manager.py:55] detokenization recv req id 40 cost time 0.010197639465332031 s DEBUG 01-12 13:51:47 [manager.py:251] dp_i 0 current batch size: 1 DEBUG 01-12 13:51:47 [manager.py:251] dp_i 0 paused req num: 0 DEBUG 01-12 13:51:47 [manager.py:251] dp_i 0 frozen token num: 0 DEBUG 01-12 13:51:47 [manager.py:251] dp_i 0 estimated_peak_token_count: 2022 DEBUG 01-12 13:51:47 [manager.py:251] dp_i 0 token used ratio: 0.00019597996104898273 not contain prompt cache tree unrefed token DEBUG 01-12 13:51:47 [manager.py:251] dp_i 0 token used ratio: 0.0002955010350191693 contain prompt cache tree unrefed token DEBUG 01-12 13:51:50 [manager.py:251] dp_i 0 current batch size: 1 DEBUG 01-12 13:51:50 [manager.py:251] dp_i 0 paused req num: 0 DEBUG 01-12 13:51:50 [manager.py:251] dp_i 0 frozen token num: 0 DEBUG 01-12 13:51:50 [manager.py:251] dp_i 0 estimated_peak_token_count: 2022 DEBUG 01-12 13:51:50 [manager.py:251] dp_i 0 token used ratio: 0.0002618169792138754 not contain prompt cache tree unrefed token DEBUG 01-12 13:51:50 [manager.py:251] dp_i 0 token used ratio: 0.0003613380531840619 contain prompt cache tree unrefed token DEBUG 01-12 13:51:53 [manager.py:251] dp_i 0 current batch size: 1 DEBUG 01-12 13:51:53 [manager.py:251] dp_i 0 paused req num: 0 DEBUG 01-12 13:51:53 [manager.py:251] dp_i 0 frozen token num: 0 DEBUG 01-12 13:51:53 [manager.py:251] dp_i 0 estimated_peak_token_count: 2020 DEBUG 01-12 13:51:53 [manager.py:251] dp_i 0 token used ratio: 0.0005052608370794086 not contain prompt cache tree unrefed token DEBUG 01-12 13:51:53 [manager.py:251] dp_i 0 token used ratio: 0.0006047819110495952 contain prompt cache tree unrefed token DEBUG 01-12 13:51:56 [manager.py:251] dp_i 0 current batch size: 1 DEBUG 01-12 13:51:56 [manager.py:251] dp_i 0 paused req num: 0 DEBUG 01-12 13:51:56 [manager.py:251] dp_i 0 frozen token num: 0 DEBUG 01-12 13:51:56 [manager.py:251] dp_i 0 estimated_peak_token_count: 2020 DEBUG 01-12 13:51:56 [manager.py:251] dp_i 0 token used ratio: 0.0007456425080535515 not contain prompt cache tree unrefed token DEBUG 01-12 13:51:56 [manager.py:251] dp_i 0 token used ratio: 0.000845163582023738 contain prompt cache tree unrefed token DEBUG 01-12 13:51:59 [manager.py:251] dp_i 0 current batch size: 1 DEBUG 01-12 13:51:59 [manager.py:251] dp_i 0 paused req num: 0 DEBUG 01-12 13:51:59 [manager.py:251] dp_i 0 frozen token num: 0 DEBUG 01-12 13:51:59 [manager.py:251] dp_i 0 estimated_peak_token_count: 2020 DEBUG 01-12 13:51:59 [manager.py:251] dp_i 0 token used ratio: 0.0009875552724733895 not contain prompt cache tree unrefed token DEBUG 01-12 13:51:59 [manager.py:251] dp_i 0 token used ratio: 0.001087076346443576 contain prompt cache tree unrefed token DEBUG 01-12 13:52:02 [manager.py:251] dp_i 0 current batch size: 1 DEBUG 01-12 13:52:02 [manager.py:251] dp_i 0 paused req num: 0 DEBUG 01-12 13:52:02 [manager.py:251] dp_i 0 frozen token num: 0 DEBUG 01-12 13:52:02 [manager.py:251] dp_i 0 estimated_peak_token_count: 2020 DEBUG 01-12 13:52:02 [manager.py:251] dp_i 0 token used ratio: 0.0012264058500018372 not contain prompt cache tree unrefed token DEBUG 01-12 13:52:02 [manager.py:251] dp_i 0 token used ratio: 0.001325926923972024 contain prompt cache tree unrefed token DEBUG 01-12 13:52:05 [manager.py:251] dp_i 0 current batch size: 1 DEBUG 01-12 13:52:05 [manager.py:251] dp_i 0 paused req num: 0 DEBUG 01-12 13:52:05 [manager.py:251] dp_i 0 frozen token num: 0 DEBUG 01-12 13:52:05 [manager.py:251] dp_i 0 estimated_peak_token_count: 2020 DEBUG 01-12 13:52:05 [manager.py:251] dp_i 0 token used ratio: 0.0014086059700395635 not contain prompt cache tree unrefed token DEBUG 01-12 13:52:05 [manager.py:251] dp_i 0 token used ratio: 0.00150812704400975 contain prompt cache tree unrefed token DEBUG 01-12 13:52:08 [manager.py:251] dp_i 0 current batch size: 1 DEBUG 01-12 13:52:08 [manager.py:251] dp_i 0 paused req num: 0 DEBUG 01-12 13:52:08 [manager.py:251] dp_i 0 frozen token num: 0 DEBUG 01-12 13:52:08 [manager.py:251] dp_i 0 estimated_peak_token_count: 2020 DEBUG 01-12 13:52:08 [manager.py:251] dp_i 0 token used ratio: 0.0015724329687289474 not contain prompt cache tree unrefed token DEBUG 01-12 13:52:08 [manager.py:251] dp_i 0 token used ratio: 0.001671954042699134 contain prompt cache tree unrefed token DEBUG 01-12 13:52:11 [manager.py:251] dp_i 0 current batch size: 1 DEBUG 01-12 13:52:11 [manager.py:251] dp_i 0 paused req num: 0 DEBUG 01-12 13:52:11 [manager.py:251] dp_i 0 frozen token num: 0 DEBUG 01-12 13:52:11 [manager.py:251] dp_i 0 estimated_peak_token_count: 2020 DEBUG 01-12 13:52:11 [manager.py:251] dp_i 0 token used ratio: 0.0017331977805269412 not contain prompt cache tree unrefed token DEBUG 01-12 13:52:11 [manager.py:251] dp_i 0 token used ratio: 0.0018327188544971277 contain prompt cache tree unrefed token DEBUG 01-12 13:52:14 [manager.py:251] dp_i 0 current batch size: 1 DEBUG 01-12 13:52:14 [manager.py:251] dp_i 0 paused req num: 0 DEBUG 01-12 13:52:14 [manager.py:251] dp_i 0 frozen token num: 0 DEBUG 01-12 13:52:14 [manager.py:251] dp_i 0 estimated_peak_token_count: 2020 DEBUG 01-12 13:52:14 [manager.py:251] dp_i 0 token used ratio: 0.0018939625923249349 not contain prompt cache tree unrefed token DEBUG 01-12 13:52:14 [manager.py:251] dp_i 0 token used ratio: 0.0019934836662951214 contain prompt cache tree unrefed token DEBUG 01-12 13:52:17 [manager.py:251] dp_i 0 current batch size: 1 DEBUG 01-12 13:52:17 [manager.py:251] dp_i 0 paused req num: 0 DEBUG 01-12 13:52:17 [manager.py:251] dp_i 0 frozen token num: 0 DEBUG 01-12 13:52:17 [manager.py:251] dp_i 0 estimated_peak_token_count: 2020 DEBUG 01-12 13:52:17 [manager.py:251] dp_i 0 token used ratio: 0.0020531963106772333 not contain prompt cache tree unrefed token DEBUG 01-12 13:52:17 [manager.py:251] dp_i 0 token used ratio: 0.00215271738464742 contain prompt cache tree unrefed token DEBUG 01-12 13:52:20 [manager.py:251] dp_i 0 current batch size: 1 DEBUG 01-12 13:52:20 [manager.py:251] dp_i 0 paused req num: 0 DEBUG 01-12 13:52:20 [manager.py:251] dp_i 0 frozen token num: 0 DEBUG 01-12 13:52:20 [manager.py:251] dp_i 0 estimated_peak_token_count: 2020 DEBUG 01-12 13:52:20 [manager.py:251] dp_i 0 token used ratio: 0.002213961122475227 not contain prompt cache tree unrefed token DEBUG 01-12 13:52:20 [manager.py:251] dp_i 0 token used ratio: 0.0023134821964454133 contain prompt cache tree unrefed token DEBUG 01-12 13:52:23 [manager.py:251] dp_i 0 current batch size: 1 DEBUG 01-12 13:52:23 [manager.py:251] dp_i 0 paused req num: 0 DEBUG 01-12 13:52:23 [manager.py:251] dp_i 0 frozen token num: 0 DEBUG 01-12 13:52:23 [manager.py:251] dp_i 0 estimated_peak_token_count: 2020 DEBUG 01-12 13:52:23 [manager.py:251] dp_i 0 token used ratio: 0.0023731948408275256 not contain prompt cache tree unrefed token DEBUG 01-12 13:52:23 [manager.py:251] dp_i 0 token used ratio: 0.002472715914797712 contain prompt cache tree unrefed token DEBUG 01-12 13:52:26 [manager.py:251] dp_i 0 current batch size: 1 DEBUG 01-12 13:52:26 [manager.py:251] dp_i 0 paused req num: 0 DEBUG 01-12 13:52:26 [manager.py:251] dp_i 0 frozen token num: 0 DEBUG 01-12 13:52:26 [manager.py:251] dp_i 0 estimated_peak_token_count: 2020 DEBUG 01-12 13:52:26 [manager.py:251] dp_i 0 token used ratio: 0.002509462157494396 not contain prompt cache tree unrefed token DEBUG 01-12 13:52:26 [manager.py:251] dp_i 0 token used ratio: 0.002608983231464583 contain prompt cache tree unrefed token DEBUG 01-12 13:52:29 [manager.py:251] dp_i 0 current batch size: 1 DEBUG 01-12 13:52:29 [manager.py:251] dp_i 0 paused req num: 0 DEBUG 01-12 13:52:29 [manager.py:251] dp_i 0 frozen token num: 0 DEBUG 01-12 13:52:29 [manager.py:251] dp_i 0 estimated_peak_token_count: 2020 DEBUG 01-12 13:52:29 [manager.py:251] dp_i 0 token used ratio: 0.0026288874462586202 not contain prompt cache tree unrefed token DEBUG 01-12 13:52:29 [manager.py:251] dp_i 0 token used ratio: 0.0027284085202288065 contain prompt cache tree unrefed token DEBUG 01-12 13:52:32 [manager.py:251] dp_i 0 current batch size: 1 DEBUG 01-12 13:52:32 [manager.py:251] dp_i 0 paused req num: 0 DEBUG 01-12 13:52:32 [manager.py:251] dp_i 0 frozen token num: 0 DEBUG 01-12 13:52:32 [manager.py:251] dp_i 0 estimated_peak_token_count: 2020 DEBUG 01-12 13:52:32 [manager.py:251] dp_i 0 token used ratio: 0.002746781641577149 not contain prompt cache tree unrefed token DEBUG 01-12 13:52:32 [manager.py:251] dp_i 0 token used ratio: 0.002846302715547335 contain prompt cache tree unrefed token DEBUG 01-12 13:52:35 [manager.py:251] dp_i 0 current batch size: 1 DEBUG 01-12 13:52:35 [manager.py:251] dp_i 0 paused req num: 0 DEBUG 01-12 13:52:35 [manager.py:251] dp_i 0 frozen token num: 0 DEBUG 01-12 13:52:35 [manager.py:251] dp_i 0 estimated_peak_token_count: 2020 DEBUG 01-12 13:52:35 [manager.py:251] dp_i 0 token used ratio: 0.002861613650004287 not contain prompt cache tree unrefed token DEBUG 01-12 13:52:35 [manager.py:251] dp_i 0 token used ratio: 0.0029611347239744735 contain prompt cache tree unrefed token DEBUG 01-12 13:52:38 [manager.py:251] dp_i 0 current batch size: 1 DEBUG 01-12 13:52:38 [manager.py:251] dp_i 0 paused req num: 0 DEBUG 01-12 13:52:38 [manager.py:251] dp_i 0 frozen token num: 0 DEBUG 01-12 13:52:38 [manager.py:251] dp_i 0 estimated_peak_token_count: 2020 DEBUG 01-12 13:52:38 [manager.py:251] dp_i 0 token used ratio: 0.002939699415734741 not contain prompt cache tree unrefed token DEBUG 01-12 13:52:38 [manager.py:251] dp_i 0 token used ratio: 0.0030392204897049277 contain prompt cache tree unrefed token DEBUG 01-12 13:52:41 [manager.py:251] dp_i 0 current batch size: 1 DEBUG 01-12 13:52:41 [manager.py:251] dp_i 0 paused req num: 0 DEBUG 01-12 13:52:41 [manager.py:251] dp_i 0 frozen token num: 0 DEBUG 01-12 13:52:41 [manager.py:251] dp_i 0 estimated_peak_token_count: 2020 DEBUG 01-12 13:52:41 [manager.py:251] dp_i 0 token used ratio: 0.0030116608076824146 not contain prompt cache tree unrefed token DEBUG 01-12 13:52:41 [manager.py:251] dp_i 0 token used ratio: 0.003111181881652601 contain prompt cache tree unrefed token INFO 01-12 13:52:42 [manager.py:163] detoken release req id 40 INFO 01-12 13:52:42 [manager.py:614] X-Request-Id: X-Session-Id: start_time:2026-01-12 13:51:44 lightllm_req_id:40 first_token_cost:91.23969078063965ms total_cost_time:58654.03771400452ms,out_token_counter:2000 mean_per_token_cost_time: 29.28139901161194ms prompt_token_num:4 gpu cache hit: True gpu_prompt_cache_len:3 gpu_prompt_cache_ratio:0.75 cpu cache hit: False cpu_prompt_cache_len:0 cpu_prompt_cache_ratio:0.0 disk cache hit: False disk_prompt_cache_len:0 disk_prompt_cache_ratio:0.0 mtp_avg_token_per_step:1.0 127.0.0.1:50156 - "POST /generate HTTP/1.1" 200 DEBUG 01-12 13:52:42 [req_manager.py:78] freed all request size 1008 DEBUG 01-12 13:52:42 [infer_batch.py:172] free a batch state: DEBUG 01-12 13:52:42 [infer_batch.py:172] radix refed token num 0 DEBUG 01-12 13:52:42 [infer_batch.py:172] radix hold token num 2068 DEBUG 01-12 13:52:42 [infer_batch.py:172] mem manager can alloc token num 651060 DEBUG 01-12 13:52:42 [infer_batch.py:172] mem manager total size 653128 INFO 01-12 13:52:42 [batch.py:56] router release req id 40 INFO 01-12 13:52:42 [shm_req_manager.py:111] all shm req has been release ok DEBUG 01-12 13:52:50 [manager.py:283] dp_i 0 frozen token num: 0 DEBUG 01-12 13:52:50 [manager.py:283] DEBUG 01-12 13:52:50 [manager.py:284] dp_i 0 estimated_peak_token_count: 0 DEBUG 01-12 13:52:50 [manager.py:284] DEBUG 01-12 13:53:53 [manager.py:283] dp_i 0 frozen token num: 0 DEBUG 01-12 13:53:53 [manager.py:283] DEBUG 01-12 13:53:53 [manager.py:284] dp_i 0 estimated_peak_token_count: 0 DEBUG 01-12 13:53:53 [manager.py:284] DEBUG 01-12 13:54:56 [manager.py:283] dp_i 0 frozen token num: 0 DEBUG 01-12 13:54:56 [manager.py:283] DEBUG 01-12 13:54:56 [manager.py:284] dp_i 0 estimated_peak_token_count: 0 DEBUG 01-12 13:54:56 [manager.py:284] DEBUG 01-12 13:56:00 [manager.py:283] dp_i 0 frozen token num: 0 DEBUG 01-12 13:56:00 [manager.py:283] DEBUG 01-12 13:56:00 [manager.py:284] dp_i 0 estimated_peak_token_count: 0 DEBUG 01-12 13:56:00 [manager.py:284] DEBUG 01-12 13:57:03 [manager.py:283] dp_i 0 frozen token num: 0 DEBUG 01-12 13:57:03 [manager.py:283] DEBUG 01-12 13:57:03 [manager.py:284] dp_i 0 estimated_peak_token_count: 0 DEBUG 01-12 13:57:03 [manager.py:284] DEBUG 01-12 13:58:06 [manager.py:283] dp_i 0 frozen token num: 0 DEBUG 01-12 13:58:06 [manager.py:283] DEBUG 01-12 13:58:06 [manager.py:284] dp_i 0 estimated_peak_token_count: 0 DEBUG 01-12 13:58:06 [manager.py:284] DEBUG 01-12 13:59:09 [manager.py:283] dp_i 0 frozen token num: 0 DEBUG 01-12 13:59:09 [manager.py:283] DEBUG 01-12 13:59:09 [manager.py:284] dp_i 0 estimated_peak_token_count: 0 DEBUG 01-12 13:59:09 [manager.py:284] INFO 01-12 14:00:06 [manager.py:417] recieved req X-Request-Id: X-Session-Id: start_time:2026-01-12 14:00:06 lightllm_req_id:48 INFO 01-12 14:00:06 [manager.py:422] router recive req id 48 cost time 0.00828862190246582 s DEBUG 01-12 14:00:06 [manager.py:320] Prefill Batch: batch_id=-1, time:1768197606.2045314s req_ids:[48] DEBUG 01-12 14:00:06 [manager.py:320] INFO 01-12 14:00:06 [manager.py:55] detokenization recv req id 48 cost time 0.010654926300048828 s DEBUG 01-12 14:00:06 [manager.py:251] dp_i 0 current batch size: 1 DEBUG 01-12 14:00:06 [manager.py:251] dp_i 0 paused req num: 0 DEBUG 01-12 14:00:06 [manager.py:251] dp_i 0 frozen token num: 0 DEBUG 01-12 14:00:06 [manager.py:251] dp_i 0 estimated_peak_token_count: 222 DEBUG 01-12 14:00:06 [manager.py:251] dp_i 0 token used ratio: 4.746389681655051e-05 not contain prompt cache tree unrefed token DEBUG 01-12 14:00:06 [manager.py:251] dp_i 0 token used ratio: 0.0032091718621770926 contain prompt cache tree unrefed token DEBUG 01-12 14:00:09 [manager.py:251] dp_i 0 current batch size: 1 DEBUG 01-12 14:00:09 [manager.py:251] dp_i 0 paused req num: 0 DEBUG 01-12 14:00:09 [manager.py:251] dp_i 0 frozen token num: 0 DEBUG 01-12 14:00:09 [manager.py:251] dp_i 0 estimated_peak_token_count: 222 DEBUG 01-12 14:00:09 [manager.py:251] dp_i 0 token used ratio: 0.0002878455677906934 not contain prompt cache tree unrefed token DEBUG 01-12 14:00:09 [manager.py:251] dp_i 0 token used ratio: 0.003449553533151235 contain prompt cache tree unrefed token INFO 01-12 14:00:10 [manager.py:163] detoken release req id 48 INFO 01-12 14:00:10 [manager.py:614] X-Request-Id: X-Session-Id: start_time:2026-01-12 14:00:06 lightllm_req_id:48 first_token_cost:94.14434432983398ms total_cost_time:3917.818784713745ms,out_token_counter:200 mean_per_token_cost_time: 19.118372201919556ms prompt_token_num:4 gpu cache hit: True gpu_prompt_cache_len:3 gpu_prompt_cache_ratio:0.75 cpu cache hit: False cpu_prompt_cache_len:0 cpu_prompt_cache_ratio:0.0 disk cache hit: False disk_prompt_cache_len:0 disk_prompt_cache_ratio:0.0 mtp_avg_token_per_step:1.0 127.0.0.1:53836 - "POST /generate HTTP/1.1" 200 DEBUG 01-12 14:00:10 [req_manager.py:78] freed all request size 1008 DEBUG 01-12 14:00:10 [infer_batch.py:172] free a batch state: DEBUG 01-12 14:00:10 [infer_batch.py:172] radix refed token num 0 DEBUG 01-12 14:00:10 [infer_batch.py:172] radix hold token num 2266 DEBUG 01-12 14:00:10 [infer_batch.py:172] mem manager can alloc token num 650862 DEBUG 01-12 14:00:10 [infer_batch.py:172] mem manager total size 653128 INFO 01-12 14:00:10 [batch.py:56] router release req id 48 INFO 01-12 14:00:10 [shm_req_manager.py:111] all shm req has been release ok DEBUG 01-12 14:00:12 [manager.py:283] dp_i 0 frozen token num: 0 DEBUG 01-12 14:00:12 [manager.py:283] DEBUG 01-12 14:00:12 [manager.py:284] dp_i 0 estimated_peak_token_count: 0 DEBUG 01-12 14:00:12 [manager.py:284] DEBUG 01-12 14:01:16 [manager.py:283] dp_i 0 frozen token num: 0 DEBUG 01-12 14:01:16 [manager.py:283] DEBUG 01-12 14:01:16 [manager.py:284] dp_i 0 estimated_peak_token_count: 0 DEBUG 01-12 14:01:16 [manager.py:284] DEBUG 01-12 14:02:19 [manager.py:283] dp_i 0 frozen token num: 0 DEBUG 01-12 14:02:19 [manager.py:283] DEBUG 01-12 14:02:19 [manager.py:284] dp_i 0 estimated_peak_token_count: 0 DEBUG 01-12 14:02:19 [manager.py:284] [2026-01-12 14:03:16 +0800] [5689] [INFO] Handling signal: winch DEBUG 01-12 14:03:22 [manager.py:283] dp_i 0 frozen token num: 0 DEBUG 01-12 14:03:22 [manager.py:283] DEBUG 01-12 14:03:22 [manager.py:284] dp_i 0 estimated_peak_token_count: 0 DEBUG 01-12 14:03:22 [manager.py:284] DEBUG 01-12 14:04:25 [manager.py:283] dp_i 0 frozen token num: 0 DEBUG 01-12 14:04:25 [manager.py:283] DEBUG 01-12 14:04:25 [manager.py:284] dp_i 0 estimated_peak_token_count: 0 DEBUG 01-12 14:04:25 [manager.py:284] DEBUG 01-12 14:05:28 [manager.py:283] dp_i 0 frozen token num: 0 DEBUG 01-12 14:05:28 [manager.py:283] DEBUG 01-12 14:05:28 [manager.py:284] dp_i 0 estimated_peak_token_count: 0 DEBUG 01-12 14:05:28 [manager.py:284] [2026-01-12 14:06:28 +0800] [5689] [INFO] Handling signal: winch [2026-01-12 14:06:28 +0800] [5689] [INFO] Handling signal: winch [2026-01-12 14:06:28 +0800] [5689] [INFO] Handling signal: winch DEBUG 01-12 14:06:31 [manager.py:283] dp_i 0 frozen token num: 0 DEBUG 01-12 14:06:31 [manager.py:283] DEBUG 01-12 14:06:31 [manager.py:284] dp_i 0 estimated_peak_token_count: 0 DEBUG 01-12 14:06:31 [manager.py:284] DEBUG 01-12 14:07:35 [manager.py:283] dp_i 0 frozen token num: 0 DEBUG 01-12 14:07:35 [manager.py:283] DEBUG 01-12 14:07:35 [manager.py:284] dp_i 0 estimated_peak_token_count: 0 DEBUG 01-12 14:07:35 [manager.py:284] DEBUG 01-12 14:08:38 [manager.py:283] dp_i 0 frozen token num: 0 DEBUG 01-12 14:08:38 [manager.py:283] DEBUG 01-12 14:08:38 [manager.py:284] dp_i 0 estimated_peak_token_count: 0 DEBUG 01-12 14:08:38 [manager.py:284] DEBUG 01-12 14:09:41 [manager.py:283] dp_i 0 frozen token num: 0 DEBUG 01-12 14:09:41 [manager.py:283] DEBUG 01-12 14:09:41 [manager.py:284] dp_i 0 estimated_peak_token_count: 0 DEBUG 01-12 14:09:41 [manager.py:284] DEBUG 01-12 14:10:44 [manager.py:283] dp_i 0 frozen token num: 0 DEBUG 01-12 14:10:44 [manager.py:283] DEBUG 01-12 14:10:44 [manager.py:284] dp_i 0 estimated_peak_token_count: 0 DEBUG 01-12 14:10:44 [manager.py:284] DEBUG 01-12 14:11:47 [manager.py:283] dp_i 0 frozen token num: 0 DEBUG 01-12 14:11:47 [manager.py:283] DEBUG 01-12 14:11:47 [manager.py:284] dp_i 0 estimated_peak_token_count: 0 DEBUG 01-12 14:11:47 [manager.py:284] [2026-01-12 14:11:57 +0800] [5689] [INFO] Handling signal: winch DEBUG 01-12 14:12:51 [manager.py:283] dp_i 0 frozen token num: 0 DEBUG 01-12 14:12:51 [manager.py:283] DEBUG 01-12 14:12:51 [manager.py:284] dp_i 0 estimated_peak_token_count: 0 DEBUG 01-12 14:12:51 [manager.py:284] DEBUG 01-12 14:13:54 [manager.py:283] dp_i 0 frozen token num: 0 DEBUG 01-12 14:13:54 [manager.py:283] DEBUG 01-12 14:13:54 [manager.py:284] dp_i 0 estimated_peak_token_count: 0 DEBUG 01-12 14:13:54 [manager.py:284] ``` Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
Co-authored-by: shihaobai <42648726+shihaobai@users.noreply.github.com>
Co-authored-by: wangzaijun <wangzaijun@sensetime.com>
Co-authored-by: sangchengmeng <sangchengmeng@sensetime.com>
- Add get_custom_input_data_multimodal for loading multimodal JSONL data - Add async_post_stream_openai_chat for OpenAI Chat Completions API - Support --server_api openai_vision mode for vision model benchmarking - Add --max_requests option to limit number of requests - Add progress bar with tqdm for better UX - Improve error handling and timeout configuration Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
…mode In non-streaming mode (openai_vision), the response only contains one time point, making first_token_time and decode_token_time statistics meaningless. This change skips these statistics when the lists are empty, only showing QPS and request_time metrics. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
openai_visionserver_api mode for benchmarking vision models with multimodal inputsget_custom_input_data_multimodalfunction for loading JSONL data with images--max_requestsoption to limit number of requests for testingfirst_token_timeanddecode_token_timestatistics in non-streaming mode (only show QPS and request_time)Test plan
--server_api openai_visionand multimodal JSONL datafirst_token_timeanddecode_token_timeare skipped in non-streaming mode🤖 Generated with Claude Code