Guide to Service-Oriented Performance Evaluationο
Introductionο
AISBench Benchmark provides service-oriented performance evaluation capabilities. For streaming inference scenarios, it systematically evaluates key performance indicators of model services in real-world deployment environmentsβsuch as response latency (e.g., TTFT, Inter-Token Latency), throughput capacity (e.g., QPS, TPUT), and concurrent processing capabilityβby accurately recording the send time of each request, the return time of each stage, and the response content.
Users can flexibly control request content, request intervals, concurrent quantities, and other parameters by configuring service-oriented backend parameters to adapt to different evaluation scenarios (e.g., low-concurrency latency-sensitive scenarios, high-concurrency throughput-priority scenarios). The evaluation supports automated execution and outputs structured results, facilitating horizontal comparison of service performance differences across different models, deployment solutions, and hardware configurations.
Quick Start for Service-Oriented Performance Evaluationο
Command Meaningο
The meaning of the AISBench service-oriented performance evaluation command is the same as explained in π Tool Quick Start/Command Meaning. On this basis, you need to add --mode perf or -m perf to enter the performance evaluation scenario. Take the following AISBench command as an example:
ais_bench --models vllm_api_stream_chat --datasets demo_gsm8k_gen_4_shot_cot_chat_prompt --summarizer default_perf --mode perf
Among them:
--modelsspecifies the model task, i.e., thevllm_api_stream_chatmodel task.--datasetsspecifies the dataset task, i.e., thedemo_gsm8k_gen_4_shot_cot_chat_promptdataset task.--summarizerspecifies the result presentation task, i.e., thedefault_perfresult presentation task (if--summarizeris not specified, thedefault_perftask is used by default in accuracy evaluation scenarios). It is generally used by default and does not need to be specified in the command line; subsequent commands will omit this parameter.
Task Meaning Query (Optional)ο
Specific information (introduction, usage constraints, etc.) about the selected model task vllm_api_stream_chat, dataset task demo_gsm8k_gen_4_shot_cot_chat_prompt, and result presentation task default_perf can be queried from the following links:
--models: π Service-Oriented Inference Backend--datasets: π Open-Source Datasets β π Detailed Introduction--summarizer: π Result Summary Tasks
Preparations Before Running the Commandο
--models: To use thevllm_api_stream_chatmodel task, you need to prepare an inference service that supports thev1/chat/completionssub-service. You can refer to π VLLM Launch OpenAI-Compatible Server to start the inference service.--datasets: To use thedemo_gsm8k_gen_4_shot_cot_chat_promptdataset task, you need to prepare the GSM8K dataset, which can be downloaded from π GSM8K Dataset Compressed Package Provided by OpenCompass. Deploy the unzippedgsm8k/folder to theais_bench/datasetsfolder in the root path of the AISBench evaluation tool.
Modify Configuration Files Corresponding to Tasksο
Each model task, dataset task, and result presentation task corresponds to a configuration file. These files need to be modified before running the command. The paths of these configuration files can be queried by adding --search to the original AISBench command. For example:
# Note: Whether to add "--mode perf" to the search command does not affect the search result
ais_bench --models vllm_api_stream_chat --datasets demo_gsm8k_gen_4_shot_cot_chat_prompt --mode perf --search
β οΈ Note: Executing the command with
--searchwill print the absolute paths of the configuration files corresponding to the tasks.
Executing the query command will yield the following results:
06/28 11:52:25 - AISBench - INFO - Searching configs...
ββββββββββββββββ€ββββββββββββββββββββββββββββββββββββββββ€βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Task Type β Task Name β Config File Path β
ββββββββββββββββͺββββββββββββββββββββββββββββββββββββββββͺβββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ‘
β --models β vllm_api_stream_chat β /your_workspace/benchmark/ais_bench/benchmark/configs/models/vllm_api/vllm_api_stream_chat.py β
ββββββββββββββββΌββββββββββββββββββββββββββββββββββββββββΌβββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β --datasets β demo_gsm8k_gen_4_shot_cot_chat_prompt β /your_workspace/benchmark/ais_bench/benchmark/configs/datasets/demo/demo_gsm8k_gen_4_shot_cot_chat_prompt.py β
ββββββββββββββββ§ββββββββββββββββββββββββββββββββββββββββ§βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
The configuration file
demo_gsm8k_gen_4_shot_cot_chat_prompt.pyfor the dataset task in the quick start does not require additional modifications. For an introduction to the content of the dataset task configuration file, refer to π Configure Open-Source Datasets.
The model configuration file vllm_api_stream_chat.py contains configuration content related to model operation and needs to be modified according to the actual situation. The content that needs to be modified in the quick start is marked with comments:
from ais_bench.benchmark.models import VLLMCustomAPIChatStream
models = [
dict(
attr="service",
type=VLLMCustomAPIChatStream,
abbr='vllm-api-stream-chat',
path="", # Specify the absolute path to the model's serialized vocabulary file; generally, this is the path to the model weight folder
model="DeepSeek-R1", # Specify the name of the model loaded on the server; configure it according to the actual model name pulled by the VLLM inference service (set to an empty string to obtain it automatically)
request_rate = 0, # Request sending frequency: send 1 request to the server every 1/request_rate seconds; if less than 0.1, all requests are sent at once
retry = 2, # Maximum number of retries per request
host_ip = "localhost", # Specify the IP address of the inference service
host_port = 8080, # Specify the port of the inference service
max_out_len = 512, # Maximum number of tokens output by the inference service
batch_size=1, # Maximum concurrency for sending requests
generation_kwargs = dict( # Model inference parameters shall be configured with reference to the VLLM documentation. The AISBench evaluation tool does not process these parameters, which will be included in the sent request.
temperature = 0.5,
top_k = 10,
top_p = 0.95,
seed = None,
repetition_penalty = 1.03,
ignore_eos = True, # The inference service output ignores EOS (the output length will definitely reach max_out_len)
)
)
]
Execute the Commandο
After modifying the configuration file, execute the command to start the service-oriented performance evaluation (β οΈ It is recommended to add --debug for the first execution to print detailed logs to the screen, which makes it easier to handle errors during the request inference service process):
# Add --debug to the command line
ais_bench --models vllm_api_stream_chat --datasets demo_gsm8k_gen_4_shot_cot_chat_prompt -m perf --debug
View Performance Resultsο
An example of the on-screen performance results is as follows:
06/05 20:22:24 - AISBench - INFO - Performance Results of task: vllm-api-stream-chat/gsm8kdataset:
ββββββββββββββββββββββββββββ€ββββββββββ€βββββββββββββββββββ€βββββββββββββββββββ€βββββββββββββββββββ€βββββββββββββββββββ€βββββββββββββββββββ€βββββββββββββββββββ€βββββββββββββββββββ€βββββββ
β Performance Parameters β Stage β Average β Min β Max β Median β P75 β P90 β P99 β N β
ββββββββββββββββββββββββββββͺββββββββββͺβββββββββββββββββββͺβββββββββββββββββββͺβββββββββββββββββββͺβββββββββββββββββββͺβββββββββββββββββββͺβββββββββββββββββββͺβββββββββββββββββββͺβββββββ‘
β E2EL β total β 2048.2945 ms β 1729.7498 ms β 3450.96 ms β 2491.8789 ms β 2750.85 ms β 3184.9186 ms β 3424.4354 ms β 8 β
ββββββββββββββββββββββββββββΌββββββββββΌβββββββββββββββββββΌβββββββββββββββββββΌβββββββββββββββββββΌβββββββββββββββββββΌβββββββββββββββββββΌβββββββββββββββββββΌβββββββββββββββββββΌβββββββ€
β TTFT β total β 50.332 ms β 50.6244 ms β 52.0585 ms β 50.3237 ms β 50.5872 ms β 50.7566 ms β 50.0551 ms β 8 β
ββββββββββββββββββββββββββββΌββββββββββΌβββββββββββββββββββΌβββββββββββββββββββΌβββββββββββββββββββΌβββββββββββββββββββΌβββββββββββββββββββΌβββββββββββββββββββΌβββββββββββββββββββΌβββββββ€
β TPOT β total β 10.6965 ms β 10.061 ms β 10.8805 ms β 10.7495 ms β 10.7818 ms β 10.808 ms β 10.8582 ms β 8 β
ββββββββββββββββββββββββββββΌββββββββββΌβββββββββββββββββββΌβββββββββββββββββββΌβββββββββββββββββββΌβββββββββββββββββββΌβββββββββββββββββββΌβββββββββββββββββββΌβββββββββββββββββββΌβββββββ€
β ITL β total β 10.6965 ms β 7.3583 ms β 13.7707 ms β 10.7513 ms β 10.8009 ms β 10.8358 ms β 10.9322 ms β 8 β
ββββββββββββββββββββββββββββΌββββββββββΌβββββββββββββββββββΌβββββββββββββββββββΌβββββββββββββββββββΌβββββββββββββββββββΌβββββββββββββββββββΌβββββββββββββββββββΌβββββββββββββββββββΌβββββββ€
β InputTokens β total β 1512.5 β 1481.0 β 1566.0 β 1511.5 β 1520.25 β 1536.6 β 1563.06 β 8 β
ββββββββββββββββββββββββββββΌββββββββββΌβββββββββββββββββββΌβββββββββββββββββββΌβββββββββββββββββββΌβββββββββββββββββββΌβββββββββββββββββββΌβββββββββββββββββββΌβββββββββββββββββββΌβββββββ€
β OutputTokens β total β 287.375 β 200.0 β 407.0 β 280.0 β 322.75 β 374.8 β 403.78 β 8 β
ββββββββββββββββββββββββββββΌββββββββββΌβββββββββββββββββββΌβββββββββββββββββββΌβββββββββββββββββββΌβββββββββββββββββββΌβββββββββββββββββββΌβββββββββββββββββββΌβββββββββββββββββββΌβββββββ€
β OutputTokenThroughput β total β 115.9216 token/s β 107.6555 token/s β 116.5352 token/s β 117.6448 token/s β 118.2426 token/s β 118.3765 token/s β 118.6388 token/s β 8 β
ββββββββββββββββββββββββββββ§ββββββββββ§βββββββββββββββββββ§βββββββββββββββββββ§βββββββββββββββββββ§βββββββββββββββββββ§βββββββββββββββββββ§βββββββββββββββββββ§βββββββββββββββββββ§βββββββ
ββββββββββββββββββββββββββββ€ββββββββββ€βββββββββββββββββββββ
β Common Metric β Stage β Value β
ββββββββββββββββββββββββββββͺββββββββββͺβββββββββββββββββββββ‘
β Benchmark Duration β total β 19897.8505 ms β
ββββββββββββββββββββββββββββΌββββββββββΌβββββββββββββββββββββ€
β Total Requests β total β 8 β
ββββββββββββββββββββββββββββΌββββββββββΌβββββββββββββββββββββ€
β Failed Requests β total β 0 β
ββββββββββββββββββββββββββββΌββββββββββΌβββββββββββββββββββββ€
β Success Requests β total β 8 β
ββββββββββββββββββββββββββββΌββββββββββΌβββββββββββββββββββββ€
β Concurrency β total β 0.9972 β
ββββββββββββββββββββββββββββΌββββββββββΌβββββββββββββββββββββ€
β Max Concurrency β total β 1 β
ββββββββββββββββββββββββββββΌββββββββββΌβββββββββββββββββββββ€
β Request Throughput β total β 0.4021 req/s β
ββββββββββββββββββββββββββββΌββββββββββΌβββββββββββββββββββββ€
β Total Input Tokens β total β 12100 β
ββββββββββββββββββββββββββββΌββββββββββΌβββββββββββββββββββββ€
β Prefill Token Throughput β total β 17014.3123 token/s β
ββββββββββββββββββββββββββββΌββββββββββΌβββββββββββββββββββββ€
β Total generated tokens β total β 2299 β
ββββββββββββββββββββββββββββΌββββββββββΌβββββββββββββββββββββ€
β Input Token Throughput β total β 608.7438 token/s β
ββββββββββββββββββββββββββββΌββββββββββΌβββββββββββββββββββββ€
β Output Token Throughput β total β 115.7835 token/s β
ββββββββββββββββββββββββββββΌββββββββββΌβββββββββββββββββββββ€
β Total Token Throughput β total β 723.5273 token/s β
ββββββββββββββββββββββββββββ§ββββββββββ§βββββββββββββββββββββ
06/05 20:22:24 - AISBench - INFO - Performance Result files locate in outputs/default/20250605_202220/performances/vllm-api-stream-chat.
π‘ For the meaning of specific performance parameters, refer to π Explanation of Performance Evaluation Results
Viewing Performance Detailsο
After executing the AISBench command, more details of task execution are finally saved to the default output path. This output path is indicated in the printed logs during runtime, for example:
06/28 15:13:26 - AISBench - INFO - Current exp folder: outputs/default/20250628_151326
This log indicates that the task execution details are saved in outputs/default/20250628_151326 under the path where the command is executed.
After the command execution is completed, the task execution details in outputs/default/20250628_151326 are as follows:
20250628_151326 # Unique directory generated based on timestamp for each experiment
βββ configs # All automatically stored dumped configuration files
βββ logs # Logs during execution; if --debug is added to the command, no process logs will be saved to disk (all are printed directly)
β βββ performance/ # Log files of the inference stage
βββ performance # Performance evaluation results
β βββ vllm-api-stream-chat/ # Name of the "service-oriented model configuration", corresponding to the abbr parameter in models of the model task configuration file
β βββ gsm8kdataset.csv # Single request performance output (CSV), consistent with the Performance Parameters table in the printed performance results
β βββ gsm8kdataset.json # End-to-end performance output (JSON), consistent with the Common Metric table in the printed performance results
β βββ gsm8kdataset_details.json # FullζηΉζ₯εΏ (JSON) [Note: "ζηΉζ₯εΏ" refers to detailed timestamped logging of key events]
β βββ gsm8kdataset_plot.html # Request concurrency visualization report (HTML)
π‘ It is recommended to open the gsm8kdataset_plot.html (request concurrency visualization report) with browsers such as Chrome or Edge. You can view the latency of each request and the number of concurrent service tasks perceived by the client at each moment:
For instructions on using this HTML visualization file, please refer to π Instructions for Using Performance Test Concurrency Visualization Charts
Preconditions for Service-Oriented Performance Evaluationο
Before performing service-oriented inference, the following conditions must be met:
Accessible service-oriented model service: Ensure the service process is directly accessible in the current environment.
Dataset preparation:
Open-source datasets: Select a dataset from π Open-Source Datasets, and choose the dataset task to execute in the βDetailed Introductionβ document corresponding to the dataset. Prepare the dataset files by referring to the βDetailed Introductionβ document of the selected dataset task. It is recommended to manually place the open-source dataset in the default directory
ais_bench/datasets/, and the program will automatically load the dataset files during task execution.Randomly synthesized datasets: Select
synthetic_genas the dataset task. For other configurations, refer to π Randomly Synthesized Datasets.Custom datasets: No need to specify a dataset task. For other configurations, refer to π Custom Datasets.
Service-oriented model backend configuration: Select a sub-service with the interface type
streaming interfacefrom Service-Oriented Inference Backends (β οΈ Other types are not supported).
Main Functional Scenariosο
Single-Task Evaluationο
Refer to [Quick Start for Service-Oriented Performance Evaluation](#Quick Start for Service-Oriented Performance Evaluation)
Multi-Task Evaluationο
Supports simultaneous configuration of multiple models or multiple dataset tasks, enabling batch evaluation through a single command, which is suitable for serial execution of multiple test commands.
Command Descriptionο
Users can specify multiple configuration tasks through the --models and --datasets parameters. The number of subtasks is the product of the number of tasks configured in --models and --datasets, that is, one model configuration and one dataset configuration form a subtask. Example:
ais_bench --models vllm_api_general_stream vllm_api_stream_chat --datasets gsm8k_gen_4_shot_cot_str aime2024_gen_0_shot_str --mode perf
The above command specifies 2 model tasks (vllm_api_general_stream and vllm_api_stream_chat) and 2 dataset tasks (gsm8k_gen_4_shot_cot_str and aime2024_gen_0_shot_str), and will execute the following 4 combined performance test tasks:
vllm_api_general_stream model task + gsm8k_gen_4_shot_cot_str dataset task
vllm_api_general_stream model task + aime2024_gen_0_shot_str dataset task
vllm_api_stream_chat model task + gsm8k_gen_4_shot_cot_str dataset task
vllm_api_stream_chat model task + aime2024_gen_0_shot_str dataset task
Modifying Configuration Files Corresponding to Tasksο
The actual paths of the configuration files corresponding to model tasks and dataset tasks can be queried by executing the command with --search:
ais_bench --models vllm_api_general_stream vllm_api_stream_chat --datasets gsm8k_gen_4_shot_cot_str aime2024_gen_0_shot_str --mode perf --search
The following configuration files to be modified are found:
βββββββββββββββ€βββββββββββββββββββββββββββ€ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Task Type β Task Name β Config File Path β
βββββββββββββββͺβββββββββββββββββββββββββββͺββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ‘
β --models β vllm_api_general_stream β /your_workspace/benchmark_test/ais_bench/benchmark/configs/models/vllm_api/vllm_api_general_stream.py β
βββββββββββββββΌβββββββββββββββββββββββββββΌββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β --models β vllm_api_stream_chat β /your_workspace/benchmark_test/ais_bench/benchmark/configs/models/vllm_api/vllm_api_stream_chat.py β
βββββββββββββββΌβββββββββββββββββββββββββββΌββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β --datasets β gsm8k_gen_4_shot_cot_str β /your_workspace/benchmark_test/ais_bench/benchmark/configs/datasets/gsm8k/gsm8k_gen_4_shot_cot_str.py β
βββββββββββββββΌβββββββββββββββββββββββββββΌββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β --datasets β aime2024_gen_0_shot_str β /your_workspace/benchmark_test/ais_bench/benchmark/configs/datasets/aime2024/aime2024_gen_0_shot_str.py β
βββββββββββββββ§βββββββββββββββββββββββββββ§ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Refer to π Description of Configuration Parameters for Service-Oriented Inference Backends to configure the configuration files corresponding to the model tasks
vllm_api_general_streamandvllm_api_stream_chataccording to the actual situation.Refer to π Configuring Open-Source Datasets to configure the configuration files corresponding to the dataset tasks
gsm8k_gen_4_shot_cot_strandaime2024_gen_0_shot_straccording to the actual situation. Note: If the dataset is placed in the default directoryais_bench/datasets/, generally no configuration is needed.
Execute the Evaluation Commandο
Execute the command:
# For the first run in the service-oriented performance evaluation scenario, it is recommended to add --debug to print the inference process
ais_bench --models vllm_api_general_stream vllm_api_stream_chat --datasets gsm8k_gen_4_shot_cot_str aime2024_gen_0_shot_str --mode perf --debug
During execution, a timestamp directory will be created under the path specified by π --work-dir (default: outputs/default/) to save execution details.
After the 4 performance evaluation tasks are completed, the performance results of all 4 tasks will be printed at once:
07/01 10:57:19 - AISBench - INFO - Performance Results of task: vllm-api-general-stream/gsm8kdataset:
ββββββββββββββββββββββββββββ€ββββββββββ€ββββββββββββββββββ€ββββββββββββββββ€ββββββββββββββββββ€ββββββββββββββββββ€ββββββββββββββββββ€ββββββββββββββββββ€ββββββββββββββββββ€βββββββ
β Performance Parameters β Stage β Average β Min β Max β Median β P75 β P90 β P99 β N β
ββββββββββββββββββββββββββββͺββββββββββͺββββββββββββββββββͺββββββββββββββββͺββββββββββββββββββͺββββββββββββββββββͺββββββββββββββββββͺββββββββββββββββββͺββββββββββββββββββͺβββββββ‘
β E2EL β total β 2754.0929 ms β 2189.0804 ms β 3366.1463 ms β 2753.1668 ms β 3048.2929 ms β 3222.573 ms β 3303.3894 ms β 1319 β
......
ββββββββββββββββββββββββββββ€ββββββββββ€βββββββββββββββββββββ
β Common Metric β Stage β Value β
ββββββββββββββββββββββββββββͺββββββββββͺβββββββββββββββββββββ‘
β Benchmark Duration β total β 38039.9928 ms β
......
07/01 10:57:19 - AISBench - INFO - Performance Result files locate in outputs/default/20250701_105506/performances/vllm-api-general-stream.
07/01 10:57:19 - AISBench - INFO - Performance Results of task: vllm-api-general-stream/aime2024dataset:
ββββββββββββββββββββββββββββ€ββββββββββ€ββββββββββββββββββ€βββββββββββββββββ€βββββββββββββββββ€ββββββββββββββββ€ββββββββββββββββββ€ββββββββββββββββββ€ββββββββββββββββββ€ββββββ
β Performance Parameters β Stage β Average β Min β Max β Median β P75 β P90 β P99 β N β
ββββββββββββββββββββββββββββͺββββββββββͺββββββββββββββββββͺβββββββββββββββββͺβββββββββββββββββͺββββββββββββββββͺββββββββββββββββββͺββββββββββββββββββͺββββββββββββββββββͺββββββ‘
β E2EL β total β 2868.1822 ms β 2277.1049 ms β 3307.2084 ms β 2941.6767 ms β 3158.5361 ms β 3220.2141 ms β 3307.0174 ms β 30 β
......
ββββββββββββββββββββββββββββ€ββββββββββ€ββββββββββββββββββββ
β Common Metric β Stage β Value β
ββββββββββββββββββββββββββββͺββββββββββͺββββββββββββββββββββ‘
β Benchmark Duration β total β 3346.9782 ms β
......
07/01 10:57:19 - AISBench - INFO - Performance Result files locate in outputs/default/20250701_105506/performances/vllm-api-general-stream.
07/01 10:57:19 - AISBench - INFO - Performance Results of task: vllm-api-stream-chat/gsm8kdataset:
ββββββββββββββββββββββββββββ€ββββββββββ€ββββββββββββββββββ€βββββββββββββββββ€ββββββββββββββββββ€ββββββββββββββββββ€ββββββββββββββββββ€βββββββββββββββββ€ββββββββββββββββββ€βββββββ
β Performance Parameters β Stage β Average β Min β Max β Median β P75 β P90 β P99 β N β
ββββββββββββββββββββββββββββͺββββββββββͺββββββββββββββββββͺβββββββββββββββββͺββββββββββββββββββͺββββββββββββββββββͺββββββββββββββββββͺβββββββββββββββββͺββββββββββββββββββͺβββββββ‘
β E2EL β total β 2753.3518 ms β 2189.5185 ms β 3339.4463 ms β 2755.8153 ms β 3039.7431 ms β 3219.6642 ms β 3313.0408 ms β 1319 β
......
ββββββββββββββββββββββββββββ€ββββββββββ€βββββββββββββββββββββ
β Common Metric β Stage β Value β
ββββββββββββββββββββββββββββͺββββββββββͺβββββββββββββββββββββ‘
β Benchmark Duration β total β 38101.2396 ms β
......
07/01 10:57:19 - AISBench - INFO - Performance Result files locate in outputs/default/20250701_105506/performances/vllm-api-stream-chat.
07/01 10:57:19 - AISBench - INFO - Performance Results of task: vllm-api-stream-chat/aime2024dataset:
ββββββββββββββββββββββββββββ€ββββββββββ€ββββββββββββββββββ€ββββββββββββββββ€βββββββββββββββββ€ββββββββββββββββββ€ββββββββββββββββββ€ββββββββββββββββββ€ββββββββββββββββββ€ββββββ
β Performance Parameters β Stage β Average β Min β Max β Median β P75 β P90 β P99 β N β
ββββββββββββββββββββββββββββͺββββββββββͺββββββββββββββββββͺββββββββββββββββͺβββββββββββββββββͺββββββββββββββββββͺββββββββββββββββββͺββββββββββββββββββͺββββββββββββββββββͺββββββ‘
β E2EL β total β 2745.4115 ms β 2187.5882 ms β 3288.4635 ms β 2820.7541 ms β 2988.8338 ms β 3188.436 ms β 3273.7475 ms β 30 β
......
ββββββββββββββββββββββββββββ€ββββββββββ€ββββββββββββββββββββ
β Common Metric β Stage β Value β
ββββββββββββββββββββββββββββͺββββββββββͺββββββββββββββββββββ‘
β Benchmark Duration β total β 3335.7672 ms β
......
07/01 10:57:19 - AISBench - INFO - Performance Result files locate in outputs/default/20250701_105506/performances/vllm-api-stream-chat.
Meanwhile, the final generated directory structure is as follows:
# Under output/default
20250701_105506/ # Output directory corresponding to the task creation time
βββ configs # A combined configuration file of model tasks, dataset tasks, and result presentation tasks
β βββ 20250701_105506_29250.py
βββ logs # Contains logs of inference and accuracy evaluation stages; if --debug is added to the command, logs will be printed directly to the screen without being saved to disk
β βββ performance # Log files of the inference stage
βββ performances # Performance evaluation results
βββ vllm-api-general-stream # Name of the "service-oriented model configuration", corresponding to the `abbr` parameter in `models` of the model task configuration file
β βββ aime2024dataset.csv # Single-request performance output (CSV)
β βββ aime2024dataset_details.json # End-to-end performance output (JSON)
β βββ aime2024dataset.json # Full timestamped log (JSON)
β βββ aime2024dataset_plot.html # Request concurrency visualization report (HTML)
β βββ gsm8kdataset.csv
β βββ gsm8kdataset_details.json
β βββ gsm8kdataset.json
β βββ gsm8kdataset_plot.html
βββ vllm-api-stream-chat
βββ aime2024dataset.csv
βββ aime2024dataset_details.json
βββ aime2024dataset.json
βββ aime2024dataset_plot.html
βββ gsm8kdataset.csv
βββ gsm8kdataset_details.json
βββ gsm8kdataset.json
βββ gsm8kdataset_plot.html
β οΈ Notes:
In multi-task performance evaluation scenarios, the dataset tasks specified by
--datasetsmust belong to different dataset types; otherwise, performance data will be missing due to overwriting. For example, you cannot specify bothaime2024_gen_0_shot_strandaime2024_gen_0_shot_chat_promptthrough--datasets.
Custom Sequence Length Evaluationο
1 Configure Input and Output Distribution for Custom Sequence Datasetο
For custom sequence length evaluation, you need to specify the special dataset task synthetic_gen:
ais_bench --models vllm_api_stream_chat --datasets synthetic_gen -m perf
If you want to conduct performance tests for a specific input length distribution, you first need to configure the distribution configuration file synthetic_config.py for synthetic_gen (available at synthetic_config.py). The configuration content is as follows:
synthetic_config = {
"Type": "string",
"RequestCount": 1000, # Number of requests (number of dataset entries)
"StringConfig": {
"Input": {
"Method": "uniform",
"Params": {"MinValue": 50, "MaxValue": 500} # Input length: 50-500
},
"Output": {
"Method": "uniform",
"Params": {"MinValue": 20, "MaxValue": 200} # Output length: 20-200
}
}
}
π‘ For more custom input and output distributions, refer to π Random Synthetic Dataset
2 Ensure the Inference Service Reaches the Set Maximum Outputο
To ensure the inference service achieves the set maximum output, you need to configure the special post-processing parameter ignore_eos = True in generation_kwargs of the π Service-Oriented Model Configuration to control the maximum output length of requests (preventing early termination).
For example, modify the content of the configuration file vllm_api_stream_chat.py corresponding to the vllm_api_stream_chat model task:
from ais_bench.benchmark.models import VLLMCustomAPIChatStream
models = [
dict(
attr="service",
type=VLLMCustomAPIChatStream,
abbr='vllm-api-stream-chat',
# Configure other model task parameters such as port and IP by yourself
generation_kwargs = dict(
# .....
ignore_eos = True, # The inference service ignores EOS during output (output length will definitely reach max_out_len)
)
)
]
3 Start Performance Evaluationο
Execute the following command:
ais_bench --models vllm_api_stream_chat --datasets synthetic_gen -m perf
After completion, the output directory structure is the same as shown in the [Multi-Task Evaluation](#Multi-Task Evaluation) section. Corresponding CSV/JSON/HTML files will be generated under performance/vllm-api-stream-chat/syntheticdataset*.
β οΈ Notes:
Some model tasks do not support retrieving the actual number of returned token IDs from the service. AISBench Benchmark will first convert the string returned by the server into corresponding Token IDs using a Tokenizer, then count the actual generated Token length. This statistical value may slightly differ from the Token count directly reported by the server.
Some service-oriented backends do not support the
ignore_eospost-processing parameter. In such cases, the actual number of outputTokensmay not reach the configured maximum output length. You need to configure other post-processing parameters to achieve the maximum output length (e.g., parameters that limit the minimum output).
Fixed Request Count Evaluationο
When the dataset scale is too large and you only want to perform performance tests on a subset of samples, you can use the π --num-prompts parameter to specify the number of data entries to read. An example is as follows:
ais_bench --models vllm_api_stream_chat --datasets demo_gsm8k_gen_4_shot_cot_chat_prompt -m perf --num-prompts 1
The above command only performs inference on the first record in the sample dataset and measures its performance.
β οΈ Note: Currently, the dataset is read sequentially in the default queue order; random sampling or shuffling is not supported.
Other Functional Scenariosο
Recalculation of Performance Resultsο
The evaluation tool for the main functional scenarios of performance testing executes a complete workflow of performance sampling β calculation β summarization:
graph LR;
A[Execute inference based on the given dataset] --> B((Performance timestamp data))
B --> C[Calculate metrics based on timestamp data]
C --> D((Performance data))
D --> E[Generate a summary report based on performance data]
E --> F((Present results))
Each link in the execution workflow is independently decoupled. Calculation and summarization can be repeatedly performed based on the results of performance sampling. If the directly printed performance data does not include data for relevant dimensions (e.g., missing 95th percentile data), you need to modify some configurations for recalculation. The specific operations are as follows:
Assume the command used for the previous performance evaluation is:
ais_bench --models vllm_api_stream_chat --datasets demo_gsm8k_gen_4_shot_cot_chat_prompt --mode perf
At the same time, it is indicated that the timestamp for disk storage is 20250628_151326, and the printed Performance Parameters table is as follows:
ββββββββββββββββββββββββββββ€ββββββββββ€ββββββββββββββββββ€βββββββββββββββββ€ββββββββββββββββββ€ββββββββββββββββββ€ββββββββββββββββββ€βββββββββββββββββ€ββββββββββββββββββ€βββββββ
β Performance Parameters β Stage β Average β Min β Max β Median β P75 β P90 β P99 β N β
ββββββββββββββββββββββββββββͺββββββββββͺββββββββββββββββββͺβββββββββββββββββͺββββββββββββββββββͺββββββββββββββββββͺββββββββββββββββββͺβββββββββββββββββͺββββββββββββββββββͺβββββββ‘
β E2EL β total β 2753.3518 ms β 2189.5185 ms β 3339.4463 ms β 2755.8153 ms β 3039.7431 ms β 3219.6642 ms β 3313.0408 ms β 1319 β
......
If you want to view performance data for the βP95β dimension, you need to modify the content of the configuration file corresponding to the default result presentation task default_perf for --summarizer. The path of default_perf can be queried using the --search command:
07/01 15:51:19 - AISBench - INFO - Searching configs...
ββββββββββββββββ€βββββββββββββββ€ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Task Type β Task Name β Config File Path β
ββββββββββββββββͺβββββββββββββββͺββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ‘
β --summarizer β default_perf β /your_workspace/ais_bench/benchmark/configs/summarizers/perf/default_perf.py β
ββββββββββββββββ§βββββββββββββββ§ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Modify the content of default_perf.py:
from mmengine.config import read_base
from ais_bench.benchmark.summarizers import DefaultPerfSummarizer
from ais_bench.benchmark.calculators import DefaultPerfMetricCalculator
summarizer = dict(
type=DefaultPerfSummarizer,
calculator=dict(
type=DefaultPerfMetricCalculator,
stats_list=["Average", "Min", "Max", "Median", "P95"],
)
)
Among them, stats_list can carry data for up to 8 performance dimensions at the same time.
After modification, you can execute the following command to recalculate performance metrics:
## Note: --summarizer default_perf must be specified
ais_bench --models vllm_api_stream_chat --datasets demo_gsm8k_gen_4_shot_cot_chat_prompt --summarizer default_perf --mode perf_viz --pressure --debug --reuse 20250628_151326
The printed performance results are as follows:
07/01 16:08:01 - AISBench - INFO - Performance Results of task: vllm-api-stream-chat/gsm8kdataset:
ββββββββββββββββββββββββββββ€ββββββββββ€βββββββββββββββββ€ββββββββββββββββββ€ββββββββββββββββββ€βββββββββββββββββ€ββββββββββββββββββ€ββββββ
β Performance Parameters β Stage β Average β Min β Max β Median β P95 β N β
ββββββββββββββββββββββββββββͺββββββββββͺβββββββββββββββββͺββββββββββββββββββͺββββββββββββββββββͺβββββββββββββββββͺββββββββββββββββββͺββββββ‘
β E2EL β total β 2761.6153 ms β 2493.8016 ms β 3086.0523 ms β 2848.9603 ms β 3021.0043 ms β 8 β
......
ββββββββββββββββββββββββββββ€ββββββββββ€ββββββββββββββββββββ
β Common Metric β Stage β Value β
ββββββββββββββββββββββββββββͺββββββββββͺββββββββββββββββββββ‘
β Benchmark Duration β total β 3090.7835 ms β
......
07/01 16:08:01 - AISBench - INFO - Performance Result files locate in outputs/default/20250701_160106/performances/vllm-api-stream-chat.
β οΈ
gsm8kdataset.csv,gsm8kdataset_details.json, andgsm8kdataset_plot.htmlunder20250628_151326/performance/will be regenerated (overwriting the original files).
Specifications for Service-Oriented Performance Testingο
The scale of service-oriented performance testing determines the resource usage of the AISBench evaluation tool. Taking [Custom Sequence Length Evaluation](#Custom Sequence Length Evaluation) as an example, the test scale is mainly determined by the total number of requests (RequestCount), dataset input token length (Input), and output token length (Output). When tested on a CPU of model Intel(R) Xeon(R) Platinum 8480P, the resource usage under typical test scales is approximately as follows:
Total Number of Requests ( |
Dataset Input Token Length ( |
Output Token Length ( |
Maximum Memory Usage (GB) |
Maximum Disk Usage (GB) |
Performance Data Calculation Time (s) |
Remarks |
|---|---|---|---|---|---|---|
10,000 |
1024 |
1024 |
< 16 |
0.12 |
3 |
|
10,000 |
1024 |
4096 |
< 16 |
0.16 |
4 |
|
10,000 |
4096 |
4096 |
< 16 |
0.17 |
6 |
|
50,000 |
4096 |
4096 |
< 32 |
0.80 |
30 |
|
250,000 |
4096 |
4096 |
< 64 |
4.00 |
150 |
Maximum specification |
β οΈ The maximum memory usage, maximum disk usage, and calculation time of performance data are roughly proportional to the value of (
RequestCount Γ (Input + Output)). The maximum specification supported by a single machine in AISBench isRequestCount Γ (Input + Output) = 250,000 Γ (4096 + 4096) = 2,024,000,000.