Quick Start๏
Command Meaning๏
A single or multiple evaluation tasks executed by the AISBench command are defined by a combination of model tasks (single or multiple), dataset tasks (single or multiple), and result presentation tasks (single). Other command-line options of AISBench specify the scenario of the evaluation task (e.g., accuracy evaluation scenario, performance evaluation scenario). Take the following AISBench command as an example:
ais_bench --models vllm_api_general_chat --datasets demo_gsm8k_gen_4_shot_cot_chat_prompt --summarizer example
This command does not specify other command-line options, so it defaults to an accuracy evaluation scenario task, where:
--modelsspecifies the model task, i.e., thevllm_api_general_chatmodel task.--datasetsspecifies the dataset task, i.e., thedemo_gsm8k_gen_4_shot_cot_chat_promptdataset task.--summarizerspecifies the result presentation task, i.e., theexampleresult presentation task (if--summarizeris not specified, theexampletask is used by default in the accuracy evaluation scenario). It is generally recommended to use the default, so there is no need to specify it in the command line, and subsequent commands will omit it.
Task Meaning Query (Optional)๏
The specific information (introduction, usage constraints, etc.) of the selected model task vllm_api_general_chat, dataset task demo_gsm8k_gen_4_shot_cot_chat_prompt, and result presentation task example can be queried from the following links respectively:
--models: ๐ Service-Oriented Inference Backend--datasets: ๐ Open Source Datasets โ ๐ Detailed Introduction--summarizer: ๐ Result Summary Tasks
Preparations Before Running the Command๏
--models: To use thevllm_api_general_chatmodel task, you need to prepare an inference service that supports thev1/chat/completionssub-service. You can refer to ๐ Launching an OpenAI-Compatible Server with VLLM to start the inference service.--datasets: To use thedemo_gsm8k_gen_4_shot_cot_chat_promptdataset task, you need to prepare the gsm8k dataset, which can be downloaded from ๐ the gsm8k dataset zip package provided by opencompass. Deploy the unzippedgsm8k/folder to theais_bench/datasetsfolder in the root directory of the AISBench evaluation tool.
Modification of Configuration Files Corresponding to Tasks๏
Each model task, dataset task, and result presentation task corresponds to a configuration file. You need to modify the content of these configuration files before running the command. The paths of these configuration files can be queried by adding --search to the original AISBench command. For example:
ais_bench --models vllm_api_general_chat --datasets demo_gsm8k_gen_4_shot_cot_chat_prompt --search
โ ๏ธ Note: Executing the command with the
searchoption will print the absolute paths of the configuration files corresponding to the tasks.
Executing the query command will yield the following results:
06/28 11:52:25 - AISBench - INFO - Searching configs...
โโโโโโโโโโโโโโโโคโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโคโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Task Type โ Task Name โ Config File Path โ
โโโโโโโโโโโโโโโโชโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโชโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโก
โ --models โ vllm_api_general_chat โ /your_workspace/benchmark/ais_bench/benchmark/configs/models/vllm_api/vllm_api_general_chat.py โ
โโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ --datasets โ demo_gsm8k_gen_4_shot_cot_chat_prompt โ /your_workspace/benchmark/ais_bench/benchmark/configs/datasets/demo/demo_gsm8k_gen_4_shot_cot_chat_prompt.py โ
โโโโโโโโโโโโโโโโงโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโงโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
The dataset task configuration file
demo_gsm8k_gen_4_shot_cot_chat_prompt.pyin the quick start does not require additional modifications. For an introduction to the content of the dataset task configuration file, please refer to ๐ Configuring Open Source Datasets.
The model configuration file vllm_api_general_chat.py contains configuration content related to model operation and needs to be modified according to the actual situation. The content that needs to be modified in the quick start is marked with comments.
from ais_bench.benchmark.models import VLLMCustomAPIChat
models = [
dict(
attr="service",
type=VLLMCustomAPIChat,
abbr='vllm-api-general-chat',
path="", # Specify the absolute path of the model serialized vocabulary file (configuration is generally not required for accuracy testing scenarios).
model="DeepSeek-R1", # Specify the name of the model loaded on the server, configured according to the actual model name pulled by the VLLM inference service (configure as an empty string to get it automatically)
stream=False,
request_rate=0, # Request sending frequency: send 1 request to the server every 1/request_rate seconds; if less than 0.1, all requests are sent at once
retry=2, # Maximum number of retries per request
api_key="", # Custom API key, default empty string
host_ip="localhost", # Specify the IP of the inference service
host_port=8080, # Specify the port of the inference service
url="", # Custom access path for the inference service (required when the base URL is not http://host_ip:host_port, and will ignore host_ip and host_port)
max_out_len=512, # Maximum number of tokens output by the inference service
batch_size=1, # Maximum concurrency for sending requests
trust_remote_code=False, # Whether to trust remote code in the tokenizer, default False;
generation_kwargs=dict( # Model inference parameters shall be configured with reference to the VLLM documentation. The AISBench evaluation tool does not process these parameters, which will be included in the sent request.
temperature=0.01,
ignore_eos=False,
)
)
]
Execute the Command๏
After modifying the configuration files, execute the command to start the service-oriented accuracy evaluation (โ ๏ธ It is recommended to add --debug for the first execution, which can print specific logs to the screen, making it easier to handle errors during the process of requesting the inference service):
# Add --debug to the command line
ais_bench --models vllm_api_general_chat --datasets demo_gsm8k_gen_4_shot_cot_chat_prompt --debug
View Task Execution Details๏
After executing the AISBench command, the details of the task execution will be continuously saved to the default output path. This output path is indicated in the screen logs during runtime, for example:
06/28 15:13:26 - AISBench - INFO - Current exp folder: outputs/default/20250628_151326
This log indicates that the details of the task execution are saved in outputs/default/20250628_151326 under the path where the command is executed.
After the command execution is completed, the details of the task execution in outputs/default/20250628_151326 are as follows:
20250628_151326/
โโโ configs # A combined configuration file of the model task, dataset task, and result presentation task
โ โโโ 20250628_151326_29317.py
โโโ logs # Logs during execution. If --debug is added to the command, no process logs will be saved to disk (all will be printed directly)
โ โโโ eval
โ โ โโโ vllm-api-general-chat
โ โ โโโ demo_gsm8k.out # Logs of the accuracy evaluation process based on the inference results in the predictions/ folder
โ โโโ infer
โ โโโ vllm-api-general-chat
โ โโโ demo_gsm8k.out # Inference process logs
โโโ predictions
โ โโโ vllm-api-general-chat
โ โโโ demo_gsm8k.json # Inference results (all outputs returned by the inference service)
โโโ results
โ โโโ vllm-api-general-chat
โ โโโ demo_gsm8k.json # Raw scores calculated by the accuracy evaluation
โโโ summary
โโโ summary_20250628_151326.csv # Final accuracy scores presented in table format
โโโ summary_20250628_151326.md # Final accuracy scores presented in markdown format
โโโ summary_20250628_151326.txt # Final accuracy scores presented in text format
โ ๏ธ Note: The content of the saved task execution details varies across different evaluation scenarios. Please refer to the guide for specific evaluation scenarios for details.
Output Results๏
Since there are only 8 pieces of data, the results will be generated quickly. An example of the result display is as follows:
dataset version metric mode vllm_api_general_chat
----------------------- -------- -------- ----- ----------------------
demo_gsm8k 401e4c accuracy gen 62.50