Model Configuration Instructions
AISBench Benchmark supports two types of model backends:
⚠️ Note: The two types of backends cannot be specified simultaneously.
Service-Oriented Inference Backend
AISBench Benchmark supports multiple service-oriented inference backends, including vLLM, SGLang, Triton, MindIE, TGI, etc. These backends receive inference requests and return results through exposed HTTP API interfaces. (HTTPS interfaces are not supported currently.)
Taking the vLLM inference service deployed on GPU as an example, you can refer to the vLLM Official Documentation to start the service.
The model configurations corresponding to different service-oriented backends are as follows:
Model Configuration Name |
Description |
Prerequisites for Use |
Interface Type |
Supported Dataset Prompt Formats |
Configuration File Path |
|---|---|---|---|---|---|
|
Access the inference service via vLLM’s OpenAI-compatible API, with the interface |
The vLLM version used supports the |
Text Interface |
String Format |
|
|
Access the vLLM inference service in streaming mode, with the interface |
The vLLM version used supports the |
Streaming Interface |
String Format |
|
|
Access the inference service via vLLM’s OpenAI-compatible API, with the interface |
The vLLM version used supports the |
Text Interface |
String Format, Dialogue Format, Multimodal Format |
|
|
Access the vLLM inference service in streaming mode, with the interface |
The vLLM version used supports the |
Streaming Interface |
String Format, Dialogue Format, Multimodal Format |
|
|
Access the vLLM inference service in streaming mode for multi-turn dialogue scenarios, with the interface |
The vLLM version used supports the |
Streaming Interface |
Dialogue Format |
|
|
API for accessing the vLLM inference service in function call accuracy evaluation scenarios, with the interface |
The vLLM version used supports the |
Text Interface |
Dialogue Format |
|
|
Access the inference service via vLLM-compatible API, with the interface |
The vLLM version used supports the |
Text Interface |
String Format, Multimodal Format |
|
|
Access the inference service via MindIE streaming API, with the interface |
The MindIE version used supports the |
Streaming Interface |
String Format, Multimodal Format |
|
|
Access the inference service via Triton API, with the interface |
Start an inference service that supports Triton API |
Text Interface |
String Format, Multimodal Format |
|
|
Access the inference service via Triton streaming API, with the interface |
Start an inference service that supports Triton API |
Streaming Interface |
String Format, Multimodal Format |
|
|
Access the inference service via TGI API, with the interface |
Start an inference service that supports TGI API |
Text Interface |
String Format, Multimodal Format |
|
|
Access the inference service via TGI streaming API, with the interface |
Start an inference service that supports TGI API |
Streaming Interface |
String Format, Multimodal Format |
Parameter Description for Service-Oriented Inference Backend Configuration
The configuration file for the service-oriented inference backend is configured using Python syntax, as shown in the example below:
from ais_bench.benchmark.models import VLLMCustomAPI
models = [
dict(
attr="service", # Backend type identifier
type=VLLMCustomAPI, # API type
abbr='vllm-api-general', # Unique identifier
path="/weight/DeepSeek-R1", # Model path
model="DeepSeek-R1", # Model name
request_rate=0, # Request rate
retry=2, # Maximum number of retries
host_ip="localhost", # Inference service IP
host_port=8080, # Inference service port
max_out_len=512, # Maximum output length
batch_size=1, # Request concurrency count
generation_kwargs=dict( # Post-processing parameters
temperature=0.5,
top_k=10,
top_p=0.95,
seed=None,
repetition_penalty=1.03,
)
)
]
The description of configurable parameters for the service-oriented inference backend is as follows:
Parameter Name |
Parameter Type |
Configuration Description |
|---|---|---|
|
String |
Identifier for the inference backend type, fixed as |
|
Python Class |
Class name of the API type, automatically associated by the system; no manual configuration is required by the user. Refer to Service-Oriented Inference Backend |
|
String |
Unique identifier for the service-oriented task, used to distinguish different tasks. It consists of English characters and hyphens, e.g., |
|
String |
Tokenizer path, usually the same as the model path. The Tokenizer is loaded using |
|
String |
Name of the model accessible on the server, which must be consistent with the name specified during service-oriented deployment |
|
String |
Applicable only to Triton services. It is concatenated into the endpoint URI |
|
Float |
Request sending rate (unit: requests per second). A request is sent every |
|
Dict |
Parameters for controlling fluctuations in the request sending rate (for detailed usage instructions, refer to 🔗 Description of Request Rate (RPS) Distribution Control and Visualization). If this item is not filled in, the function is disabled by default |
|
Int |
Maximum number of retries after failing to connect to the server. Valid range: [0, 1000] |
|
String |
Server IP address, supporting valid IPv4 or IPv6, e.g., |
|
Int |
Server port number, which must be consistent with the port specified during service-oriented deployment |
|
Int |
Maximum output length of the inference response; the actual length may be limited by the server. Valid range: (0, 131072] |
|
Int |
Batch size for concurrent requests. Valid range: (0, 64000] |
|
Dict |
Configuration of inference generation parameters, depending on the specific service-oriented backend and interface type. Note: Currently, multi-sampling parameters such as |
|
Bool |
Controls the extraction method of function call information. When set to |
|
Dict |
Post-processing configuration for model output results. It is used to format, clean, or convert the original model output to meet the requirements of specific evaluation tasks |
Precautions:
request_rateis affected by hardware performance. You can increase 📚 WORKERS_NUM to improve concurrency capability.The function of
request_ratemay be overwritten by thetraffic_cfgitem. For specific reasons, refer to 🔗 Parameter Interpretation Section in the Description of Request Rate (RPS) Distribution Control and Visualization.Setting
batch_sizetoo large may result in high CPU usage. Please configure it reasonably based on hardware conditions.The default service address used by the service-oriented inference evaluation API is
localhost:8080. In actual use, you need to modify it to the IP and port of the service-oriented backend according to the actual deployment.
Local Model Backend
Model Configuration Name |
Description |
Prerequisites for Use |
Supported Prompt Formats (String Format or Dialogue Format) |
Corresponding Source Code Configuration File Path |
|---|---|---|---|---|
|
HuggingFace Base Model Backend |
The basic dependencies of the evaluation tool have been installed; the HuggingFace model weight path must be specified in the configuration file (automatic download is not supported currently) |
String Format |
|
|
HuggingFace Chat Model Backend |
The basic dependencies of the evaluation tool have been installed; the HuggingFace model weight path must be specified in the configuration file (automatic download is not supported currently) |
Dialogue Format |
Parameter Description for Local Model Backend Configuration
The configuration file for the local model backend is configured using Python syntax, as shown in the example below:
from ais_bench.benchmark.models import HuggingFacewithChatTemplate
models = [
dict(
attr="local", # Backend type identifier
type=HuggingFacewithChatTemplate, # Model type
abbr='hf-chat-model', # Unique identifier
path='THUDM/chatglm-6b', # Model weight path
tokenizer_path='THUDM/chatglm-6b', # Tokenizer path
model_kwargs=dict( # Model loading parameters
device_map="auto",
trust_remote_code=True
),
max_out_len=512, # Maximum output length
batch_size=1, # Request concurrency count
generation_kwargs=dict( # Generation parameters
temperature=0.5,
top_k=10,
top_p=0.95,
seed=None,
repetition_penalty=1.03,
)
)
]
The description of configurable parameters for the local model inference backend is as follows:
Parameter Name |
Parameter Type |
Description & Configuration |
|---|---|---|
|
String |
Identifier for the backend type, fixed as |
|
Python Class |
Model class name, automatically associated by the system; no manual configuration is required by the user |
|
String |
Unique identifier for the local task, used to distinguish multiple tasks. It is recommended to use a combination of English characters and hyphens, e.g., |
|
String |
Model weight path, which must be an accessible local path. The model is loaded using |
|
String |
Tokenizer path, usually the same as the model path. The Tokenizer is loaded using |
|
Dict |
Tokenizer loading parameters. Refer to 🔗 PreTrainedTokenizerBase Documentation |
|
Dict |
Model loading parameters. Refer to 🔗 AutoModel Configuration |
|
Dict |
Inference generation parameters. Refer to 🔗 Text Generation Documentation |
|
Dict |
Runtime configuration, including |
|
Int |
Maximum number of output tokens generated by inference. Valid range: (0, 131072] |
|
Int |
Batch size for inference requests. Valid range: (0, 64000] |
|
Int |
Maximum input sequence length. Valid range: (0, 131072] |
|
Bool |
Whether to enable batch padding. Set to |