Guide to Using Random Synthetic Datasets

I. Application Introduction

This feature only supports performance evaluation scenarios and does not support accuracy evaluation scenarios.

This feature is designed for scenarios where real datasets are unavailable, and randomly constructed synthetic datasets are used for large language model inference performance benchmarking.

II. Usage Guide

2.1 Modifying the Configuration File

Configure the required parameters in the ais_bench/datasets/synthetic/synthetic_config.py configuration file.

This configuration file is used to generate two types of randomly constructed datasets:

string mode: Generates random-length strings (simulating real input)
```
synthetic_config = {
    "Type": "string",
    ... # Other parameters
}
```
In this mode, except for public parameters, parameters in StringConfig take effect while parameters in TokenIDConfig do not.
tokenid mode: Generates random token ID sequences (directly inputting encoded tokens)
```
synthetic_config = {
    "Type": "tokenid",
    ... # Other parameters
}
```
In this mode, except for public parameters, parameters in TokenIDConfig take effect while parameters in StringConfig do not.

2.2 Command Execution

Run the following command in the command line to start the evaluation:

ais_bench --models {model_api_file} --datasets synthetic_gen {other_option_args}

III. Parameter Description

The following is a general description of parameters in the synthetic_config.py configuration file. For detailed value requirements, refer to the comments in the configuration file and specific usage scenarios.

3.1 Public Parameters

Parameter Name	Type	Description	Value Range
Type	string	Dataset type (required)	string/tokenid
RequestCount	int	Total number of generated requests (required)	[1, 1,048,576]

3.2 String Type Configuration (Required when Type=“string”)

"StringConfig" : {
    "Input" : {          # Input sequence configuration
        "Method": str,    # Distribution type: uniform/gaussian/zipf
        "Params": {}      # Parameters for the corresponding distribution
    },
    "Output" : {         # Output sequence configuration (parameters same as above)
        "Method": str,
        "Params": {}
    }
}

Description of Input/Output Distribution Parameters

Key Rules

The maximum value of all numerical parameters should not exceed 2^20 (i.e., 1,048,576) by default.

The maximum input/output length of requests is also limited by service configuration. Refer to the comments in the configuration file for details.

Distribution Type	Parameter	Type	Description	Value Range
uniform	`MinValue`	int	Minimum length of input/output sequences	[1, 1,048,576]
	`MaxValue`	int	Maximum length of input/output sequences (can equal MinValue)	[≥MinValue]
gaussian	`Mean`	float	Central value of the distribution (mean)	[-3.0e38, 3.0e38]
	`Var`	float	Variance (controls data dispersion)	[0, 3.0e38]
	`MinValue`	int	Hard lower limit for input/output sequence length	[1, 1,048,576]
	`MaxValue`	int	Hard upper limit for input/output sequence length	[≥MinValue]
zipf	`Alpha`	float	Shape parameter (larger values make the distribution more uniform)	(1.0, 10.0]
	`MinValue`	int	Minimum length of input/output sequences	[1, 1,048,576]
	`MaxValue`	int	Maximum length of input/output sequences (must be greater than MinValue)	[>MinValue]

3.3 TokenId Type Configuration (Required when Type=“tokenid”)

"TokenIdConfig" : {
    "RequestSize": int   # Number of tokens per request
}

IV. Configuration Examples

4.1 String Type Examples

1. Uniform Distribution

synthetic_config = {
    "Type": "string",
    "RequestCount": 1000,
    "StringConfig": {
        "Input": {
            "Method": "uniform",
            "Params": {"MinValue": 50, "MaxValue": 500}  # Input length: 50-500
        },
        "Output": {
            "Method": "uniform",
            "Params": {"MinValue": 20, "MaxValue": 200}  # Output length: 20-200
        }
    }
}

Features: Input/output lengths are evenly distributed within the range, suitable for baseline performance testing.

2. Gaussian Distribution

synthetic_config = {
    "Type": "string",
    "RequestCount": 800,
    "StringConfig": {
        "Input": {
            "Method": "gaussian",
            "Params": {
                "Mean": 256,       # Central value: 256
                "Var": 100,        # Standard deviation: 10
                "MinValue": 64,    # Actual range: 64-512
                "MaxValue": 512
            }
        },
        "Output": {
            "Method": "gaussian",
            "Params": {
                "Mean": 128,
                "Var": 50,
                "MinValue": 32,
                "MaxValue": 256
            }
        }
    }
}

Distribution Characteristics: Approximately 95% of input lengths fall within [236, 276] (μ±2σ).

3. Zipf Distribution

synthetic_config = {
    "Type": "string",
    "RequestCount": 1200,
    "StringConfig": {
        "Input": {
            "Method": "zipf",
            "Params": {
                "Alpha": 1.5,      # Strong long-tail effect
                "MinValue": 10,    # Input length range: 10-1000
                "MaxValue": 1000
            }
        },
        "Output": {
            "Method": "zipf",
            "Params": {
                "Alpha": 2.0,     # Flatter distribution
                "MinValue": 5,
                "MaxValue": 500
            }
        }
    }
}

Typical Scenario: Simulates long-tail distribution of requests in real scenarios. When Alpha=1.5, approximately 20% of requests account for 60% of the computation.

4. Mixed Distribution Configuration

synthetic_config = {
    "Type": "string",
    "RequestCount": 1500,
    "StringConfig": {
        "Input": {
            "Method": "zipf",    # Long-tail distribution for input
            "Params": {
                "Alpha": 1.2,
                "MinValue": 10,
                "MaxValue": 2000
            }
        },
        "Output": {
            "Method": "uniform",  # Uniform distribution for output
            "Params": {
                "MinValue": 50,
                "MaxValue": 300
            }
        }
    }
}

4.2 TokenId Type Examples

Long Text Stress Testing

synthetic_config = {
    "Type": "tokenid",
    "RequestCount": 1000,
    "TokenIdConfig": {
        "RequestSize": 2048   # 2048 tokens per request
    }
}

Short Text Performance Testing

synthetic_config = {
    "Type": "tokenid",
    "RequestCount": 5000,
    "TokenIdConfig": {
        "RequestSize": 128    # Short text processing scenario
    }
}

V. Frequently Asked Questions

Q1: How to choose a distribution type?

Uniform distribution: Suitable for baseline scenarios in stress testing.
Gaussian distribution: Simulates average request lengths in real scenarios.
Zipf distribution: Generates long-tail distributed data (e.g., 1% of requests account for 50% of computation).
Suggested distribution combinations:
- Stress testing: Use zipf distribution for Input and uniform distribution for Output.
- Stability testing: Use gaussian distribution for both Input and Output.

Q2: Why does the performance evaluation result matrix show unexpected values even after specifying the input length?

tokenid mode: When sending requests, prompts of specified length (composed of randomly generated tokens within the model’s vocabulary range) are re-decoded into strings before being sent to the service. Fluctuations may occur due to possible many-to-one or one-to-many vocabulary mappings in different models.
string mode: The input length here refers to the length of the input string, not the number of tokens.
Preprocessing stage: Additional string concatenation may be performed before/after using chat-related APIs.

Q3: Why does the performance evaluation result matrix show unexpected values even after specifying the output length in String mode?

Significant discrepancy: Check if the ignore_eos parameter in generation_kwargs of the model API configuration file is correctly set to True (this ensures the service ignores the end-of-sequence token until the preset output length is reached).

VI. Notes

tokenid mode: The value range of tokenid depends on the vocabulary range of the model specified in the model configuration file.
string mode: A fixed-length sequence is generated when MinValue=MaxValue.