Guide to Using Random Synthetic Datasets

I. Application Introduction

This feature only supports performance evaluation scenarios and does not support accuracy evaluation scenarios.

This feature is designed for scenarios where real datasets are unavailable, and randomly constructed synthetic datasets are used for large language model inference performance benchmarking.

II. Usage Guide

2.1 Modifying the Configuration File

Configure the required parameters in the ais_bench/datasets/synthetic/synthetic_config.py configuration file.

This configuration file is used to generate two types of randomly constructed datasets:

  • string mode: Generates random-length strings (simulating real input)

    synthetic_config = {
        "Type": "string",
        ... # Other parameters
    }
    

    In this mode, except for public parameters, parameters in StringConfig take effect while parameters in TokenIDConfig do not.

  • tokenid mode: Generates random token ID sequences (directly inputting encoded tokens)

    synthetic_config = {
        "Type": "tokenid",
        ... # Other parameters
    }
    

    In this mode, except for public parameters, parameters in TokenIDConfig take effect while parameters in StringConfig do not.

2.2 Command Execution

Run the following command in the command line to start the evaluation:

ais_bench --models {model_api_file} --datasets synthetic_gen {other_option_args}

III. Parameter Description

The following is a general description of parameters in the synthetic_config.py configuration file. For detailed value requirements, refer to the comments in the configuration file and specific usage scenarios.

3.1 Public Parameters

Parameter Name

Type

Description

Value Range

Type

string

Dataset type (required)

string/tokenid

RequestCount

int

Total number of generated requests (required)

[1, 1,048,576]

3.2 String Type Configuration (Required when Type=“string”)

"StringConfig" : {
    "Input" : {          # Input sequence configuration
        "Method": str,    # Distribution type: uniform/gaussian/zipf
        "Params": {}      # Parameters for the corresponding distribution
    },
    "Output" : {         # Output sequence configuration (parameters same as above)
        "Method": str,
        "Params": {}
    }
}

Description of Input/Output Distribution Parameters

Key Rules

  • The maximum value of all numerical parameters should not exceed 2^20 (i.e., 1,048,576) by default.

  • The maximum input/output length of requests is also limited by service configuration. Refer to the comments in the configuration file for details.

Distribution Type

Parameter

Type

Description

Value Range

uniform

MinValue

int

Minimum length of input/output sequences

[1, 1,048,576]

MaxValue

int

Maximum length of input/output sequences (can equal MinValue)

[≥MinValue]

gaussian

Mean

float

Central value of the distribution (mean)

[-3.0e38, 3.0e38]

Var

float

Variance (controls data dispersion)

[0, 3.0e38]

MinValue

int

Hard lower limit for input/output sequence length

[1, 1,048,576]

MaxValue

int

Hard upper limit for input/output sequence length

[≥MinValue]

zipf

Alpha

float

Shape parameter (larger values make the distribution more uniform)

(1.0, 10.0]

MinValue

int

Minimum length of input/output sequences

[1, 1,048,576]

MaxValue

int

Maximum length of input/output sequences (must be greater than MinValue)

[>MinValue]

3.3 TokenId Type Configuration (Required when Type=“tokenid”)

"TokenIdConfig" : {
    "RequestSize": int   # Number of tokens per request
}

IV. Configuration Examples

4.1 String Type Examples

1. Uniform Distribution

synthetic_config = {
    "Type": "string",
    "RequestCount": 1000,
    "StringConfig": {
        "Input": {
            "Method": "uniform",
            "Params": {"MinValue": 50, "MaxValue": 500}  # Input length: 50-500
        },
        "Output": {
            "Method": "uniform",
            "Params": {"MinValue": 20, "MaxValue": 200}  # Output length: 20-200
        }
    }
}

Features: Input/output lengths are evenly distributed within the range, suitable for baseline performance testing.

2. Gaussian Distribution

synthetic_config = {
    "Type": "string",
    "RequestCount": 800,
    "StringConfig": {
        "Input": {
            "Method": "gaussian",
            "Params": {
                "Mean": 256,       # Central value: 256
                "Var": 100,        # Standard deviation: 10
                "MinValue": 64,    # Actual range: 64-512
                "MaxValue": 512
            }
        },
        "Output": {
            "Method": "gaussian",
            "Params": {
                "Mean": 128,
                "Var": 50,
                "MinValue": 32,
                "MaxValue": 256
            }
        }
    }
}

Distribution Characteristics: Approximately 95% of input lengths fall within [236, 276] (μ±2σ).

3. Zipf Distribution

synthetic_config = {
    "Type": "string",
    "RequestCount": 1200,
    "StringConfig": {
        "Input": {
            "Method": "zipf",
            "Params": {
                "Alpha": 1.5,      # Strong long-tail effect
                "MinValue": 10,    # Input length range: 10-1000
                "MaxValue": 1000
            }
        },
        "Output": {
            "Method": "zipf",
            "Params": {
                "Alpha": 2.0,     # Flatter distribution
                "MinValue": 5,
                "MaxValue": 500
            }
        }
    }
}

Typical Scenario: Simulates long-tail distribution of requests in real scenarios. When Alpha=1.5, approximately 20% of requests account for 60% of the computation.

4. Mixed Distribution Configuration

synthetic_config = {
    "Type": "string",
    "RequestCount": 1500,
    "StringConfig": {
        "Input": {
            "Method": "zipf",    # Long-tail distribution for input
            "Params": {
                "Alpha": 1.2,
                "MinValue": 10,
                "MaxValue": 2000
            }
        },
        "Output": {
            "Method": "uniform",  # Uniform distribution for output
            "Params": {
                "MinValue": 50,
                "MaxValue": 300
            }
        }
    }
}

4.2 TokenId Type Examples

Long Text Stress Testing

synthetic_config = {
    "Type": "tokenid",
    "RequestCount": 1000,
    "TokenIdConfig": {
        "RequestSize": 2048   # 2048 tokens per request
    }
}

Short Text Performance Testing

synthetic_config = {
    "Type": "tokenid",
    "RequestCount": 5000,
    "TokenIdConfig": {
        "RequestSize": 128    # Short text processing scenario
    }
}

V. Frequently Asked Questions

Q1: How to choose a distribution type?

  • Uniform distribution: Suitable for baseline scenarios in stress testing.

  • Gaussian distribution: Simulates average request lengths in real scenarios.

  • Zipf distribution: Generates long-tail distributed data (e.g., 1% of requests account for 50% of computation).

  • Suggested distribution combinations:

    • Stress testing: Use zipf distribution for Input and uniform distribution for Output.

    • Stability testing: Use gaussian distribution for both Input and Output.

Q2: Why does the performance evaluation result matrix show unexpected values even after specifying the input length?

  • tokenid mode: When sending requests, prompts of specified length (composed of randomly generated tokens within the model’s vocabulary range) are re-decoded into strings before being sent to the service. Fluctuations may occur due to possible many-to-one or one-to-many vocabulary mappings in different models.

  • string mode: The input length here refers to the length of the input string, not the number of tokens.

  • Preprocessing stage: Additional string concatenation may be performed before/after using chat-related APIs.

Q3: Why does the performance evaluation result matrix show unexpected values even after specifying the output length in String mode?

  • Significant discrepancy: Check if the ignore_eos parameter in generation_kwargs of the model API configuration file is correctly set to True (this ensures the service ignores the end-of-sequence token until the preset output length is reached).

VI. Notes

  1. tokenid mode: The value range of tokenid depends on the vocabulary range of the model specified in the model configuration file.

  2. string mode: A fixed-length sequence is generated when MinValue=MaxValue.