Skip to content

Latest commit

 

History

History
109 lines (94 loc) · 8.06 KB

benchmarks.md

File metadata and controls

109 lines (94 loc) · 8.06 KB

Performance Notes

We conducted performance reviews on various mainstream NVIDIA GPUs with different model sizes and precisions. For dense models ranging from 2B to 110B parameters on PCIe devices, ZhiLight demonstrates significant performance advantages compared to mainstream open-source inference engines.

To quickly start a benchmark task, refer to the guide.

Test Description:

  • Test purpose is to demonstrate performance, applicable scenarios and limitations
  • Test dataset contains approximately 3.7k prompts
  • Test metrics include:
    • QPS: Queries Per Second
    • TTFT (Time To First Token): First token generation latency
    • TPOT (Time Per Output Token): Generation latency per output token
  • Test environments include:
    • AD102 PCIe : Consumer-grade GPU for experimental research
    • A800: Data center GPU for production deployment
  • Test models include:
    • Large-scale models: Qwen1.5-110B, Qwen2-72B, LLama-3.1-70B
    • Medium-scale models: Qwen2.5-14B, Llama-3.1-8B, Minicpm-2B
  • Compared inference engines include:
    • vLLM
    • SGLang
    • ZhiLight

MiniCPM-2B-sft-bf16

  • NVIDIA AD102 PCIe * 1
  • vLLM args: python -m vllm.entrypoints.openai.api_server --model /mnt/models --host 127.0.0.1 --port 8080 --max-num-seqs 100 --gpu-memory-utilization 0.9 --trust-remote-code
  • SGLang args: python -m sglang.launch_server --port 8080 --model-path /mnt/models --log-requests --chunked-prefill-size 256 --enable-metrics --mem-fraction-static 0.8 --trust-remote-code --disable-radix-cache
  • ZhiLight args: python -m zhilight.server.openai.entrypoints.api_server --model-path /mnt/models
Inference Engine QPS TTFT Mean TTFT P95 TPOT Mean TPOT P95
vLLM 1.67 527.55 1062.96 16.71 31.95
SGLang 1.67 466.19 1181.5 33.96 59.44
ZhiLight 1.67 434.64 989.03 26.1 61.14

Llama-3.1-8B

  • NVIDIA AD102 PCIe * 2
  • vLLM args: python -m vllm.entrypoints.openai.api_server --model /mnt/models --host 127.0.0.1 --port 8080 --max-num-seqs 64 --gpu-memory-utilization 0.9 --max-model-len 25000 -tp 2 --enable-chunked-prefill
  • SGlang args: python -m sglang.launch_server --model-path /mnt/models --port 8080 --enable-mixed-chunk --disable-radix-cache --tp 2 --enable-p2p-check --context-length 25000 --mem-fraction-static 0.8 --enable-torch-compile --max-num-reqs 64
  • ZhiLight args: python -m zhilight.server.openai.entrypoints.api_server --model-path /mnt/models --env "HIGH_PRECISION=0;CPM_FUSE_QKV=1;CPM_FUSE_FF_IN=2;CHUNKED_PREFILL=1;CHUNKED_PREFILL_SIZE=256" --dyn-max-batch-size 64
Inference Engine QPS TTFT Mean TTFT P95 TPOT Mean TPOT P95
vLLM 0.46 915.58 1472.96 18.68 23.42
SGLang 0.84 599.13 1148.46 28.99 37.95
ZhiLight 0.4 1091.12 2123.93 66.24 88.5

Llama-3.1-70B-Instruct-GPTQ-INT4

  • NVIDIA AD102 PCIe * 4
  • vLLM args: python -m vllm.entrypoints.openai.api_server --model /mnt/models --port 8080 --max-num-seqs 32 --gpu-memory-utilization 0.9 --max-model-len 32000 -tp 4 --enable-chunked-prefill
  • SGLang args: python -m sglang.launch_server --port 8080 --model-path /mnt/models --tp 4 --log-requests --chunked-prefill-size 256 --enable-metrics --mem-fraction-static 0.8 --disable-radix-cache
  • ZhiLight args: python -m zhilight.server.openai.entrypoints.api_server --model-path /mnt/models --env "HIGH_PRECISION=0;CPM_FUSE_QKV=1;CPM_FUSE_FF_IN=2;REDUCE_TP_INT8_THRES=100;DUAL_STREAM=1" --dyn-max-batch-size 32
Inference Engine QPS TTFT Mean TTFT P95 TPOT Mean TPOT P95
vLLM 0.18 4796.32 9149.49 41.54 90.12
SGLang 0.18 3962.99 8886.13 63.22 134.48
ZhiLight 0.18 1419.74 2295.92 30.97 56.08

Qwen2.5-14B-Instruct-GPTQ-Int4

  • NVIDIA AD102 PCIe * 2
  • vLLM args: python -m vllm.entrypoints.openai.api_server --model /mnt/models --port 8080 --max-num-seqs 32 --gpu-memory-utilization 0.9 --max-model-len 25000 -tp 2 --enable-chunked-prefill -q fp8
  • SGLang args: python -m sglang.launch_server --host 0.0.0.0 --port 8080 --model-path /mnt/models --tp 2 --disable-radix-cache --chunked-prefill-size 2048 --disable-custom-all-reduce --max-running-requests 32 --mem-fraction-static 0.8 --context-length 25000
  • ZhiLight args: python -m zhilight.server.openai.entrypoints.api_server --model-path /mnt/models --dyn-max-batch-size 32
Inference Engine QPS TTFT Mean TTFT P95 TPOT Mean TPOT P95
vLLM 0.35 1113.4 1770.56 21.69 28.96
SGLang 0.56 613.89 957.38 28.09 35.57
ZhiLight 0.57 795.33 1475.42 31.98 40.01

Qwen2-72B-Instruct-GPTQ-Int4

  • vLLM args: python -m vllm.entrypoints.openai.api_server --model /mnt/models --port 8080 --max-num-seqs 40 --gpu-memory-utilization 0.9 --max-model-len 32000 --enable-chunked-prefill --max-num-batched-tokens 512 -tp 4 --distributed-executor-backend mp --disable-custom-all-reduce
  • SGLang args: python -m sglang.launch_server --port 8080 --model-path /mnt/models --disable-radix-cache --tp 4 --chunked-prefill-size 2048 --disable-custom-all-reduce --max-num-reqs 40
  • ZhiLight args: python -m zhilight.server.openai.entrypoints.api_server --model-path /mnt/models --env "HIGH_PRECISION=0;CPM_FUSE_QKV=1;CPM_FUSE_FF_IN=2;REDUCE_TP_INT8_THRES=100;DUAL_STREAM=1" --dyn-max-batch-size 16

NVIDIA AD102 PCIe * 4

Inference Engine QPS TTFT Mean TTFT P95 TPOT Mean TPOT P95
vLLM 0.18 3493.97 6852.07 35.47 61.74
SGLang 0.18 2276.1 3820.7 38.12 65.16
ZhiLight 0.18 1111.8 1882.5 26.75 41.81

NVIDIA A800 * 4

Inference Engine QPS TTFT Mean TTFT P95 TPOT Mean TPOT P95
vLLM 0.18 1457.65 2136.5 22.14 28.96
SGLang 0.36 1113.06 1850.57 30.41 43.65
ZhiLight 0.18 1227.37 1968.95 31.95 48.53

Qwen1.5-110B-Chat-GPTQ-Int4

  • vLLM args: python -m vllm.entrypoints.openai.api_server --model /mnt/models --host 127.0.0.1 --port 8080 --max-num-seqs 100 --gpu-memory-utilization 0.95 --enable-chunked-prefill --max-num-batched-tokens 256 --max-model-len 30000 -tp 4 --disable-custom-all-reduce
  • SGLang args: python -m sglang.launch_server --port 8080 --model-path /mnt/models --disable-radix-cache --tp 4 --chunked-prefill-size 2048
  • ZhiLight args: python -m zhilight.server.openai.entrypoints.api_sever --model-path /mnt/models --env "HIGH_PRECISION=0;CPM_FUSE_QKV=1;CPM_FUSE_FF_IN=2;REDUCE_TP_INT8_THRES=100;DUAL_STREAM=1" --dyn-max-batch-size 16

NVIDIA AD102 PCIe * 4

Inference Engine QPS TTFT Mean TTFT P95 TPOT Mean TPOT P95
vLLM 0.09 3085.74 4274.03 30.34 44.08
SGLang 0.09 2418.56 3187.73 31.39 53.1
ZhiLight 0.18 1671.38 2669.82 39.68 64.35

NVIDIA A800 * 4

Inference Engine QPS TTFT Mean TTFT P95 TPOT Mean TPOT P95
vLLM 0.09 1899.07 2719.59 23.8 33.02
SGLang 0.18 1514.49 2135.75 28.5 47.28
ZhiLight 0.1 1574.85 2086.8 27.07 38.82