TensorRT-LLM v0.11 Update (#1969)

NVIDIA · Jul 17, 2024 · 05316d3 · 05316d3
1 parent 9bd15f1
commit 05316d3
Show file tree

Hide file tree

Showing 1,024 changed files with 2,084,834 additions and 868,880 deletions.
diff --git a/.gitignore b/.gitignore
@@ -6,9 +6,9 @@ __pycache__/
 *.nsys-rep
 .VSCodeCounter
 build*/
+!builders/
 *.egg-info/
 .coverage
-*.csv
 *.onnx
 tmp/
 venv/

diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml
@@ -46,5 +46,5 @@ repos:
         args:
         - --skip=".git,3rdparty"
         - --exclude-file=examples/whisper/tokenizer.py
-        - --ignore-words-list=rouge,inout,atleast,strat,nd
+        - --ignore-words-list=rouge,inout,atleast,strat,nd,subtile
         exclude: 'tests/llm-test-defs/turtle/test_input_files'
diff --git a/README.md b/README.md
@@ -6,9 +6,9 @@ TensorRT-LLM
 
 [![Documentation](https://img.shields.io/badge/docs-latest-brightgreen.svg?style=flat)](https://nvidia.github.io/TensorRT-LLM/)
 [![python](https://img.shields.io/badge/python-3.10.12-green)](https://www.python.org/downloads/release/python-31012/)
-[![cuda](https://img.shields.io/badge/cuda-12.4.0-green)](https://developer.nvidia.com/cuda-downloads)
-[![trt](https://img.shields.io/badge/TRT-10.0.1-green)](https://developer.nvidia.com/tensorrt)
-[![version](https://img.shields.io/badge/release-0.10.0.dev-green)](./setup.py)
+[![cuda](https://img.shields.io/badge/cuda-12.4.1-green)](https://developer.nvidia.com/cuda-downloads)
+[![trt](https://img.shields.io/badge/TRT-10.1.0-green)](https://developer.nvidia.com/tensorrt)
+[![version](https://img.shields.io/badge/release-0.11.0-green)](./tensorrt_llm/version.py)
 [![license](https://img.shields.io/badge/license-Apache%202-blue)](./LICENSE)
 
 [Architecture](./docs/source/architecture/overview.md)&nbsp;&nbsp;&nbsp;|&nbsp;&nbsp;&nbsp;[Results](./docs/source/performance/perf-overview.md)&nbsp;&nbsp;&nbsp;|&nbsp;&nbsp;&nbsp;[Examples](./examples/)&nbsp;&nbsp;&nbsp;|&nbsp;&nbsp;&nbsp;[Documentation](./docs/source/)
@@ -17,7 +17,44 @@ TensorRT-LLM
 <div align="left">
 
 ## Latest News
-* [*Weekly*] Check out **[@NVIDIAAIDev](https://twitter.com/nvidiaaidev?lang=en)** & **[NVIDIA AI](https://www.linkedin.com/showcase/nvidia-ai/)** LinkedIn for the latest updates!
+* [2024/07/09] Checklist to maximize multi-language performance of @meta #Llama3 with #TensorRT #LLM inference:
+✅ MultiLingual
+✅ NIM
+✅ LoRA tuned adaptors[➡️ Tech blog](https://developer.nvidia.com/blog/deploy-multilingual-llms-with-nvidia-nim/)
+<div align="center">
+<img src="docs/source/media/picture-07-09-2024.png" width="45%">
+<div align="left">
+
+* [2024/07/02] Let the @MistralAI MoE tokens fly 📈 🚀 #Mixtral 8x7B with NVIDIA #TensorRT #LLM on #H100.
+[➡️ Tech blog](https://developer.nvidia.com/blog/achieving-high-mixtral-8x7b-performance-with-nvidia-h100-tensor-core-gpus-and-tensorrt-llm?ncid=so-twit-928467)
+
+* [2024/06/24] Enhanced with NVIDIA #TensorRT #LLM, @upstage.ai’s solar-10.7B-instruct is ready to power your developer projects through our API catalog 🏎️. ✨[➡️ link](https://build.nvidia.com/upstage/solar-10_7b-instruct?snippet_tab=Try )
+
+* [2024/06/18] CYMI: 🤩 Stable Diffusion 3 dropped last week 🎊 🏎️ Speed up your SD3 with #TensorRT INT8 Quantization[➡️ link](https://build.nvidia.com/upstage/solar-10_7b-instruct?snippet_tab=Try )
+
+* [2024/06/18] 🧰Deploying ComfyUI with TensorRT?  Here’s your setup guide [➡️ link](https://github.com/comfyanonymous/ComfyUI_TensorRT)
+
+* [2024/06/11] ✨#TensorRT Weight-Stripped Engines ✨
+Technical Deep Dive for serious coders ✅+99% compression ✅1 set of weights → ** GPUs ✅0 performance loss ✅** models…LLM, CNN, etc.[➡️ link](https://developer.nvidia.com/blog/maximum-performance-and-minimum-footprint-for-ai-apps-with-nvidia-tensorrt-weight-stripped-engines/)
+
+* [2024/06/04] ✨ #TensorRT and GeForce #RTX unlock ComfyUI SD superhero powers 🦸⚡ 🎥 Demo: [➡️ link](https://youtu.be/64QEVfbPHyg)
+📗 DIY notebook: [➡️ link](https://console.brev.dev/launchable/deploy?userID=2x2sil999&orgID=ktj33l4xj&name=ComfyUI_TensorRT&instance=L4%40g2-standard-4%3Anvidia-l4%3A1&diskStorage=500&cloudID=GCP&baseImage=docker.io%2Fpytorch%2Fpytorch%3A2.2.0-cuda12.1-cudnn8-runtime&ports=ComfUI%3A8188&file=https%3A%2F%2Fgithub.com%2Fbrevdev%2Fnotebooks%2Fblob%2Fmain%2Ftensorrt-comfyui.ipynb&launchableID=env-2hQX3n7ae5mq3NjNZ32DfAG0tJf)
+
+* [2024/05/28] ✨#TensorRT weight stripping for ResNet-50 ✨ ✅+99% compression
+✅1 set of weights → ** GPUs\ ✅0 performance loss ✅** models…LLM, CNN, etc
+👀 📚 DIY [➡️ link](https://console.brev.dev/launchable/deploy?userID=2x2sil999&orgID=ktj33l4xj&launchableID=env-2h6bym7h5GFNho3vpWQQeUYMwTM&instance=L4%40g6.xlarge&diskStorage=500&cloudID=devplane-brev-1&baseImage=nvcr.io%2Fnvidia%2Ftensorrt%3A24.05-py3&file=https%3A%2F%2Fgithub.com%2FNVIDIA%2FTensorRT%2Fblob%2Frelease%2F10.0%2Fsamples%2Fpython%2Fsample_weight_stripping%2Fnotebooks%2Fweight_stripping.ipynb&name=tensorrt_weight_stripping_resnet50)
+
+* [2024/05/21] ✨@modal_labs has the codes for serverless @AIatMeta Llama 3 on #TensorRT #LLM ✨👀 📚 Marvelous Modal Manual:
+Serverless TensorRT-LLM (LLaMA 3 8B) | Modal Docs [➡️ link](https://modal.com/docs/examples/trtllm_llama)
+
+* [2024/05/08] NVIDIA TensorRT Model Optimizer -- the newest member of the #TensorRT ecosystem is a library of post-training and training-in-the-loop model optimization techniques ✅quantization ✅sparsity ✅QAT [➡️ blog](https://developer.nvidia.com/blog/accelerate-generative-ai-inference-performance-with-nvidia-tensorrt-model-optimizer-now-publicly-available/)
+
+
+* [2024/05/07] 🦙🦙🦙 24,000 tokens per second 🛫Meta Llama 3 takes off with #TensorRT #LLM 📚[➡️ link](https://blogs.nvidia.com/blog/meta-llama3-inference-acceleration/)
+
+<details close>
+<summary>Previous News</summary>
+
 * [2024/02/06] [🚀 Speed up inference with SOTA quantization techniques in TRT-LLM](./docs/source/blogs/quantization-in-TRT-LLM.md)
 * [2024/01/30] [ New XQA-kernel provides 2.4x more Llama-70B throughput within the same latency budget](./docs/source/blogs/XQA-kernel.md)
 * [2023/12/04] [Falcon-180B on a single H200 GPU with INT4 AWQ, and 6.7x faster Llama-70B over A100](./docs/source/blogs/Falcon180B-H200.md)
@@ -29,6 +66,8 @@ TensorRT-LLM
 * [2023/10/17] [Large Language Models up to 4x Faster on RTX With TensorRT-LLM for Windows
 ](https://blogs.nvidia.com/blog/2023/10/17/tensorrt-llm-windows-stable-diffusion-rtx/)
 
+</details>
+
 ## TensorRT-LLM Overview
 
 TensorRT-LLM is an easy-to-use Python API to define Large
@@ -75,3 +114,6 @@ To get started with TensorRT-LLM, visit our documentation:
 - [Installation Guide for Linux](https://nvidia.github.io/TensorRT-LLM/installation/linux.html)
 - [Installation Guide for Windows](https://nvidia.github.io/TensorRT-LLM/installation/windows.html)
 - [Supported Hardware, Models, and other Software](https://nvidia.github.io/TensorRT-LLM/reference/support-matrix.html)
+
+## Community
+- [Model zoo](https://huggingface.co/TheFloat16) (generated by TRT-LLM rel 0.9 a9356d4b7610330e89c1010f342a9ac644215c52)
diff --git a/benchmarks/README.md b/benchmarks/README.md
@@ -0,0 +1,11 @@
+# TensorRT-LLM Benchmarks
+
+## Overview
+
+There are currently three workflows to benchmark TensorRT-LLM:
+* [C++ benchmarks](./cpp)
+  - The recommended workflow that uses TensorRT-LLM C++ API and can take advantage of the latest features of TensorRT-LLM.
+* [Python benchmarks](./python)
+  - The Python benchmarking scripts can only benchmark the Python runtime, which do not support the latest features, such as in-flight batching.
+* [The Python benchmarking suite](./suite)
+  - This benchmarking suite is a current work in progress and is prone to large changes.
diff --git a/benchmarks/cpp/README.md b/benchmarks/cpp/README.md
@@ -1,7 +1,7 @@
-# Benchmark for C++ Runtime
+# Benchmark C++ Runtime
 
 This document explains how to benchmark the models supported by TensorRT-LLM on a single GPU, a single node with
-multiple GPUs or multiple nodes with multiple GPUs.
+multiple GPUs or multiple nodes with multiple GPUs using the C++ runtime.
 
 ## Usage
 
@@ -16,58 +16,11 @@ Windows users: Follow the
 instead, and be sure to set DLL paths as specified in
 [Extra Steps for C++ Runtime Usage](../../windows/README.md#extra-steps-for-c-runtime-usage).
 
-### 2. Launch C++ benchmarking (Fixed BatchSize/InputLen/OutputLen)
-
-#### Prepare TensorRT-LLM engine(s)
-
-Before you launch C++ benchmarking, please make sure that you have already built engine(s) using TensorRT-LLM API, C++ benchmarking code cannot generate engine(s) for you.
-
-Use `trtllm-build` to build the TRT-LLM engine. Alternatively, if you have already benchmarked Python Runtime, you can reuse the engine(s) built previously, please see that [`document`](../python/README.md).
-
-####  Launch benchmarking
-
-For detailed usage, you can do the following
-```
-cd cpp/build
-
-# You can directly execute the binary for help information
-./benchmarks/gptSessionBenchmark --help
-./benchmarks/bertBenchmark --help
-```
-
-Take GPT-350M as an example for single GPU
-
-```
-./benchmarks/gptSessionBenchmark \
-    --engine_dir "../../benchmarks/gpt_350m/" \
-    --batch_size "1" \
-    --input_output_len "60,20"
-
-# Expected output:
-# [BENCHMARK] batch_size 1 input_length 60 output_length 20 latency(ms) 40.81
-```
-Take GPT-175B as an example for multiple GPUs
-```
-mpirun -n 8 ./benchmarks/gptSessionBenchmark \
-    --engine_dir "../../benchmarks/gpt_175b/" \
-    --batch_size "1" \
-    --input_output_len "60,20"
-
-# Expected output:
-# [BENCHMARK] batch_size 1 input_length 60 output_length 20 latency(ms) 792.14
-```
-
-If you want to obtain context and generation logits, you could build an enigne with `--gather_context_logits` and `--gather_generation_logits`, respectively. Enable `--gather_all_token_logits` will enable both of them.
-
-If you want to get the logits, you could run gptSessionBenchmark with `--print_all_logits`. This will print a large number of logit values and has a certain impact on performance.
-
-*Please note that the expected outputs in that document are only for reference, specific performance numbers depend on the GPU you're using.*
-
-### 3. Launch Batch Manager benchmarking (Inflight/V1 batching)
+### 2. Launch C++ benchmarking (Inflight/V1 batching)
 
 #### Prepare dataset
 
-Run a preprocessing script to prepare/generate dataset into a json that gptManagerBenchmark can consume later. The processed output json has *input tokens length, input token ids and output tokens length*
+Run a preprocessing script to prepare/generate dataset into a json that gptManagerBenchmark can consume later. The processed output json has *input tokens length, input token ids and output tokens length*.
 
 This tool can be used in 2 different modes of traffic generation.
 
@@ -127,7 +80,8 @@ For `tokenizer`, specifying the path to the local tokenizer that have already be
 
 
 #### Prepare TensorRT-LLM engines
-Please make sure that the engines are built with argument `--use_inflight_batching` and `--remove_input_padding` if you'd like to benchmark inflight batching, for more details, please see the document in TensorRT-LLM examples.
+
+Before you launch C++ benchmarking, please make sure that you have already built engine(s) using `trtllm-build` command. For more details on building engine(s), please refer to the [Quick Start Guide](../../docs/source/quick-start-guide.md).
 
 #### Launch benchmarking
 
@@ -139,34 +93,24 @@ cd cpp/build
 ./benchmarks/gptManagerBenchmark --help
 ```
 
-Take GPT-350M as an example for single GPU V1 batching
-```
-./benchmarks/gptManagerBenchmark \
-    --engine_dir ../../examples/gpt/trt_engine/gpt2/fp16/1-gpu/ \
-    --type V1 \
-    --request_rate 10 \
-    --dataset ../../benchmarks/cpp/preprocessed_dataset.json
-    --max_num_samples 500
-```
-
 Take GPT-350M as an example for 2-GPU inflight batching
 ```
 mpirun -n 2 ./benchmarks/gptManagerBenchmark \
     --engine_dir ../../examples/gpt/trt_engine/gpt2-ib/fp16/2-gpu/ \
-    --type IFB \
     --request_rate 10 \
     --dataset ../../benchmarks/cpp/preprocessed_dataset.json
     --max_num_samples 500
 ```
 
-`gptManagerBenchmark` can also be used with the high-level C++ API defined by the `executor::Executor` class (see `cpp/include/tensorrt_llm/executor/executor.h`). This can be done by passing the argument `--api executor`. Note that the Executor class is still under development and currently does not support models with tp or pp > 1.
+`gptManagerBenchmark` by default uses the high-level C++ API defined by the `executor::Executor` class (see `cpp/include/tensorrt_llm/executor/executor.h`).
 
 #### Emulated static batching
 
-To emulate `gptSessionBenchmark` static batching, you can use `gptManagerBenchmark` with the `--static_emulated_batch_size` and `--static_emulated-timeout` arguments.
+To emulate the deprecated `gptSessionBenchmark` static batching, you can use `gptManagerBenchmark` with the `--static_emulated_batch_size` and `--static_emulated-timeout` arguments.
+
 Given a `static_emulated_batch_size` of `n` the server will wait for `n` requests to arrive before submitting them to the batch manager at once. If the `static_emulated_timeout` (in ms) is reached before `n` requests are collected, the batch will be submitted prematurely with the current request count. New batches will only be submitted once the previous batch has been processed comepletely.
 
-`gptSessionBenchmark` uses fixed input/output lengths for benchmarking. A similar dataset for `gptManagerBenchmark` can be generated with the preprocessing script, e.g.
+Datasets with fixed input/output lengths for benchmarking can be generated with the preprocessing script, e.g.
 ```
  python prepare_dataset.py \
   --output tokens-fixed-lengths.json \
@@ -181,7 +125,6 @@ Take GPT-350M as an example for single GPU with static batching
 ```
 ./benchmarks/gptManagerBenchmark \
     --engine_dir ../../examples/gpt/trt_engine/gpt2/fp16/1-gpu/ \
-    --type IFB \
     --request-rate -1 \
     --static_emulated_batch_size 32 \
     --static_emulated_timeout 100 \
@@ -210,8 +153,10 @@ TP=2
 PP=1
 MAX_LEN=1024
 MAX_BATCH=32
-MAX_LORA_RANK=32
+NUM_LAYERS=40
+MAX_LORA_RANK=64
 NUM_LORA_MODS=7
+EOS_ID=2
 
 SOURCE_LORA=chinese-llama-2-lora-13b
 CPP_LORA=chinese-llama-2-lora-13b-cpp
@@ -230,14 +175,14 @@ ${HOME}/.local/bin/trtllm-build \
     --output_dir ${LORA_ENGINE} \
     --max_batch_size ${MAX_BATCH} \
     --max_input_len $MAX_LEN \
-    --max_output_len $MAX_LEN \
+    --max_seq_len $((2*${MAX_LEN})) \
     --gemm_plugin float16 \
     --lora_plugin float16 \
     --use_paged_context_fmha enable \
-    --lora_target_modules attn_qkv \
+    --lora_target_modules attn_q attn_k attn_v attn_dense mlp_h_to_4h mlp_4h_to_h mlp_gate \
     --max_lora_rank ${MAX_LORA_RANK}
 
-NUM_LORAS=(8 16 24 32 64 128 256)
+NUM_LORAS=(8 16)
 NUM_REQUESTS=1024
 
 # Convert LoRA to cpp format
@@ -252,8 +197,6 @@ mkdir -p $EG_DIR/data
 # Prepare dataset without lora_task_id
 python benchmarks/cpp/prepare_dataset.py \
     --output "${EG_DIR}/data/token-norm-dist.json" \
-    --request-rate -1 \
-    --time-delay-dist constant \
     --tokenizer $TOKENIZER \
     token-norm-dist \
     --num-requests $NUM_REQUESTS \
@@ -263,8 +206,6 @@ python benchmarks/cpp/prepare_dataset.py \
 for nloras in ${NUM_LORAS[@]}; do
     python benchmarks/cpp/prepare_dataset.py \
         --output "${EG_DIR}/data/token-norm-dist-lora-${nloras}.json" \
-        --request-rate -1 \
-        --time-delay-dist constant \
         --rand-task-id 0 $(( $nloras - 1 )) \
         --tokenizer $TOKENIZER \
         token-norm-dist \
@@ -273,7 +214,7 @@ for nloras in ${NUM_LORAS[@]}; do
 done
 
 # Generate random lora weights for 256 adapters
-python benchmarks/cpp/utils/generate_rand_loras.py ${CPP_LORA} ${EG_DIR}/loras 256
+python benchmarks/cpp/utils/generate_rand_loras.py ${CPP_LORA} ${EG_DIR}/loras 16
 
 # perform benchmarking
 
@@ -286,13 +227,13 @@ mpirun -n ${TP} --output-filename ${EG_DIR}/log-base-lora \
     --dataset "${EG_DIR}/data/token-norm-dist.json" \
     --lora_host_cache_bytes 8589934592 \
     --lora_num_device_mod_layers $(( 32 * $NUM_LAYERS * $NUM_LORA_MODS * $MAX_LORA_RANK )) \
-    --kv_cache_free_gpu_mem_fraction 0.80 \
+    --kv_cache_free_gpu_mem_fraction 0.70 \
     --log_level info \
     --eos_id ${EOS_ID}
 
 # Now run inference with various numbers or loras
 # The host cache is set large enough to hold all the LoRAs in lora_dir
-# GPU cache is set to hold 32 LoRAs
+# GPU cache is set to hold 16 LoRAs
 # This benchmark will preload all the LoRAs into the host cache
 # We run inference on a range of active LoRAs exercising different cache miss rates.
 for nloras in ${NUM_LORAS[@]}; do
@@ -303,10 +244,57 @@ for nloras in ${NUM_LORAS[@]}; do
         --type IFB \
         --dataset "${EG_DIR}/data/token-norm-dist-lora-${nloras}.json" \
         --lora_host_cache_bytes 8589934592 \
-        --lora_num_device_mod_layers $(( 32 * $NUM_LAYERS * $NUM_LORA_MODS * $MAX_LORA_RANK )) \
-        --kv_cache_free_gpu_mem_fraction 0.80 \
+        --lora_num_device_mod_layers $(( 16 * $NUM_LAYERS * $NUM_LORA_MODS * $MAX_LORA_RANK )) \
+        --kv_cache_free_gpu_mem_fraction 0.70 \
         --log_level info \
         --eos_id ${EOS_ID} \
         --lora_dir ${EG_DIR}/loras
 done
 ```
+
+### 3. [DEPRECATED] Launch C++ static batching benchmarking (Fixed BatchSize/InputLen/OutputLen)
+
+#### Prepare TensorRT-LLM engine(s)
+
+Before you launch C++ benchmarking, please make sure that you have already built engine(s) using TensorRT-LLM API, C++ benchmarking code cannot generate engine(s) for you.
+
+Use `trtllm-build` to build the TRT-LLM engine. Alternatively, if you have already benchmarked Python Runtime, you can reuse the engine(s) built previously, please see that [`document`](../python/README.md).
+
+####  Launch benchmarking
+
+For detailed usage, you can do the following
+```
+cd cpp/build
+
+# You can directly execute the binary for help information
+./benchmarks/gptSessionBenchmark --help
+./benchmarks/bertBenchmark --help
+```
+
+Take GPT-350M as an example for single GPU
+
+```
+./benchmarks/gptSessionBenchmark \
+    --engine_dir "../../benchmarks/gpt_350m/" \
+    --batch_size "1" \
+    --input_output_len "60,20"
+
+# Expected output:
+# [BENCHMARK] batch_size 1 input_length 60 output_length 20 latency(ms) 40.81
+```
+Take GPT-175B as an example for multiple GPUs
+```
+mpirun -n 8 ./benchmarks/gptSessionBenchmark \
+    --engine_dir "../../benchmarks/gpt_175b/" \
+    --batch_size "1" \
+    --input_output_len "60,20"
+
+# Expected output:
+# [BENCHMARK] batch_size 1 input_length 60 output_length 20 latency(ms) 792.14
+```
+
+If you want to obtain context and generation logits, you could build an enigne with `--gather_context_logits` and `--gather_generation_logits`, respectively. Enable `--gather_all_token_logits` will enable both of them.
+
+If you want to get the logits, you could run gptSessionBenchmark with `--print_all_logits`. This will print a large number of logit values and has a certain impact on performance.
+
+*Please note that the expected outputs in that document are only for reference, specific performance numbers depend on the GPU you're using.*
diff --git a/benchmarks/cpp/bertBenchmark.cpp b/benchmarks/cpp/bertBenchmark.cpp
@@ -17,6 +17,7 @@
 #include "tensorrt_llm/common/memoryUtils.h"
 #include "tensorrt_llm/plugins/api/tllmPlugin.h"
 #include "tensorrt_llm/runtime/iTensor.h"
+#include "tensorrt_llm/runtime/rawEngine.h"
 #include "tensorrt_llm/runtime/tllmLogger.h"
 #include "tensorrt_llm/runtime/tllmRuntime.h"
 #include "tensorrt_llm/runtime/worldConfig.h"
@@ -78,11 +79,10 @@ void benchmarkBert(std::string const& modelName, std::filesystem::path const& da
 {
     auto const worldConfig = WorldConfig::mpi();
     auto const enginePath = dataPath / engineFilename(dataPath, worldConfig, modelName);
-    auto engineBlob = loadEngine(enginePath.string());
 
     for (float gpuWeightsPercent : gpuWeightsPercents)
     {
-        auto rt = std::make_shared<TllmRuntime>(engineBlob.data(), engineBlob.size(), gpuWeightsPercent, *logger);
+        auto rt = std::make_shared<TllmRuntime>(RawEngine(enginePath), logger.get(), gpuWeightsPercent);
         rt->addContext(0);
         for (auto inLen : inLens)
         {