Skip to content

Commit

Permalink
TensorRT 10.7-GA OSS Release (#4269)
Browse files Browse the repository at this point in the history
Signed-off-by: Kevin Chen <kevinch@nvidia.com>
  • Loading branch information
kevinch-nv authored Dec 5, 2024
1 parent c468d67 commit 17003e4
Show file tree
Hide file tree
Showing 81 changed files with 1,411 additions and 530 deletions.
2 changes: 1 addition & 1 deletion .gitmodules
Original file line number Diff line number Diff line change
Expand Up @@ -9,4 +9,4 @@
[submodule "parsers/onnx"]
path = parsers/onnx
url = https://github.com/onnx/onnx-tensorrt.git
branch = release/10.6-GA
branch = release/10.7-GA
25 changes: 25 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,30 @@
# TensorRT OSS Release Changelog

## 10.7.0 GA - 2024-12-4
Key Feature and Updates:

- Demo Changes
- demoDiffusion
- Enabled low-vram for the Flux pipeline. Users can now run the pipelines on systems with 32GB VRAM.
- Added support for [FLUX.1-schnell](https://huggingface.co/black-forest-labs/FLUX.1-schnell) pipeline.
- Enabled weight streaming mode for Flux pipeline.

- Plugin Changes
- On Blackwell and later platforms, TensorRT will drop cuDNN support on the following categories of plugins
- User-written `IPluginV2Ext`, `IPluginV2DynamicExt`, and `IPluginV2IOExt` plugins that are dependent on cuDNN handles provided by TensorRT (via the `attachToContext()` API).
- TensorRT standard plugins that use cuDNN, specifically:
- `InstanceNormalization_TRT` (version: 1, 2, and 3) present in `plugin/instanceNormalizationPlugin/`.
- `GroupNormalizationPlugin` (version: 1) present in `plugin/groupNormalizationPlugin/`.
- Note: These normalization plugins are superseded by TensorRT’s native `INormalizationLayer` ([C++](https://docs.nvidia.com/deeplearning/tensorrt/api/c_api/classnvinfer1_1_1_i_normalization_layer.html), [Python](https://docs.nvidia.com/deeplearning/tensorrt/operators/docs/Normalization.html)). TensorRT support for cuDNN-dependent plugins remain unchanged on pre-Blackwell platforms.

- Parser Changes
- Now prioritizes using plugins over local functions when a corresponding plugin is available in the registry.
- Added dynamic axes support for `Squeeze` and `Unsqueeze` operations.
- Added support for parsing mixed-precision `BatchNormalization` nodes in strongly-typed mode.

- Addressed Issues
- Fixed [4113](https://github.com/NVIDIA/TensorRT/issues/4113).

## 10.6.0 GA - 2024-11-05
Key Feature and Updates:
- Demo Changes
Expand Down
18 changes: 9 additions & 9 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@ You can skip the **Build** section to enjoy TensorRT with Python.
To build the TensorRT-OSS components, you will first need the following software packages.

**TensorRT GA build**
* TensorRT v10.6.0.26
* TensorRT v10.7.0.23
* Available from direct download links listed below

**System Packages**
Expand Down Expand Up @@ -73,25 +73,25 @@ To build the TensorRT-OSS components, you will first need the following software
If using the TensorRT OSS build container, TensorRT libraries are preinstalled under `/usr/lib/x86_64-linux-gnu` and you may skip this step.

Else download and extract the TensorRT GA build from [NVIDIA Developer Zone](https://developer.nvidia.com) with the direct links below:
- [TensorRT 10.6.0.26 for CUDA 11.8, Linux x86_64](https://developer.nvidia.com/downloads/compute/machine-learning/tensorrt/10.6.0/tars/TensorRT-10.6.0.26.Linux.x86_64-gnu.cuda-11.8.tar.gz)
- [TensorRT 10.6.0.26 for CUDA 12.6, Linux x86_64](https://developer.nvidia.com/downloads/compute/machine-learning/tensorrt/10.6.0/tars/TensorRT-10.6.0.26.Linux.x86_64-gnu.cuda-12.6.tar.gz)
- [TensorRT 10.6.0.26 for CUDA 11.8, Windows x86_64](https://developer.nvidia.com/downloads/compute/machine-learning/tensorrt/10.6.0/zip/TensorRT-10.6.0.26.Windows.win10.cuda-11.8.zip)
- [TensorRT 10.6.0.26 for CUDA 12.6, Windows x86_64](https://developer.nvidia.com/downloads/compute/machine-learning/tensorrt/10.6.0/zip/TensorRT-10.6.0.26.Windows.win10.cuda-12.6.zip)
- [TensorRT 10.7.0.23 for CUDA 11.8, Linux x86_64](https://developer.nvidia.com/downloads/compute/machine-learning/tensorrt/10.7.0/tars/TensorRT-10.7.0.23.Linux.x86_64-gnu.cuda-11.8.tar.gz)
- [TensorRT 10.7.0.23 for CUDA 12.6, Linux x86_64](https://developer.nvidia.com/downloads/compute/machine-learning/tensorrt/10.7.0/tars/TensorRT-10.7.0.23.Linux.x86_64-gnu.cuda-12.6.tar.gz)
- [TensorRT 10.7.0.23 for CUDA 11.8, Windows x86_64](https://developer.nvidia.com/downloads/compute/machine-learning/tensorrt/10.7.0/zip/TensorRT-10.7.0.23.Windows.win10.cuda-11.8.zip)
- [TensorRT 10.7.0.23 for CUDA 12.6, Windows x86_64](https://developer.nvidia.com/downloads/compute/machine-learning/tensorrt/10.7.0/zip/TensorRT-10.7.0.23.Windows.win10.cuda-12.6.zip)


**Example: Ubuntu 20.04 on x86-64 with cuda-12.6**

```bash
cd ~/Downloads
tar -xvzf TensorRT-10.6.0.26.Linux.x86_64-gnu.cuda-12.6.tar.gz
export TRT_LIBPATH=`pwd`/TensorRT-10.6.0.26
tar -xvzf TensorRT-10.7.0.23.Linux.x86_64-gnu.cuda-12.6.tar.gz
export TRT_LIBPATH=`pwd`/TensorRT-10.7.0.23
```

**Example: Windows on x86-64 with cuda-12.6**

```powershell
Expand-Archive -Path TensorRT-10.6.0.26.Windows.win10.cuda-12.6.zip
$env:TRT_LIBPATH="$pwd\TensorRT-10.6.0.26\lib"
Expand-Archive -Path TensorRT-10.7.0.23.Windows.win10.cuda-12.6.zip
$env:TRT_LIBPATH="$pwd\TensorRT-10.7.0.23\lib"
```

## Setting Up The Build Environment
Expand Down
2 changes: 1 addition & 1 deletion VERSION
Original file line number Diff line number Diff line change
@@ -1 +1 @@
10.6.0.26
10.7.0.23
2 changes: 1 addition & 1 deletion demo/BERT/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -75,7 +75,7 @@ The following software version configuration has been tested:
|Software|Version|
|--------|-------|
|Python|>=3.8|
|TensorRT|10.6.0.26|
|TensorRT|10.7.0.23|
|CUDA|12.6|

## Setup
Expand Down
35 changes: 28 additions & 7 deletions demo/Diffusion/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ This demo application ("demoDiffusion") showcases the acceleration of Stable Dif
### Clone the TensorRT OSS repository

```bash
git clone git@github.com:NVIDIA/TensorRT.git -b release/10.5 --single-branch
git clone git@github.com:NVIDIA/TensorRT.git -b release/10.7 --single-branch
cd TensorRT
```

Expand All @@ -16,7 +16,7 @@ cd TensorRT
Install nvidia-docker using [these intructions](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html#docker).

```bash
docker run --rm -it --gpus all -v $PWD:/workspace nvcr.io/nvidia/pytorch:24.07-py3 /bin/bash
docker run --rm -it --gpus all -v $PWD:/workspace nvcr.io/nvidia/pytorch:24.10-py3 /bin/bash
```

NOTE: The demo supports CUDA>=11.8
Expand All @@ -43,12 +43,12 @@ pip3 install -r requirements.txt

> NOTE: demoDiffusion has been tested on systems with NVIDIA H100, A100, L40, T4, and RTX4090 GPUs, and the following software configuration.
```
diffusers 0.30.2
diffusers 0.31.0
onnx 1.15.0
onnx-graphsurgeon 0.5.2
onnxruntime 1.16.3
polygraphy 0.49.9
tensorrt 10.6.0.26
tensorrt 10.7.0.23
tokenizers 0.13.3
torch 2.2.0
transformers 4.42.2
Expand All @@ -66,6 +66,7 @@ python3 demo_img2img.py --help
python3 demo_inpaint.py --help
python3 demo_controlnet.py --help
python3 demo_txt2img_xl.py --help
python3 demo_txt2img_flux.py --help
```

### HuggingFace user access token
Expand Down Expand Up @@ -257,23 +258,43 @@ python3 demo_stable_cascade.py --onnx-opset=16 "Anthropomorphic cat dressed as a
### Generate an image guided by a text prompt using Flux

Run the below command to generate an image with FLUX.1 Dev in FP16.

```bash
python3 demo_txt2img_flux.py "a beautiful photograph of Mt. Fuji during cherry blossom" --hf-token=$HF_TOKEN
```

Run the below command to generate an image with FLUX in BF16.
Run the below command to generate an image with FLUX.1 Dev in BF16.

```bash
python3 demo_txt2img_flux.py "a beautiful photograph of Mt. Fuji during cherry blossom" --hf-token=$HF_TOKEN --bf16
```

Run the below command to generate an image with FLUX in FP8. (FP8 is only supppoted on Hopper.)
Run the below command to generate an image with FLUX.1 Dev in FP8. (FP8 is suppported on Hopper and Ada.)

```bash
python3 demo_txt2img_flux.py "a beautiful photograph of Mt. Fuji during cherry blossom" --hf-token=$HF_TOKEN --fp8
```

NOTE: Running the Flux pipeline requires 80GB of GPU memory or higher
Run the below command to generate an image with FLUX.1 Schnell in FP16.

```bash
python3 demo_txt2img_flux.py "a beautiful photograph of Mt. Fuji during cherry blossom" --hf-token=$HF_TOKEN --version="flux.1-schnell"
```

Run the below command to generate an image with FLUX.1 Schnell in BF16.

```bash
python3 demo_txt2img_flux.py "a beautiful photograph of Mt. Fuji during cherry blossom" --hf-token=$HF_TOKEN --version="flux.1-schnell" --bf16
```

Run the below command to generate an image with FLUX.1 Schnell in FP8. (FP8 is suppported on Hopper and Ada.)

```bash
python3 demo_txt2img_flux.py "a beautiful photograph of Mt. Fuji during cherry blossom" --hf-token=$HF_TOKEN --version="flux.1-schnell" --fp8
```

NOTE: Running the FLUX.1 Dev or FLUX.1 Schnell pipeline requires 48GB or 24GB of GPU memory or higher, respectively.

## Configuration options
- Noise scheduler can be set using `--scheduler <scheduler>`. Note: not all schedulers are available for every version.
Expand Down
75 changes: 63 additions & 12 deletions demo/Diffusion/demo_txt2img_flux.py
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,12 @@
from cuda import cudart

from flux_pipeline import FluxPipeline
from utilities import PIPELINE_TYPE, add_arguments, process_pipeline_args
from utilities import (
PIPELINE_TYPE,
add_arguments,
process_pipeline_args,
VALID_OPTIMIZATION_LEVELS,
)


def parse_args():
Expand All @@ -32,7 +37,7 @@ def parse_args():
"--version",
type=str,
default="flux.1-dev",
choices=["flux.1-dev"],
choices=("flux.1-dev", "flux.1-schnell"),
help="Version of Flux",
)
parser.add_argument(
Expand Down Expand Up @@ -65,20 +70,48 @@ def parse_args():
parser.add_argument(
"--max_sequence_length",
type=int,
default=512,
help="Maximum sequence length to use with the prompt",
help="Maximum sequence length to use with the prompt. Can be up to 512 for the dev and 256 for the schnell variant.",
)
parser.add_argument(
"--bf16",
action='store_true',
help="Run pipeline in BFloat16 precision"
"--bf16", action="store_true", help="Run pipeline in BFloat16 precision"
)
parser.add_argument(
"--low-vram",
action="store_true",
help="Optimize for low VRAM usage, possibly at the expense of inference performance. Disabled by default.",
)
parser.add_argument(
"--optimization-level",
type=int,
default=3,
help=f"Set the builder optimization level to build the engine with. A higher level allows TensorRT to spend more building time for more optimization options. Must be one of {VALID_OPTIMIZATION_LEVELS}.",
)
parser.add_argument(
"--torch-fallback",
default=None,
type=str,
help="Name list of models to be inferenced using torch instead of TRT. For example --torch-fallback t5,transformer. If --torch-inference set, this parameter will be ignored."
)

parser.add_argument(
"--ws",
action='store_true',
help="Optimize for low VRAM usage, possibly at the expense of inference performance. Disabled by default."
help="Build TensorRT engines with weight streaming enabled."
)

parser.add_argument(
"--t5-ws-percentage",
type=int,
default=None,
help="Set runtime weight streaming budget as the percentage of the size of streamable weights for the T5 model. This argument only takes effect when --ws is set. 0 streams the most weights and 100 or None streams no weights. "
)

parser.add_argument(
"--transformer-ws-percentage",
type=int,
default=None,
help="Set runtime weight streaming budget as the percentage of the size of streamable weights for the transformer model. This argument only takes effect when --ws is set. 0 streams the most weights and 100 or None streams no weights."
)
return parser.parse_args()


Expand All @@ -100,10 +133,24 @@ def process_demo_args(args):
if len(prompt2) == 1:
prompt2 = prompt2 * batch_size

if args.max_sequence_length is not None and args.max_sequence_length > 512:
raise ValueError(
f"`max_sequence_length` cannot be greater than 512 but is {args.max_sequence_length}"
)
max_seq_supported_by_model = {
"flux.1-schnell": 256,
"flux.1-dev": 512,
}[args.version]
if args.max_sequence_length is not None:
if args.max_sequence_length > max_seq_supported_by_model:
raise ValueError(
f"For {args.version}, `max_sequence_length` cannot be greater than {max_seq_supported_by_model} but is {args.max_sequence_length}"
)
else:
args.max_sequence_length = max_seq_supported_by_model

if args.torch_fallback and not args.torch_inference:
args.torch_fallback = args.torch_fallback.split(",")

if args.torch_fallback and args.torch_inference:
print(f"[W] All models will run in PyTorch when --torch-inference is set. Parameter --torch-fallback will be ignored.")
args.torch_fallback = None

args_run_demo = (
prompt,
Expand Down Expand Up @@ -131,6 +178,10 @@ def process_demo_args(args):
max_sequence_length=args.max_sequence_length,
bf16=args.bf16,
low_vram=args.low_vram,
torch_fallback=args.torch_fallback,
weight_streaming=args.ws,
t5_weight_streaming_budget_percentage=args.t5_ws_percentage,
transformer_weight_streaming_budget_percentage=args.transformer_ws_percentage,
**kwargs_init_pipeline)

# Load TensorRT engines and pytorch modules
Expand Down
Loading

0 comments on commit 17003e4

Please sign in to comment.