Skip to content

Commit

Permalink
Update TensorRT-LLM Release branch (#1192)
Browse files Browse the repository at this point in the history
* Update TensorRT-LLM

---------

Co-authored-by: Shixiaowei02 <39303645+Shixiaowei02@users.noreply.github.com>
  • Loading branch information
kaiyux and Shixiaowei02 authored Feb 29, 2024
1 parent 2f169d1 commit 5955b8a
Show file tree
Hide file tree
Showing 1,337 changed files with 3,804,632 additions and 2,009,981 deletions.
2 changes: 2 additions & 0 deletions .dockerignore
Original file line number Diff line number Diff line change
@@ -1,5 +1,7 @@
build
cpp/*build*
cpp/cmake-*
cpp/.ccache
cpp/tests/resources/models
tensorrt_llm/libs
**/__pycache__
Expand Down
116 changes: 116 additions & 0 deletions .github/ISSUE_TEMPLATE/bug_report.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,116 @@
name: "Bug Report"
description: Submit a bug report to help us improve TensorRT-LLM
labels: [ "bug" ]
body:
- type: textarea
id: system-info
attributes:
label: System Info
description: Please share your system info with us.
placeholder: |
- CPU architecture (e.g., x86_64, aarch64)
- CPU/Host memory size (if known)
- GPU properties
- GPU name (e.g., NVIDIA H100, NVIDIA A100, NVIDIA L40S)
- GPU memory size (if known)
- Clock frequencies used (if applicable)
- Libraries
- TensorRT-LLM branch or tag (e.g., main, v0.7.1)
- TensorRT-LLM commit (if known)
- Versions of TensorRT, AMMO, CUDA, cuBLAS, etc. used
- Container used (if running TensorRT-LLM in a container)
- NVIDIA driver version
- OS (Ubuntu 22.04, CentOS 7, Windows 10)
- Any other information that may be useful in reproducing the bug
validations:
required: true

- type: textarea
id: who-can-help
attributes:
label: Who can help?
description: |
To expedite the response to your issue, it would be helpful if you could identify the appropriate person
to tag using the **@** symbol. Here is a general guideline on **whom to tag**.
Rest assured that all issues are reviewed by the core maintainers. If you are unsure about whom to tag,
you can leave it blank, and a core maintainer will make sure to involve the appropriate person.
Please tag fewer than 3 people.
Quantization: @Tracin
Documentation: @juney-nvidia
Feature request: @ncomly-nvidia
Performance: @kaiyux
Others: @byshiue
placeholder: "@Username ..."

- type: checkboxes
id: information-scripts-examples
attributes:
label: Information
description: 'The problem arises when using:'
options:
- label: "The official example scripts"
- label: "My own modified scripts"

- type: checkboxes
id: information-tasks
attributes:
label: Tasks
description: "The tasks I am working on are:"
options:
- label: "An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...)"
- label: "My own task or dataset (give details below)"

- type: textarea
id: reproduction
validations:
required: true
attributes:
label: Reproduction
description: |
Kindly share a code example that demonstrates the issue you encountered. It is recommending to provide a code snippet directly.
Additionally, if you have any error messages, or stack traces related to the problem, please include them here.
Remember to use code tags to properly format your code. You can refer to the
link https://help.github.com/en/github/writing-on-github/creating-and-highlighting-code-blocks#syntax-highlighting for guidance on code formatting.
Please refrain from using screenshots, as they can be difficult to read and prevent others from copying and pasting your code.
It would be most helpful if we could reproduce your issue by simply copying and pasting your scripts and codes.
placeholder: |
Steps to reproduce the behavior:
1.
2.
3.
- type: textarea
id: expected-behavior
validations:
required: true
attributes:
label: Expected behavior
description: "Provide a brief summary of the expected behavior of the software. Provide output files or examples if possible."

- type: textarea
id: actual-behavior
validations:
required: true
attributes:
label: actual behavior
description: "Describe the actual behavior of the software and how it deviates from the expected behavior. Provide output files or examples if possible."

- type: textarea
id: additioanl-notes
validations:
required: true
attributes:
label: additional notes
description: "Provide any additional context here you think might be useful for the TensorRT-LLM team to help debug this issue (such as experiments done, potential things to investigate)."
6 changes: 4 additions & 2 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,8 @@ repos:
rev: v4.1.0
hooks:
- id: check-added-large-files
exclude: 'cpp/tensorrt_llm/kernels/contextFusedMultiHeadAttention/'
exclude: |
(?x)^(.*cubin.cpp)$
- id: check-merge-conflict
- id: check-symlinks
- id: detect-private-key
Expand Down Expand Up @@ -45,4 +46,5 @@ repos:
args:
- --skip=".git,3rdparty"
- --exclude-file=examples/whisper/tokenizer.py
- --ignore-words-list=rouge,inout,atleast,strat
- --ignore-words-list=rouge,inout,atleast,strat,nd
exclude: 'tests/llm-test-defs/turtle/test_input_files'
36 changes: 36 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,41 @@
# Change Log

## Versions 0.7.0 / 0.7.1

* Models
- BART and mBART support in encoder-decoder models
- FairSeq Neural Machine Translation (NMT) family
- Mixtral-8x7B model
- Support weight loading for HuggingFace Mixtral model
- OpenAI Whisper
- Mixture of Experts support
- MPT - Int4 AWQ / SmoothQuant support
- Baichuan FP8 quantization support
* Features
- [Preview] Speculative decoding
- Add Python binding for `GptManager`
- Add a Python class `ModelRunnerCpp` that wraps C++ `gptSession`
- System prompt caching
- Enable split-k for weight-only cutlass kernels
- FP8 KV cache support for XQA kernel
- New Python builder API and `trtllm-build` command(already applied to [blip2](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/blip2) and [OPT](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/opt#3-build-tensorrt-engines) )
- Support `StoppingCriteria` and `LogitsProcessor` in Python generate API (thanks to the contribution from @zhang-ge-hao)
- fMHA support for chunked attention and paged kv cache
* Bug fixes
- Fix tokenizer usage in quantize.py #288, thanks to the contribution from @0xymoro
- Fix LLaMa with LoRA error #637
- Fix LLaMA GPTQ failure #580
- Fix Python binding for InferenceRequest issue #528
- Fix CodeLlama SQ accuracy issue #453
* Performance
- MMHA optimization for MQA and GQA
- LoRA optimization: cutlass grouped gemm
- Optimize Hopper warp specialized kernels
- Optimize AllReduce for parallel attention on Falcon and GPT-J
- Enable split-k for weight-only cutlass kernel when SM>=75
* Documentation
- Add [documentation for new builder workflow](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/new_workflow.md)

## Versions 0.6.0 / 0.6.1

* Models
Expand Down
Loading

0 comments on commit 5955b8a

Please sign in to comment.