Releases: huggingface/text-generation-inference
v1.0.1
Notable changes:
- More GPTQ support
- Rope scaling (linear + dynamic)
- Bitsandbytes 4bits (both modes)
- Added more documentation
What's Changed
- Local gptq support. by @Narsil in #738
- Fix typing in
Model.generate_token
by @jaywonchung in #733 - Adding Rope scaling. by @Narsil in #741
- chore: fix typo in mpt_modeling.py by @eltociear in #737
- fix(server): Failing quantize config after local read. by @Narsil in #743
- Typo fix. by @Narsil in #746
- fix typo for dynamic rotary by @flozi00 in #745
- add FastLinear import by @zspo in #750
- Automatically map deduplicated safetensors weights to their original values (#501) by @Narsil in #761
- feat(server): Add native support for PEFT Lora models by @Narsil in #762
- This should prevent the PyTorch overriding. by @Narsil in #767
- fix build tokenizer in quantize and remove duplicate import by @zspo in #768
- Merge BNB 4bit. by @Narsil in #770
- Fix dynamic rope. by @Narsil in #783
- Fixing non 4bits quantization. by @Narsil in #785
- Update init.py by @Narsil in #794
- Llama change. by @Narsil in #793
- Setup for doc-builder and docs for TGI by @merveenoyan in #740
- Use destructuring in router arguments to avoid '.0' by @ivarflakstad in #798
- Fix gated docs by @osanseviero in #805
- Minor docs style fixes by @osanseviero in #806
- Added CLI docs and rename docker launch by @merveenoyan in #799
- [docs] Build docs only when doc files change by @mishig25 in #812
- Added ChatUI Screenshot to Docs by @merveenoyan in #823
- Upgrade transformers (fix protobuf==3.20 issue) by @Narsil in #795
- Added streaming for InferenceClient by @merveenoyan in #821
- Version 1.0.1 by @Narsil in #836
New Contributors
- @jaywonchung made their first contribution in #733
- @eltociear made their first contribution in #737
- @flozi00 made their first contribution in #745
- @zspo made their first contribution in #750
- @ivarflakstad made their first contribution in #798
- @osanseviero made their first contribution in #805
- @mishig25 made their first contribution in #812
Full Changelog: v1.0.0...v1.0.1
v1.0.0
License change
We are releasing TGI v1.0 under a new license: HFOIL 1.0.
All prior versions of TGI remain licensed under Apache 2.0, the last Apache 2.0 version being version 0.9.4.
HFOIL stands for Hugging Face Optimized Inference License, and it has been specifically designed for our optimized inference solutions. While the source code remains accessible, HFOIL is not a true open source license because we added a restriction: to sell a hosted or managed service built on top of TGI, we now require a separate agreement.
You can consult the new license here.
What does this mean for you?
This change in source code licensing has no impact on the overwhelming majority of our user community who use TGI for free. Additionally, both our Inference Endpoint customers and those of our commercial partners will also remain unaffected.
However, it will restrict non-partnered cloud service providers from offering TGI v1.0+ as a service without requesting a license.
To elaborate further:
-
If you are an existing user of TGI prior to v1.0, your current version is still Apache 2.0 and you can use it commercially without restrictions.
-
If you are using TGI for personal use or research purposes, the HFOIL 1.0 restrictions do not apply to you.
-
If you are using TGI for commercial purposes as part of an internal company project (that will not be sold to third parties as a hosted or managed service), the HFOIL 1.0 restrictions do not apply to you.
-
If you integrate TGI into a hosted or managed service that you sell to customers, then consider requesting a license to upgrade to v1.0 and later versions - you can email us at api-enterprise@huggingface.co with information about your service.
For more information, see: #726.
Full Changelog: v0.9.4...v1.0.0
v0.9.4
Features
- server: auto max_batch_total_tokens for flash att models #630
- router: ngrok edge #642
- server: Add trust_remote_code to quantize script by @ChristophRaab #647
- server: Add exllama GPTQ CUDA kernel support #553 #666
- server: Directly load GPTBigCode to specified device by @Atry in #618
- server: add cuda memory fraction #659
- server: Using
quantize_config.json
instead of GPTQ_BITS env variables #671 - server: support new falcon config #712
Fix
- server: llama v2 GPTQ #648
- server: Fixing non parameters in quantize script
bigcode/starcoder
was an example #661 - server: use mem_get_info to get kv cache size #664
- server: fix exllama buffers #689
- server: fix quantization python requirements #708
New Contributors
- @ChristophRaab made their first contribution in #647
- @fxmarty made their first contribution in #648
- @Atry made their first contribution in #618
Full Changelog: v0.9.3...v0.9.4
v0.9.3
Highlights
- server: add support for flash attention v2
- server: add support for llamav2
Features
- launcher: add debug logs
- server: rework the quantization to support all models
Full Changelog: v0.9.2...v0.9.3
v0.9.2
Features
- server: harden a bit the weights choice to save on disk
- server: better errors for warmup and TP
- server: Support for env value for GPTQ_BITS and GPTQ_GROUPSIZE
- server: Implements sharding for non divisible
vocab_size
- launcher: add arg validation and drop subprocess
- router: explicit warning if revision is not set
Fix
- server: Fixing RW code (it's remote code so the Arch checking doesn't work to see which weights to keep
- server: T5 weights names
- server: Adding logger import to t5_modeling.py by @akowalsk
- server: Bug fixes for GPTQ_BITS environment variable passthrough by @ssmi153
- server: GPTQ Env vars: catch correct type of error by @ssmi153
- server: blacklist local files
New Contributors
- @akowalsk made their first contribution in #585
- @ssmi153 made their first contribution in #590
- @gary149 made their first contribution in #611
Full Changelog: v0.9.1...v0.9.2
v0.9.1
Highlights
- server: Non flash MPT
- server: decrease memory fragmentation
Features
- server: use latest flash attention
- router: add argument for hostname in router
- docs: Adding some help for the options in
text-generation-benchmark
Fix
- makefile: Update server/Makefile to include Makefile-vllm
- server: Handle loading from local files for MPT
- server: avoid errors for very small top_p values
Full Changelog: v0.9.0...v0.9.1
v0.9.0
Highlights
- server: add paged attention to flash models
- server: Inference support for GPTQ (llama + falcon tested) + Quantization script
- server: only compute prefill logprobs when asked
Features
- launcher: parse oom signals
- server: batch tokenization for flash causal lm
- server: Rework loading by
- server: optimize dist ops
- router: add ngrok integration
- server: improve flash attention import errors
- server: Refactor conversion logic
- router: add header option to disable buffering for the generate_stream response by @rkimball
- router: add arg validation
Fix
- docs: CUDA_VISIBLE_DEVICES comment by @antferdom
- docs: Fix typo and use POSIX comparison in the makefile by @piratos
- server: fix warpers on CPU
- server: Fixing T5 in case the names are mixed up
- router: add timeout on flume sends
- server: Do not init process group if already initialized
- server: Add the option to force another dtype than
f16
- launcher: fix issue where launcher does not properly report shard failures
New Contributors
- @antferdom made their first contribution in #441
- @piratos made their first contribution in #443
- @Yard1 made their first contribution in #388
- @rkimball made their first contribution in #498
Full Changelog: v0.8.2...v0.9.0
v0.8.2
Features
- server: remove trust_remote_code requirement for falcon models
- server: load santacoder/starcoder models with safetensors
Fix
- server: fix has_position_ids
Full Changelog: v0.8.1...v0.8.2
v0.8.1
Features
- server: add retry on download
Fix
- server: fix bnb quantization for CausalLM models
Full Changelog: v0.8.0...v0.8.1
v0.8.0
Features
- router: support vectorized warpers in flash causal lm (co-authored by @jlamypoirier )
- proto: decrease IPC proto size
- benchmarker: add summary tables
- server: support RefinedWeb models
Fix
- server: Fix issue when load AutoModelForSeq2SeqLM model (contributed by @CL-Shang)
New Contributors
- @CL-Shang made their first contribution in #370
- @jlamypoirier made their first contribution in #317
Full Changelog: v0.7.0...v0.8.0