Skip to content

Releases: huggingface/text-generation-inference

v1.0.1

14 Aug 09:24
09eca64
Compare
Choose a tag to compare

Notable changes:

  • More GPTQ support
  • Rope scaling (linear + dynamic)
  • Bitsandbytes 4bits (both modes)
  • Added more documentation

What's Changed

New Contributors

Full Changelog: v1.0.0...v1.0.1

v1.0.0

28 Jul 15:47
3ef5ffb
Compare
Choose a tag to compare

License change

We are releasing TGI v1.0 under a new license: HFOIL 1.0.
All prior versions of TGI remain licensed under Apache 2.0, the last Apache 2.0 version being version 0.9.4.

HFOIL stands for Hugging Face Optimized Inference License, and it has been specifically designed for our optimized inference solutions. While the source code remains accessible, HFOIL is not a true open source license because we added a restriction: to sell a hosted or managed service built on top of TGI, we now require a separate agreement.
You can consult the new license here.

What does this mean for you?

This change in source code licensing has no impact on the overwhelming majority of our user community who use TGI for free. Additionally, both our Inference Endpoint customers and those of our commercial partners will also remain unaffected.

However, it will restrict non-partnered cloud service providers from offering TGI v1.0+ as a service without requesting a license.

To elaborate further:

  • If you are an existing user of TGI prior to v1.0, your current version is still Apache 2.0 and you can use it commercially without restrictions.

  • If you are using TGI for personal use or research purposes, the HFOIL 1.0 restrictions do not apply to you.

  • If you are using TGI for commercial purposes as part of an internal company project (that will not be sold to third parties as a hosted or managed service), the HFOIL 1.0 restrictions do not apply to you.

  • If you integrate TGI into a hosted or managed service that you sell to customers, then consider requesting a license to upgrade to v1.0 and later versions - you can email us at api-enterprise@huggingface.co with information about your service.

For more information, see: #726.

Full Changelog: v0.9.4...v1.0.0

v0.9.4

27 Jul 17:29
9f18f4c
Compare
Choose a tag to compare

Features

  • server: auto max_batch_total_tokens for flash att models #630
  • router: ngrok edge #642
  • server: Add trust_remote_code to quantize script by @ChristophRaab #647
  • server: Add exllama GPTQ CUDA kernel support #553 #666
  • server: Directly load GPTBigCode to specified device by @Atry in #618
  • server: add cuda memory fraction #659
  • server: Using quantize_config.json instead of GPTQ_BITS env variables #671
  • server: support new falcon config #712

Fix

  • server: llama v2 GPTQ #648
  • server: Fixing non parameters in quantize script bigcode/starcoder was an example #661
  • server: use mem_get_info to get kv cache size #664
  • server: fix exllama buffers #689
  • server: fix quantization python requirements #708

New Contributors

Full Changelog: v0.9.3...v0.9.4

v0.9.3

18 Jul 16:53
5e6ddfd
Compare
Choose a tag to compare

Highlights

  • server: add support for flash attention v2
  • server: add support for llamav2

Features

  • launcher: add debug logs
  • server: rework the quantization to support all models

Full Changelog: v0.9.2...v0.9.3

v0.9.2

14 Jul 14:36
c58a0c1
Compare
Choose a tag to compare

Features

  • server: harden a bit the weights choice to save on disk
  • server: better errors for warmup and TP
  • server: Support for env value for GPTQ_BITS and GPTQ_GROUPSIZE
  • server: Implements sharding for non divisible vocab_size
  • launcher: add arg validation and drop subprocess
  • router: explicit warning if revision is not set

Fix

  • server: Fixing RW code (it's remote code so the Arch checking doesn't work to see which weights to keep
  • server: T5 weights names
  • server: Adding logger import to t5_modeling.py by @akowalsk
  • server: Bug fixes for GPTQ_BITS environment variable passthrough by @ssmi153
  • server: GPTQ Env vars: catch correct type of error by @ssmi153
  • server: blacklist local files

New Contributors

Full Changelog: v0.9.1...v0.9.2

v0.9.1

06 Jul 14:09
31b36cc
Compare
Choose a tag to compare

Highlights

  • server: Non flash MPT
  • server: decrease memory fragmentation

Features

  • server: use latest flash attention
  • router: add argument for hostname in router
  • docs: Adding some help for the options in text-generation-benchmark

Fix

  • makefile: Update server/Makefile to include Makefile-vllm
  • server: Handle loading from local files for MPT
  • server: avoid errors for very small top_p values

Full Changelog: v0.9.0...v0.9.1

v0.9.0

01 Jul 17:26
e28a809
Compare
Choose a tag to compare

Highlights

  • server: add paged attention to flash models
  • server: Inference support for GPTQ (llama + falcon tested) + Quantization script
  • server: only compute prefill logprobs when asked

Features

  • launcher: parse oom signals
  • server: batch tokenization for flash causal lm
  • server: Rework loading by
  • server: optimize dist ops
  • router: add ngrok integration
  • server: improve flash attention import errors
  • server: Refactor conversion logic
  • router: add header option to disable buffering for the generate_stream response by @rkimball
  • router: add arg validation

Fix

  • docs: CUDA_VISIBLE_DEVICES comment by @antferdom
  • docs: Fix typo and use POSIX comparison in the makefile by @piratos
  • server: fix warpers on CPU
  • server: Fixing T5 in case the names are mixed up
  • router: add timeout on flume sends
  • server: Do not init process group if already initialized
  • server: Add the option to force another dtype than f16
  • launcher: fix issue where launcher does not properly report shard failures

New Contributors

Full Changelog: v0.8.2...v0.9.0

v0.8.2

01 Jun 17:51
Compare
Choose a tag to compare

Features

  • server: remove trust_remote_code requirement for falcon models
  • server: load santacoder/starcoder models with safetensors

Fix

  • server: fix has_position_ids

Full Changelog: v0.8.1...v0.8.2

v0.8.1

31 May 10:10
Compare
Choose a tag to compare

Features

  • server: add retry on download

Fix

  • server: fix bnb quantization for CausalLM models

Full Changelog: v0.8.0...v0.8.1

v0.8.0

30 May 16:45
Compare
Choose a tag to compare

Features

  • router: support vectorized warpers in flash causal lm (co-authored by @jlamypoirier )
  • proto: decrease IPC proto size
  • benchmarker: add summary tables
  • server: support RefinedWeb models

Fix

  • server: Fix issue when load AutoModelForSeq2SeqLM model (contributed by @CL-Shang)

New Contributors

Full Changelog: v0.7.0...v0.8.0