14 Aug 09:24

Narsil

09eca64

v1.0.1

Notable changes:

More GPTQ support
Rope scaling (linear + dynamic)
Bitsandbytes 4bits (both modes)
Added more documentation

What's Changed

Local gptq support. by @Narsil in #738
Fix typing in Model.generate_token by @jaywonchung in #733
Adding Rope scaling. by @Narsil in #741
chore: fix typo in mpt_modeling.py by @eltociear in #737
fix(server): Failing quantize config after local read. by @Narsil in #743
Typo fix. by @Narsil in #746
fix typo for dynamic rotary by @flozi00 in #745
add FastLinear import by @zspo in #750
Automatically map deduplicated safetensors weights to their original values (#501) by @Narsil in #761
feat(server): Add native support for PEFT Lora models by @Narsil in #762
This should prevent the PyTorch overriding. by @Narsil in #767
fix build tokenizer in quantize and remove duplicate import by @zspo in #768
Merge BNB 4bit. by @Narsil in #770
Fix dynamic rope. by @Narsil in #783
Fixing non 4bits quantization. by @Narsil in #785
Update init.py by @Narsil in #794
Llama change. by @Narsil in #793
Setup for doc-builder and docs for TGI by @merveenoyan in #740
Use destructuring in router arguments to avoid '.0' by @ivarflakstad in #798
Fix gated docs by @osanseviero in #805
Minor docs style fixes by @osanseviero in #806
Added CLI docs and rename docker launch by @merveenoyan in #799
[docs] Build docs only when doc files change by @mishig25 in #812
Added ChatUI Screenshot to Docs by @merveenoyan in #823
Upgrade transformers (fix protobuf==3.20 issue) by @Narsil in #795
Added streaming for InferenceClient by @merveenoyan in #821
Version 1.0.1 by @Narsil in #836

New Contributors

@jaywonchung made their first contribution in #733
@eltociear made their first contribution in #737
@flozi00 made their first contribution in #745
@zspo made their first contribution in #750
@ivarflakstad made their first contribution in #798
@osanseviero made their first contribution in #805
@mishig25 made their first contribution in #812

Full Changelog: v1.0.0...v1.0.1

Contributors

Narsil, osanseviero, and 7 other contributors

Assets 2

28 Jul 15:47

OlivierDehaene

v1.0.0

3ef5ffb

v1.0.0

License change

We are releasing TGI v1.0 under a new license: HFOIL 1.0.
All prior versions of TGI remain licensed under Apache 2.0, the last Apache 2.0 version being version 0.9.4.

HFOIL stands for Hugging Face Optimized Inference License, and it has been specifically designed for our optimized inference solutions. While the source code remains accessible, HFOIL is not a true open source license because we added a restriction: to sell a hosted or managed service built on top of TGI, we now require a separate agreement.
You can consult the new license here.

What does this mean for you?

This change in source code licensing has no impact on the overwhelming majority of our user community who use TGI for free. Additionally, both our Inference Endpoint customers and those of our commercial partners will also remain unaffected.

However, it will restrict non-partnered cloud service providers from offering TGI v1.0+ as a service without requesting a license.

To elaborate further:

If you are an existing user of TGI prior to v1.0, your current version is still Apache 2.0 and you can use it commercially without restrictions.
If you are using TGI for personal use or research purposes, the HFOIL 1.0 restrictions do not apply to you.
If you are using TGI for commercial purposes as part of an internal company project (that will not be sold to third parties as a hosted or managed service), the HFOIL 1.0 restrictions do not apply to you.
If you integrate TGI into a hosted or managed service that you sell to customers, then consider requesting a license to upgrade to v1.0 and later versions - you can email us at api-enterprise@huggingface.co with information about your service.

For more information, see: #726.

Full Changelog: v0.9.4...v1.0.0

Assets 2

27 Jul 17:29

OlivierDehaene

v0.9.4

9f18f4c

v0.9.4

Features

server: auto max_batch_total_tokens for flash att models #630
router: ngrok edge #642
server: Add trust_remote_code to quantize script by @ChristophRaab #647
server: Add exllama GPTQ CUDA kernel support #553 #666
server: Directly load GPTBigCode to specified device by @Atry in #618
server: add cuda memory fraction #659
server: Using quantize_config.json instead of GPTQ_BITS env variables #671
server: support new falcon config #712

Fix

server: llama v2 GPTQ #648
server: Fixing non parameters in quantize script bigcode/starcoder was an example #661
server: use mem_get_info to get kv cache size #664
server: fix exllama buffers #689
server: fix quantization python requirements #708

New Contributors

@ChristophRaab made their first contribution in #647
@fxmarty made their first contribution in #648
@Atry made their first contribution in #618

Full Changelog: v0.9.3...v0.9.4

Contributors

Atry, ChristophRaab, and fxmarty

Assets 2

18 Jul 16:53

OlivierDehaene

v0.9.3

5e6ddfd

v0.9.3

Highlights

server: add support for flash attention v2
server: add support for llamav2

Features

launcher: add debug logs
server: rework the quantization to support all models

Full Changelog: v0.9.2...v0.9.3

Assets 2

14 Jul 14:36

OlivierDehaene

v0.9.2

c58a0c1

v0.9.2

Features

server: harden a bit the weights choice to save on disk
server: better errors for warmup and TP
server: Support for env value for GPTQ_BITS and GPTQ_GROUPSIZE
server: Implements sharding for non divisible vocab_size
launcher: add arg validation and drop subprocess
router: explicit warning if revision is not set

Fix

server: Fixing RW code (it's remote code so the Arch checking doesn't work to see which weights to keep
server: T5 weights names
server: Adding logger import to t5_modeling.py by @akowalsk
server: Bug fixes for GPTQ_BITS environment variable passthrough by @ssmi153
server: GPTQ Env vars: catch correct type of error by @ssmi153
server: blacklist local files

New Contributors

@akowalsk made their first contribution in #585
@ssmi153 made their first contribution in #590
@gary149 made their first contribution in #611

Full Changelog: v0.9.1...v0.9.2

Contributors

gary149, akowalsk, and ssmi153

Assets 2

06 Jul 14:09

OlivierDehaene

v0.9.1

31b36cc

v0.9.1

Highlights

server: Non flash MPT
server: decrease memory fragmentation

Features

server: use latest flash attention
router: add argument for hostname in router
docs: Adding some help for the options in text-generation-benchmark

Fix

makefile: Update server/Makefile to include Makefile-vllm
server: Handle loading from local files for MPT
server: avoid errors for very small top_p values

Full Changelog: v0.9.0...v0.9.1

Assets 2

01 Jul 17:26

OlivierDehaene

v0.9.0

e28a809

v0.9.0

Highlights

server: add paged attention to flash models
server: Inference support for GPTQ (llama + falcon tested) + Quantization script
server: only compute prefill logprobs when asked

Features

launcher: parse oom signals
server: batch tokenization for flash causal lm
server: Rework loading by
server: optimize dist ops
router: add ngrok integration
server: improve flash attention import errors
server: Refactor conversion logic
router: add header option to disable buffering for the generate_stream response by @rkimball
router: add arg validation

Fix

docs: CUDA_VISIBLE_DEVICES comment by @antferdom
docs: Fix typo and use POSIX comparison in the makefile by @piratos
server: fix warpers on CPU
server: Fixing T5 in case the names are mixed up
router: add timeout on flume sends
server: Do not init process group if already initialized
server: Add the option to force another dtype than f16
launcher: fix issue where launcher does not properly report shard failures

New Contributors

@antferdom made their first contribution in #441
@piratos made their first contribution in #443
@Yard1 made their first contribution in #388
@rkimball made their first contribution in #498

Full Changelog: v0.8.2...v0.9.0

Contributors

piratos, Yard1, and 2 other contributors

Assets 2

01 Jun 17:51

OlivierDehaene

v0.8.2

e7248fe

v0.8.2

Features

server: remove trust_remote_code requirement for falcon models
server: load santacoder/starcoder models with safetensors

Fix

server: fix has_position_ids

Full Changelog: v0.8.1...v0.8.2

Assets 2

31 May 10:10

OlivierDehaene

v0.8.1

db2ebe3

v0.8.1

Features

server: add retry on download

Fix

server: fix bnb quantization for CausalLM models

Full Changelog: v0.8.0...v0.8.1

Assets 2

30 May 16:45

OlivierDehaene

v0.8.0

081b926

v0.8.0

Features

router: support vectorized warpers in flash causal lm (co-authored by @jlamypoirier )
proto: decrease IPC proto size
benchmarker: add summary tables
server: support RefinedWeb models

Fix

server: Fix issue when load AutoModelForSeq2SeqLM model (contributed by @CL-Shang)

New Contributors

@CL-Shang made their first contribution in #370
@jlamypoirier made their first contribution in #317

Full Changelog: v0.7.0...v0.8.0

Contributors

CL-Shang and jlamypoirier

Assets 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Notable changes:

What's Changed

New Contributors

Contributors

License change

What does this mean for you?

Features

Fix

New Contributors

Contributors

Highlights

Features

Features

Fix

New Contributors

Contributors

Highlights

Features

Fix

Highlights

Features

Fix

New Contributors

Contributors

Features

Fix

Features

Fix

Features

Fix

New Contributors

Contributors

Releases: huggingface/text-generation-inference

v1.0.1

Notable changes:

What's Changed

New Contributors

Contributors

v1.0.0

License change

What does this mean for you?

v0.9.4

Features

Fix

New Contributors

Contributors

v0.9.3

Highlights

Features

v0.9.2

Features

Fix

New Contributors

Contributors

v0.9.1

Highlights

Features

Fix

v0.9.0

Highlights

Features

Fix

New Contributors

Contributors

v0.8.2

Features

Fix

v0.8.1

Features

Fix

v0.8.0

Features

Fix

New Contributors

Contributors