Changelog

0.2.0.post1 (2024-12-22)

Bug Fixes

bug fix on determine_attention_backend condition (#688) (bcf7a3e)
accelerate plan speed of fa3 template (#690) (db8f04d)

0.2.0 (2024-12-17)

Release Blog

FlashInfer 0.2 - Efficient and Customizable Kernels for LLM Inference Serving

Features

add rotary_dim argument to rope APIs for partial apply rope (#599) (eb9bc71)
add a use_softmax field in variant class (#533) (d81af97)
add an option non_blocking to plan function (#622) (560af6f)
add gemma_rmsnorm and gemma_fused_add_rmsnorm (#477) (1a6b17e)
add group size 3 to GQA decode dispatch (#558) (6227562)
add JIT compilation support for FA3 templates (#672) (d4e8d79)
allow the cascade kernels to be executed using varying sequence lenghts (#627) (92ac440)
CUDAGraph compatibility of multi-level cascade inference APIs (#586) (2332e8a)
fix the maximal grid dimension in prefill planning with CUDA graphs (#639) (86ca89a)
improve the precision of the FusedAddRMSNormKernel function (#587) (c7dc921)
JIT compilation (#507) (3613a5b)
modify group-gemm stage number (#497) (52dab1d)
non-contiguous query with paged kv cache (#553) (89f2c4a)
pass a dynamic token count to the cascade kernels (#635) (5fe9f7d)
simplify prefill JIT compilation (#605) (fe4f898)
specify gemm backend (#648) (0cc1a51)
support cached cos/sin in rope APIs (#585) (83e541d)
support huggingface transformer style rope interface (#568) (4f40420)
support sm90 cutlass group gemm (#509) (794bdda)
torch custom_op fix for rope (#569) (3e104bc)
torch custom_op support: norm (#552) (f6e0010)
torch.compile and custom_op support (#554) (9bf916f)
warmup for jit kernel tests (#629) (8f5f349)

Bug Fixes

AOT compiler flags on non-sm90 (#522) (0aa4726)
batch decode kernel redundant store output to gmem (#505) (90e42a7)
compatible with torch 2.2 (#478) (ac41d1b)
#452 (b53a46f)
remove redundant load (#495) (2de16b0)
update bmm fp8 test (#487) (45eac04)

Performance Improvements

accelerate JIT compilation speed (#618) (eaf73fd)
Dense and sparse customizable flashattention-3 template (#667) (51236c9)
fix prefill kernel performance degradation (step 1) (#602) (595cf60)
fix the performance issue of append_paged_kv_cache (#588) (e15f7c9)
improve parallelism in RoPE with pos_ids (#609) (ff05155)
improve plan performance by using non-blocking memcpy (#547) (41ebe6d)
reduce the read and write of shared memory in the FusedAddRMSNormKernel (#592) (2043ca2)
reduce total_num_tiles_q by one (#644) (553ace5)
remove unnecessary contiguous operation in block sparse attention (#561) (7a7ad46)
speedup jit compilation of prefill attention kernels (#632) (a059586)
use cuda-core implemention for io-bound block-sparse attention (#560) (3fbf028)

0.1.6 (2024-08-27)

SM75 Support

Starting from 0.1.6, our pre-built wheels include experimental support sm75 (Turing architecture GPUs such as Tesla T4, Quadro RTX 6000 and RTX 2080).

API Changes

`plan`/`run`

Since 0.1.6 on, begin_forward/forward/end_forward APIs are replaced with the new plan/run API.

forward is renamed to run, which is more precise and consistent with the naming convention of cutlass's python API.
begin_forward is renamed to plan, which is consistent with the naming convention of nvmath API.
end_forward is deprecated and has no effect after this PR.

There is some slight difference between the old forward and the new run API:

All extra arguments such as causal and logits_soft_cap will be provided in plan (previously begin_forward) API, and cached until next plan call, and we only need to provide query and KV-Cache tensors in run API.

The old begin_forward/forward/end_forward APIs are still functional, but we will gradually deprecate them in future releases.

Check #466 for more details.

`MultiLevelCascadeAttentionWrapper`

Since 0.1.6 on, we introduce a new MultiLevelCascadeAttentionWrapper API for cascade inference, which supports multi-level cascade inference where all levels' KV-Cache can be managed in a unified Paged KV-Cache.

See documentation and tutorial on API usage and layout explaination.

The old BatchDecodeWithSharedPrefixPagedKVCacheWrapper and BatchPrefillWithSharedPrefixPagedKVCacheWrapper will be deprecated in future releases.

Features

sm75 support (#448, #449)
add MultiLevelCascadeAttentionWrapper API (#462) (1e37989)
add accept num, emit num metric for ChainSpeculativeSampling (#450) (fa38b5e)
support bmm fp8 (#469) (f1c0b68)

Refactor

refactor: replace begin_forward/forward/end_forward with plan/run #466

Misc

misc: improve error handling of sampling kernels (#456) (0dce178)

Performance Improvements

slight optimization on f16->f8 fragment layout swizzling (#453) (0d61871)
slight optimization on fragment layout swizzle (#458) (7c397cb)
use persistent kernel for merging attention states (#459) (be6bf5b)

Acknowledgement

We thank @LiuXiaoxuanPKU on enhance of speculative sampling operator, @merrymercy on API change suggestion and @zhyncs on integrating fp8 BMM cublas implementation.

0.1.5 (2024-08-13)

Bugfix

resolve cu121 compile wired issue (#446) (5f0159e)
Fix PagedPrefill python api and some typos (#441) (3fff008)
fix prefill kernels' lse result for empty kv-cache (#440) (6ac28f4)

Features

decouple float and int workspace buffer (#442) (a7ee566)

Performance Improvements

faster fp8->fp16 dequantization for pre sm_90 arch (#439) (c93f647)

Acknowledgement

We thank contributions and feedbacks from the community: @comaniac, @hnyls2002, @jianfei-wangg, @Yard1.

0.1.4 (2024-08-09)

Features

append attention kernels for fp8 kv-cache (#420) (906c2f5)
support min_p sampling (#422) (d52f2da)
deterministic sampling (#417) (0dd801d)
more sampling operator options (#431) (68df9c4)
support fused add rmsnorm (#419) (b781513)
support fused silu mul (#427) (ea0ba9a)

Bug Fixes

fix dispatch fp16 type when enable fp8 (#430) (daa5566)
improve numerical stability of sampling kernels (#429) (898d8ea)

Other improvements

break up _kernels into multiple modules (#428) (8e482d9)

Acknowledgement

We thank contributions and feedbacks from the community: @comaniac, @esmeetu, @LiuXiaoxuanPKU, @peng1999, @xslingcn, @Yard1, @zhyncs.

0.1.3 (2024-07-31)

Bugfix

bugfix: Fix cudagraph mode of BatchPrefillWithRaggedKVCacheWrapper (#412) (9907bc)
fix cu118 cub usage for sampling kernels (#410) (58d359)

MiscBreak up _kernels into multiple modules

enhance allocator error info and add shape check for prefill begin forward functions (#413) (5e36c5)

0.1.2 (2024-07-29)

Bugfix

Fix the sampling kernel bug for cu118 (#386, #387) (0cd499, dc3f18)

Features

add llama 3.1 style rope (#401) (4c89dec)
non-inplace rope operators (#405) (74ffba1)
sliding window attention (#406) (28cffd3)
support non-contiguous (packed) input for prefill kernels (#404) (68c3719)

Performance Improvements

slight optimization on merge states (#313) (701c813)

0.1.1 (2024-07-20)

Bugfix

fix the invalid kernel configuration for architectures with small shared memory size (#385) (cdac57)

Features

expose decoupled kv-cache to pytorch api (#383) (457a0ae)

Performance Improvements

use stmatrix in epilogue for sm90+ (#380) (c6f20d1)

0.1.0 (2024-07-17)

Features

Add mask to merge_state_in_place (#372) (e14fa81)
expose pytorch api for block sparse attention (#375) (4bba6fa)
Fused GPU sampling kernel for joint top-k & top-p sampling (#374) (6e028eb)

0.0.9 (2024-07-12)

Bugfix

fix the decode kernel segfault in cudagraph mode (#368)(c69cfa)

fix decode kernels output for empty kv cache (#363)(ac72b1)
check gpu id in PyTorch APIs and use input tensor's gpu default stream (#361)(1b84fa)

Performance Improvements

accelerate alibi (#365) (4f0a9f9)
accelerate gqa performance (#356) (e56ddad)
Optimize tensor conversions in C++ code to avoid unnecessary copies (#366) (1116237)

Acknowledgement

We thank @Yard1, @Ying1123 and @zhyncs for their contributions.

0.0.8 (2024-07-03)

Bugfix

fix prefill/append kernel behavior for empty kv-cache (#353) (7adc8c)
fix decode attention kernel with logits cap (#350) (f5f7a2)

0.0.7 (2024-06-28)

Breaking Changes

batch_decode_with_padded_kv_cache was removed, we encourage user to use BatchDecodeWithPagedKVCacheWrapper instead. (#343)

Bugfix

fix the forward_return_lse function in BatchPrefillWithRaggedKVCache class (#337)
fix the scheduler behavior of large page size (#333)

Features

customize logits_soft_cap value (#339) (a2498f5)

Performance Improvements

change minimal kv_chunk_size back to 128 (#329) (f237f5f)
more options for kv tile size (#336) (bf2a6c7)

0.0.6 (2024-06-21)

Bugfix

Fix some bug in v0.0.5 that might lead to crashes and instable performance.

Performance Improvements

use 1x4 warp layout for small query length (#322) (4e89b4d)

0.0.5 (2024-06-20)

Highlights

Support any GQA group size support for tensor-cores kernels.
Support any page size support for tensor-cores kernels.
Support CUDA-Graph for prefill/decode APIs.
Add an option to accelerate decode kernels with Tensor Cores.
Support custom attention mask. (https://docs.flashinfer.ai/tutorials/kv_layout.html#mask-layout-2d-ragged-tensor)
Support logits cap in Grok-1 models.
Fused GPU-sampling kernels: top-p, top-k, speculative verification. (https://docs.flashinfer.ai/api/python/sampling.html)
PyTorch wrapper of group-gemm cutlass kernels. (https://docs.flashinfer.ai/api/python/group_gemm.html)

Acknowledgement

We thank @ibsidorenko, @LiuXiaoxuanPKU, @Yard1 @AgrawalAmey, @xuzhenqi, @mgerstgrasser, @esmeetu, @yz-tang, @HSQ79815, @Qubitium, @shreygupta2809, @sighingnow, @vinx13, @tqchen, @merrymercy, @comaniac and many others for their contributions and helpful discussions for 0.0.5 release.

Refactor

support any GQA group size for tensor-cores kernels (#301) (c111ca)
support any page size for tensor-cores kernels (#306) (82fd8c)

Features

add use_tensor_cores option to decode kernels to accelerate GQA (#317) (3b50dd5)
add group gemm operators (#282) (e08ba42)
initial support of distributed operators (#289) (03553da)
initial support of logits hook (#298) (ab1e2ad)
Separate Q and KV dtypes for decode (#286) (5602659)
support cuda graph for batched multi-query(prefill/append) attention (#275) (83ceb67)
support cuda graph for batched multi-query(prefill/append) attention (#277) (24cc583)
support custom attention mask in prefill/append attention kernels (#266) (7304282)
fused speculative sampilng kernels (#259) (cea2bb)
expose sampling APIs in pytorch (#238) (092902)

Performance Improvements

initial cuda graph support (#256) (7e9cc7f)
split kv-cache for prefill/append kernels (#310) (f0bb0a3)
use packed bit array for attention mask (#308) (3d43dc9)

0.0.4 (2024-05-01)

Features

pytorch 2.3 support
gpu sampling kernels (top-p, top-k)
more gqa group sizes
add mma instructions for fp8 (#179) (d305798)
mma rowsum for fp8 (#180) (5af935c)
support any num_heads for get_alibi_slope (#200) (b217a6f)

Bug Fixes

fix python package dispatch error message (#182) (8eed01c)

0.0.3 (2024-03-08)

Features

adding sm_scale field for all attention APIs (#145) (85d4018)
enable head_dim=256 for attention kernels (#132) (0372acc)
pytorch api of fp8 kv-cache (#156) (66ee066)
support ALiBi (#146) (383518b)

Bug Fixes

bugfix to pr 135 (#136) (3d55c71)
fix bugs introduced in #132 (#135) (9b7b0b9)
fix FindThrust.cmake (#161) (30fa584)

Misc

add stream argument in BeginForwardFunction of TVMWrapper (#164) (fabfcb5)

Performance Improvements

multiple q by sm_scale in decode kernels (#144) (660c559)

0.0.2 (2024-02-17)

Bug Fixes

add python 3.9 wheels to ci/cd (#114) (2d8807d)
version names cannot include multiple + (#118) (af6bd10)
version naming issue (#117) (c849a90)

Files

CHANGELOG.md

Latest commit

History

CHANGELOG.md

File metadata and controls

Changelog

0.2.0.post1 (2024-12-22)

Bug Fixes

0.2.0 (2024-12-17)

Release Blog

Features

Bug Fixes

Performance Improvements

0.1.6 (2024-08-27)

SM75 Support

API Changes

plan/run

MultiLevelCascadeAttentionWrapper

Features

Refactor

Misc

Performance Improvements

Acknowledgement

0.1.5 (2024-08-13)

Bugfix

Features

Performance Improvements

Acknowledgement

0.1.4 (2024-08-09)

Features

Bug Fixes

Other improvements

Acknowledgement

0.1.3 (2024-07-31)

Bugfix

MiscBreak up _kernels into multiple modules

0.1.2 (2024-07-29)

Bugfix

Features

Performance Improvements

0.1.1 (2024-07-20)

Bugfix

Features

Performance Improvements

0.1.0 (2024-07-17)

Features

0.0.9 (2024-07-12)

Bugfix

Performance Improvements

Acknowledgement

0.0.8 (2024-07-03)

Bugfix

0.0.7 (2024-06-28)

Breaking Changes

Bugfix

Features

Performance Improvements

0.0.6 (2024-06-21)

Bugfix

Performance Improvements

0.0.5 (2024-06-20)

Highlights

Acknowledgement

Refactor

Features

Performance Improvements

0.0.4 (2024-05-01)

Features

Bug Fixes

0.0.3 (2024-03-08)

Features

Bug Fixes

Misc

Performance Improvements

0.0.2 (2024-02-17)

Bug Fixes

`plan`/`run`

`MultiLevelCascadeAttentionWrapper`