Releases: flashinfer-ai/flashinfer
v0.2.0.post1
0.2.0.post1 (2024-12-22)
Bug Fixes
v0.2.0
0.2.0 (2024-12-17)
Features
- add
rotary_dim
argument to rope APIs for partial apply rope (#599) (eb9bc71) - add a
use_softmax
field in variant class (#533) (d81af97) - add an option
non_blocking
to plan function (#622) (560af6f) - add gemma_rmsnorm and gemma_fused_add_rmsnorm (#477) (1a6b17e)
- add group size 3 to GQA decode dispatch (#558) (6227562)
- add JIT compilation support for FA3 templates (#672) (d4e8d79)
- allow the cascade kernels to be executed using varying sequence lenghts (#627) (92ac440)
- CUDAGraph compatibility of multi-level cascade inference APIs (#586) (2332e8a)
- fix the maximal grid dimension in prefill planning with CUDA graphs (#639) (86ca89a)
- improve the precision of the FusedAddRMSNormKernel function (#587) (c7dc921)
- JIT compilation (#507) (3613a5b)
- modify group-gemm stage number (#497) (52dab1d)
- non-contiguous query with paged kv cache (#553) (89f2c4a)
- pass a dynamic token count to the cascade kernels (#635) (5fe9f7d)
- simplify prefill JIT compilation (#605) (fe4f898)
- specify gemm backend (#648) (0cc1a51)
- support cached cos/sin in rope APIs (#585) (83e541d)
- support huggingface transformer style rope interface (#568) (4f40420)
- support sm90 cutlass group gemm (#509) (794bdda)
- torch custom_op fix for rope (#569) (3e104bc)
- torch custom_op support: norm (#552) (f6e0010)
- torch.compile and custom_op support (#554) (9bf916f)
- warmup for jit kernel tests (#629) (8f5f349)
Bug Fixes
- AOT compiler flags on non-sm90 (#522) (0aa4726)
- batch decode kernel redundant store output to gmem (#505) (90e42a7)
- compatible with torch 2.2 (#478) (ac41d1b)
- #452 (b53a46f)
- remove redundant load (#495) (2de16b0)
- update bmm fp8 test (#487) (45eac04)
Performance Improvements
- accelerate JIT compilation speed (#618) (eaf73fd)
- Dense and sparse customizable flashattention-3 template (#667) (51236c9)
- fix prefill kernel performance degradation (step 1) (#602) (595cf60)
- fix the performance issue of
append_paged_kv_cache
(#588) (e15f7c9) - improve parallelism in RoPE with pos_ids (#609) (ff05155)
- improve plan performance by using non-blocking memcpy (#547) (41ebe6d)
- reduce the read and write of shared memory in the FusedAddRMSNormKernel (#592) (2043ca2)
- reduce total_num_tiles_q by one (#644) (553ace5)
- remove unnecessary contiguous operation in block sparse attention (#561) (7a7ad46)
- speedup jit compilation of prefill attention kernels (#632) (a059586)
- use cuda-core implemention for io-bound block-sparse attention (#560) (3fbf028)
v0.1.6
0.1.6 (2024-08-27)
SM75 Support
Starting from 0.1.6, our pre-built wheels include experimental support sm75 (Turing architecture GPUs such as Tesla T4, Quadro RTX 6000 and RTX 2080).
API Changes
plan
/run
Since 0.1.6 on, begin_forward
/forward
/end_forward
APIs are replaced with the new plan
/run
API.
forward
is renamed torun
, which is more precise and consistent with the naming convention of cutlass's python API.begin_forward
is renamed toplan
, which is consistent with the naming convention of nvmath API.end_forward
is deprecated and has no effect after this PR.
There is some slight difference between the old forward
and the new run
API:
- All extra arguments such as
causal
andlogits_soft_cap
will be provided inplan
(previouslybegin_forward
) API, and cached until nextplan
call, and we only need to provide query and KV-Cache tensors inrun
API.
The old begin_forward
/forward
/end_forward
APIs are still functional, but we will gradually deprecate them in future releases.
Check #466 for more details.
MultiLevelCascadeAttentionWrapper
Since 0.1.6 on, we introduce a new MultiLevelCascadeAttentionWrapper
API for cascade inference,
which supports multi-level cascade inference where all levels' KV-Cache can be managed in a unified Paged KV-Cache.
See documentation and tutorial on API usage and layout explaination.
The old BatchDecodeWithSharedPrefixPagedKVCacheWrapper
and BatchPrefillWithSharedPrefixPagedKVCacheWrapper
will be deprecated in future releases.
Features
- sm75 support (#448, #449)
- add
MultiLevelCascadeAttentionWrapper
API (#462) (1e37989) - add accept num, emit num metric for ChainSpeculativeSampling (#450) (fa38b5e)
- support bmm fp8 (#469) (f1c0b68)
Refactor
- refactor: replace
begin_forward
/forward
/end_forward
withplan
/run
#466
Misc
Performance Improvements
- slight optimization on f16->f8 fragment layout swizzling (#453) (0d61871)
- slight optimization on fragment layout swizzle (#458) (7c397cb)
- use persistent kernel for merging attention states (#459) (be6bf5b)
Acknowledgement
We thank @LiuXiaoxuanPKU on enhance of speculative sampling operator, @merrymercy on API change suggestion and @zhyncs on integrating fp8 BMM cublas implementation.
v0.1.5
0.1.5 (2024-08-13)
Bugfix
- Fix PagedPrefill python api and some typos (#441) (3fff008)
- fix prefill kernels' lse result for empty kv-cache (#440) (6ac28f4)
Features
Performance Improvements
Acknowledgement
We thank contributions and feedbacks from the community: @comaniac, @hnyls2002, @jianfei-wangg, @Yard1.
v0.1.4
0.1.4 (2024-08-09)
Features
- append attention kernels for fp8 kv-cache (#420) (906c2f5)
- support min_p sampling (#422) (d52f2da)
- deterministic sampling (#417) (0dd801d)
- more sampling operator options (#431) (68df9c4)
- support fused add rmsnorm (#419) (b781513)
- support fused silu mul (#427) (ea0ba9a)
- feat: support fused gelu tanh mul (#434) (2c9d1c3)
Bug Fixes
- fix dispatch fp16 type when enable fp8 (#430) (daa5566)
- improve numerical stability of sampling kernels (#429) (898d8ea)
Other improvements
Acknowledgement
We thank contributions and feedbacks from the community: @comaniac, @esmeetu, @LiuXiaoxuanPKU, @peng1999, @xslingcn, @Yard1, @zhyncs.
v0.1.3
v0.1.2
v0.1.1
v0.1.0
v0.0.9
0.0.9 (2024-07-12)
Bugfix
- fix decode kernels output for empty kv cache (#363)(ac72b1)
- check gpu id in PyTorch APIs and use input tensor's gpu default stream (#361)(1b84fa)
Performance Improvements
- accelerate alibi (#365) (4f0a9f9)
- accelerate gqa performance (#356) (e56ddad)
- Optimize tensor conversions in C++ code to avoid unnecessary copies (#366) (1116237)
Acknowledgement
We thank @Yard1, @Ying1123 and @zhyncs for their contributions.