Skip to content

v0.2.0

Compare
Choose a tag to compare
@github-actions github-actions released this 17 Dec 12:59
· 0 commits to 2bc3214075391c175057302023c4bd99f64b2899 since this release
3470329

0.2.0 (2024-12-17)

Release Blog.

Features

  • add rotary_dim argument to rope APIs for partial apply rope (#599) (eb9bc71)
  • add a use_softmax field in variant class (#533) (d81af97)
  • add an option non_blocking to plan function (#622) (560af6f)
  • add gemma_rmsnorm and gemma_fused_add_rmsnorm (#477) (1a6b17e)
  • add group size 3 to GQA decode dispatch (#558) (6227562)
  • add JIT compilation support for FA3 templates (#672) (d4e8d79)
  • allow the cascade kernels to be executed using varying sequence lenghts (#627) (92ac440)
  • CUDAGraph compatibility of multi-level cascade inference APIs (#586) (2332e8a)
  • fix the maximal grid dimension in prefill planning with CUDA graphs (#639) (86ca89a)
  • improve the precision of the FusedAddRMSNormKernel function (#587) (c7dc921)
  • JIT compilation (#507) (3613a5b)
  • modify group-gemm stage number (#497) (52dab1d)
  • non-contiguous query with paged kv cache (#553) (89f2c4a)
  • pass a dynamic token count to the cascade kernels (#635) (5fe9f7d)
  • simplify prefill JIT compilation (#605) (fe4f898)
  • specify gemm backend (#648) (0cc1a51)
  • support cached cos/sin in rope APIs (#585) (83e541d)
  • support huggingface transformer style rope interface (#568) (4f40420)
  • support sm90 cutlass group gemm (#509) (794bdda)
  • torch custom_op fix for rope (#569) (3e104bc)
  • torch custom_op support: norm (#552) (f6e0010)
  • torch.compile and custom_op support (#554) (9bf916f)
  • warmup for jit kernel tests (#629) (8f5f349)

Bug Fixes

Performance Improvements

  • accelerate JIT compilation speed (#618) (eaf73fd)
  • Dense and sparse customizable flashattention-3 template (#667) (51236c9)
  • fix prefill kernel performance degradation (step 1) (#602) (595cf60)
  • fix the performance issue of append_paged_kv_cache (#588) (e15f7c9)
  • improve parallelism in RoPE with pos_ids (#609) (ff05155)
  • improve plan performance by using non-blocking memcpy (#547) (41ebe6d)
  • reduce the read and write of shared memory in the FusedAddRMSNormKernel (#592) (2043ca2)
  • reduce total_num_tiles_q by one (#644) (553ace5)
  • remove unnecessary contiguous operation in block sparse attention (#561) (7a7ad46)
  • speedup jit compilation of prefill attention kernels (#632) (a059586)
  • use cuda-core implemention for io-bound block-sparse attention (#560) (3fbf028)