Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix scan / sm90 perf regression #3236

Merged
merged 1 commit into from
Jan 2, 2025

Conversation

gevtushenko
Copy link
Collaborator

@gevtushenko gevtushenko commented Jan 2, 2025

Description

Fixes regression introduced in #3138

We accidentally dropped load_algorithm and store_algorithm member variables from sm90 tuning. That made SFINAE always choose default tuning for Hopper. Shijie Chen embedded missing fields in every specialization, so proper Hopper tunings and not SFINAEd out now.

Checklist

  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@gevtushenko gevtushenko requested review from a team as code owners January 2, 2025 21:15
@gevtushenko gevtushenko enabled auto-merge (squash) January 2, 2025 21:27
Copy link
Contributor

github-actions bot commented Jan 2, 2025

🟩 CI finished in 1h 33m: Pass: 100%/96 | Total: 2d 15h | Avg: 39m 30s | Max: 1h 13m | Hits: 72%/12404
  • 🟩 cub: Pass: 100%/47 | Total: 1d 14h | Avg: 49m 34s | Max: 1h 13m | Hits: 60%/3144

    🟩 cpu
      🟩 amd64              Pass: 100%/45  | Total:  1d 12h | Avg: 49m 19s | Max:  1h 13m | Hits:  60%/3144  
      🟩 arm64              Pass: 100%/2   | Total:  1h 50m | Avg: 55m 24s | Max: 56m 46s
    🟩 ctk
      🟩 11.1               Pass: 100%/7   | Total:  5h 36m | Avg: 48m 03s | Max: 56m 47s | Hits:  60%/786   
      🟩 12.5               Pass: 100%/2   | Total:  2h 11m | Avg:  1h 05m | Max:  1h 07m
      🟩 12.6               Pass: 100%/38  | Total:  1d 07h | Avg: 49m 01s | Max:  1h 13m | Hits:  60%/2358  
    🟩 cudacxx
      🟩 ClangCUDA18        Pass: 100%/2   | Total:  2h 01m | Avg:  1h 00m | Max:  1h 02m
      🟩 nvcc11.1           Pass: 100%/7   | Total:  5h 36m | Avg: 48m 03s | Max: 56m 47s | Hits:  60%/786   
      🟩 nvcc12.5           Pass: 100%/2   | Total:  2h 11m | Avg:  1h 05m | Max:  1h 07m
      🟩 nvcc12.6           Pass: 100%/36  | Total:  1d 05h | Avg: 48m 22s | Max:  1h 13m | Hits:  60%/2358  
    🟩 cudacxx_family
      🟩 ClangCUDA          Pass: 100%/2   | Total:  2h 01m | Avg:  1h 00m | Max:  1h 02m
      🟩 nvcc               Pass: 100%/45  | Total:  1d 12h | Avg: 49m 04s | Max:  1h 13m | Hits:  60%/3144  
    🟩 cxx
      🟩 Clang9             Pass: 100%/4   | Total:  3h 28m | Avg: 52m 10s | Max: 58m 20s
      🟩 Clang10            Pass: 100%/1   | Total: 54m 08s | Avg: 54m 08s | Max: 54m 08s
      🟩 Clang11            Pass: 100%/1   | Total: 56m 16s | Avg: 56m 16s | Max: 56m 16s
      🟩 Clang12            Pass: 100%/1   | Total: 56m 58s | Avg: 56m 58s | Max: 56m 58s
      🟩 Clang13            Pass: 100%/1   | Total: 57m 37s | Avg: 57m 37s | Max: 57m 37s
      🟩 Clang14            Pass: 100%/1   | Total: 58m 39s | Avg: 58m 39s | Max: 58m 39s
      🟩 Clang15            Pass: 100%/1   | Total: 54m 16s | Avg: 54m 16s | Max: 54m 16s
      🟩 Clang16            Pass: 100%/1   | Total:  1h 02m | Avg:  1h 02m | Max:  1h 02m
      🟩 Clang17            Pass: 100%/1   | Total: 50m 55s | Avg: 50m 55s | Max: 50m 55s
      🟩 Clang18            Pass: 100%/7   | Total:  5h 35m | Avg: 47m 58s | Max:  1h 02m
      🟩 GCC6               Pass: 100%/2   | Total:  1h 29m | Avg: 44m 43s | Max: 45m 40s
      🟩 GCC7               Pass: 100%/2   | Total:  1h 52m | Avg: 56m 07s | Max: 56m 16s
      🟩 GCC8               Pass: 100%/1   | Total: 53m 14s | Avg: 53m 14s | Max: 53m 14s
      🟩 GCC9               Pass: 100%/3   | Total:  2h 26m | Avg: 48m 52s | Max: 51m 04s
      🟩 GCC10              Pass: 100%/1   | Total: 51m 53s | Avg: 51m 53s | Max: 51m 53s
      🟩 GCC11              Pass: 100%/1   | Total:  1h 13m | Avg:  1h 13m | Max:  1h 13m
      🟩 GCC12              Pass: 100%/3   | Total:  1h 41m | Avg: 33m 44s | Max: 58m 33s
      🟩 GCC13              Pass: 100%/8   | Total:  4h 36m | Avg: 34m 34s | Max: 57m 48s
      🟩 Intel2023.2.0      Pass: 100%/1   | Total: 55m 58s | Avg: 55m 58s | Max: 55m 58s
      🟩 MSVC14.16          Pass: 100%/1   | Total: 56m 47s | Avg: 56m 47s | Max: 56m 47s | Hits:  60%/786   
      🟩 MSVC14.29          Pass: 100%/1   | Total:  1h 02m | Avg:  1h 02m | Max:  1h 02m | Hits:  60%/786   
      🟩 MSVC14.39          Pass: 100%/2   | Total:  2h 04m | Avg:  1h 02m | Max:  1h 03m | Hits:  60%/1572  
      🟩 NVHPC24.7          Pass: 100%/2   | Total:  2h 11m | Avg:  1h 05m | Max:  1h 07m
    🟩 cxx_family
      🟩 Clang              Pass: 100%/19  | Total: 16h 35m | Avg: 52m 23s | Max:  1h 02m
      🟩 GCC                Pass: 100%/21  | Total: 15h 04m | Avg: 43m 03s | Max:  1h 13m
      🟩 Intel              Pass: 100%/1   | Total: 55m 58s | Avg: 55m 58s | Max: 55m 58s
      🟩 MSVC               Pass: 100%/4   | Total:  4h 03m | Avg:  1h 00m | Max:  1h 03m | Hits:  60%/3144  
      🟩 NVHPC              Pass: 100%/2   | Total:  2h 11m | Avg:  1h 05m | Max:  1h 07m
    🟩 gpu
      🟩 h100               Pass: 100%/2   | Total: 42m 39s | Avg: 21m 19s | Max: 26m 51s
      🟩 v100               Pass: 100%/45  | Total:  1d 14h | Avg: 50m 50s | Max:  1h 13m | Hits:  60%/3144  
    🟩 jobs
      🟩 Build              Pass: 100%/40  | Total:  1d 12h | Avg: 54m 29s | Max:  1h 13m | Hits:  60%/3144  
      🟩 DeviceLaunch       Pass: 100%/1   | Total: 19m 13s | Avg: 19m 13s | Max: 19m 13s
      🟩 GraphCapture       Pass: 100%/1   | Total: 15m 22s | Avg: 15m 22s | Max: 15m 22s
      🟩 HostLaunch         Pass: 100%/3   | Total:  1h 05m | Avg: 21m 55s | Max: 27m 37s
      🟩 TestGPU            Pass: 100%/2   | Total: 50m 31s | Avg: 25m 15s | Max: 29m 01s
    🟩 sm
      🟩 90                 Pass: 100%/2   | Total: 42m 39s | Avg: 21m 19s | Max: 26m 51s
      🟩 90a                Pass: 100%/1   | Total: 24m 58s | Avg: 24m 58s | Max: 24m 58s
    🟩 std
      🟩 11                 Pass: 100%/5   | Total:  4h 12m | Avg: 50m 29s | Max: 56m 16s
      🟩 14                 Pass: 100%/4   | Total:  3h 36m | Avg: 54m 11s | Max: 58m 20s | Hits:  60%/786   
      🟩 17                 Pass: 100%/12  | Total: 11h 12m | Avg: 56m 03s | Max:  1h 03m | Hits:  60%/1572  
      🟩 20                 Pass: 100%/26  | Total: 19h 48m | Avg: 45m 42s | Max:  1h 13m | Hits:  60%/786   
    
  • 🟩 thrust: Pass: 100%/46 | Total: 23h 36m | Avg: 30m 48s | Max: 1h 03m | Hits: 76%/9260

    🟩 cmake_options
      🟩 -DTHRUST_DISPATCH_TYPE=Force32bit Pass: 100%/2   | Total: 42m 25s | Avg: 21m 12s | Max: 30m 53s
    🟩 cpu
      🟩 amd64              Pass: 100%/44  | Total: 22h 35m | Avg: 30m 49s | Max:  1h 03m | Hits:  76%/9260  
      🟩 arm64              Pass: 100%/2   | Total:  1h 00m | Avg: 30m 29s | Max: 33m 04s
    🟩 ctk
      🟩 11.1               Pass: 100%/7   | Total:  3h 27m | Avg: 29m 38s | Max: 53m 21s | Hits:  71%/1852  
      🟩 12.5               Pass: 100%/2   | Total:  1h 38m | Avg: 49m 29s | Max: 49m 39s
      🟩 12.6               Pass: 100%/37  | Total: 18h 30m | Avg: 30m 00s | Max:  1h 03m | Hits:  78%/7408  
    🟩 cudacxx
      🟩 ClangCUDA18        Pass: 100%/2   | Total: 55m 19s | Avg: 27m 39s | Max: 28m 51s
      🟩 nvcc11.1           Pass: 100%/7   | Total:  3h 27m | Avg: 29m 38s | Max: 53m 21s | Hits:  71%/1852  
      🟩 nvcc12.5           Pass: 100%/2   | Total:  1h 38m | Avg: 49m 29s | Max: 49m 39s
      🟩 nvcc12.6           Pass: 100%/35  | Total: 17h 35m | Avg: 30m 08s | Max:  1h 03m | Hits:  78%/7408  
    🟩 cudacxx_family
      🟩 ClangCUDA          Pass: 100%/2   | Total: 55m 19s | Avg: 27m 39s | Max: 28m 51s
      🟩 nvcc               Pass: 100%/44  | Total: 22h 41m | Avg: 30m 56s | Max:  1h 03m | Hits:  76%/9260  
    🟩 cxx
      🟩 Clang9             Pass: 100%/4   | Total:  1h 46m | Avg: 26m 41s | Max: 33m 08s
      🟩 Clang10            Pass: 100%/1   | Total: 32m 30s | Avg: 32m 30s | Max: 32m 30s
      🟩 Clang11            Pass: 100%/1   | Total: 29m 37s | Avg: 29m 37s | Max: 29m 37s
      🟩 Clang12            Pass: 100%/1   | Total: 31m 30s | Avg: 31m 30s | Max: 31m 30s
      🟩 Clang13            Pass: 100%/1   | Total: 30m 59s | Avg: 30m 59s | Max: 30m 59s
      🟩 Clang14            Pass: 100%/1   | Total: 29m 59s | Avg: 29m 59s | Max: 29m 59s
      🟩 Clang15            Pass: 100%/1   | Total: 31m 44s | Avg: 31m 44s | Max: 31m 44s
      🟩 Clang16            Pass: 100%/1   | Total: 31m 43s | Avg: 31m 43s | Max: 31m 43s
      🟩 Clang17            Pass: 100%/1   | Total: 29m 55s | Avg: 29m 55s | Max: 29m 55s
      🟩 Clang18            Pass: 100%/7   | Total:  2h 44m | Avg: 23m 29s | Max: 33m 32s
      🟩 GCC6               Pass: 100%/2   | Total: 48m 23s | Avg: 24m 11s | Max: 26m 32s
      🟩 GCC7               Pass: 100%/2   | Total: 55m 20s | Avg: 27m 40s | Max: 32m 27s
      🟩 GCC8               Pass: 100%/1   | Total: 30m 36s | Avg: 30m 36s | Max: 30m 36s
      🟩 GCC9               Pass: 100%/3   | Total:  1h 29m | Avg: 29m 48s | Max: 33m 15s
      🟩 GCC10              Pass: 100%/1   | Total: 31m 27s | Avg: 31m 27s | Max: 31m 27s
      🟩 GCC11              Pass: 100%/1   | Total: 36m 07s | Avg: 36m 07s | Max: 36m 07s
      🟩 GCC12              Pass: 100%/1   | Total: 33m 27s | Avg: 33m 27s | Max: 33m 27s
      🟩 GCC13              Pass: 100%/8   | Total:  3h 05m | Avg: 23m 09s | Max: 34m 21s
      🟩 Intel2023.2.0      Pass: 100%/1   | Total: 38m 49s | Avg: 38m 49s | Max: 38m 49s
      🟩 MSVC14.16          Pass: 100%/1   | Total: 53m 21s | Avg: 53m 21s | Max: 53m 21s | Hits:  71%/1852  
      🟩 MSVC14.29          Pass: 100%/1   | Total: 50m 59s | Avg: 50m 59s | Max: 50m 59s | Hits:  71%/1852  
      🟩 MSVC14.39          Pass: 100%/3   | Total:  2h 25m | Avg: 48m 33s | Max:  1h 03m | Hits:  80%/5556  
      🟩 NVHPC24.7          Pass: 100%/2   | Total:  1h 38m | Avg: 49m 29s | Max: 49m 39s
    🟩 cxx_family
      🟩 Clang              Pass: 100%/19  | Total:  8h 39m | Avg: 27m 19s | Max: 33m 32s
      🟩 GCC                Pass: 100%/19  | Total:  8h 30m | Avg: 26m 50s | Max: 36m 07s
      🟩 Intel              Pass: 100%/1   | Total: 38m 49s | Avg: 38m 49s | Max: 38m 49s
      🟩 MSVC               Pass: 100%/5   | Total:  4h 10m | Avg: 50m 00s | Max:  1h 03m | Hits:  76%/9260  
      🟩 NVHPC              Pass: 100%/2   | Total:  1h 38m | Avg: 49m 29s | Max: 49m 39s
    🟩 gpu
      🟩 v100               Pass: 100%/46  | Total: 23h 36m | Avg: 30m 48s | Max:  1h 03m | Hits:  76%/9260  
    🟩 jobs
      🟩 Build              Pass: 100%/40  | Total: 22h 24m | Avg: 33m 36s | Max:  1h 03m | Hits:  71%/7408  
      🟩 TestCPU            Pass: 100%/3   | Total: 37m 09s | Avg: 12m 23s | Max: 22m 04s | Hits:  99%/1852  
      🟩 TestGPU            Pass: 100%/3   | Total: 35m 16s | Avg: 11m 45s | Max: 13m 17s
    🟩 sm
      🟩 90a                Pass: 100%/1   | Total: 22m 44s | Avg: 22m 44s | Max: 22m 44s
    🟩 std
      🟩 11                 Pass: 100%/5   | Total:  1h 55m | Avg: 23m 06s | Max: 25m 03s
      🟩 14                 Pass: 100%/4   | Total:  2h 25m | Avg: 36m 22s | Max: 53m 21s | Hits:  71%/1852  
      🟩 17                 Pass: 100%/12  | Total:  7h 24m | Avg: 37m 04s | Max:  1h 00m | Hits:  71%/3704  
      🟩 20                 Pass: 100%/23  | Total: 11h 08m | Avg: 29m 04s | Max:  1h 03m | Hits:  85%/3704  
    
  • 🟩 cccl_c_parallel: Pass: 100%/2 | Total: 9m 59s | Avg: 4m 59s | Max: 7m 40s

    🟩 cpu
      🟩 amd64              Pass: 100%/2   | Total:  9m 59s | Avg:  4m 59s | Max:  7m 40s
    🟩 ctk
      🟩 12.6               Pass: 100%/2   | Total:  9m 59s | Avg:  4m 59s | Max:  7m 40s
    🟩 cudacxx
      🟩 nvcc12.6           Pass: 100%/2   | Total:  9m 59s | Avg:  4m 59s | Max:  7m 40s
    🟩 cudacxx_family
      🟩 nvcc               Pass: 100%/2   | Total:  9m 59s | Avg:  4m 59s | Max:  7m 40s
    🟩 cxx
      🟩 GCC13              Pass: 100%/2   | Total:  9m 59s | Avg:  4m 59s | Max:  7m 40s
    🟩 cxx_family
      🟩 GCC                Pass: 100%/2   | Total:  9m 59s | Avg:  4m 59s | Max:  7m 40s
    🟩 gpu
      🟩 v100               Pass: 100%/2   | Total:  9m 59s | Avg:  4m 59s | Max:  7m 40s
    🟩 jobs
      🟩 Build              Pass: 100%/1   | Total:  2m 19s | Avg:  2m 19s | Max:  2m 19s
      🟩 Test               Pass: 100%/1   | Total:  7m 40s | Avg:  7m 40s | Max:  7m 40s
    
  • 🟩 python: Pass: 100%/1 | Total: 36m 03s | Avg: 36m 03s | Max: 36m 03s

    🟩 cpu
      🟩 amd64              Pass: 100%/1   | Total: 36m 03s | Avg: 36m 03s | Max: 36m 03s
    🟩 ctk
      🟩 12.6               Pass: 100%/1   | Total: 36m 03s | Avg: 36m 03s | Max: 36m 03s
    🟩 cudacxx
      🟩 nvcc12.6           Pass: 100%/1   | Total: 36m 03s | Avg: 36m 03s | Max: 36m 03s
    🟩 cudacxx_family
      🟩 nvcc               Pass: 100%/1   | Total: 36m 03s | Avg: 36m 03s | Max: 36m 03s
    🟩 cxx
      🟩 GCC13              Pass: 100%/1   | Total: 36m 03s | Avg: 36m 03s | Max: 36m 03s
    🟩 cxx_family
      🟩 GCC                Pass: 100%/1   | Total: 36m 03s | Avg: 36m 03s | Max: 36m 03s
    🟩 gpu
      🟩 v100               Pass: 100%/1   | Total: 36m 03s | Avg: 36m 03s | Max: 36m 03s
    🟩 jobs
      🟩 Test               Pass: 100%/1   | Total: 36m 03s | Avg: 36m 03s | Max: 36m 03s
    

👃 Inspect Changes

Modifications in project?

Project
CCCL Infrastructure
libcu++
+/- CUB
Thrust
CUDA Experimental
python
CCCL C Parallel Library
Catch2Helper

Modifications in project or dependencies?

Project
CCCL Infrastructure
libcu++
+/- CUB
+/- Thrust
CUDA Experimental
+/- python
+/- CCCL C Parallel Library
+/- Catch2Helper

🏃‍ Runner counts (total jobs: 96)

# Runner
71 linux-amd64-cpu16
11 linux-amd64-gpu-v100-latest-1
9 windows-amd64-cpu16
4 linux-arm64-cpu16
1 linux-amd64-gpu-h100-latest-1-testing

@gevtushenko gevtushenko merged commit b57e065 into NVIDIA:main Jan 2, 2025
118 checks passed
Comment on lines +117 to +120
static constexpr BlockLoadAlgorithm load_algorithm =
(sizeof(AccumT) > 128) ? BLOCK_LOAD_WARP_TRANSPOSE_TIMESLICED : BLOCK_LOAD_WARP_TRANSPOSE;
static constexpr BlockStoreAlgorithm store_algorithm =
(sizeof(AccumT) > 128) ? BLOCK_STORE_WARP_TRANSPOSE_TIMESLICED : BLOCK_STORE_WARP_TRANSPOSE;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While fixing the issue at hand, this duplicates the logic from several lines below: https://github.com/NVIDIA/cccl/pull/3236/files#diff-d0a57aa3bf737e06d3f9f37bc80ea090ddf53e25f882ed3b99858ce26e785617R235-R238. I will file a refactoring.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bernhardmgruber
Copy link
Contributor

This PR fixes NVBug 5022428.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

5 participants