Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable PDL for DeviceMergeSortBlockSortKernel #3199

Merged
merged 1 commit into from
Dec 20, 2024

Conversation

bernhardmgruber
Copy link
Contributor

@bernhardmgruber bernhardmgruber commented Dec 19, 2024

The kernel already contains a call to _CCCL_PDL_GRID_DEPENDENCY_SYNC, but PDL was not enabled when launching it. This was missed in #3114.

Running cub.bench.merge_sort.keys before and after reports no change in performance. This is expected because there is no kernel running before merge sort to overlap.

The kernel already contains a call to _CCCL_PDL_GRID_DEPENDENCY_SYNC,
but PDL was not enabled when launching it. This was missed in NVIDIA#3114.
Copy link
Contributor

🟩 CI finished in 1h 09m: Pass: 100%/96 | Total: 1d 15h | Avg: 24m 56s | Max: 51m 47s | Hits: 94%/12384
  • 🟩 cub: Pass: 100%/47 | Total: 1d 04h | Avg: 36m 05s | Max: 51m 47s | Hits: 92%/3124

    🟩 cpu
      🟩 amd64              Pass: 100%/45  | Total:  1d 02h | Avg: 35m 45s | Max: 51m 47s | Hits:  92%/3124  
      🟩 arm64              Pass: 100%/2   | Total:  1h 27m | Avg: 43m 39s | Max: 44m 34s
    🟩 ctk
      🟩 11.1               Pass: 100%/7   | Total:  4h 06m | Avg: 35m 09s | Max: 42m 56s | Hits:  92%/781   
      🟩 12.5               Pass: 100%/2   | Total:  1h 32m | Avg: 46m 25s | Max: 47m 11s
      🟩 12.6               Pass: 100%/38  | Total: 22h 37m | Avg: 35m 43s | Max: 51m 47s | Hits:  92%/2343  
    🟩 cudacxx
      🟩 ClangCUDA18        Pass: 100%/2   | Total:  1h 42m | Avg: 51m 29s | Max: 51m 47s
      🟩 nvcc11.1           Pass: 100%/7   | Total:  4h 06m | Avg: 35m 09s | Max: 42m 56s | Hits:  92%/781   
      🟩 nvcc12.5           Pass: 100%/2   | Total:  1h 32m | Avg: 46m 25s | Max: 47m 11s
      🟩 nvcc12.6           Pass: 100%/36  | Total: 20h 54m | Avg: 34m 50s | Max: 50m 37s | Hits:  92%/2343  
    🟩 cudacxx_family
      🟩 ClangCUDA          Pass: 100%/2   | Total:  1h 42m | Avg: 51m 29s | Max: 51m 47s
      🟩 nvcc               Pass: 100%/45  | Total:  1d 02h | Avg: 35m 24s | Max: 50m 37s | Hits:  92%/3124  
    🟩 cxx
      🟩 Clang9             Pass: 100%/4   | Total:  2h 29m | Avg: 37m 17s | Max: 42m 28s
      🟩 Clang10            Pass: 100%/1   | Total: 42m 10s | Avg: 42m 10s | Max: 42m 10s
      🟩 Clang11            Pass: 100%/1   | Total: 39m 23s | Avg: 39m 23s | Max: 39m 23s
      🟩 Clang12            Pass: 100%/1   | Total: 37m 01s | Avg: 37m 01s | Max: 37m 01s
      🟩 Clang13            Pass: 100%/1   | Total: 39m 25s | Avg: 39m 25s | Max: 39m 25s
      🟩 Clang14            Pass: 100%/1   | Total: 38m 14s | Avg: 38m 14s | Max: 38m 14s
      🟩 Clang15            Pass: 100%/1   | Total: 39m 03s | Avg: 39m 03s | Max: 39m 03s
      🟩 Clang16            Pass: 100%/1   | Total: 38m 55s | Avg: 38m 55s | Max: 38m 55s
      🟩 Clang17            Pass: 100%/1   | Total: 38m 27s | Avg: 38m 27s | Max: 38m 27s
      🟩 Clang18            Pass: 100%/7   | Total:  4h 21m | Avg: 37m 23s | Max: 51m 47s
      🟩 GCC6               Pass: 100%/2   | Total:  1h 05m | Avg: 32m 48s | Max: 33m 01s
      🟩 GCC7               Pass: 100%/2   | Total:  1h 17m | Avg: 38m 33s | Max: 40m 28s
      🟩 GCC8               Pass: 100%/1   | Total: 36m 26s | Avg: 36m 26s | Max: 36m 26s
      🟩 GCC9               Pass: 100%/3   | Total:  1h 49m | Avg: 36m 31s | Max: 39m 19s
      🟩 GCC10              Pass: 100%/1   | Total: 36m 50s | Avg: 36m 50s | Max: 36m 50s
      🟩 GCC11              Pass: 100%/1   | Total: 37m 17s | Avg: 37m 17s | Max: 37m 17s
      🟩 GCC12              Pass: 100%/3   | Total:  1h 12m | Avg: 24m 17s | Max: 41m 25s
      🟩 GCC13              Pass: 100%/8   | Total:  3h 29m | Avg: 26m 13s | Max: 44m 34s
      🟩 Intel2023.2.0      Pass: 100%/1   | Total: 44m 40s | Avg: 44m 40s | Max: 44m 40s
      🟩 MSVC14.16          Pass: 100%/1   | Total: 42m 56s | Avg: 42m 56s | Max: 42m 56s | Hits:  92%/781   
      🟩 MSVC14.29          Pass: 100%/1   | Total: 46m 47s | Avg: 46m 47s | Max: 46m 47s | Hits:  92%/781   
      🟩 MSVC14.39          Pass: 100%/2   | Total:  1h 40m | Avg: 50m 08s | Max: 50m 37s | Hits:  92%/1562  
      🟩 NVHPC24.7          Pass: 100%/2   | Total:  1h 32m | Avg: 46m 25s | Max: 47m 11s
    🟩 cxx_family
      🟩 Clang              Pass: 100%/19  | Total: 12h 03m | Avg: 38m 04s | Max: 51m 47s
      🟩 GCC                Pass: 100%/21  | Total: 10h 45m | Avg: 30m 44s | Max: 44m 34s
      🟩 Intel              Pass: 100%/1   | Total: 44m 40s | Avg: 44m 40s | Max: 44m 40s
      🟩 MSVC               Pass: 100%/4   | Total:  3h 10m | Avg: 47m 30s | Max: 50m 37s | Hits:  92%/3124  
      🟩 NVHPC              Pass: 100%/2   | Total:  1h 32m | Avg: 46m 25s | Max: 47m 11s
    🟩 gpu
      🟩 h100               Pass: 100%/2   | Total: 31m 27s | Avg: 15m 43s | Max: 16m 06s
      🟩 v100               Pass: 100%/45  | Total:  1d 03h | Avg: 37m 00s | Max: 51m 47s | Hits:  92%/3124  
    🟩 jobs
      🟩 Build              Pass: 100%/40  | Total:  1d 02h | Avg: 39m 16s | Max: 51m 47s | Hits:  92%/3124  
      🟩 DeviceLaunch       Pass: 100%/1   | Total: 16m 33s | Avg: 16m 33s | Max: 16m 33s
      🟩 GraphCapture       Pass: 100%/1   | Total: 15m 10s | Avg: 15m 10s | Max: 15m 10s
      🟩 HostLaunch         Pass: 100%/3   | Total: 52m 34s | Avg: 17m 31s | Max: 18m 52s
      🟩 TestGPU            Pass: 100%/2   | Total: 40m 54s | Avg: 20m 27s | Max: 20m 42s
    🟩 sm
      🟩 90                 Pass: 100%/2   | Total: 31m 27s | Avg: 15m 43s | Max: 16m 06s
      🟩 90a                Pass: 100%/1   | Total: 15m 24s | Avg: 15m 24s | Max: 15m 24s
    🟩 std
      🟩 11                 Pass: 100%/5   | Total:  3h 00m | Avg: 36m 06s | Max: 40m 28s
      🟩 14                 Pass: 100%/4   | Total:  2h 34m | Avg: 38m 39s | Max: 42m 56s | Hits:  92%/781   
      🟩 17                 Pass: 100%/12  | Total:  8h 22m | Avg: 41m 50s | Max: 51m 12s | Hits:  92%/1562  
      🟩 20                 Pass: 100%/26  | Total: 14h 19m | Avg: 33m 03s | Max: 51m 47s | Hits:  92%/781   
    
  • 🟩 thrust: Pass: 100%/46 | Total: 11h 00m | Avg: 14m 21s | Max: 28m 35s | Hits: 95%/9260

    🟩 cmake_options
      🟩 -DTHRUST_DISPATCH_TYPE=Force32bit Pass: 100%/2   | Total: 26m 55s | Avg: 13m 27s | Max: 14m 17s
    🟩 cpu
      🟩 amd64              Pass: 100%/44  | Total: 10h 37m | Avg: 14m 28s | Max: 28m 35s | Hits:  95%/9260  
      🟩 arm64              Pass: 100%/2   | Total: 23m 29s | Avg: 11m 44s | Max: 12m 13s
    🟩 ctk
      🟩 11.1               Pass: 100%/7   | Total:  1h 42m | Avg: 14m 41s | Max: 28m 22s | Hits:  93%/1852  
      🟩 12.5               Pass: 100%/2   | Total: 50m 22s | Avg: 25m 11s | Max: 25m 17s
      🟩 12.6               Pass: 100%/37  | Total:  8h 27m | Avg: 13m 42s | Max: 28m 35s | Hits:  95%/7408  
    🟩 cudacxx
      🟩 ClangCUDA18        Pass: 100%/2   | Total: 22m 22s | Avg: 11m 11s | Max: 11m 33s
      🟩 nvcc11.1           Pass: 100%/7   | Total:  1h 42m | Avg: 14m 41s | Max: 28m 22s | Hits:  93%/1852  
      🟩 nvcc12.5           Pass: 100%/2   | Total: 50m 22s | Avg: 25m 11s | Max: 25m 17s
      🟩 nvcc12.6           Pass: 100%/35  | Total:  8h 04m | Avg: 13m 51s | Max: 28m 35s | Hits:  95%/7408  
    🟩 cudacxx_family
      🟩 ClangCUDA          Pass: 100%/2   | Total: 22m 22s | Avg: 11m 11s | Max: 11m 33s
      🟩 nvcc               Pass: 100%/44  | Total: 10h 38m | Avg: 14m 30s | Max: 28m 35s | Hits:  95%/9260  
    🟩 cxx
      🟩 Clang9             Pass: 100%/4   | Total: 48m 13s | Avg: 12m 03s | Max: 13m 31s
      🟩 Clang10            Pass: 100%/1   | Total: 13m 12s | Avg: 13m 12s | Max: 13m 12s
      🟩 Clang11            Pass: 100%/1   | Total: 11m 44s | Avg: 11m 44s | Max: 11m 44s
      🟩 Clang12            Pass: 100%/1   | Total: 11m 37s | Avg: 11m 37s | Max: 11m 37s
      🟩 Clang13            Pass: 100%/1   | Total: 12m 16s | Avg: 12m 16s | Max: 12m 16s
      🟩 Clang14            Pass: 100%/1   | Total: 11m 36s | Avg: 11m 36s | Max: 11m 36s
      🟩 Clang15            Pass: 100%/1   | Total: 13m 19s | Avg: 13m 19s | Max: 13m 19s
      🟩 Clang16            Pass: 100%/1   | Total: 12m 02s | Avg: 12m 02s | Max: 12m 02s
      🟩 Clang17            Pass: 100%/1   | Total: 11m 55s | Avg: 11m 55s | Max: 11m 55s
      🟩 Clang18            Pass: 100%/7   | Total:  1h 15m | Avg: 10m 45s | Max: 12m 25s
      🟩 GCC6               Pass: 100%/2   | Total: 25m 19s | Avg: 12m 39s | Max: 14m 45s
      🟩 GCC7               Pass: 100%/2   | Total: 23m 26s | Avg: 11m 43s | Max: 13m 01s
      🟩 GCC8               Pass: 100%/1   | Total: 11m 50s | Avg: 11m 50s | Max: 11m 50s
      🟩 GCC9               Pass: 100%/3   | Total: 37m 58s | Avg: 12m 39s | Max: 15m 55s
      🟩 GCC10              Pass: 100%/1   | Total: 12m 05s | Avg: 12m 05s | Max: 12m 05s
      🟩 GCC11              Pass: 100%/1   | Total: 12m 17s | Avg: 12m 17s | Max: 12m 17s
      🟩 GCC12              Pass: 100%/1   | Total: 12m 57s | Avg: 12m 57s | Max: 12m 57s
      🟩 GCC13              Pass: 100%/8   | Total:  1h 45m | Avg: 13m 09s | Max: 16m 17s
      🟩 Intel2023.2.0      Pass: 100%/1   | Total: 15m 18s | Avg: 15m 18s | Max: 15m 18s
      🟩 MSVC14.16          Pass: 100%/1   | Total: 28m 22s | Avg: 28m 22s | Max: 28m 22s | Hits:  93%/1852  
      🟩 MSVC14.29          Pass: 100%/1   | Total: 24m 55s | Avg: 24m 55s | Max: 24m 55s | Hits:  93%/1852  
      🟩 MSVC14.39          Pass: 100%/3   | Total:  1h 19m | Avg: 26m 23s | Max: 28m 35s | Hits:  95%/5556  
      🟩 NVHPC24.7          Pass: 100%/2   | Total: 50m 22s | Avg: 25m 11s | Max: 25m 17s
    🟩 cxx_family
      🟩 Clang              Pass: 100%/19  | Total:  3h 41m | Avg: 11m 38s | Max: 13m 31s
      🟩 GCC                Pass: 100%/19  | Total:  4h 01m | Avg: 12m 41s | Max: 16m 17s
      🟩 Intel              Pass: 100%/1   | Total: 15m 18s | Avg: 15m 18s | Max: 15m 18s
      🟩 MSVC               Pass: 100%/5   | Total:  2h 12m | Avg: 26m 29s | Max: 28m 35s | Hits:  95%/9260  
      🟩 NVHPC              Pass: 100%/2   | Total: 50m 22s | Avg: 25m 11s | Max: 25m 17s
    🟩 gpu
      🟩 v100               Pass: 100%/46  | Total: 11h 00m | Avg: 14m 21s | Max: 28m 35s | Hits:  95%/9260  
    🟩 jobs
      🟩 Build              Pass: 100%/40  | Total:  9h 41m | Avg: 14m 31s | Max: 28m 35s | Hits:  93%/7408  
      🟩 TestCPU            Pass: 100%/3   | Total: 37m 56s | Avg: 12m 38s | Max: 22m 41s | Hits:  99%/1852  
      🟩 TestGPU            Pass: 100%/3   | Total: 41m 15s | Avg: 13m 45s | Max: 16m 17s
    🟩 sm
      🟩 90a                Pass: 100%/1   | Total: 14m 00s | Avg: 14m 00s | Max: 14m 00s
    🟩 std
      🟩 11                 Pass: 100%/5   | Total: 53m 25s | Avg: 10m 41s | Max: 11m 20s
      🟩 14                 Pass: 100%/4   | Total:  1h 09m | Avg: 17m 24s | Max: 28m 22s | Hits:  93%/1852  
      🟩 17                 Pass: 100%/12  | Total:  3h 16m | Avg: 16m 21s | Max: 28m 35s | Hits:  93%/3704  
      🟩 20                 Pass: 100%/23  | Total:  5h 14m | Avg: 13m 39s | Max: 27m 54s | Hits:  96%/3704  
    
  • 🟩 cccl_c_parallel: Pass: 100%/2 | Total: 8m 52s | Avg: 4m 26s | Max: 6m 50s

    🟩 cpu
      🟩 amd64              Pass: 100%/2   | Total:  8m 52s | Avg:  4m 26s | Max:  6m 50s
    🟩 ctk
      🟩 12.6               Pass: 100%/2   | Total:  8m 52s | Avg:  4m 26s | Max:  6m 50s
    🟩 cudacxx
      🟩 nvcc12.6           Pass: 100%/2   | Total:  8m 52s | Avg:  4m 26s | Max:  6m 50s
    🟩 cudacxx_family
      🟩 nvcc               Pass: 100%/2   | Total:  8m 52s | Avg:  4m 26s | Max:  6m 50s
    🟩 cxx
      🟩 GCC13              Pass: 100%/2   | Total:  8m 52s | Avg:  4m 26s | Max:  6m 50s
    🟩 cxx_family
      🟩 GCC                Pass: 100%/2   | Total:  8m 52s | Avg:  4m 26s | Max:  6m 50s
    🟩 gpu
      🟩 v100               Pass: 100%/2   | Total:  8m 52s | Avg:  4m 26s | Max:  6m 50s
    🟩 jobs
      🟩 Build              Pass: 100%/1   | Total:  2m 02s | Avg:  2m 02s | Max:  2m 02s
      🟩 Test               Pass: 100%/1   | Total:  6m 50s | Avg:  6m 50s | Max:  6m 50s
    
  • 🟩 python: Pass: 100%/1 | Total: 28m 06s | Avg: 28m 06s | Max: 28m 06s

    🟩 cpu
      🟩 amd64              Pass: 100%/1   | Total: 28m 06s | Avg: 28m 06s | Max: 28m 06s
    🟩 ctk
      🟩 12.6               Pass: 100%/1   | Total: 28m 06s | Avg: 28m 06s | Max: 28m 06s
    🟩 cudacxx
      🟩 nvcc12.6           Pass: 100%/1   | Total: 28m 06s | Avg: 28m 06s | Max: 28m 06s
    🟩 cudacxx_family
      🟩 nvcc               Pass: 100%/1   | Total: 28m 06s | Avg: 28m 06s | Max: 28m 06s
    🟩 cxx
      🟩 GCC13              Pass: 100%/1   | Total: 28m 06s | Avg: 28m 06s | Max: 28m 06s
    🟩 cxx_family
      🟩 GCC                Pass: 100%/1   | Total: 28m 06s | Avg: 28m 06s | Max: 28m 06s
    🟩 gpu
      🟩 v100               Pass: 100%/1   | Total: 28m 06s | Avg: 28m 06s | Max: 28m 06s
    🟩 jobs
      🟩 Test               Pass: 100%/1   | Total: 28m 06s | Avg: 28m 06s | Max: 28m 06s
    

👃 Inspect Changes

Modifications in project?

Project
CCCL Infrastructure
libcu++
+/- CUB
Thrust
CUDA Experimental
python
CCCL C Parallel Library
Catch2Helper

Modifications in project or dependencies?

Project
CCCL Infrastructure
libcu++
+/- CUB
+/- Thrust
CUDA Experimental
+/- python
+/- CCCL C Parallel Library
+/- Catch2Helper

🏃‍ Runner counts (total jobs: 96)

# Runner
71 linux-amd64-cpu16
11 linux-amd64-gpu-v100-latest-1
9 windows-amd64-cpu16
4 linux-arm64-cpu16
1 linux-amd64-gpu-h100-latest-1-testing

@bernhardmgruber bernhardmgruber added 2.8.0 target for 2.8.0 release cub For all items related to CUB labels Dec 19, 2024
@elstehle elstehle merged commit f629f07 into NVIDIA:main Dec 20, 2024
115 checks passed
@bernhardmgruber bernhardmgruber deleted the pdl branch December 28, 2024 00:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2.8.0 target for 2.8.0 release cub For all items related to CUB
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

2 participants