Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PTX shfl_sync #3241

Open
wants to merge 2 commits into
base: main
Choose a base branch
from
Open

PTX shfl_sync #3241

wants to merge 2 commits into from

Conversation

fbusato
Copy link
Contributor

@fbusato fbusato commented Jan 4, 2025

Related to #2976

Description

Provide C++ implementation of PTX shfl_sync.

In addition to CUDA intrinsics, the function provide the following features:

  • Returns the "lane predicate" can be used for subsequent operations instead of check the lane validity manually.
  • Perform basic input checks

@fbusato fbusato requested review from a team as code owners January 4, 2025 00:46
@fbusato fbusato requested review from miscco and wmaxey January 4, 2025 00:46
Copy link
Contributor

github-actions bot commented Jan 4, 2025

🟩 CI finished in 1h 49m: Pass: 100%/170 | Total: 3d 02h | Avg: 26m 12s | Max: 1h 08m | Hits: 76%/22526
  • 🟩 libcudacxx: Pass: 100%/48 | Total: 7h 21m | Avg: 9m 11s | Max: 32m 13s | Hits: 84%/9822

    🟩 cpu
      🟩 amd64              Pass: 100%/46  | Total:  7h 13m | Avg:  9m 26s | Max: 32m 13s | Hits:  84%/9822  
      🟩 arm64              Pass: 100%/2   | Total:  7m 36s | Avg:  3m 48s | Max:  3m 51s
    🟩 ctk
      🟩 11.1               Pass: 100%/7   | Total: 38m 54s | Avg:  5m 33s | Max: 20m 16s | Hits:  96%/2241  
      🟩 12.5               Pass: 100%/2   | Total: 17m 28s | Avg:  8m 44s | Max:  9m 05s
      🟩 12.6               Pass: 100%/39  | Total:  6h 25m | Avg:  9m 52s | Max: 32m 13s | Hits:  80%/7581  
    🟩 cudacxx
      🟩 ClangCUDA18        Pass: 100%/4   | Total:  1h 06m | Avg: 16m 43s | Max: 21m 42s
      🟩 nvcc11.1           Pass: 100%/7   | Total: 38m 54s | Avg:  5m 33s | Max: 20m 16s | Hits:  96%/2241  
      🟩 nvcc12.5           Pass: 100%/2   | Total: 17m 28s | Avg:  8m 44s | Max:  9m 05s
      🟩 nvcc12.6           Pass: 100%/35  | Total:  5h 18m | Avg:  9m 05s | Max: 32m 13s | Hits:  80%/7581  
    🟩 cudacxx_family
      🟩 ClangCUDA          Pass: 100%/4   | Total:  1h 06m | Avg: 16m 43s | Max: 21m 42s
      🟩 nvcc               Pass: 100%/44  | Total:  6h 14m | Avg:  8m 30s | Max: 32m 13s | Hits:  84%/9822  
    🟩 cxx
      🟩 Clang9             Pass: 100%/4   | Total: 15m 54s | Avg:  3m 58s | Max:  4m 44s
      🟩 Clang10            Pass: 100%/1   | Total:  5m 13s | Avg:  5m 13s | Max:  5m 13s
      🟩 Clang11            Pass: 100%/1   | Total:  4m 46s | Avg:  4m 46s | Max:  4m 46s
      🟩 Clang12            Pass: 100%/1   | Total:  4m 30s | Avg:  4m 30s | Max:  4m 30s
      🟩 Clang13            Pass: 100%/1   | Total:  4m 15s | Avg:  4m 15s | Max:  4m 15s
      🟩 Clang14            Pass: 100%/1   | Total:  4m 12s | Avg:  4m 12s | Max:  4m 12s
      🟩 Clang15            Pass: 100%/1   | Total:  4m 17s | Avg:  4m 17s | Max:  4m 17s
      🟩 Clang16            Pass: 100%/1   | Total:  4m 34s | Avg:  4m 34s | Max:  4m 34s
      🟩 Clang17            Pass: 100%/1   | Total:  4m 45s | Avg:  4m 45s | Max:  4m 45s
      🟩 Clang18            Pass: 100%/8   | Total:  1h 37m | Avg: 12m 10s | Max: 21m 42s
      🟩 GCC6               Pass: 100%/2   | Total:  5m 54s | Avg:  2m 57s | Max:  3m 04s
      🟩 GCC7               Pass: 100%/2   | Total:  6m 57s | Avg:  3m 28s | Max:  3m 29s
      🟩 GCC8               Pass: 100%/1   | Total:  3m 38s | Avg:  3m 38s | Max:  3m 38s
      🟩 GCC9               Pass: 100%/3   | Total:  9m 36s | Avg:  3m 12s | Max:  3m 46s
      🟩 GCC10              Pass: 100%/1   | Total:  3m 55s | Avg:  3m 55s | Max:  3m 55s
      🟩 GCC11              Pass: 100%/1   | Total:  3m 55s | Avg:  3m 55s | Max:  3m 55s
      🟩 GCC12              Pass: 100%/1   | Total:  4m 09s | Avg:  4m 09s | Max:  4m 09s
      🟩 GCC13              Pass: 100%/10  | Total:  2h 37m | Avg: 15m 45s | Max: 32m 13s
      🟩 Intel2023.2.0      Pass: 100%/1   | Total:  5m 46s | Avg:  5m 46s | Max:  5m 46s
      🟩 MSVC14.16          Pass: 100%/1   | Total: 20m 16s | Avg: 20m 16s | Max: 20m 16s | Hits:  96%/2241  
      🟩 MSVC14.29          Pass: 100%/1   | Total: 25m 52s | Avg: 25m 52s | Max: 25m 52s | Hits:  47%/2478  
      🟩 MSVC14.39          Pass: 100%/2   | Total: 26m 42s | Avg: 13m 21s | Max: 13m 52s | Hits:  96%/5103  
      🟩 NVHPC24.7          Pass: 100%/2   | Total: 17m 28s | Avg:  8m 44s | Max:  9m 05s
    🟩 cxx_family
      🟩 Clang              Pass: 100%/20  | Total:  2h 29m | Avg:  7m 29s | Max: 21m 42s
      🟩 GCC                Pass: 100%/21  | Total:  3h 15m | Avg:  9m 19s | Max: 32m 13s
      🟩 Intel              Pass: 100%/1   | Total:  5m 46s | Avg:  5m 46s | Max:  5m 46s
      🟩 MSVC               Pass: 100%/4   | Total:  1h 12m | Avg: 18m 12s | Max: 25m 52s | Hits:  84%/9822  
      🟩 NVHPC              Pass: 100%/2   | Total: 17m 28s | Avg:  8m 44s | Max:  9m 05s
    🟩 gpu
      🟩 v100               Pass: 100%/48  | Total:  7h 21m | Avg:  9m 11s | Max: 32m 13s | Hits:  84%/9822  
    🟩 jobs
      🟩 Build              Pass: 100%/41  | Total:  4h 46m | Avg:  6m 58s | Max: 25m 52s | Hits:  84%/9822  
      🟩 NVRTC              Pass: 100%/4   | Total:  1h 52m | Avg: 28m 12s | Max: 32m 13s
      🟩 Test               Pass: 100%/2   | Total: 40m 29s | Avg: 20m 14s | Max: 22m 58s
      🟩 VerifyCodegen      Pass: 100%/1   | Total:  1m 58s | Avg:  1m 58s | Max:  1m 58s
    🟩 sm
      🟩 90                 Pass: 100%/1   | Total: 12m 22s | Avg: 12m 22s | Max: 12m 22s
      🟩 90a                Pass: 100%/2   | Total: 20m 31s | Avg: 10m 15s | Max: 12m 46s
    🟩 std
      🟩 11                 Pass: 100%/6   | Total: 37m 42s | Avg:  6m 17s | Max: 21m 11s
      🟩 14                 Pass: 100%/5   | Total:  1h 03m | Avg: 12m 45s | Max: 32m 13s | Hits:  96%/2241  
      🟩 17                 Pass: 100%/13  | Total:  2h 11m | Avg: 10m 07s | Max: 29m 55s | Hits:  72%/4956  
      🟩 20                 Pass: 100%/23  | Total:  3h 26m | Avg:  8m 58s | Max: 29m 29s | Hits:  96%/2625  
    
  • 🟩 cub: Pass: 100%/47 | Total: 1d 14h | Avg: 48m 38s | Max: 1h 03m | Hits: 66%/3132

    🟩 cpu
      🟩 amd64              Pass: 100%/45  | Total:  1d 12h | Avg: 48m 19s | Max:  1h 03m | Hits:  66%/3132  
      🟩 arm64              Pass: 100%/2   | Total:  1h 51m | Avg: 55m 49s | Max: 56m 57s
    🟩 ctk
      🟩 11.1               Pass: 100%/7   | Total:  5h 37m | Avg: 48m 11s | Max: 53m 42s | Hits:  66%/783   
      🟩 12.5               Pass: 100%/2   | Total:  2h 04m | Avg:  1h 02m | Max:  1h 03m
      🟩 12.6               Pass: 100%/38  | Total:  1d 06h | Avg: 47m 59s | Max:  1h 01m | Hits:  66%/2349  
    🟩 cudacxx
      🟩 ClangCUDA18        Pass: 100%/2   | Total:  2h 00m | Avg:  1h 00m | Max:  1h 01m
      🟩 nvcc11.1           Pass: 100%/7   | Total:  5h 37m | Avg: 48m 11s | Max: 53m 42s | Hits:  66%/783   
      🟩 nvcc12.5           Pass: 100%/2   | Total:  2h 04m | Avg:  1h 02m | Max:  1h 03m
      🟩 nvcc12.6           Pass: 100%/36  | Total:  1d 04h | Avg: 47m 18s | Max:  1h 01m | Hits:  66%/2349  
    🟩 cudacxx_family
      🟩 ClangCUDA          Pass: 100%/2   | Total:  2h 00m | Avg:  1h 00m | Max:  1h 01m
      🟩 nvcc               Pass: 100%/45  | Total:  1d 12h | Avg: 48m 07s | Max:  1h 03m | Hits:  66%/3132  
    🟩 cxx
      🟩 Clang9             Pass: 100%/4   | Total:  3h 18m | Avg: 49m 36s | Max: 53m 49s
      🟩 Clang10            Pass: 100%/1   | Total: 54m 06s | Avg: 54m 06s | Max: 54m 06s
      🟩 Clang11            Pass: 100%/1   | Total: 56m 32s | Avg: 56m 32s | Max: 56m 32s
      🟩 Clang12            Pass: 100%/1   | Total: 53m 25s | Avg: 53m 25s | Max: 53m 25s
      🟩 Clang13            Pass: 100%/1   | Total: 53m 09s | Avg: 53m 09s | Max: 53m 09s
      🟩 Clang14            Pass: 100%/1   | Total: 54m 27s | Avg: 54m 27s | Max: 54m 27s
      🟩 Clang15            Pass: 100%/1   | Total: 50m 31s | Avg: 50m 31s | Max: 50m 31s
      🟩 Clang16            Pass: 100%/1   | Total: 51m 59s | Avg: 51m 59s | Max: 51m 59s
      🟩 Clang17            Pass: 100%/1   | Total: 54m 15s | Avg: 54m 15s | Max: 54m 15s
      🟩 Clang18            Pass: 100%/7   | Total:  5h 30m | Avg: 47m 14s | Max:  1h 01m
      🟩 GCC6               Pass: 100%/2   | Total:  1h 37m | Avg: 48m 52s | Max: 49m 15s
      🟩 GCC7               Pass: 100%/2   | Total:  1h 47m | Avg: 53m 48s | Max: 56m 16s
      🟩 GCC8               Pass: 100%/1   | Total: 52m 27s | Avg: 52m 27s | Max: 52m 27s
      🟩 GCC9               Pass: 100%/3   | Total:  2h 31m | Avg: 50m 33s | Max: 57m 25s
      🟩 GCC10              Pass: 100%/1   | Total: 57m 03s | Avg: 57m 03s | Max: 57m 03s
      🟩 GCC11              Pass: 100%/1   | Total: 55m 58s | Avg: 55m 58s | Max: 55m 58s
      🟩 GCC12              Pass: 100%/3   | Total:  1h 36m | Avg: 32m 02s | Max: 58m 12s
      🟩 GCC13              Pass: 100%/8   | Total:  4h 47m | Avg: 35m 59s | Max: 57m 27s
      🟩 Intel2023.2.0      Pass: 100%/1   | Total:  1h 01m | Avg:  1h 01m | Max:  1h 01m
      🟩 MSVC14.16          Pass: 100%/1   | Total: 53m 42s | Avg: 53m 42s | Max: 53m 42s | Hits:  66%/783   
      🟩 MSVC14.29          Pass: 100%/1   | Total:  1h 01m | Avg:  1h 01m | Max:  1h 01m | Hits:  66%/783   
      🟩 MSVC14.39          Pass: 100%/2   | Total:  2h 00m | Avg:  1h 00m | Max:  1h 00m | Hits:  66%/1566  
      🟩 NVHPC24.7          Pass: 100%/2   | Total:  2h 04m | Avg:  1h 02m | Max:  1h 03m
    🟩 cxx_family
      🟩 Clang              Pass: 100%/19  | Total: 15h 57m | Avg: 50m 23s | Max:  1h 01m
      🟩 GCC                Pass: 100%/21  | Total: 15h 06m | Avg: 43m 10s | Max: 58m 12s
      🟩 Intel              Pass: 100%/1   | Total:  1h 01m | Avg:  1h 01m | Max:  1h 01m
      🟩 MSVC               Pass: 100%/4   | Total:  3h 56m | Avg: 59m 01s | Max:  1h 01m | Hits:  66%/3132  
      🟩 NVHPC              Pass: 100%/2   | Total:  2h 04m | Avg:  1h 02m | Max:  1h 03m
    🟩 gpu
      🟩 h100               Pass: 100%/2   | Total: 37m 56s | Avg: 18m 58s | Max: 22m 02s
      🟩 v100               Pass: 100%/45  | Total:  1d 13h | Avg: 49m 57s | Max:  1h 03m | Hits:  66%/3132  
    🟩 jobs
      🟩 Build              Pass: 100%/40  | Total:  1d 11h | Avg: 53m 01s | Max:  1h 03m | Hits:  66%/3132  
      🟩 DeviceLaunch       Pass: 100%/1   | Total: 18m 34s | Avg: 18m 34s | Max: 18m 34s
      🟩 GraphCapture       Pass: 100%/1   | Total: 19m 49s | Avg: 19m 49s | Max: 19m 49s
      🟩 HostLaunch         Pass: 100%/3   | Total:  1h 15m | Avg: 25m 11s | Max: 29m 59s
      🟩 TestGPU            Pass: 100%/2   | Total: 51m 21s | Avg: 25m 40s | Max: 28m 12s
    🟩 sm
      🟩 90                 Pass: 100%/2   | Total: 37m 56s | Avg: 18m 58s | Max: 22m 02s
      🟩 90a                Pass: 100%/1   | Total: 24m 57s | Avg: 24m 57s | Max: 24m 57s
    🟩 std
      🟩 11                 Pass: 100%/5   | Total:  4h 04m | Avg: 48m 54s | Max: 52m 57s
      🟩 14                 Pass: 100%/4   | Total:  3h 33m | Avg: 53m 15s | Max: 56m 16s | Hits:  66%/783   
      🟩 17                 Pass: 100%/12  | Total: 11h 07m | Avg: 55m 39s | Max:  1h 01m | Hits:  66%/1566  
      🟩 20                 Pass: 100%/26  | Total: 19h 20m | Avg: 44m 38s | Max:  1h 03m | Hits:  66%/783   
    
  • 🟩 thrust: Pass: 100%/46 | Total: 1d 01h | Avg: 33m 32s | Max: 1h 08m | Hits: 70%/9260

    🟩 cmake_options
      🟩 -DTHRUST_DISPATCH_TYPE=Force32bit Pass: 100%/2   | Total:  1h 04m | Avg: 32m 29s | Max: 34m 41s
    🟩 cpu
      🟩 amd64              Pass: 100%/44  | Total:  1d 00h | Avg: 33m 27s | Max:  1h 08m | Hits:  70%/9260  
      🟩 arm64              Pass: 100%/2   | Total:  1h 10m | Avg: 35m 27s | Max: 38m 47s
    🟩 ctk
      🟩 11.1               Pass: 100%/7   | Total:  3h 36m | Avg: 30m 57s | Max: 53m 10s | Hits:  63%/1852  
      🟩 12.5               Pass: 100%/2   | Total:  1h 50m | Avg: 55m 04s | Max: 55m 24s
      🟩 12.6               Pass: 100%/37  | Total: 20h 16m | Avg: 32m 52s | Max:  1h 08m | Hits:  72%/7408  
    🟩 cudacxx
      🟩 ClangCUDA18        Pass: 100%/2   | Total:  1h 01m | Avg: 30m 42s | Max: 32m 31s
      🟩 nvcc11.1           Pass: 100%/7   | Total:  3h 36m | Avg: 30m 57s | Max: 53m 10s | Hits:  63%/1852  
      🟩 nvcc12.5           Pass: 100%/2   | Total:  1h 50m | Avg: 55m 04s | Max: 55m 24s
      🟩 nvcc12.6           Pass: 100%/35  | Total: 19h 14m | Avg: 32m 59s | Max:  1h 08m | Hits:  72%/7408  
    🟩 cudacxx_family
      🟩 ClangCUDA          Pass: 100%/2   | Total:  1h 01m | Avg: 30m 42s | Max: 32m 31s
      🟩 nvcc               Pass: 100%/44  | Total:  1d 00h | Avg: 33m 40s | Max:  1h 08m | Hits:  70%/9260  
    🟩 cxx
      🟩 Clang9             Pass: 100%/4   | Total:  1h 50m | Avg: 27m 39s | Max: 31m 11s
      🟩 Clang10            Pass: 100%/1   | Total: 37m 13s | Avg: 37m 13s | Max: 37m 13s
      🟩 Clang11            Pass: 100%/1   | Total: 34m 37s | Avg: 34m 37s | Max: 34m 37s
      🟩 Clang12            Pass: 100%/1   | Total: 30m 30s | Avg: 30m 30s | Max: 30m 30s
      🟩 Clang13            Pass: 100%/1   | Total: 35m 16s | Avg: 35m 16s | Max: 35m 16s
      🟩 Clang14            Pass: 100%/1   | Total: 31m 19s | Avg: 31m 19s | Max: 31m 19s
      🟩 Clang15            Pass: 100%/1   | Total: 34m 35s | Avg: 34m 35s | Max: 34m 35s
      🟩 Clang16            Pass: 100%/1   | Total: 31m 27s | Avg: 31m 27s | Max: 31m 27s
      🟩 Clang17            Pass: 100%/1   | Total: 33m 26s | Avg: 33m 26s | Max: 33m 26s
      🟩 Clang18            Pass: 100%/7   | Total:  2h 58m | Avg: 25m 29s | Max: 35m 30s
      🟩 GCC6               Pass: 100%/2   | Total: 55m 00s | Avg: 27m 30s | Max: 29m 50s
      🟩 GCC7               Pass: 100%/2   | Total: 58m 35s | Avg: 29m 17s | Max: 31m 18s
      🟩 GCC8               Pass: 100%/1   | Total: 36m 02s | Avg: 36m 02s | Max: 36m 02s
      🟩 GCC9               Pass: 100%/3   | Total:  1h 30m | Avg: 30m 12s | Max: 34m 33s
      🟩 GCC10              Pass: 100%/1   | Total: 32m 31s | Avg: 32m 31s | Max: 32m 31s
      🟩 GCC11              Pass: 100%/1   | Total: 35m 26s | Avg: 35m 26s | Max: 35m 26s
      🟩 GCC12              Pass: 100%/1   | Total: 36m 35s | Avg: 36m 35s | Max: 36m 35s
      🟩 GCC13              Pass: 100%/8   | Total:  3h 39m | Avg: 27m 23s | Max: 38m 47s
      🟩 Intel2023.2.0      Pass: 100%/1   | Total: 43m 09s | Avg: 43m 09s | Max: 43m 09s
      🟩 MSVC14.16          Pass: 100%/1   | Total: 53m 10s | Avg: 53m 10s | Max: 53m 10s | Hits:  63%/1852  
      🟩 MSVC14.29          Pass: 100%/1   | Total:  1h 03m | Avg:  1h 03m | Max:  1h 03m | Hits:  63%/1852  
      🟩 MSVC14.39          Pass: 100%/3   | Total:  2h 31m | Avg: 50m 39s | Max:  1h 08m | Hits:  75%/5556  
      🟩 NVHPC24.7          Pass: 100%/2   | Total:  1h 50m | Avg: 55m 04s | Max: 55m 24s
    🟩 cxx_family
      🟩 Clang              Pass: 100%/19  | Total:  9h 17m | Avg: 29m 20s | Max: 37m 13s
      🟩 GCC                Pass: 100%/19  | Total:  9h 23m | Avg: 29m 40s | Max: 38m 47s
      🟩 Intel              Pass: 100%/1   | Total: 43m 09s | Avg: 43m 09s | Max: 43m 09s
      🟩 MSVC               Pass: 100%/5   | Total:  4h 28m | Avg: 53m 40s | Max:  1h 08m | Hits:  70%/9260  
      🟩 NVHPC              Pass: 100%/2   | Total:  1h 50m | Avg: 55m 04s | Max: 55m 24s
    🟩 gpu
      🟩 v100               Pass: 100%/46  | Total:  1d 01h | Avg: 33m 32s | Max:  1h 08m | Hits:  70%/9260  
    🟩 jobs
      🟩 Build              Pass: 100%/40  | Total:  1d 00h | Avg: 36m 08s | Max:  1h 08m | Hits:  63%/7408  
      🟩 TestCPU            Pass: 100%/3   | Total: 38m 21s | Avg: 12m 47s | Max: 22m 40s | Hits:  99%/1852  
      🟩 TestGPU            Pass: 100%/3   | Total: 58m 44s | Avg: 19m 34s | Max: 34m 41s
    🟩 sm
      🟩 90a                Pass: 100%/1   | Total: 20m 37s | Avg: 20m 37s | Max: 20m 37s
    🟩 std
      🟩 11                 Pass: 100%/5   | Total:  2h 07m | Avg: 25m 30s | Max: 29m 09s
      🟩 14                 Pass: 100%/4   | Total:  2h 25m | Avg: 36m 22s | Max: 53m 10s | Hits:  63%/1852  
      🟩 17                 Pass: 100%/12  | Total:  8h 06m | Avg: 40m 31s | Max:  1h 03m | Hits:  63%/3704  
      🟩 20                 Pass: 100%/23  | Total: 11h 58m | Avg: 31m 15s | Max:  1h 08m | Hits:  81%/3704  
    
  • 🟩 cudax: Pass: 100%/26 | Total: 2h 24m | Avg: 5m 34s | Max: 21m 56s | Hits: 90%/312

    🟩 cpu
      🟩 amd64              Pass: 100%/22  | Total:  2h 10m | Avg:  5m 56s | Max: 21m 56s | Hits:  90%/312   
      🟩 arm64              Pass: 100%/4   | Total: 14m 02s | Avg:  3m 30s | Max:  3m 39s
    🟩 ctk
      🟩 12.0               Pass: 100%/3   | Total: 18m 02s | Avg:  6m 00s | Max: 10m 59s | Hits:  90%/156   
      🟩 12.5               Pass: 100%/2   | Total: 11m 26s | Avg:  5m 43s | Max:  5m 50s
      🟩 12.6               Pass: 100%/21  | Total:  1h 55m | Avg:  5m 29s | Max: 21m 56s | Hits:  91%/156   
    🟩 cudacxx
      🟩 nvcc12.0           Pass: 100%/3   | Total: 18m 02s | Avg:  6m 00s | Max: 10m 59s | Hits:  90%/156   
      🟩 nvcc12.5           Pass: 100%/2   | Total: 11m 26s | Avg:  5m 43s | Max:  5m 50s
      🟩 nvcc12.6           Pass: 100%/21  | Total:  1h 55m | Avg:  5m 29s | Max: 21m 56s | Hits:  91%/156   
    🟩 cudacxx_family
      🟩 nvcc               Pass: 100%/26  | Total:  2h 24m | Avg:  5m 34s | Max: 21m 56s | Hits:  90%/312   
    🟩 cxx
      🟩 Clang9             Pass: 100%/1   | Total:  3m 29s | Avg:  3m 29s | Max:  3m 29s
      🟩 Clang10            Pass: 100%/1   | Total:  4m 14s | Avg:  4m 14s | Max:  4m 14s
      🟩 Clang11            Pass: 100%/1   | Total:  3m 53s | Avg:  3m 53s | Max:  3m 53s
      🟩 Clang12            Pass: 100%/1   | Total:  3m 47s | Avg:  3m 47s | Max:  3m 47s
      🟩 Clang13            Pass: 100%/1   | Total:  4m 00s | Avg:  4m 00s | Max:  4m 00s
      🟩 Clang14            Pass: 100%/1   | Total:  3m 44s | Avg:  3m 44s | Max:  3m 44s
      🟩 Clang15            Pass: 100%/1   | Total:  3m 42s | Avg:  3m 42s | Max:  3m 42s
      🟩 Clang16            Pass: 100%/1   | Total:  3m 57s | Avg:  3m 57s | Max:  3m 57s
      🟩 Clang17            Pass: 100%/1   | Total:  4m 01s | Avg:  4m 01s | Max:  4m 01s
      🟩 Clang18            Pass: 100%/4   | Total: 32m 25s | Avg:  8m 06s | Max: 21m 56s
      🟩 GCC9               Pass: 100%/1   | Total:  3m 34s | Avg:  3m 34s | Max:  3m 34s
      🟩 GCC10              Pass: 100%/1   | Total:  3m 55s | Avg:  3m 55s | Max:  3m 55s
      🟩 GCC11              Pass: 100%/1   | Total:  3m 46s | Avg:  3m 46s | Max:  3m 46s
      🟩 GCC12              Pass: 100%/2   | Total: 20m 31s | Avg: 10m 15s | Max: 16m 40s
      🟩 GCC13              Pass: 100%/4   | Total: 13m 59s | Avg:  3m 29s | Max:  3m 39s
      🟩 MSVC14.36          Pass: 100%/1   | Total: 10m 59s | Avg: 10m 59s | Max: 10m 59s | Hits:  90%/156   
      🟩 MSVC14.39          Pass: 100%/1   | Total:  9m 29s | Avg:  9m 29s | Max:  9m 29s | Hits:  91%/156   
      🟩 NVHPC24.7          Pass: 100%/2   | Total: 11m 26s | Avg:  5m 43s | Max:  5m 50s
    🟩 cxx_family
      🟩 Clang              Pass: 100%/13  | Total:  1h 07m | Avg:  5m 10s | Max: 21m 56s
      🟩 GCC                Pass: 100%/9   | Total: 45m 45s | Avg:  5m 05s | Max: 16m 40s
      🟩 MSVC               Pass: 100%/2   | Total: 20m 28s | Avg: 10m 14s | Max: 10m 59s | Hits:  90%/312   
      🟩 NVHPC              Pass: 100%/2   | Total: 11m 26s | Avg:  5m 43s | Max:  5m 50s
    🟩 gpu
      🟩 v100               Pass: 100%/26  | Total:  2h 24m | Avg:  5m 34s | Max: 21m 56s | Hits:  90%/312   
    🟩 jobs
      🟩 Build              Pass: 100%/24  | Total:  1h 46m | Avg:  4m 25s | Max: 10m 59s | Hits:  90%/312   
      🟩 Test               Pass: 100%/2   | Total: 38m 36s | Avg: 19m 18s | Max: 21m 56s
    🟩 sm
      🟩 90                 Pass: 100%/1   | Total:  3m 23s | Avg:  3m 23s | Max:  3m 23s
      🟩 90a                Pass: 100%/1   | Total:  3m 23s | Avg:  3m 23s | Max:  3m 23s
    🟩 std
      🟩 17                 Pass: 100%/6   | Total: 22m 57s | Avg:  3m 49s | Max:  5m 36s
      🟩 20                 Pass: 100%/20  | Total:  2h 01m | Avg:  6m 05s | Max: 21m 56s | Hits:  90%/312   
    
  • 🟩 cccl_c_parallel: Pass: 100%/2 | Total: 10m 31s | Avg: 5m 15s | Max: 8m 23s

    🟩 cpu
      🟩 amd64              Pass: 100%/2   | Total: 10m 31s | Avg:  5m 15s | Max:  8m 23s
    🟩 ctk
      🟩 12.6               Pass: 100%/2   | Total: 10m 31s | Avg:  5m 15s | Max:  8m 23s
    🟩 cudacxx
      🟩 nvcc12.6           Pass: 100%/2   | Total: 10m 31s | Avg:  5m 15s | Max:  8m 23s
    🟩 cudacxx_family
      🟩 nvcc               Pass: 100%/2   | Total: 10m 31s | Avg:  5m 15s | Max:  8m 23s
    🟩 cxx
      🟩 GCC13              Pass: 100%/2   | Total: 10m 31s | Avg:  5m 15s | Max:  8m 23s
    🟩 cxx_family
      🟩 GCC                Pass: 100%/2   | Total: 10m 31s | Avg:  5m 15s | Max:  8m 23s
    🟩 gpu
      🟩 v100               Pass: 100%/2   | Total: 10m 31s | Avg:  5m 15s | Max:  8m 23s
    🟩 jobs
      🟩 Build              Pass: 100%/1   | Total:  2m 08s | Avg:  2m 08s | Max:  2m 08s
      🟩 Test               Pass: 100%/1   | Total:  8m 23s | Avg:  8m 23s | Max:  8m 23s
    
  • 🟩 python: Pass: 100%/1 | Total: 28m 15s | Avg: 28m 15s | Max: 28m 15s

    🟩 cpu
      🟩 amd64              Pass: 100%/1   | Total: 28m 15s | Avg: 28m 15s | Max: 28m 15s
    🟩 ctk
      🟩 12.6               Pass: 100%/1   | Total: 28m 15s | Avg: 28m 15s | Max: 28m 15s
    🟩 cudacxx
      🟩 nvcc12.6           Pass: 100%/1   | Total: 28m 15s | Avg: 28m 15s | Max: 28m 15s
    🟩 cudacxx_family
      🟩 nvcc               Pass: 100%/1   | Total: 28m 15s | Avg: 28m 15s | Max: 28m 15s
    🟩 cxx
      🟩 GCC13              Pass: 100%/1   | Total: 28m 15s | Avg: 28m 15s | Max: 28m 15s
    🟩 cxx_family
      🟩 GCC                Pass: 100%/1   | Total: 28m 15s | Avg: 28m 15s | Max: 28m 15s
    🟩 gpu
      🟩 v100               Pass: 100%/1   | Total: 28m 15s | Avg: 28m 15s | Max: 28m 15s
    🟩 jobs
      🟩 Test               Pass: 100%/1   | Total: 28m 15s | Avg: 28m 15s | Max: 28m 15s
    

👃 Inspect Changes

Modifications in project?

Project
CCCL Infrastructure
+/- libcu++
CUB
Thrust
CUDA Experimental
python
CCCL C Parallel Library
Catch2Helper

Modifications in project or dependencies?

Project
CCCL Infrastructure
+/- libcu++
+/- CUB
+/- Thrust
+/- CUDA Experimental
+/- python
+/- CCCL C Parallel Library
+/- Catch2Helper

🏃‍ Runner counts (total jobs: 170)

# Runner
125 linux-amd64-cpu16
19 linux-amd64-gpu-v100-latest-1
15 windows-amd64-cpu16
10 linux-arm64-cpu16
1 linux-amd64-gpu-h100-latest-1-testing

@gevtushenko
Copy link
Collaborator

@fbusato please add a sub-issue to #101 on deprecating and later dropping shuffle fasilities from CUB (util_ptx.cuh) in fawor of libcu++ ones.

Copy link
Contributor

@bernhardmgruber bernhardmgruber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I gave this a quick review. I would love to have @ahendriksen's opinion, since it touches his work on the PTX exposure. Also, he has a way better PTX understanding than me.

Comment on lines +1 to +3

shfl.sync
^^^^^^^^^
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Q: Why do we need a dedicated file in a manual directory? We currently just document PTX instructions inline (with generated parts from the generated directory). I see no strong reason to deviate here.

// under the Apache License v2.0 with LLVM Exceptions.
// See https://llvm.org/LICENSE.txt for license information.
// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
// SPDX-FileCopyrightText: Copyright (c) 2023 NVIDIA CORPORATION & AFFILIATES.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// SPDX-FileCopyrightText: Copyright (c) 2023 NVIDIA CORPORATION & AFFILIATES.
// SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES.

Applies to other places as well.

//===----------------------------------------------------------------------===//
// UNSUPPORTED: libcpp-has-no-threads
// UNSUPPORTED: clang && !nvcc
// UNSUPPORTED: c++98, c++03, c++11, c++14
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It feels a bit unsatisfactory that we provide the new feature for C++11, but can only compile and run tests in C++17 mode or higher. @miscco how do you feel about that?

Copy link
Contributor

@ahendriksen ahendriksen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have sent some comments in private as well. The data parameter should be a template parameter to allow shuffling any 32-bit value.

[[nodiscard]] __device__ static inline
shfl_return_values shfl_sync(shfl_mode_t shfl_mode,
uint32_t data,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be a template parameter that takes any 32-bit value.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: In Progress
Development

Successfully merging this pull request may close these issues.

5 participants