-
Notifications
You must be signed in to change notification settings - Fork 115
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Re-implement SYCL backend parallel_for
to improve bandwidth utilization
#1976
Open
mmichel11
wants to merge
53
commits into
main
Choose a base branch
from
dev/mmichel11/parallel_for_vectorize
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from 47 commits
Commits
Show all changes
53 commits
Select commit
Hold shift + click to select a range
9764a57
Optimize memory transactions in SYCL backend parallel for
mmichel11 c836b1d
clang-format
mmichel11 55f33a4
Correct comment and error handling.
mmichel11 adadd56
__num_groups bugfix
mmichel11 71d7bcc
Introduce stride recommender for different targets and better distrib…
mmichel11 ebb3d56
Cleanup
mmichel11 2c4ecd0
Unroll loop if possible
mmichel11 dc6bd0c
Revert "Unroll loop if possible"
mmichel11 d5126b2
Use a small and large kernel in parallel for
mmichel11 6433a50
Improve __iters_per_work_item heuristic.
mmichel11 d376124
Code cleanup
mmichel11 a7c7606
Clang format
mmichel11 b8aa15c
Update comments
mmichel11 b45a7c2
Bugfix in comment
mmichel11 4f9a360
More cleanup and better handle non-full case
mmichel11 7bb1d2b
Rename __ndi to __item for consistency with codebase
mmichel11 a2ad920
Update all comments on kernel naming trick
mmichel11 47fe214
Handle non-full case in a cleaner way
mmichel11 79a18e9
Switch min tuple type utility to return size of type
mmichel11 3ab8c75
Remove unnecessary template parameter
mmichel11 4a70fe2
Make non-template function inline for ODR compliance
mmichel11 5530209
If the iters per work item is 1, then only compile the basic pfor kernel
mmichel11 90f19d4
Address several PR comments
mmichel11 1ac65b9
Remove free function __stride_recommender
mmichel11 6a5a562
Accept ranges as forwarding references in __parallel_for_large_submitter
mmichel11 357032f
Address reviewer comments
mmichel11 ca9e594
Introduce vectorized for-path for small types and parallel_backend_sy…
mmichel11 e4060f5
Improve testing and cleanup of code
mmichel11 283b053
clang format
mmichel11 75e4beb
Miscellaneous fixes identified during testing
mmichel11 7990bc1
clang-format
mmichel11 4aaa81f
Fix ordering to __vector_load call
mmichel11 65e4a68
Add support for vectorization with C++20 parallel range APIs
mmichel11 b4657a6
Add device copyable specializations for new walk patterns
mmichel11 3086dd3
Align vector_walk implementation with other vector functors
mmichel11 df17673
Add back non-spirv path
mmichel11 fd4e2c3
Further improve test coverage
mmichel11 58fd466
Restore original shift_left due to implicit implementation requiremen…
mmichel11 094124f
Fix issues in vectorized rotate
mmichel11 82135f6
Fix fpga parallel for compilation issues
mmichel11 e979118
Restore initial shift_left_right.pass.cpp
mmichel11 4bfaada
Fix test side issue when unnamed lambdas are disabled
mmichel11 8ae18db
Add a vector path specialization for std::swap_ranges
mmichel11 6cb11c7
General code cleanup
mmichel11 505bdf3
Bugfix with __pattern_swap using nanoranges
mmichel11 114924d
clang-format
mmichel11 845de21
Address applicable comments from PR #1870
mmichel11 71678d0
Refactor __lazy_ctor_storage deleter
mmichel11 8b0b18b
Address review comments
mmichel11 f7d9753
Remove intrusive test macro and adjust input sizes in test framework
mmichel11 83c5ca4
Make walk_scalar_base and walk_vector_or_scalar_base structs
mmichel11 fedd5de
Add missing max_n
mmichel11 08aa260
Add constructors for for-based bricks
mmichel11 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -56,8 +56,12 @@ __pattern_walk1(__hetero_tag<_BackendTag>, _ExecutionPolicy&& __exec, _ForwardIt | |
oneapi::dpl::__ranges::__get_sycl_range<__par_backend_hetero::access_mode::read_write, _ForwardIterator>(); | ||
auto __buf = __keep(__first, __last); | ||
|
||
auto __view = __buf.all_view(); | ||
oneapi::dpl::__par_backend_hetero::__parallel_for( | ||
_BackendTag{}, __exec, unseq_backend::walk_n<_ExecutionPolicy, _Function>{__f}, __n, __buf.all_view()) | ||
_BackendTag{}, __exec, | ||
unseq_backend::walk1_vector_or_scalar<_ExecutionPolicy, _Function, decltype(__view)>{ | ||
{}, __f, static_cast<std::size_t>(__n)}, | ||
__n, __view) | ||
.__deferrable_wait(); | ||
} | ||
|
||
|
@@ -101,9 +105,14 @@ __pattern_walk2(__hetero_tag<_BackendTag>, _ExecutionPolicy&& __exec, _ForwardIt | |
auto __keep2 = oneapi::dpl::__ranges::__get_sycl_range<__acc_mode2, _ForwardIterator2>(); | ||
auto __buf2 = __keep2(__first2, __first2 + __n); | ||
|
||
auto __view1 = __buf1.all_view(); | ||
auto __view2 = __buf2.all_view(); | ||
|
||
auto __future = oneapi::dpl::__par_backend_hetero::__parallel_for( | ||
_BackendTag{}, ::std::forward<_ExecutionPolicy>(__exec), | ||
unseq_backend::walk_n<_ExecutionPolicy, _Function>{__f}, __n, __buf1.all_view(), __buf2.all_view()); | ||
unseq_backend::walk2_vectors_or_scalars<_ExecutionPolicy, _Function, decltype(__view1), decltype(__view2)>{ | ||
{}, __f, static_cast<std::size_t>(__n)}, | ||
__n, __view1, __view2); | ||
|
||
// Call no wait, wait or deferrable wait depending on _WaitMode | ||
__future.wait(_WaitMode{}); | ||
|
@@ -130,10 +139,28 @@ _ForwardIterator2 | |
__pattern_swap(__hetero_tag<_BackendTag> __tag, _ExecutionPolicy&& __exec, _ForwardIterator1 __first1, | ||
_ForwardIterator1 __last1, _ForwardIterator2 __first2, _Function __f) | ||
{ | ||
return __pattern_walk2</*_WaitMode*/ __par_backend_hetero::__deferrable_mode, | ||
__par_backend_hetero::access_mode::read_write, | ||
__par_backend_hetero::access_mode::read_write>( | ||
__tag, ::std::forward<_ExecutionPolicy>(__exec), __first1, __last1, __first2, __f); | ||
auto __n = __last1 - __first1; | ||
if (__n <= 0) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What is the case when There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Never if a valid sequence is passed :) I switched to |
||
return __first2; | ||
|
||
auto __keep1 = | ||
oneapi::dpl::__ranges::__get_sycl_range<__par_backend_hetero::access_mode::read_write, _ForwardIterator1>(); | ||
auto __buf1 = __keep1(__first1, __last1); | ||
|
||
auto __keep2 = | ||
oneapi::dpl::__ranges::__get_sycl_range<__par_backend_hetero::access_mode::read_write, _ForwardIterator2>(); | ||
auto __buf2 = __keep2(__first2, __first2 + __n); | ||
|
||
auto __view1 = __buf1.all_view(); | ||
auto __view2 = __buf2.all_view(); | ||
|
||
auto __future = oneapi::dpl::__par_backend_hetero::__parallel_for( | ||
_BackendTag{}, std::forward<_ExecutionPolicy>(__exec), | ||
unseq_backend::__brick_swap<_ExecutionPolicy, _Function, decltype(__view1), decltype(__view2)>{ | ||
{}, __f, static_cast<std::size_t>(__n)}, | ||
__n, __view1, __view2); | ||
__future.wait(__par_backend_hetero::__deferrable_mode{}); | ||
return __first2 + __n; | ||
} | ||
|
||
//------------------------------------------------------------------------ | ||
|
@@ -160,9 +187,15 @@ __pattern_walk3(__hetero_tag<_BackendTag>, _ExecutionPolicy&& __exec, _ForwardIt | |
auto __keep3 = oneapi::dpl::__ranges::__get_sycl_range<__acc_mode3, _ForwardIterator3>(); | ||
auto __buf3 = __keep3(__first3, __first3 + __n); | ||
|
||
oneapi::dpl::__par_backend_hetero::__parallel_for(_BackendTag{}, ::std::forward<_ExecutionPolicy>(__exec), | ||
unseq_backend::walk_n<_ExecutionPolicy, _Function>{__f}, __n, | ||
__buf1.all_view(), __buf2.all_view(), __buf3.all_view()) | ||
auto __view1 = __buf1.all_view(); | ||
auto __view2 = __buf2.all_view(); | ||
auto __view3 = __buf3.all_view(); | ||
|
||
oneapi::dpl::__par_backend_hetero::__parallel_for( | ||
_BackendTag{}, std::forward<_ExecutionPolicy>(__exec), | ||
unseq_backend::walk3_vectors_or_scalars<_ExecutionPolicy, _Function, decltype(__view1), decltype(__view2), | ||
decltype(__view3)>{{}, __f, static_cast<std::size_t>(__n)}, | ||
__n, __view1, __view2, __view3) | ||
.__deferrable_wait(); | ||
|
||
return __first3 + __n; | ||
|
@@ -1564,9 +1597,10 @@ __pattern_reverse(__hetero_tag<_BackendTag>, _ExecutionPolicy&& __exec, _Iterato | |
auto __keep = oneapi::dpl::__ranges::__get_sycl_range<__par_backend_hetero::access_mode::read_write, _Iterator>(); | ||
auto __buf = __keep(__first, __last); | ||
oneapi::dpl::__par_backend_hetero::__parallel_for( | ||
_BackendTag{}, ::std::forward<_ExecutionPolicy>(__exec), | ||
unseq_backend::__reverse_functor<typename ::std::iterator_traits<_Iterator>::difference_type>{__n}, __n / 2, | ||
__buf.all_view()) | ||
_BackendTag{}, std::forward<_ExecutionPolicy>(__exec), | ||
unseq_backend::__reverse_functor<typename std::iterator_traits<_Iterator>::difference_type, | ||
decltype(__buf.all_view())>{{}, __n}, | ||
__n / 2, __buf.all_view()) | ||
.__deferrable_wait(); | ||
} | ||
|
||
|
@@ -1589,10 +1623,13 @@ __pattern_reverse_copy(__hetero_tag<_BackendTag>, _ExecutionPolicy&& __exec, _Bi | |
auto __keep2 = | ||
oneapi::dpl::__ranges::__get_sycl_range<__par_backend_hetero::access_mode::write, _ForwardIterator>(); | ||
auto __buf2 = __keep2(__result, __result + __n); | ||
auto __view1 = __buf1.all_view(); | ||
auto __view2 = __buf2.all_view(); | ||
oneapi::dpl::__par_backend_hetero::__parallel_for( | ||
_BackendTag{}, ::std::forward<_ExecutionPolicy>(__exec), | ||
unseq_backend::__reverse_copy<typename ::std::iterator_traits<_BidirectionalIterator>::difference_type>{__n}, | ||
__n, __buf1.all_view(), __buf2.all_view()) | ||
_BackendTag{}, std::forward<_ExecutionPolicy>(__exec), | ||
unseq_backend::__reverse_copy<typename std::iterator_traits<_BidirectionalIterator>::difference_type, | ||
decltype(__view1), decltype(__view2)>{{}, __n}, | ||
__n, __view1, __view2) | ||
.__deferrable_wait(); | ||
|
||
return __result + __n; | ||
|
@@ -1626,24 +1663,27 @@ __pattern_rotate(__hetero_tag<_BackendTag>, _ExecutionPolicy&& __exec, _Iterator | |
auto __buf = __keep(__first, __last); | ||
auto __temp_buf = oneapi::dpl::__par_backend_hetero::__buffer<_ExecutionPolicy, _Tp>(__exec, __n); | ||
|
||
auto __view = __buf.all_view(); | ||
auto __temp_rng_w = | ||
oneapi::dpl::__ranges::all_view<_Tp, __par_backend_hetero::access_mode::write>(__temp_buf.get_buffer()); | ||
|
||
const auto __shift = __new_first - __first; | ||
oneapi::dpl::__par_backend_hetero::__parallel_for( | ||
_BackendTag{}, oneapi::dpl::__par_backend_hetero::make_wrapped_policy<__rotate_wrapper>(__exec), | ||
unseq_backend::__rotate_copy<typename ::std::iterator_traits<_Iterator>::difference_type>{__n, __shift}, __n, | ||
__buf.all_view(), __temp_rng_w); | ||
unseq_backend::__rotate_copy<typename std::iterator_traits<_Iterator>::difference_type, decltype(__view), | ||
decltype(__temp_rng_w)>{{}, __n, __shift}, | ||
__n, __view, __temp_rng_w); | ||
|
||
//An explicit wait isn't required here because we are working with a temporary sycl::buffer and sycl accessors and | ||
//SYCL runtime makes a dependency graph to prevent the races between two __parallel_for patterns. | ||
|
||
using _Function = __brick_move<__hetero_tag<_BackendTag>, _ExecutionPolicy>; | ||
auto __brick = unseq_backend::walk_n<_ExecutionPolicy, _Function>{_Function{}}; | ||
|
||
auto __temp_rng_rw = | ||
oneapi::dpl::__ranges::all_view<_Tp, __par_backend_hetero::access_mode::read_write>(__temp_buf.get_buffer()); | ||
oneapi::dpl::__par_backend_hetero::__parallel_for(_BackendTag{}, ::std::forward<_ExecutionPolicy>(__exec), __brick, | ||
auto __brick = unseq_backend::walk2_vectors_or_scalars<_ExecutionPolicy, _Function, decltype(__temp_rng_rw), | ||
decltype(__buf.all_view())>{ | ||
{}, _Function{}, static_cast<std::size_t>(__n)}; | ||
oneapi::dpl::__par_backend_hetero::__parallel_for(_BackendTag{}, std::forward<_ExecutionPolicy>(__exec), __brick, | ||
__n, __temp_rng_rw, __buf.all_view()) | ||
.__deferrable_wait(); | ||
|
||
|
@@ -1673,13 +1713,16 @@ __pattern_rotate_copy(__hetero_tag<_BackendTag>, _ExecutionPolicy&& __exec, _Bid | |
oneapi::dpl::__ranges::__get_sycl_range<__par_backend_hetero::access_mode::write, _ForwardIterator>(); | ||
auto __buf2 = __keep2(__result, __result + __n); | ||
|
||
auto __view1 = __buf1.all_view(); | ||
auto __view2 = __buf2.all_view(); | ||
|
||
const auto __shift = __new_first - __first; | ||
|
||
oneapi::dpl::__par_backend_hetero::__parallel_for( | ||
_BackendTag{}, ::std::forward<_ExecutionPolicy>(__exec), | ||
unseq_backend::__rotate_copy<typename ::std::iterator_traits<_BidirectionalIterator>::difference_type>{__n, | ||
__shift}, | ||
__n, __buf1.all_view(), __buf2.all_view()) | ||
unseq_backend::__rotate_copy<typename ::std::iterator_traits<_BidirectionalIterator>::difference_type, | ||
decltype(__view1), decltype(__view2)>{{}, __n, __shift}, | ||
__n, __view1, __view2) | ||
.__deferrable_wait(); | ||
|
||
return __result + __n; | ||
|
@@ -1936,19 +1979,22 @@ __pattern_shift_left(__hetero_tag<_BackendTag>, _ExecutionPolicy&& __exec, _Rang | |
if (__n >= __mid) | ||
{ | ||
using _Function = __brick_move<__hetero_tag<_BackendTag>, _ExecutionPolicy>; | ||
auto __brick = oneapi::dpl::unseq_backend::walk_n<_ExecutionPolicy, _Function>{_Function{}}; | ||
|
||
//TODO: to consider use just "read" access mode for a source range and just "write" - for a destination range. | ||
auto __src = oneapi::dpl::__ranges::drop_view_simple<_Range, _DiffType>(__rng, __n); | ||
auto __dst = oneapi::dpl::__ranges::take_view_simple<_Range, _DiffType>(__rng, __size_res); | ||
|
||
auto __brick = | ||
unseq_backend::walk2_vectors_or_scalars<_ExecutionPolicy, _Function, decltype(__src), decltype(__dst)>{ | ||
{}, _Function{}, static_cast<std::size_t>(__size_res)}; | ||
|
||
oneapi::dpl::__par_backend_hetero::__parallel_for(_BackendTag{}, ::std::forward<_ExecutionPolicy>(__exec), | ||
__brick, __size_res, __src, __dst) | ||
.__deferrable_wait(); | ||
} | ||
else //2. n < size/2; 'n' parallel copying | ||
{ | ||
auto __brick = unseq_backend::__brick_shift_left<_ExecutionPolicy, _DiffType>{__size, __n}; | ||
auto __brick = unseq_backend::__brick_shift_left<_ExecutionPolicy, _DiffType, decltype(__rng)>{__size, __n}; | ||
oneapi::dpl::__par_backend_hetero::__parallel_for( | ||
_BackendTag{}, | ||
oneapi::dpl::__par_backend_hetero::make_wrapped_policy<__shift_left_right>( | ||
|
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lets fix the naming of this while were touching all its instances
__custom_brick
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that the historical convention within the
internal/
directory is to not use any leading underscores although it has changed a bit over time.I do not have a strong preference if we make this change or leave it as is, but maybe it fits in a broader discussion regarding the remaining implementations in this directory.