-
Notifications
You must be signed in to change notification settings - Fork 115
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Re-implement SYCL backend parallel_for
to improve bandwidth utilization
#1976
base: main
Are you sure you want to change the base?
Conversation
parallel_for
to improve bandwidth utilizationparallel_for
to improve bandwidth utilization
Signed-off-by: Matthew Michel <matthew.michel@intel.com>
Signed-off-by: Matthew Michel <matthew.michel@intel.com>
128 byte memory operations are performed instead of 512 after inspecting the assembly. Processing 512 bytes per sub-group still seems to be the best value after experimentation. Signed-off-by: Matthew Michel <matthew.michel@intel.com>
Signed-off-by: Matthew Michel <matthew.michel@intel.com>
…ute work for small inputs Signed-off-by: Matthew Michel <matthew.michel@intel.com>
Signed-off-by: Matthew Michel <matthew.michel@intel.com>
This reverts commit e4cbceb. Small sizes slightly slower and for horizontal vectorization no "real" benefit is observed.
Small but measurable overheads can be observed for small inputs where runtime dispatch in the kernel is present to check for the correct path to take. Letting the compiler handle the the small input case in the original kernel shows the best performance. Signed-off-by: Matthew Michel <matthew.michel@intel.com>
We now flatten the user-provided ranges and find the minimum sized type to estimate the best __iters_per_work_item. This benefits performance in calls that wrap multiple buffers in a single input / output through a zip_iterator (e.g. dpct::scatter_if in SYCLomatic compatibility headers). Signed-off-by: Matthew Michel <matthew.michel@intel.com>
Signed-off-by: Matthew Michel <matthew.michel@intel.com>
Signed-off-by: Matthew Michel <matthew.michel@intel.com>
Signed-off-by: Matthew Michel <matthew.michel@intel.com>
Signed-off-by: Matthew Michel <matthew.michel@intel.com>
Signed-off-by: Matthew Michel <matthew.michel@intel.com>
Signed-off-by: Matthew Michel <matthew.michel@intel.com>
Signed-off-by: Matthew Michel <matthew.michel@intel.com>
Signed-off-by: Matthew Michel <matthew.michel@intel.com>
Signed-off-by: Matthew Michel <matthew.michel@intel.com>
Signed-off-by: Matthew Michel <matthew.michel@intel.com>
Signed-off-by: Matthew Michel <matthew.michel@intel.com>
Signed-off-by: Matthew Michel <matthew.michel@intel.com>
Signed-off-by: Matthew Michel <matthew.michel@intel.com>
Signed-off-by: Matthew Michel <matthew.michel@intel.com>
…t that for pattern launches exactly n work items Signed-off-by: Matthew Michel <matthew.michel@intel.com>
Signed-off-by: Matthew Michel <matthew.michel@intel.com>
Signed-off-by: Matthew Michel <matthew.michel@intel.com>
Due to the revert of the vectorization path the original test provides sufficient coverage. Signed-off-by: Matthew Michel <matthew.michel@intel.com>
Signed-off-by: Matthew Michel <matthew.michel@intel.com>
Signed-off-by: Matthew Michel <matthew.michel@intel.com>
Signed-off-by: Matthew Michel <matthew.michel@intel.com>
Signed-off-by: Matthew Michel <matthew.michel@intel.com>
085eaf5
to
505bdf3
Compare
Signed-off-by: Matthew Michel <matthew.michel@intel.com>
Signed-off-by: Matthew Michel <matthew.michel@intel.com>
include/oneapi/dpl/pstl/utils.h
Outdated
{ | ||
template <typename _Tp> | ||
void | ||
operator()(__lazy_ctor_storage<_Tp> __storage) const |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why you pass __storage
parameter by value?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great catch. I have made this a l-value reference.
__par_backend_hetero::access_mode::read_write>( | ||
__tag, ::std::forward<_ExecutionPolicy>(__exec), __first1, __last1, __first2, __f); | ||
auto __n = __last1 - __first1; | ||
if (__n <= 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the case when __n < 0
is true?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Never if a valid sequence is passed :) I switched to __n == 0
.
|
||
// Path that intentionally disables vectorization for algorithms with a scattered access pattern (e.g. binary_search) | ||
template <typename... _Ranges> | ||
class walk_scalar_base |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why class walk_scalar_base
declared as class
but
template <typename _ExecutionPolicy, typename _F, typename _Range>
struct walk1_vector_or_scalar : public walk_vector_or_scalar_base<_Range>
declared as struct
?
__vector_path(_IsFull __is_full, const _ItemId __idx, _Range __rng) const | ||
{ | ||
// This is needed to enable vectorization | ||
auto __raw_ptr = __rng.begin(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- I think that
__raw_ptr
isn't very good name becausebegin()
usually linked in mind with iterator. Butraw
usually is some pointer. - Do we really need to have here local variable
__raw_ptr
? Can we pass__rng.begin()
instead of that variable into__vector_walk
call?
So now we have 3 entity with defined
Does these constexpr-variables really has different semantic? And if the semantic of these entities are the same, may be make sense to make some re-design to have only one entity |
In some moments implementation details remind me But what if we instead of two different functions template <typename _IsFull, typename _ItemId>
void
__vector_path(_IsFull __is_full, const _ItemId __idx, _Range __rng) const
{
// This is needed to enable vectorization
auto __raw_ptr = __rng.begin();
oneapi::dpl::__par_backend_hetero::__vector_walk<__base_t::__preferred_vector_size>{__n}(__is_full, __idx, __f,
__raw_ptr);
}
// _IsFull is ignored here. We assume that boundary checking has been already performed for this index.
template <typename _IsFull, typename _ItemId>
void
__scalar_path(_IsFull, const _ItemId __idx, _Range __rng) const
{
__f(__rng[__idx]);
} we will have some two functions with the same name and the format excepting the first parameter type which will be used as some Please take a look at |
include/oneapi/dpl/pstl/utils.h
Outdated
@@ -784,6 +785,32 @@ union __lazy_ctor_storage | |||
} | |||
}; | |||
|
|||
// Utility to explicitly call the destructor of __lazy_ctor_storage as a callback functor | |||
struct __lazy_ctor_storage_deleter |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably I don't understand something, but why this struct
has name lazy
?
It's looks like some kind of visitor pattern implementation, which call destroy()
for each element in container.
What is exactly the lazy
functional here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe it is a callable deleter for __lazy_ctor_storage
which is storage that has a delayed "lazy" constructor. Perhaps it would be better to instead add a static member function to the __lazy_ctor_storage
union, get_deleter_callable()
, which returns a lambda to delete a __lazy_ctor_storage&
passed as an argument. This would remove any confusion, and group these together.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@danhoeflinger's explanation is correct here. I have adopted the suggestion that introduces a static __get_callable_deleter()
member which returns a lambda that calls __destroy()
.
void | ||
operator()(std::false_type, _IdxType __start_idx, _LoadOp __load_op, _Acc... __acc) const | ||
{ | ||
std::uint8_t __elements = std::min(std::size_t{__vec_size}, std::size_t{__n - __start_idx}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We assume here that std::min(std::size_t{}, std::size_t{})
will always fit into std::uint8_t
type?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it makes sense... __vec_size
is 4 or less, but __n - __start_idx
can only be assumed to fit within size_t
(and you don't want to overflow before the min). The result will be 4 or less, which fits in 8 bits.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, logically this will be safe. Vector sizes will always be <= 4 which is enforced through a static assert in the containing struct.
One more point: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First round of review. I've not gotten to all the details yet, but this is enough to be interesting.
template <template <typename...> typename _WrapperType, typename... _Ts> | ||
struct __min_nested_type_size<_WrapperType<_Ts...>> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if this formulation leaves us open for bugs in the future with no restrictions on what _WrapperType
could be.
What we probably want is something like tuple-like from c++23.
Would we be better off limiting this to std::tuple and and onedpl's tuple with explicit partial specializations? or limit it via some enable_if
magic?
Right now any templated type is reduced to its template arguments, which isn't always the case. Imagine a contrived user provided type for their input range which has a template argument which isn't used as a member field.
template <typename T>
struct __my_converting_type{
std::uint8_t var;
T get_conversion(){ return T{var};}
};
This would match the _WrapperType
flavor I think, and return the wrong result if I understand the intention correctly. We would want such a type to use sizeof
.
// To ensure that the large submitter gets tested on all devices, set the switch point to 10,000 only when compiling | ||
// oneDPL tests. | ||
#if TEST_FOR_ALGORITHM_LARGE_SUBMITTER | ||
return 10000; | ||
#else |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we try to avoid letting testing specific code seep into the main repo, though I understand the need here to gain coverage.
Can we instead perhaps add one large test to the "normal" test suite which would hit the large submitter and enable it only under the same circumstances. I understand the desire to limit the test time of the suite, but this both infects the main repo with test specifics, but also adds coverage of this code in situations it will never encounter in the wild, and doesn't cover any real sizes.
Id really prefer not to, but if we do have to have this, I'd suggest uglifying the name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that we should try avoiding this. The reason I tested this way was due to concerns regarding our growing test / CI times. However, adding a few cases above the threshold to call the large submitter path may not be too bad for a limited set of for-algorithm cases. I will take a look and see if this can be done entirely within our test framework without artificially lowering the threshold to call this path in the implementation itself.
const std::uint32_t __sub_group_size = __sub_group.get_local_linear_range(); | ||
const std::uint32_t __sub_group_id = __sub_group.get_group_linear_id(); | ||
const std::uint32_t __sub_group_local_id = __sub_group.get_local_linear_id(); | ||
const std::size_t __work_group_id = __item.get_group().get_group_linear_id(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like we could move this out of the branch and use in both sides.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have moved __work_group_id
outside of the branch and used it in both sides.
static inline std::tuple<std::size_t, std::size_t, bool> | ||
__stride_recommender(const sycl::nd_item<1>& __item, std::size_t __count, std::size_t __iters_per_work_item, | ||
std::size_t __adj_elements_per_work_item, std::size_t __work_group_size) | ||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a general utility which might have utility for other commutative operations beyond just parallel_for
or is there a reason you believe this to be specific to this algorithm / kernel?
If we think it might be useful, we could lift this to a general utility level. Obviously we don't need to incorporate it elsewhere in this PR. An alternative is to add an issue to explore this and only lift it if we find utility.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@SergeyKopienko and I had brief discussion regarding this in the first PR: #1870 (comment).
I think in the initial PR making it a static member was the best choice. However, in this new PR I am pushing to expose more general utilities for future use (e.g. strided loops, vectorization paths) to enforce good memory access patterns, and I think stride recommender can be a good general utility. @SergeyKopienko What are your thoughts here with adding this function as a utility?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unless there is a clear and easy consensus to make it a utility, lets just leave it here and add an issue to explore its use elsewhere as a utility at a later time. I don't want to derail the PR for this purpose.
void | ||
operator()(std::false_type, _IdxType __start_idx, _LoadOp __load_op, _Acc... __acc) const | ||
{ | ||
std::uint8_t __elements = std::min(std::size_t{__vec_size}, std::size_t{__n - __start_idx}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it makes sense... __vec_size
is 4 or less, but __n - __start_idx
can only be assumed to fit within size_t
(and you don't want to overflow before the min). The result will be 4 or less, which fits in 8 bits.
std::forward<_Range2>(__result)) | ||
unseq_backend::walk2_vectors_or_scalars<_ExecutionPolicy, _CopyBrick, std::decay_t<_Range1>, | ||
std::decay_t<_Range2>>{ | ||
{}, _CopyBrick{}, static_cast<std::size_t>(__n)}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really dislike having to pass {}
as the first argument here. I'm not sure I even really understand why its necessary, is this for the base class?
Can we just define constructors which accepts only the brick and size to avoid this issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is necessary for the base class when doing aggregate initialization. I will define constructors for the bricks so these can all be removed.
struct custom_brick | ||
#if _ONEDPL_BACKEND_SYCL | ||
template <typename Comp, typename T, typename _Range, search_algorithm func> | ||
struct custom_brick : oneapi::dpl::unseq_backend::walk_scalar_base<_Range> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lets fix the naming of this while were touching all its instances __custom_brick
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that the historical convention within the internal/
directory is to not use any leading underscores although it has changed a bit over time.
I do not have a strong preference if we make this change or leave it as is, but maybe it fits in a broader discussion regarding the remaining implementations in this directory.
auto __raw_ptr3 = __rng3.begin(); | ||
|
||
oneapi::dpl::__internal::__lazy_ctor_storage<_ValueType1> __rng1_vector[__base_t::__preferred_vector_size]; | ||
oneapi::dpl::__internal::__lazy_ctor_storage<_ValueType2> __rng2_vector[__base_t::__preferred_vector_size]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it should be possible to combine walk*_vectors_or_scalars
together with some complicated fold instructions, lambdas, tuples, and std::apply.
Take a look at the first answer of https://stackoverflow.com/questions/7230621/how-can-i-iterate-over-a-packed-variadic-template-argument-list. I think you should do something similar, chaining together instructions by returning tuples and then with std::apply
.
Here is an example I was playing with.
https://godbolt.org/z/vc8dK4ed6
In the end, I'm not sure if (1) its actually possible and (2) its worth the complexity to consolidate these structs, but its worth considering...
* Deleter is now a callable lambda returned from a static function in the class * Deleter accepts l-value reference to __lazy_ctor_storage Signed-off-by: Matthew Michel <matthew.michel@intel.com>
Signed-off-by: Matthew Michel <matthew.michel@intel.com>
High Level Description
This PR improves hardware bandwidth utilization of oneDPL's SYCL backend parallel for pattern through two ideas:
Implementation Details
binary_search
)To implement this approach, the parallel for kernel rewrite from #1870 was adopted with additional changes to handle vectorization paths. Additionally, generic vectorization and strided loop utilities have been defined with the intention for these to be applicable in other portions of the codebase as well. Tests have been expanded to ensure coverage of vectorization paths.
This PR will supersedes #1870. Initially, the plan was to merge this PR into 1870 but after comparing the diff, I believe the most straightforward approach will be to target this directly to main.