Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[AutoBump] Merge with 4b7f07a0 (Aug 27) (12) #365

Merged
merged 167 commits into from
Dec 9, 2024

Conversation

mgehre-amd
Copy link
Collaborator

No description provided.

jurahul and others added 30 commits August 25, 2024 05:40
Comparison operations regression tests, from the original larger PR that
has been broken down: llvm#92272

---------

Co-authored-by: Jakub Kuderski <kubakuderski@gmail.com>
Co-authored-by: Andrzej Warzyński <andrzej.warzynski@gmail.com>
S.substr(N) is simpler than S.slice(N, StringRef::npos) and
S.slice(N, S.size()). Also, substr is probably better recognizable
than slice thanks to std::string_view::substr.
The helper can simply use VPRecipeBuilder::Plan.
I'm planning to change the inner loop to a range-based for loop.
…#105845)" (llvm#106000)" (llvm#106001)

This reverts commit 4b6c064.

Add a requirement for an amdgpu target in the test.
…/store on RV32. (llvm#105874)

In order to support -unaligned-scalar-mem properly, we need to be more
careful with immediates of global variables. We need to guarantee that
adding 4 in RISCVExpandingPseudos won't overflow simm12. Since we don't
know what the simm12 is until link time, the only way to guarantee this
is to make sure the base address is at least 8 byte aligned.

There were also several corner cases bugs in immediate folding where we
would fold an immediate in the range [2044,2047] where adding 4 would
overflow. These are not related to unaligned-scalar-mem.
…I.liveins(). NFC

MachineRegisterInfo::liveins returns std::pair<MCRegister, Register>.
Don't convert to std::pair<unsigned, unsigned>.
…5554)

The stub class for `FloatType` is present in `ir.pyi`, but it is missing
from the `__all__` export list.
This is a fix forward for the issue introduced in llvm#104523.
…en constructing the debug varaible for __coro_frame (llvm#105626)

As the title mentioned, do not search for the DILocalVariable for
__promise when constructing the debug variable for __coro_frame.

This should make sense because the debug variable of `__coro_frame`
shouldn't dependent on the debug variable of `__promise`. And actually,
it is not. Currently, we search the debug variable for `__promise` only
because we want to get the debug location and the debug scope for the
`__promise`. However, we can construct the debug location directly from
the debug scope of the being compiled function. Then it is not necessary
any more to search the `__promise` variable.

And this patch makes the codes to construct the debug variable for
`__coro_frame` to be more stable. Now we will always be able to
construct the debug variable for the coroutine frame no matter if we
found the debug variable for the __promise or not.

This patch is not strictly NFC but it is intended to not affect any end
users.
…lvm#87265)

This PR introduces new pass "amdgpu-sw-lower-lds". 

This pass lowers the local data store, LDS, uses in kernel and
non-kernel functions in module to use dynamically allocated global
memory. Packed LDS Layout is emulated in the global memory.
The lowered memory instructions from LDS to global memory are then
instrumented for address sanitizer, to catch addressing errors.
This pass only work when address sanitizer has been enabled and has
instrumented the IR. It identifies that IR has been instrumented using
"nosanitize_address" module flag.

For a kernel, LDS access can be static or dynamic which are direct
(accessed within kernel) and indirect (accessed through non-kernels).

**Replacement of Kernel LDS accesses:** 
- All the LDS accesses corresponding to kernel will be packed together,
where all static LDS accesses will be allocated first and then dynamic
LDS follows. The total size with alignment is calculated. A new LDS
global will be created for the kernel called "SW LDS" and it will have
the attribute "amdgpu-lds-size" attached with value of the size
calculated. All the LDS accesses in the module will be replaced by GEP
with offset into the "Sw LDS".
- A new "llvm.amdgcn.<kernel>.dynlds" is created per kernel accessing
the dynamic LDS. This will be marked used by kernel and will have
MD_absolue_symbol metadata set to total static LDS size, Since dynamic
LDS allocation starts after all static LDS allocation.

- A device global memory equal to the total LDS size will be allocated.
At the prologue of the kernel, a single work-item from the work-group,
does a "malloc" and stores the pointer of the allocation in "SW LDS". To
store the offsets corresponding to all LDS accesses, another global
variable is created which will be called "SW LDS metadata" in this pass.

- **SW LDS:** 
It is LDS global of ptr type with name
"llvm.amdgcn.sw.lds.<kernel-name>".

- **SW LDS Metadata:** 
It is of struct type, with n members. n equals the number of LDS globals
accessed by the kernel(direct and indirect). Each member of struct is
another struct of type {i32, i32, i32}. First member corresponds to
offset, second member corresponds to size of LDS global being replaced
and third represents the total aligned size. It will have name
"llvm.amdgcn.sw.lds.<kernel-name>.md". This global will have an
intializer with static LDS related offsets and sizes initialized. But
for dynamic LDS related entries, offsets will be intialized to previous
static LDS allocation end offset. Sizes for them will be zero initially.
These dynamic LDS offset and size values will be updated with in the
kernel, since kernel can read the dynamic LDS size allocation done at
runtime with query to "hidden_dynamic_lds_size" hidden kernel argument.

- At the epilogue of kernel, allocated memory would be made free by the
same single work-item.

**Replacement of non-kernel LDS accesses:** 
- Multiple kernels can access the same non-kernel function. All the
kernels accessing LDS through non-kernels are sorted and assigned a
kernel-id. All the LDS globals accessed by non-kernels are sorted.

- This information is used to build two tables: 
- **Base table:** 
Base table will have single row, with elements of the row placed as per
kernel ID. Each element in the row corresponds to ptr of "SW LDS"
variable created for that kernel.

- **Offset table:** 
Offset table will have multiple rows and columns. Rows are assumed to be
from 0 to (n-1). n is total number of kernels accessing the LDS through
non-kernels. Each row will have m elements. m is the total number of
unique LDS globals accessed by all non-kernels. Each element in the row
correspond to the ptr of the replacement of LDS global done by that
particular kernel.

- A LDS variable in non-kernel will be replaced based on the information
from base and offset tables. Based on kernel-id query, ptr of "SW LDS"
for that corresponding kernel is obtained from base table. The Offset
into the base "SW LDS" is obtained from corresponding element in offset
table. With this information, replacement value is obtained.
/llvm-project/llvm/lib/Target/AMDGPU/AMDGPUSwLowerLDS.cpp:260:10:
error: moving a local object in a return statement prevents copy elision [-Werror,-Wpessimizing-move]
  return std::move(OrderedKernels);
         ^
/llvm-project/llvm/lib/Target/AMDGPU/AMDGPUSwLowerLDS.cpp:260:10: note: remove std::move call here
  return std::move(OrderedKernels);
         ^~~~~~~~~~              ~
1 error generated.
…mbdas (llvm#105999)

Fixes llvm#104722.

Missed handling `decltype(auto)` trailing return types for lambdas.
This was a mistake and regression on my part with my PR,
llvm#104722.

Added some missing unit tests to test for the various placeholder
trailing return types in lambdas.
… in compiler-rt with lit internal shell (llvm#105917)

There are several files in the compiler-rt subproject that have command
not found errors. This patch uses the `env` command to properly set the
environment variables correctly when using the lit internal shell.
fixes: llvm#102395 
[This change is relevant [RFC] Enabling the lit internal shell by
Default](https://discourse.llvm.org/t/rfc-enabling-the-lit-internal-shell-by-default/80179)
…arameter packs (llvm#102131)

We established an instantiation scope in order for constraint
equivalence checking to properly map the uninstantiated parameters.

That mechanism mapped even packs to themselves. Consequently, parameter
packs e.g. appearing in a function call, were not expanded. So they
would end up becoming `SubstTemplateTypeParmPackType`s that circularly
depend on the canonical declaration of the function template, which is
not yet determined, hence the spurious error.

No release note as I plan to backport it to 19.

Fixes llvm#101735

---------

Co-authored-by: cor3ntin <corentinjabot@gmail.com>
…6039)

Fixes linking error in llvm CI: 
"AMDGPUSwLowerLDS::run()':

AMDGPUSwLowerLDS.cpp:(.text._ZN12_GLOBAL__N_116AMDGPUSwLowerLDS3runEv+0x164):
undefined reference to `llvm::getAddressSanitizerParams(llvm::Triple
const&, int, bool, unsigned long*, int*, bool*)'"

llvm#87265 amdgpu-sw-lower-lds pass uses getAddressSanitizerParams method
from AddressSanitizer pass. It misses linking of LLVMInstrumentation to
AMDGPUCodegen. This PR adds it.
Take the intersection of the existing range attribute for the return
value and the inferred range.
…lvm#104941)

getMaskedTypeForICmpPair() tries to model non-and operands as x & -1.
However, this can end up confusing the matching logic, by picking the -1
operand as the "common" operand, resulting in a successful, but useless,
match. This is what causes commutation failures for some of the
optimizations driven by this function.

Fix this by treating a match against -1 as a non-match.
…lvm#104788)

This is a followup to llvm#104579
to remove the limitation on sinking loads/stores of allocas entirely,
even if this would introduce a phi node.

Nowadays, SROA supports speculating load/store over select/phi.
Additionally, SimplifyCFG with sinking only runs at the end of the
function simplification pipeline, after SROA. I checked that the two
tests modified here still successfully SROA after the SimplifyCFG
transform.

We should, however, keep the limitation on lifetime intrinsics. SROA
does not have speculation support for these, and I've also found that
the way these are handled in the backend is very problematic
(llvm#104776), so I think we
should leave them alone.
The legacy cost model in some parts checks if any of the operands are
constants via SCEV. Update VPlan construction to replace live-ins that
are constants via SCEV with such constants. This means VPlans (and
codegen) reflects what we computing the cost of and removes another case
where the legacy and VPlan cost model diverged.

Fixes llvm#105722.
This PR enables "amdgpu-sw-lower-lds" pass in the pipeline.
Also introduces "amdgpu-enable-sw-lower-lds" cmd line flag to
enbale/disable the pass.
yuxuanchen1997 and others added 26 commits August 26, 2024 18:46
The renamable flag is useful during MachineCopyPropagation but renamable
flag will be dropped after lowerCopy in some case.

This patch introduces extra arguments to pass the renamable flag to
copyPhysReg.
While working on a MIR unittest, I noticed that parseMIR includes an
unused argument that sets a function name. This is not only redundant
but also irrelevant, as parseMIR is designed to parse entire module, not
specific functions, even though most unittests contain a single function
per module. To streamline the API, I have removed this unnecessary
argument from parseMIR. However, if this argument was originally
included to enhance readability or for any other purpose, please let me
know.
add f8E5M2 and tests for np_to_memref

---------

Co-authored-by: Zhicheng Xiong <zhichengx@dc2-sim-c01-215.nvidia.com>
TSAN warns that `ptr` is read and write without protection in
`clearExpiredEntries` and in the destructor of `Owner`. Add an atomic
bool to synchronize these without incurring a cost when calling `get`.
…rt command in lit's internal shell (llvm#105961)

This patch fixes the incorrect usage of lit's built-in `export` command.
There is a typo in raising the error itself where the error being raised
had the wrong number of parameters passed in.

Fixes llvm#102386.
… tests (llvm#105754)

This patch rewrites tests in clang and compiler-rt that uses bash
command substitution syntax $() to execute the dirname command. This is
done so that the tests can be run using lit's internal shell.

Fixes llvm#102384.
…sts with lit internal shell (llvm#105729)

This patch addresses compatibility issues with the lit internal shell by
removing the use of subshell execution (parentheses and subshell syntax)
in the `merge-posix.test` and `vptr.cpp` tests. The lit internal shell
does not support parentheses, so the tests have been refactored to use
separate command invocations.

This change is relevant for enabling the lit internal shell by default,
as outlined in [[RFC] Enabling the Lit Internal Shell by
Default](https://discourse.llvm.org/t/rfc-enabling-the-lit-internal-shell-by-default/80179)

fixes: llvm#102401
…ompatibility (llvm#106115)

This patch addresses compatibility issues with the lit internal shell by
expanding and rewriting test scripts in the compiler-rt subproject.
These changes were prompted by the FileNotFound unresolved errors
encountered during the testing process, specifically when running the
command `LIT_USE_INTERNAL_SHELL=1 ninja check compiler-rt`.

**Why the error occurred:**
The error occurred because the original test scripts used process
substitution `(<(...))` in their diff commands. Process substitution
creates temporary files or FIFOs to hold command output, and these are
then passed to `diff`. However, the lit internal shell, which is more
limited than a typical shell like `bash`, does not support process
substitution. When lit tries to execute these commands, it is unable to
create or access the temporary files or FIFOs generated by process
substitution. As a result, lit attempts to open a file or directory that
doesn't exist, leading to the `FileNotFoundError`.

**Changes Made:**

- Instead of using process substitution, the commands now explicitly
redirect the output of `llvm-profdata show` to temporary files before
performing the `diff` comparison. This ensures that the lit internal
shell can correctly find and open these files, resolving the
`FileNotFoundError`.

[This change is relevant [RFC] Enabling the lit internal shell by
Default](https://discourse.llvm.org/t/rfc-enabling-the-lit-internal-shell-by-default/80179)
fixes: llvm#106111
…or-loop (llvm#106150)

This patch adds `REQUIRES: shell` to the `focus-function.test` because
the lit internal shell does not support the for loop syntax. This will
make the test file unsupported when running llvm-lit with its internal
shell implementation, which is enabled by turning on the
`LIT_USE_INTERNAL_SHELL=1`.

fixes: llvm#106111
By default, type legalization will try to promote the build_vector,
but that generic type legalizer doesn't support that. Bitcast to
vXi16 instead. Same as what we do for vXf16 without Zfhmin.

Fixes llvm#100846.
Add NotConstant(Null) roots for nonnull arguments and then propagate
them through nuw/inbounds GEPs.

Having this functionality in SCCP is useful because it allows reliably
eliminating null comparisons, independently of how deeply nested they
are in selects/phis. This handles cases that would hit a cutoff in
ValueTracking otherwise.

The implementation is something of a MVP, there are a number of obvious
extensions (e.g. allocas are also non-null).
…icit-bool-conversion (llvm#104882)

When readability-implicit-bool-conversion-check and
readability-uppercase-literal-suffix-check is enabled this will cause
you to apply a fix twice

from (!i) -> (i == 0u) to (i == 0U) twice instead will skip the middle
one

Adding this option allows this check to be in sync with readability-uppercase-literal-suffix, avoiding duplicate warnings.

Fixes llvm#40544
…lvm#104785)

There was some inconsistency with ConvertVectorToLLVM Pass builder,
files and option names.
This patch aims to move all occurences to ConvertVectorToLLVM.
OpenCL's vload_half builtin expects two arguments, but the current
TableGen definition expects three.
This change fixes the mismatch and adds a test to check this.
…lvm#105663)

Use SmallVectorImpl instead of SmallVector for function arguments
to give the caller greater flexibility in choice of initial size.
Base automatically changed from bump_to_b6603e1b to feature/fused-ops December 7, 2024 13:44
@mgehre-amd mgehre-amd merged commit 7b9c9c0 into feature/fused-ops Dec 9, 2024
10 checks passed
@mgehre-amd mgehre-amd deleted the bump_to_4b7f07a0 branch December 9, 2024 18:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.