Implement grouped conv interface #80870

This reverts commit 2ec122d.

Casting the result of `Section.getAddressWithOffset()` goes wrong if we are on a 32-bit platform whose addresses are regarded as signed; in that case, just doing ``` (uint64_t)Section.getAddressWithOffset(...) ``` or ``` reinterpret_cast<uint64_t>(Section.getAddressWithOffset(...)) ``` will result in sign-extension. We use these expressions when constructing branch stubs, which is before we know the final load address, so we can just switch to the `Section.getLoadAddressWithOffset(...)` method instead. Doing that is also more consistent, since when calculating relative offsets for relocations, we use the load address anyway, so the code currently only works because `Section.Address` is equal to `Section.LoadAddress` at this point. Fixes llvm#94478.

…s on LA64 (llvm#93813) Materializing constants on LoongArch is simpler if the constant is sign extended from i32. By default i32 constant operands of phis are zero extended. This patch adds a hook to allow LoongArch to override this for i32. We have an existing isSExtCheaperThanZExt, but it operates on EVT which we don't have at these places in the code.

…fmin (llvm#91936)

Implements fmaxf16 and fminf16, which are two missing functions listed here: llvm#93566

This patch make all errors start with a lowercase letter and removes trailing periods and newlines. This fixes inconsistencies between error messages and facilitate concatenating them.

…ath functions (llvm#94535) llvm#93566

…ges (llvm#94259) This patch changes the crashlog image loading default behaviour to not only load images from the crashed thread but also for the application specific backtrace thread. This patch also move the Application Specific Backtrace / Last Exception Backtrace tag from the thread queue field to the thread name. rdar://128276576 Signed-off-by: Med Ismail Bennani <[email protected]>

…n object files (llvm#94487) Follow up to llvm#92042

Following of llvm#86912 The motivation of the patch series is that, for a module interface unit `X`, when the dependent modules of `X` changes, if the changes is not relevant with `X`, we hope the BMI of `X` won't change. For the specific patch, we hope if the changes was about irrelevant declaration changes, we hope the BMI of `X` won't change. **However**, I found the patch itself is not very useful in practice, since the adding or removing declarations, will change the state of identifiers and types in most cases. That said, for the most simple example, ``` // partA.cppm export module m:partA; // partA.v1.cppm export module m:partA; export void a() {} // partB.cppm export module m:partB; export void b() {} // m.cppm export module m; export import :partA; export import :partB; // onlyUseB; export module onlyUseB; import m; export inline void onluUseB() { b(); } ``` the BMI of `onlyUseB` will change after we change the implementation of `partA.cppm` to `partA.v1.cppm`. Since `partA.v1.cppm` introduces new identifiers and types (the function prototype). So in this patch, we have to write the tests as: ``` // partA.cppm export module m:partA; export int getA() { ... } export int getA2(int) { ... } // partA.v1.cppm export module m:partA; export int getA() { ... } export int getA(int) { ... } export int getA2(int) { ... } // partB.cppm export module m:partB; export void b() {} // m.cppm export module m; export import :partA; export import :partB; // onlyUseB; export module onlyUseB; import m; export inline void onluUseB() { b(); } ``` so that the new introduced declaration `int getA(int)` doesn't introduce new identifiers and types, then the BMI of `onlyUseB` can keep unchanged. While it looks not so great, the patch should be the base of the patch to erase the transitive change for identifiers and types since I don't know how can we introduce new types and identifiers without introducing new declarations. Given how tightly the relationship between declarations, types and identifiers, I think we can only reach the ideal state after we made the series for all of the three entties. The design of the patch is similar to llvm#86912, which extends the 32-bit DeclID to 64-bit and use the higher bits to store the module file index and the lower bits to store the Local Decl ID. A slight difference is that we only use 48 bits to store the new DeclID since we try to use the higher 16 bits to store the module ID in the prefix of Decl class. Previously, we use 32 bits to store the module ID and 32 bits to store the DeclID. I don't want to allocate additional space so I tried to make the additional space the same as 64 bits. An potential interesting thing here is about the relationship between the module ID and the module file index. I feel we can get the module file index by the module ID. But I didn't prove it or implement it. Since I want to make the patch itself as small as possible. We can make it in the future if we want. Another change in the patch is the new concept Decl Index, which means the index of the very big array `DeclsLoaded` in ASTReader. Previously, the index of a loaded declaration is simply the Decl ID minus PREDEFINED_DECL_NUMs. So there are some places they got used ambiguously. But this patch tried to split these two concepts. As llvm#86912 did, the change will increase the on-disk PCM file sizes. As the declaration ID may be the most IDs in the PCM file, this can have the biggest impact on the size. In my experiments, this change will bring 6.6% increase of the on-disk PCM size. No compile-time performance regression observed. Given the benefits in the motivation example, I think the cost is worthwhile.

…ave Zvfbfmin" (llvm#94565) Reverts llvm#91936 Premerge bots are broken.

…vm#92746) This patch add support of intrinsics GNU extension GETCWD llvm#84203. Some usage info and example has been added to `flang/docs/Intrinsics.md`. The patch contains both the lowering and the runtime code and works on both Windows and Linux. | System | Implmentation | |-----------|--------------------| | Windows | _getcwd | | Linux |getcwd |

…86512) This patch implements a `__is_bitwise_cloneable` builtin in clang. The builtin is used as a guard to check a type can be safely bitwise copied by memcpy. It's functionally similar to `__is_trivially_copyable`, but covers a wider range of types (e.g. classes with virtual functions). The compiler guarantees that after copy, the destination object has the same object representations as the source object. And it is up to user to guarantee that program semantic constraints are satisfied. Context: https://discourse.llvm.org/t/extension-for-creating-objects-via-memcpy

…lvm#93814) Although i32 type is illegal in the backend, LA64 has pretty good support for i32 types by using W instructions. By adding n32 to the DataLayout string, middle end optimizations will consider i32 to be a native type. One known effect of this is enabling LoopStrengthReduce on loops with i32 induction variables. This can be beneficial because C/C++ code often has loops with i32 induction variables due to the use of `int` or `unsigned int`. If this patch exposes performance issues, those are better addressed by tuning LSR or other passes.

This commit enhances the docsting of `translateModuleToLLVMIR` as a followup to llvm#94445

…lvm#94522)

As the comment already indicates, only replacement with undef is problematic, as it introduces an additional use of undef. Use the correct ValueTracking helper.

If we're only checking for undef, then also only look for undef elements in the vector (rather than undef and poison).

…#91715) - There is no restriction on a loop with controlled convergent operations when the relevant tokens are defined and used within the loop. - When a token defined outside a loop is used inside (also called a loop convergence heart), unrolling is allowed only in the absence of remainder or runtime checks. - When a token defined inside a loop is used outside, such a loop is said to be "extended". This loop can only be unrolled by also duplicating the extended part lying outside the loop. Such unrolling is disabled for now. - Clean up loop hearts: When unrolling a loop with a heart, duplicating the heart will introduce multiple static uses of a convergence control token in a cycle that does not contain its definition. This violates the static rules for tokens, and needs to be cleaned up into a single occurrence of the intrinsic. - Spell out the initializer for UnrollLoopOptions to improve readability. Original implementation [D85605] by Nicolai Haehnle <[email protected]>.

…lvm#93806) The m_ZExtOrSelf() family of matchers currently incorrectly calls std::forward twice on the same value. However, just removing those causes other complications, because then template arguments get incorrectly inferred to const references instead of the underlying value types. Things become a mess. Instead, just completely remove the use of std::forward and rvalue references from SDPatternMatch. I don't think they really provide value in this context, especially as they're not used consistently in the first place.

…Y` are known signed/unsigned Several transforms: 1) If known `Y < 0`: - slt -> ult: https://alive2.llvm.org/ce/z/9zt2iK - sle -> ule: https://alive2.llvm.org/ce/z/SPoPNF - sgt -> ugt: https://alive2.llvm.org/ce/z/IGNxAk - sge -> uge: https://alive2.llvm.org/ce/z/joqTvR 2) If known `Y >= 0`: - `(X & PosY) s> X --> X s< 0` - https://alive2.llvm.org/ce/z/7e-5BQ - `(X & PosY) s> X --> X s< 0` - https://alive2.llvm.org/ce/z/jvT4Gb 3) If known `X < 0`: - `(NegX & Y) s> NegX --> Y s>= 0` - https://alive2.llvm.org/ce/z/ApkaEh - `(NegX & Y) s<= NegX --> Y s< 0` - https://alive2.llvm.org/ce/z/oRnfHp Closes llvm#94417

Cleanup for llvm#94504

…UEs (llvm#94458) `SelectionDAGBuilder::handleDebugValue` has a parameter `Order` which represents the insert-at position for the new DBG_VALUE. Prior to this patch `SelectionDAGBuilder::SDNodeOrder` is used instead of the `Order` parameter. The only code-paths where `Order != SDNodeOrder` are the two calls calls to `handleDebugValue` from `salvageUnresolvedDbgValue`. `salvageUnresolvedDbgValue` is called from `resolveOrClearDbgInfo` and `dropDanglingDebugInfo`. The former is called after SelectionDAG completes one block. Some dbg.values can't be lowered to DBG_VALUEs right away. These get recorded as 'dangling' - their order-number is saved - and get salvaged later through `dropDanglingDebugInfo`, or if we've still got dangling debug info once the whole block has been emitted, through `resolveOrClearDbgInfo`. Their saved order-number is passed to `handleDebugValue`. Prior to this patch, DBG_VALUEs inserted using these functions are inserted at the "current" `SDNodeOrder` rather than the intended position that is passed to the function. Fix and add test.

… globals (llvm#94497) The 'metadata' delta pass will remove !dbg attachments from globals (which are DIGlobalVariableExpression nodes). The DIGlobalVariableExpressions don't get eliminated from the IR however if they are still referenced by the globals field in DICompileUnit. Teach the 'di-metadata' pass to try removing global variable operands from metadata tuples as well as DINodes.

When a critical construct is present inside another construct where privatizations may occur, such as a parallel construct, some privatizations are skipped if the corresponding symbols are defined inside the critical section only (see the example below). This happens because, while critical constructs have a "body", they don't have a separate scope (which makes sense, since no privatizations can occur in them). Because of this, in semantics phase, it's not possible to insert a new host association symbol, but instead the symbol from the enclosing context is used directly. This makes symbol collection in DataSharingProcessor consider the new symbol to be defined by the critical construct, instead of by the enclosing one, which causes the privatization to be skipped. Example: ``` !$omp parallel default(firstprivate) !$omp critical i = 200 !$omp end critical !$omp end parallel ``` This patch fixes this by identifying constructs where privatizations may not happen and skipping them during the collection of nested symbols. Currently, this seems to happen only with critical constructs, but others can be easily added to the skip list, if needed. Fixes llvm#75767

After O3 opt pipeline, the alignment of toc-data symbol is changed which is unexpected.

The local PTY is not available for the remotely executed lldb-server to pass the test. Also, in general, we cannot execute the local lldb-server instance because it could be compiled for the different system/cpu target.

Test coverage for llvm#94504

…nd hasOperation. Avoids the need to explicitly test both commuted variants and doesn't match custom lowering after legalization. Cleanup for llvm#94504

…lvm#94474) Use tablegen to generate the pass constructor. I removed the duplicated pass option handling. I don't understand why the manual instantiation of the pass needs its own duplicate of the pass options in the (automatically generated) base class (even with the option to ignore the pass options in the base class). This pass doesn't need changes to support other top level operations.

… ADDI* nodes (llvm#93642) Simultaneously, the `ADDItoc` machineinstr is generated in `PPCISelDAGToDAG::Select` so the pattern is not used and can be removed.

This adds a `MemberPointer` class along with a `PT_MemberPtr` primitive type. A `MemberPointer` has a `Pointer` Base as well as a `Decl*` (could be `ValueDecl*`?) decl it points to. For the actual logic, this mainly changes the way we handle `PtrMemOp`s in `VisitBinaryOperator`.

…adMode (llvm#94452) NFCI; this just preserves SI_INIT_EXEC and SI_INIT_EXEC_FROM_INPUT instructions a little longer so that we can reliably identify them in SIWholeQuadMode.

This is a mostly-target-independent variadic function optimisation and lowering pass. It is only enabled for AMDGPU in this initial commit. The purpose is to make C style variadic functions a zero cost abstraction. They are lowered to equivalent IR which is then amenable to other optimisations. This is inherently slightly target specific but much less so than one might expect - the C varargs interface heavily constrains the ABI design divergence. The pass is primarily tested from webassembly. This is because wasm has a straightforward variadic lowering strategy which coincides exactly with what this pass transforms code into and a struct passing convention with few cases to check. Adding further targets conventions is straightforward and elided from this patch primarily to simplify the review. Implemented in other branches are Linux X86, AMD64, AArch64 and NVPTX. Testing for targets that have existing lowering for va_arg from clang is most efficiently done by checking that clang | opt completely elides the variadic syntax from test cases. The lowering produces a struct for each call site which can be inspected to check the various alignment and indirections are correct. AMDGPU presently has no variadic support other than some ad hoc printf handling. Combined with the pass being inactive on all other targets landing this represents strict increase in capability with zero risk. Testing and refining will continue post commit. In addition to the compiler tests included here, a self contained x64 clang/musl toolchain was constructed using the "lowering" instead of the systemv ABI and used to build various C programs like lua and libxml2.

…ult init expression (llvm#91879)" (llvm#94597) This depends on llvm#92527 which needs to be reverted due to llvm#92527 (comment). This reverts commit 905b402. Co-authored-by: Bogdan Graur <[email protected]>

This reverts commit 97c866f. This fails on 32bit machines. See llvm#92083

…rary created by aggregate initialization using a default member initializer" (llvm#92527)" (llvm#94600) Reverting due to llvm#92527 (comment). This reverts commit f049d72. Co-authored-by: Bogdan Graur <[email protected]>

…93963) Adds a calling convention for calls to the `__arm_get_current_vg` support routine, which preserves X1-X15, X19-X29, SP, Z0-Z31 & P0-P15. See ARM-software/abi-aa#263

We need them for scalable address calculation and legal scalable addressing modes.

Remove some #includes in ExpandVariadics.cpp as it will cause layering violations.

/llvm-project/llvm/lib/Transforms/IPO/ExpandVariadics.cpp:426:14: error: unused variable 'OriginalFunctionIsDeclaration' [-Werror,-Wunused-variable] const bool OriginalFunctionIsDeclaration = OriginalFunction->isDeclaration(); ^ /llvm-project/llvm/lib/Transforms/IPO/ExpandVariadics.cpp:445:13: error: unused variable 'VariadicWrapperDefine' [-Werror,-Wunused-variable] Function *VariadicWrapperDefine = ^ 2 errors generated.

…4239) VPT blocks that do not produce an interesting 'output' (like a stored value or reduction result), do not need to be predicated on vctp for the whole loop to be tail-predicated. Just producing results for the valid tail predication lanes should be enough.

clangAnalysis is already being pulled in via clang_target_link_libraries(). Also listing it in LINK_LIBS means that we'll link both against the static libraries and the shared libclang-cpp.so library if CLANG_LINK_CLANG_DYLIB is enabled, and waste time on unnecessary LTO.

…g-11 bug. (llvm#94569) The conversion between _Float16 and long double will crash clang-11 on aarch64. This is fixed in clang-12: https://godbolt.org/z/8ceT9454c

…pansion Noticed while working on llvm#94601

Derived type components may use a given `Symbol` regardless of what parent objects they are a part of. Because of that, simply using a symbol address is not sufficient to determine object identity. Make the designator a part of the IdTy. To compare identities, when symbols are equal (and non-null), compare the designators.

There were a handlful of scope flags that were not handled in the dump function, which would then lead to an assert.

Test cases inspired by llvm#90417.

The test case is adapted from llvm/test/CodeGen/RISCV/fp16-promote.ll, because it covers some more IR patterns that ought to be common. Fixes llvm#93894

This will be used for a new CI job that runs the static analyzer.

This fixes compile time regression after llvm#93692.

…lvm#93844) Summary: Currently, we register images into a linear table according to the logical OpenMP device identifier. We then initialize all of these images as one block. This logic requires that images are compatible with *all* devices instead of just the one that it can run on. This prevents us from running on systems with heterogeneous devices (i.e. image 1 runs on device 0 image 0 runs on device 1). This patch reworks the logic by instead making the compatibility check a per-device query. We then scan every device to see if it's compatible and do it as they come.

IceLakeServer was copying these from SkylakeServer, but integer HADD/SUB can now run on an extra port

IceLakeServer/SkylakeServer can only use Port01 for the FADD/FSUB stage Confirmed with uops.info + Agner

…n_global_load_lds` (llvm#94376)

Summary: This reverts commit 574ab7e.

Revamp the NVVMIntrRange pass making the following updates: - Use range attributes over range metadata. This is what instcombine has move to for ranges on intrinsics in llvm#88776 and it seems a bit cleaner. - Consider the `!"maxntid{x,y,z}"` and `!"reqntid{x,y,z}"` function metadata when adding ranges for `tid` srge instrinsics. This can allow for smaller ranges and more optimization. - When range attributes are already present, use the intersection of the old and new range. This complements the metadata change by allowing ranges to be shrunk when an intrinsic is in a function which is inlined into a kernel with metadata. While we don't call this more then once yet, we should consider adding a second call after inlining, once this has had a chance to soak for a while and no issues have arisen. I've also re-enabled this pass in the TM, it was disabled years ago due to "numerical discrepancies" https://reviews.llvm.org/D96166. In our testing we haven't seen any issues with adding ranges to intrinsics, and I cannot find any further info about what issues were encountered.

…et for 63cda2d

…count (llvm#84189) Adds an AArch64-specific version of isLSRCostLess, changing the relative importance of the various terms from the formulae being evaluated. This has been split out from my vscale-aware LSR work, see the RFC for reference: https://discourse.llvm.org/t/rfc-vscale-aware-loopstrengthreduce/77131

/llvm-project/llvm/lib/Target/NVPTX/NVVMIntrRange.cpp:33:12: error: private field 'SmVersion' is not used [-Werror,-Wunused-private-field] unsigned SmVersion; ^ 1 error generated.

Co-authored-by: Marianne Mailhot-Sarrasin <[email protected]>

PR llvm#75125 introduced upward propagation of some OMPT-related CMake variables. For stand-alone builds this results in a warning that `SCOPE_PARENT` has no meaning in a top-level directory.

`unmerge_i64` and `unmerge_i32` were exactly the same test cases. This PR would fix that, so `unmerge_i32` would actually unmerge a 32 bit value into two 16 bit values.

Extend the folding ability of the RewriteAsConstant patterns to include tensor.pad operations on constants. The new pattern with constant fold tensor.pad operations which operate on tensor constants and have statically resolvable padding sizes/values. %init = arith.constant dense<[[6, 7], [8, 9]]> : tensor<2x2xi32> %pad_value = arith.constant 0 : i32 %0 = tensor.pad %init low[1, 1] high[1, 1] { ^bb0(%arg1: index, %arg2: index): tensor.yield %pad_value : i32 } : tensor<2x2xi32> to tensor<4x4xi32> becomes %cst = arith.constant dense<[[0, 0, 0, 0], [0, 6, 7, 0], [0, 8, 9, 0], [0, 0, 0, 0]]> : tensor<4x4xi32> Co-authored-by: Spenser Bauman <sabauma@fastmail>

Create a new class and file for functions that update GDB index.

Oryon is an ARM V8 AArch64 CPU from Qualcomm. --------- Co-authored-by: Wei Zhao <[email protected]>

…t_image_dim calls (llvm#94467) This PR is to add validation to the test case with get_image_array_size/get_image_dim calls (transcoding/check_ro_qualifier.ll). This test case didn't pass validation because of invalid emission of OpCompositeExtract instruction (Result Type must be the same type as Composite.). In order to fix the problem this PR improves type inference in general and partially addresses issues: * llvm#91998 * llvm#91997 A reproducer from the description of the latter issue is added as a new test case as a part of this PR.

Extra tests for llvm#94610.

We already assert that the given PC is in range and that the function has a body, so the SrcMap should generally never be empty. However, when generating destructors, we create quite a few instructions for which we have no source information, which may cause the previous assertion to fail. Return the end of the source map in this case.

There is simply way too much going on inside getNode. The complicated constant folding of vector handling works by looking for build_vector operands, and then tries to getNode the scalar element and then checks if constants were the result. As a side effect, this produces unused scalar operation nodes (previously, without flags). If the vector operation were later scalarized, it would find the flagless constant folding temporary and lose the flag. I don't think this is a reasonable way for constant folding to operate, but for now fix this by ensuring flags on the original operation are preserved in the temporary. This yields a clear code improvement for AMDGPU when f16 isn't legal. The Wasm cases switch from using a libcall to compare and select. We are evidently missing the fcmp+select to fminimum/fmaximum handling, but this would be further improved when that's handled. AArch64 also avoids the libcall, but looks worse and has a different call for some reason.

Followup to llvm#94487

…llvm#94583) When a float attribute is printed with Hex, we should not elide the type because it is parsed back as i64 otherwise.

… bugs (llvm#93055) This PR adds transpose + pack/unpack folding support for transpose ops in the form of `linalg.generic` ops. There were also some bugs with the permutation composing in the previous patterns, so this PR fixes these bugs and adds tests for them as well.

Otherwise code that depends on those targets being enabled might not get compiled correctly even if the targets are explicitly included in the configuration (in my case NVVM target for MLIR).

The size of the properties is fixed, so no need for a BitVector. Assigning small, fixed-size bitsets is faster. It's a minor performance improvement.

The pass iterates over the IR multiple times, but most code doesn't use AMX. Therefore, do a single iteration in advance to check whether a function uses AMX at all, and exit early if it doesn't. This makes the function-has-AMX path slightly more expensive, but AMX users probably care a lot less about compile time than JIT users (which tend to not use AMX). For us, it reduces the time spent in this pass from 0.62% to 0.12%. Ideally, we wouldn't even need to iterate over the function to determine that it doesn't use AMX.

2214026 didn't fix an unused variable warning correctly.

…in}f16 (llvm#94624)

…th functions (llvm#94510) llvm#93566

…m#94509) 058e445 added an XFAIL for this test on AIX because of a backend limitation. That backend limitation has been resolved by 0295c2a and will be available for clang 19, so we should update the test to limit the XFAIL to clang versions before that.

…ng dir to VFS (llvm#94461) When trying to add a file to clang's VFS via `addFile` and a directory of the same name already exists, we run into a [out-of-bound access](https://github.com/llvm/llvm-project/blob/145815c180fc82c5a55bf568d01d98d250490a55/llvm/lib/Support/Path.cpp#L244). The problem is that the file name is [recognised as existing path]( https://github.com/llvm/llvm-project/blob/145815c180fc82c5a55bf568d01d98d250490a55/llvm/lib/Support/VirtualFileSystem.cpp#L896) and thus continues to process the next part of the path which doesn't exist. This patch adds a check if we have reached the last part of the filename and return false in that case. This we reject to add a file if a directory of the same name already exists. This is in sync with [this check](https://github.com/llvm/llvm-project/blob/145815c180fc82c5a55bf568d01d98d250490a55/llvm/lib/Support/VirtualFileSystem.cpp#L903) that rejects adding a path if a file of the same name already exists.

This is similar to what the current interpreter does.

LD_LIBRARY_PATH will become invalid when LIBPATH is also set on AIX. See below example on AIX: ``` $ldd a.out a.out needs: /usr/lib/libc.a(shr.o) Cannot find libtest.a /unix /usr/lib/libcrypt.a(shr.o) $./a.out Could not load program ./a.out: Dependent module libtest.a could not be loaded. Could not load module libtest.a. System error: No such file or directory $export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/tmp $./a.out ; echo $? 10 $export LIBPATH=./ $./a.out ; echo $? >>>>>> Now LD_LIBRARY_PATH is not used by system loader Could not load program ./a.out: Dependent module libtest.a could not be loaded. Could not load module libtest.a. System error: No such file or directory ``` This breaks many AIX LIT cases on our downstream buildbots which sets LIBPATH. --------- Co-authored-by: Anh Tuyen Tran <[email protected]> Co-authored-by: David Tenty <[email protected]>

llvm#94188) This PR updates removeFnAttrFromReachable in AMDGPUMemoryUtils to accept array of function attributes as argument. Helps to remove multiple attributes in one CallGraph walk.

…FC) (llvm#94556)

…lvm#94558)

… for class/union types" and two follow-up commits. The reason is the crash we've discovered when processing -gsimple-template-names binaries. I'm committing a minimal reproducer as a separate patch. This reverts the following commits: - 51dd4ea (llvm#92328) - 3d9d485 (llvm#93839) - afe6ab7 (llvm#94400)

…vm#94641) After 63cda2d. See also a97871e.

First, expandFMINIMUM_FMAXIMUM should be a never-fail API. The client wanted it expanded, and it can always be expanded. This logic was tied up with what the VectorLegalizer wanted. Prefer using the min/max opcodes, and unrolling if we don't have a vselect. This seems to produce better code in all the changed tests.

Fixed crash caused by call to getCookedLiteral on template user defined literal. Fix base on assert in getCookedLiteral method. Closes llvm#94454

- added at_quick_exit function - used helper file exit_handler which reuses code from atexit - atexit now calls helper functions from exit_handler - test cases and dependencies are added --------- Co-authored-by: Aaryan Shukla <[email protected]>

This patch implements the following change to enable zos for death test support. google/googletest#4527

We only get a "reached end of constexpr function" diagnostic otherwise.

…Y)`; NFC

… of `Y` The fold will replace 2 uses of `Y` we should also do fold if `Y` has 2 uses (not only oneuse). Reviewed By: nikic Differential Revision: https://reviews.llvm.org/D159062

…ntion (llvm#94318) We named the intrinsics by replacing "." by "_" in the instruction conventionally, so the `vcpopv_v` where the corresponding instruction is `vcpop.v` should be named `vcpop_v`.

…lvm#94415) The ContextIds set on the ContextNode struct is not technically needed as we can compute it from either the callee or caller edge context ids. Remove it and add a helper to recompute from the edges on demand. Also add helpers to compute the node allocation type and whether the context ids are empty from the edges without needing to first compute the node's context id set, to minimize the runtime cost increase. This yielded a 20% reduction in peak memory for a large thin link, for about a 2% time increase (which is more than offset by some other recent time efficiency improvements).

Derived from llvm#92480. This PR introduces reduction semantics into loops for DO CONCURRENT REDUCE. The `fir.do_loop` operation now invisibly has the `operandSegmentsizes` attribute and takes variable-length reduction operands with their operations given as `fir.reduce_attr`. For the sake of compatibility, `fir.do_loop`'s builder has additional arguments at the end. The `iter_args` operand should be placed in front of the declaration of result types, so the new operand for reduction variables (`reduce`) is put in the middle of arguments.

This adds: - A ctor accepting a start and end iterator - A ctor accepting a count and const T& - size() - subscript operators - begin() and end() iterators

…alizer (llvm#94511) This code uses namespaces `llvm` and `llvm::json`. However, we have both `llvm::Value` and `llvm::json::Value`. Whenever any of the headers declare or include `llvm::Value`, the lookup becomes ambiguous. Fixing this by qualifying the `Value` type.

Changing the type of Frame::SymbolName from std::optional<std::string> to std::unique<std::string> reduces sizeof(Frame) from 64 to 32. The smaller type reduces the cycle and instruction counts by 23% and 4.4%, respectively, with "llvm-profdata show" modified to deserialize all MemProfRecords in a MemProf V2 profile. The peak memory usage is cut down nearly by half.

https://reviews.llvm.org/D85867 changed the way we assign file offsets (alloc sections first, then non-alloc sections). It also removed a non-alloc special case from `findOrphanPos`. Looking at the memory-nonalloc-no-warn.test change, which would be needed by llvm#93761, it makes sense to restore the previous behavior: when placing non-alloc orphan sections, keep these sections at the end so that the section index order matches the file offset order. This change is cosmetic. In sections-nonalloc.s, GNU ld places the orphan `other3` in the middle and the orphan .symtab/.shstrtab/.strtab at the end. Pull Request: llvm#94519

) This PR removes the `target-aarch64` requirement on the crashlog tests to exercice them on Intel bots and make image loading single-threaded temporarily while implementing a fix for a deadlock issue when loading the images in parallel. Signed-off-by: Med Ismail Bennani <[email protected]>

Summary: We don't have the abs function to link against, just use the builtin.

Remove `REQUIRES: shell` from some tests that seem fine without it. Tested on Windows and with LIT_USE_INTERNAL_SHELL=1 on Linux.

Relaxes restriction that certain public utility functions only apply to the builtin ModuleOp.

The summary already includes other size information, e.g. total debug info size in bytes. The only other way I can get this information is by dumping all statistics which can be quite large. Adding it to the summary seems fair.

…lvm#94667) Summary: The old COV3 implementation of HSA used to omit the implicit arguments from the kernel argument size. For COV4 and COV5 this is no longer the case so we can simply use the size reported from the symbol information. See ROCm/ROCR-Runtime#117 (comment)

getVNInfoFromReg is expected to return a nullptr if-and-only-if the operand is undef. (This was asserted for.) Reverse the order of the checks to simplify an upcoming set of patches.

This patch removes swapToHostOrder in favor of llvm::support::endian::readNext as swapToHostOrder is too thin a wrapper around readNext. Note that there are two variants of readNext: - readNext<type, endian, align>(ptr) - readNext<type, align>(ptr, endian) swapToHostOrder uses the former, but this patch switches to the latter. While we are at it, this patch teaches readNext to default to unaligned just as I did in: commit 568368a Author: Kazu Hirata <[email protected]> Date: Mon Apr 15 19:05:30 2024 -0700

Fixed for more accurate searches of the flag `-Wsystem-headers-in-module=`.

…(NFC) (llvm#94432) This patch replaces llvm::SmallVector<Frame> with std::vector<Frame>. llvm::SmallVector<Frame> sets aside one inline element. Meanwhile, when I sort all call stacks by their lengths, the length at the first percentile is already 2. That is, 99 percent of call stacks do not take advantage of the inline element. Using std::vector<Frame> reduces the cycle and instruction counts by 11% and 22%, respectively, with "llvm-profdata show" modified to deserialize all MemProfRecords.

- addressed llvm#94317 (comment) - added conditional in cmake file for exit_handler object library Co-authored-by: Aaryan Shukla <[email protected]>

These CHECKs are all checking indices, which must be strictly smaller than the size (otherwise they would go out of bounds).

llvm#93234) …ty with high-entropy ASLR With high-entropy ASLR (e.g., 32-bits == 16TB), the allocator base of 0x700000000000 (112TB) may collide with the placement of the libraries (e.g., on Linux, the mmap base could be 128TB - 16TB == 112TB). This results in a segfault in the test case. This patch moves the allocator base below the PIE program segment, inspired by fb77ca0. As per that patch: 1) we are leaving the old behavior for Apple 2) since ASLR cannot be set above 32-bits for x86-64 Linux, we expect this new layout to be durable. Note that this is only changing a test case, not the behavior of sanitizers. Sanitizers have their own settings for initializing the allocator base. Reproducer: 1. ninja check-sanitizer # Just to build the test binary needed below; no need to actually run the tests here 2. sudo sysctl vm.mmap_rnd_bits=32 # Increase ASLR entropy 3. for f in `seq 1 10000`; do echo $f; GTEST_FILTER=*SizeClassAllocator64Dense ./projects/compiler-rt/lib/sanitizer_common/tests/Sanitizer-x86_64-Test > /tmp/x; if [ $? -ne 0 ]; then cat /tmp/x; fi; done

…lvm#84441) This function is called during very early startup and which can result in a crash on FreeBSD. The sigaction() function in libc is indirected via a table so that it can be interposed by the threading library rather than calling the syscall directly. In the crash I was observing this table had not yet been relocated, so we ended up jumping to an invalid address. To avoid this problem we can call __sys_sigaction, which calls the syscall directly and in FreeBSD 15 is part of libsys rather than libc, so does not depend on libc being fully initialized.

This patch fixes a build issue following e57308b when enabling module build. With that change, we failed to build the LLVM_IR module since GEPNoWrapFlags wasn't defined prior to using it. This patch addressed that issue by including the missing header in `llvm/IR/IRBuilderFolder.h` which uses the `GEPNoWrapFlags` type. This should ensure that we can always build the `LLVM_IR` module. Signed-off-by: Med Ismail Bennani <[email protected]> Signed-off-by: Med Ismail Bennani <[email protected]>

Call stacks are a huge portion of the MemProf profile, taking up 70+% of the profile file size. This patch implements a radix tree to compress call stacks, which are known to have long common prefixes. Specifically, CallStackRadixTreeBuilder, introduced in this patch, takes call stacks in the MemProf profile, sorts them in the dictionary order to maximize the common prefix between adjacent call stacks, and then encodes a radix tree into a single array that is ready for serialization. The resulting radix array is essentially a concatenation of call stack arrays, each encoded with its length followed by the payload, except that these arrays contain "instructions" like "skip 7 elements forward" to borrow common prefixes from other call stacks. This patch does not integrate with the MemProf serialization/deserialization infrastructure yet. Once integrated, the radix tree is expected to roughly halve the file size of the MemProf profile.

…lude-cycle (llvm#94636) Fixes: llvm#94634

…94561) Fixes llvm#94555.

This test case shows a limitation of DFSan's sscanf implementation (introduced in https://reviews.llvm.org/D153775): it simply ignores ordinary characters in the format string, instead of actually comparing them against the input. This may change the semantics of instrumented programs. Importantly, this also means that DFSan's release_shadow_space.c test, which relies on sscanf to scrape the RSS from /proc/maps output, will incorrectly match lines that don't contain RSS information. As a result, it adding together numbers from irrelevant output (e.g., base addresses), resulting in test flakiness (llvm#91287).

Make it easier to add CREL support.

…nstants We don't need the `noundef` check if the new simplification is a constant. This cleans up regressions from folding multiuse: `(icmp eq/ne (sub/xor x, y), 0)` -> `(icmp eq/ne x, y)`. Closes llvm#88298

…eatures vector. (llvm#94660) Instead of having multiple places insert into the Features vector independently, check all the conditions in one place. This avoids a subtle ordering requirement that -mstrict-align processing had to be done after the others.

Short-circuit the parsing of tok::colon to label colons found within lines starting with asm as InlineASMColon. Fixes llvm#92616. --------- Co-authored-by: Owen Pan <[email protected]>

Following of llvm#86912 The motivation of the patch series is that, for a module interface unit `X`, when the dependent modules of `X` changes, if the changes is not relevant with `X`, we hope the BMI of `X` won't change. For the specific patch, we hope if the changes was about irrelevant declaration changes, we hope the BMI of `X` won't change. **However**, I found the patch itself is not very useful in practice, since the adding or removing declarations, will change the state of identifiers and types in most cases. That said, for the most simple example, ``` // partA.cppm export module m:partA; // partA.v1.cppm export module m:partA; export void a() {} // partB.cppm export module m:partB; export void b() {} // m.cppm export module m; export import :partA; export import :partB; // onlyUseB; export module onlyUseB; import m; export inline void onluUseB() { b(); } ``` the BMI of `onlyUseB` will change after we change the implementation of `partA.cppm` to `partA.v1.cppm`. Since `partA.v1.cppm` introduces new identifiers and types (the function prototype). So in this patch, we have to write the tests as: ``` // partA.cppm export module m:partA; export int getA() { ... } export int getA2(int) { ... } // partA.v1.cppm export module m:partA; export int getA() { ... } export int getA(int) { ... } export int getA2(int) { ... } // partB.cppm export module m:partB; export void b() {} // m.cppm export module m; export import :partA; export import :partB; // onlyUseB; export module onlyUseB; import m; export inline void onluUseB() { b(); } ``` so that the new introduced declaration `int getA(int)` doesn't introduce new identifiers and types, then the BMI of `onlyUseB` can keep unchanged. While it looks not so great, the patch should be the base of the patch to erase the transitive change for identifiers and types since I don't know how can we introduce new types and identifiers without introducing new declarations. Given how tightly the relationship between declarations, types and identifiers, I think we can only reach the ideal state after we made the series for all of the three entties. The design of the patch is similar to llvm#86912, which extends the 32-bit DeclID to 64-bit and use the higher bits to store the module file index and the lower bits to store the Local Decl ID. A slight difference is that we only use 48 bits to store the new DeclID since we try to use the higher 16 bits to store the module ID in the prefix of Decl class. Previously, we use 32 bits to store the module ID and 32 bits to store the DeclID. I don't want to allocate additional space so I tried to make the additional space the same as 64 bits. An potential interesting thing here is about the relationship between the module ID and the module file index. I feel we can get the module file index by the module ID. But I didn't prove it or implement it. Since I want to make the patch itself as small as possible. We can make it in the future if we want. Another change in the patch is the new concept Decl Index, which means the index of the very big array `DeclsLoaded` in ASTReader. Previously, the index of a loaded declaration is simply the Decl ID minus PREDEFINED_DECL_NUMs. So there are some places they got used ambiguously. But this patch tried to split these two concepts. As llvm#86912 did, the change will increase the on-disk PCM file sizes. As the declaration ID may be the most IDs in the PCM file, this can have the biggest impact on the size. In my experiments, this change will bring 6.6% increase of the on-disk PCM size. No compile-time performance regression observed. Given the benefits in the motivation example, I think the cost is worthwhile.

This reverts commit 5c10487. The ArmV7 bot is complaining the change breaks the alignment.

Test CodeGen/AMDGPU/build_vector.ll has the lit patterns partially hand-written and the rest auto-generated. It doesn't look good when changes are required with future patches. Auto-generating the entire pattern. Moved out the R600 test into build_vector-r600.ll.

Also, converted the R600 RUN lines from some tests into standalone tests.

This pull request port `regallocfast` to new pass manager. It exposes the parameter `filter` to handle different register classes for AMDGPU. IIUC AMDGPU need to allocate different register classes separately so it need implement its own `--<reg-class>-regalloc`. Now users can use e.g. `-passe=regallocfast<filter=sgpr>` to allocate specific register class. The command line option `--regalloc-npm` is still in work progress, plan to reuse the syntax of passes, e.g. use `--regalloc-npm=regallocfast<filter=sgpr>,greedy<filter=vgpr>` to replace `--sgpr-regalloc` and `--vgpr-regalloc`.

The test will generate an empty `regalloc-amdgpu.s` file in test, which causes an unresolved test.

1. merge valid check 2. use range base loop

The availability attributes are stored on the function declarations. The code was looking for them in the function template declarations. This resulted in spuriously diagnosing (non-strict) availablity issues in contexts that are not available. Co-authored-by: Gabor Horvath <[email protected]>

…tions (llvm#93374) For patterns where there are multiple results apart from dpsInits, this fails. E.g.: ``` %13:2 = iree_codegen.ukernel.generic "iree_uk_unpack" ins(%extracted_slice : tensor<?x1x16x16xf32>) outs(%11 : tensor<?x?xf32>) ... -> tensor<?x?xf32>, i32 ``` The above op has results apart from dpsInit and hence fails. The PR assumes that the result has dpsInits followed by nonDpsInits.

Handle lvalues pointing to declarations, unions and member pointers.

gcc patch: https://gcc.gnu.org/git/?p=gcc.git;a=commit;h=1f2ca510065a2033bac408eb5a960ef0126f25cc

…#94455) The the function is doing two fairly different things, depending on how it is called. While this allows for some code reuse, it also makes it hard to override it correctly. Possibly for this reason ValueObjectSynthetic overerides GetChildAtIndex instead, which forces it to reimplement some of its functionality, most notably caching of generated children. Splitting this up makes it easier to move the caching to a common place (and hopefully makes the code easier to follow in general).

…m#94557)" (llvm#94730) This reverts commit d843c02.

Since the constructor of ContextEdge takes ContextIds by value, we should move it to the corresponding member variable as suggested by clang-tidy's performance-unnecessary-value-param. While we are at it, this patch updates a couple of callers. To avoid the ambiguity in the evaluation order among the constructor arguments, I'm calling computeAllocType before calling the constructor.

This allows the ReportError functor to hold move-only types.

…RI instructions (llvm#94552)

…rs whose return values are unused (llvm#94590) This patch adds a peephole pass `LoongArchDeadRegisterDefinitions`. It rewrites `rd` to `r0` when `rd` is marked as dead. It may improve the register allocation and reduce pipeline hazards on CPUs without register renaming and OOO.

And change the previous GetPtrField to only peek() the base pointer. We can get rid of a whole bunch of DupPtr ops this way.

In preparation for adding essentially the same visitor to StreamChecker, this patch factors this visitor out to a common header. I'll be the first to admit that the interface of these classes are not terrific, but it rather tightly held back by its main technical debt, which is NoStoreFuncVisitor, the main descendant of NoStateChangeVisitor. Change-Id: I99d73ccd93a18dd145bbbc83afadbb432dd42b90

…ave Zvfbfmin" (llvm#94565)"

This PR fixes an incorrect line for setting scaling_governer in benchmarking tips.

It's not strictly needed and did cause some test failures.

This PR handle translation of DIStringType. Mostly mechanical changes to translate DIStringType to/from DIStringTypeAttr. The 'stringLength' field is 'DIVariable' in DIStringType. As there was no `DIVariableAttr` previously, it has been added to ease the translation. --------- Co-authored-by: Tobias Gysi <[email protected]>

Fixes llvm#94599

…lvm#94598) Use tablegen to generate the pass constructor. This pass is supposed to add function attributes so it does not need to operate on other top level operations.

As noted on llvm#94466, NEON has ABDS/ABDU instructions but only handles them via intrinsics, plus some VABDL custom patterns. This patch flags basic ABDS/ABDU for neon types as legal and updates all tablegen patterns to use abds/abdu instead. Fixes llvm#94466

This operation extracts a number of bits at a given offset and sign or zero extends them, which is done by emitting it as a left shift followed by a right shift. This is being added for use in clang for C++ structured bindings of bitfields that have offset or size that aren't a byte multiple. A new operation is being added, instead of shifts being used directly, as it makes correctly handling it in optimisations (which will be done in a later patch) much easier.

Currently, during a loop pipelining transformation, operations may be hoisted out without any checks on the loop bounds, which leads to incorrect transformations and unexpected behaviour. The following [issue ](llvm#90870) describes the problem more extensively, including an example. The proposed fix adds some check in the loop bounds before and applies the maximum hoisting.

They do not count into lambda captures, so visit them lazily.

The check lines in this test were clearly not generated by UTC.

Regenerate these with --check-globals. The manual global CHECKS get dropped during regeneration otherwise. Annoyingly UTC insists on putting the globals directly before the first function, so the first comment is a bit out of place now.

This patch implements the lowering for vector deinterleave for vector of n-dimensions. Process involves unrolling the n-d vector to a series of one-dimensional vectors. The deinterleave operation is then used on these vectors. From: ``` %0, %1 = vector.deinterleave %a : vector<2x8xi8> -> vector<2x4xi8> ``` To: ``` %cst = arith.constant dense<0> : vector<2x4xi32> %0 = vector.extract %arg0[0] : vector<8xi32> from vector<2x8xi32> %res1, %res2 = vector.deinterleave %0 : vector<8xi32> -> vector<4xi32> %1 = vector.insert %res1, %cst [0] : vector<4xi32> into vector<2x4xi32> %2 = vector.insert %res2, %cst [0] : vector<4xi32> into vector<2x4xi32> %3 = vector.extract %arg0[1] : vector<8xi32> from vector<2x8xi32> %res1_0, %res2_1 = vector.deinterleave %3 : vector<8xi32> -> vector<4xi32> %4 = vector.insert %res1_0, %1 [1] : vector<4xi32> into vector<2x4xi32> %5 = vector.insert %res2_1, %2 [1] : vector<4xi32> into vector<2x4xi32> ...etc. ```

When using the -mframe-chain=aapcs or -mframe-chain=aapcs-leaf options, we cannot use r11 as an allocatable register, even if -fomit-frame-pointer is also used. This is so that r11 will always point to a valid frame record, even if we don't create one in every function.

…#94601) Removes residual ARM handling for vXi64 ABS nodes to prevent infinite loops.

from PEP8 (https://peps.python.org/pep-0008/#programming-recommendations): > Comparisons to singletons like None should always be done with is or is not, never the equality operators. Co-authored-by: Eisuke Kawashima <[email protected]>

Cortex-R52+ is an Armv8-R AArch32 CPU. Technical Reference Manual for Cortex-R52+: https://developer.arm.com/documentation/102199/latest/

llvm#89811 caused this test to fail, somehow. I think it may not be at fault, but actually be exposing some existing undefined behaviour, see llvm#94741. Skipping this for now to get the bots green again.

This change seeks to add support for vendor flavoured SPIRV - more specifically, AMDGCN flavoured SPIRV. The aim is to generate SPIRV that carries some extra bits of information that are only usable by AMDGCN targets, forfeiting absolute genericity to obtain greater expressiveness for target features: - AMDGCN inline ASM is allowed/supported, under the assumption that the [SPV_INTEL_inline_assembly](https://github.com/intel/llvm/blob/sycl/sycl/doc/design/spirv-extensions/SPV_INTEL_inline_assembly.asciidoc) extension is enabled/used - AMDGCN target specific builtins are allowed/supported, under the assumption that e.g. the `--spirv-allow-unknown-intrinsics` option is enabled when using the downstream translator - the featureset matches the union of AMDGCN targets' features - the datalayout string is overspecified to affix both the program address space and the alloca address space, the latter under the assumption that the [SPV_INTEL_function_pointers](https://github.com/intel/llvm/blob/sycl/sycl/doc/design/spirv-extensions/SPV_INTEL_function_pointers.asciidoc) extension is enabled/used, case in which the extant SPIRV datalayout string would lead to pointers to function pointing to the private address space, which would be wrong. Existing AMDGCN tests are extended to cover this new target. It is currently dormant / will require some additional changes, but I thought I'd rather put it up for review to get feedback as early as possible. I will note that an alternative option is to place this under AMDGPU, but that seems slightly less natural, since this is still SPIRV, albeit relaxed in terms of preconditions & constrained in terms of postconditions, and only guaranteed to be usable on AMDGCN targets (it is still possible to obtain pristine portable SPIRV through usage of the flavoured target, though).

Both `reverseBranchCondition` and `replaceBranchTarget` return a success boolean. But all-but-one caller ignores the return value, and the exception emits a fatal error on failure. Thus, just return nothing.

This "small" set grows quite large and it's more performant to store whether a node has been combined before in the node itself. As this information is only relevant for nodes that are currently not in the worklist, add a second state to the CombinerWorklistIndex (-2) to indicate that a node is currently not in a worklist, but was combined before. This brings a substantial performance improvement.

They need to be fully initialized, similar to global variables.

Check this by looking at the VarDecl.

Following of llvm#86912 The motivation of the patch series is that, for a module interface unit `X`, when the dependent modules of `X` changes, if the changes is not relevant with `X`, we hope the BMI of `X` won't change. For the specific patch, we hope if the changes was about irrelevant declaration changes, we hope the BMI of `X` won't change. **However**, I found the patch itself is not very useful in practice, since the adding or removing declarations, will change the state of identifiers and types in most cases. That said, for the most simple example, ``` // partA.cppm export module m:partA; // partA.v1.cppm export module m:partA; export void a() {} // partB.cppm export module m:partB; export void b() {} // m.cppm export module m; export import :partA; export import :partB; // onlyUseB; export module onlyUseB; import m; export inline void onluUseB() { b(); } ``` the BMI of `onlyUseB` will change after we change the implementation of `partA.cppm` to `partA.v1.cppm`. Since `partA.v1.cppm` introduces new identifiers and types (the function prototype). So in this patch, we have to write the tests as: ``` // partA.cppm export module m:partA; export int getA() { ... } export int getA2(int) { ... } // partA.v1.cppm export module m:partA; export int getA() { ... } export int getA(int) { ... } export int getA2(int) { ... } // partB.cppm export module m:partB; export void b() {} // m.cppm export module m; export import :partA; export import :partB; // onlyUseB; export module onlyUseB; import m; export inline void onluUseB() { b(); } ``` so that the new introduced declaration `int getA(int)` doesn't introduce new identifiers and types, then the BMI of `onlyUseB` can keep unchanged. While it looks not so great, the patch should be the base of the patch to erase the transitive change for identifiers and types since I don't know how can we introduce new types and identifiers without introducing new declarations. Given how tightly the relationship between declarations, types and identifiers, I think we can only reach the ideal state after we made the series for all of the three entties. The design of the patch is similar to llvm#86912, which extends the 32-bit DeclID to 64-bit and use the higher bits to store the module file index and the lower bits to store the Local Decl ID. A slight difference is that we only use 48 bits to store the new DeclID since we try to use the higher 16 bits to store the module ID in the prefix of Decl class. Previously, we use 32 bits to store the module ID and 32 bits to store the DeclID. I don't want to allocate additional space so I tried to make the additional space the same as 64 bits. An potential interesting thing here is about the relationship between the module ID and the module file index. I feel we can get the module file index by the module ID. But I didn't prove it or implement it. Since I want to make the patch itself as small as possible. We can make it in the future if we want. Another change in the patch is the new concept Decl Index, which means the index of the very big array `DeclsLoaded` in ASTReader. Previously, the index of a loaded declaration is simply the Decl ID minus PREDEFINED_DECL_NUMs. So there are some places they got used ambiguously. But this patch tried to split these two concepts. As llvm#86912 did, the change will increase the on-disk PCM file sizes. As the declaration ID may be the most IDs in the PCM file, this can have the biggest impact on the size. In my experiments, this change will bring 6.6% increase of the on-disk PCM size. No compile-time performance regression observed. Given the benefits in the motivation example, I think the cost is worthwhile.

…93680) Whole quad mode requires inserting a copy of the initial EXEC mask. In a function that also uses llvm.amdgcn.init.exec, insert the COPY after initializing EXEC.

The file OMP.td is becoming tedious to update by hand due to the seemingly random ordering of various items in it. This patch brings order to it by sorting most of the contents. The clause definitions are sorted alphabetically with respect to the spelling of the clause.[1] The directive definitions are split into two leaf directives and compound directives.[2] Within each, definitions are sorted alphabetically with respect to the spelling, with the exception that "end xyz" directives are placed immediately following the definition of "xyz".[3] Within each directive definition, the lists of clauses are also sorted alphabetically. [1] All spellings are made of lowercase letters, _, or space. Ordering that includes non-letters follows the order assumed by the `sort` utility. [2] Compound directives refer to the consituent leaf directives, hence the leaf definitions must come first. [3] Some of the "end xyz" directives have properties derived from the corresponding "xyz" directive. This exception guarantees that "xyz" precedes the "end xyz".

…lvm#94195) Extends delayed privatization support to `taraget .. private(..)`. With this PR, `private` is support for `target` **only** is delayed privatization mode.

Summary: The NVPTX build wasn't getting the `C++20` standard necessary for a few files.

This commit adds support for lowering `tensor.unpack` with a non-identity `outer_dims_perm`. This was previously left as a not-yet-implemented case.

This PR adds fusion by collapsing and fusion by expansion patterns for `tensor.pad` ops in ElementwiseOpFusion. Pad ops can be expanded or collapsed as long as none of the padded dimensions will be expanded or collapsed.

…m#94631) After the `output_shape` field was added to `expand_shape` ops, dynamically sized expand shapes are now possible, but this was not accounted for in the folder. This PR tightens the constraints of the folder to fix this.

Change the target triple to remove some unnecessary instructions.

This change is an implementation of llvm#87367 investigation on supporting IEEE math operations as intrinsics. Which was discussed in this RFC: https://discourse.llvm.org/t/rfc-all-the-math-intrinsics/78294 This PR is just for Tan. Now that x86 tan backend landed: llvm#90503 we can add other backends since the shared pieces are in tree now. Changes: - `llvm/include/llvm/Analysis/VecFuncs.def` - vectorization of tan for arm64 backends. - `llvm/lib/Target/AArch64/AArch64FastISel.cpp` - Add tan to the libcall table - `llvm/lib/Target/AArch64/AArch64ISelLowering.cpp` - Add tan expansion for f128, f16, and vector\neon operations - `llvm/lib/Target/AArch64/GISel/AArch64LegalizerInfo.cpp` define `G_FTAN` as a legal arm64 instruction resolves llvm#94755

Summary: The utilities `nvptx-arch` and `amdgpu-arch` are used to support `--offload-arch=native` among other utilities in clang. However, these rely on the GPU drivers to query the features. In certain cases these drivers can become locked up, which will lead to indefinate hangs on any compiler jobs running in the meantime. This patch adds a ten second timeout period for these utilities before it kills the job and errors out.

@CharKeaney

All post-Increment load/store, register-register load/store spec: https://github.com/openhwgroup/cv32e40p/blob/master/docs/source/instruction_set_extensions.rst Contributors: @CharKeaney, @jeremybennett, @lewis-revill, @NandniJamnadas, @PaoloS02, @serkm, @simonpcook, @xingmingjie, @realqhc

This PR depends on llvm#90260 We changed the order in which functions are outlined in Machine Outliner. The formula for priority is found via a black-box Bayesian optimization toolbox. Using this formula for sorting consistently reduces the uncompressed size of large real-world mobile apps. We also ran a few benchmarks using LLVM test suites, and showed that sorting by priority consistently reduces the text segment size. |run (CTMark/) |baseline (1)|priority (2)|diff (1 -> 2)| |----------------|------------|------------|-------------| |lencod |349624 |349264 |-0.1030% | |SPASS |219672 |219480 |-0.0874% | |kc |271956 |251200 |-7.6321% | |sqlite3 |223920 |223708 |-0.0947% | |7zip-benchmark |405364 |402624 |-0.6759% | |bullet |139820 |139500 |-0.2289% | |consumer-typeset|295684 |290196 |-1.8560% | |pairlocalalign |72236 |72092 |-0.1993% | |tramp3d-v4 |189572 |189292 |-0.1477% | This is part of an enhanced version of machine outliner -- see [RFC](https://discourse.llvm.org/t/rfc-enhanced-machine-outliner-part-1-fulllto-part-2-thinlto-nolto-to-come/78732).

Parameter "Version" is confusing in deserializeV012 and deserializeV3 because we also have member variable "Version". Fortunately, parameter "Version" and member variable "Version" always have the same value because IndexedMemProfReader::deserialize initializes the member variable and passes it to deserializeV012 and deserializeV3. This patch removes the parameter.

This patch integrates CallStackRadixTreeBuilder into the V3 format, reducing the profile size to about 27% of the V2 profile size. - Serialization: writeMemProfCallStackArray just needs to write out the radix tree array prepared by CallStackRadixTreeBuilder. Mappings from CallStackIds to LinearCallStackIds are moved by new function CallStackRadixTreeBuilder::takeCallStackPos. - Deserialization: Deserializing a call stack is the same as deserializing an array encoded in the obvious manner -- the length followed by the payload, except that we need to follow a pointer to the parent to take advantage of common prefixes once in a while. This patch teaches LinearCallStackIdConverter to how to handle those pointers.

The "Emulated" sub-directories under "ArmSVE" and "ArmSME" have been removed. Associated tests have been moved up a directory and now include the "REQUIRES" constraint for the arm-emulator.

Allow KnownBits to represent "always poison" values via conflict. close: llvm#94436

…#94646) These tests pass on 64-bit. They were fixed by 5fdd094 on 32-bit. So XFAIL only for 32-bit before clang 19.

If we are extracting the even lanes and the odd lanes and adding them, we can use an addp instruction.

llvm#94550) For regex patterns that produce zero-length matches, there is one (imaginary) match in-between every character in the sequence being searched (as well as before the first character and after the last character). It's easiest to demonstrate using replacement: `std::regex_replace("abc"s, "!", "")` should produce `!a!b!c!`, where each exclamation mark makes a zero-length match visible. Currently our implementation doesn't correctly set the prefix of each zero-length match, "swallowing" the characters separating the imaginary matches -- e.g. when going through zero-length matches within `abc`, the corresponding prefixes should be `{'', 'a', 'b', 'c'}`, but before this patch they will all be empty (`{'', '', '', ''}`). This happens in the implementation of `regex_iterator::operator++`. Note that the Standard spells out quite explicitly that the prefix might need to be adjusted when dealing with zero-length matches in [`re.regiter.incr`](http://eel.is/c++draft/re.regiter.incr): > In all cases in which the call to `regex_search` returns `true`, `match.prefix().first` shall be equal to the previous value of `match[0].second`... It is unspecified how the implementation makes these adjustments. [Reproduction example](https://godbolt.org/z/8ve6G3dav) ```cpp #include <iostream> #include <regex> #include <string> int main() { std::string str = "abc"; std::regex empty_matching_pattern(""); { // The underlying problem is that `regex_iterator::operator++` doesn't update // the prefix correctly. std::sregex_iterator i(str.begin(), str.end(), empty_matching_pattern), e; std::cout << "\""; for (; i != e; ++i) { const std::ssub_match& prefix = i->prefix(); std::cout << prefix.str(); } std::cout << "\"\n"; // Before the patch: "" // After the patch: "abc" } { // `regex_replace` makes the problem very visible. std::string replaced = std::regex_replace(str, empty_matching_pattern, "!"); std::cout << "\"" << replaced << "\"\n"; // Before the patch: "!!!!" // After the patch: "!a!b!c!" } } ``` Fixes llvm#64451 rdar://119912002

Re-apply llvm#87550 with fixes. Details: Some tests in fuchsia failed because of the newly added assertion. This was because `GetExceptionBreakpoint()` could be called before `g_dap.debugger` was initted. The fix here is to just lazily populate the list in GetExceptionBreakpoint() rather than assuming it's already been initted. (There is some nuisance here because we can't simply just populate it in DAP::DAP(), which is a global ctor and is called before `SBDebugger::Initialize()` is called. )

This patch reverts 9b832b7 (llvm#87111): - [libc++] Deprecated `shared_ptr` Atomic Access APIs as per P0718R2 - [libc++] Implemented P2869R3: Remove Deprecated `shared_ptr` Atomic Access APIs from C++26 As explained in [1], the suggested replacement in P2869R3 is `__cpp_lib_atomic_shared_ptr`, which libc++ does not yet implement. Let's not deprecate the old way of doing things before the new way of doing things exists. [1]: llvm#87111 (comment)

…rep expression (and remove an unused argument)

Add SHAPE runtime API (will be used for assumed-rank, lowering is generating other cases inline). I tried to make it in a way were there is no dynamic allocation in the runtime/deallocation expected to be inserted by inline code for arrays that we know are small (lowering will just always stack allocate a rank 15 array to avoid dynamic stack allocation or heap allocation).

…lag (llvm#94749)

) Summary: AMDGPU supports a `target-id` feature which is used to qualify targets with different incompatible features. These are both rules and target features. Currently, we pass `-target-cpu` twice when offloading to OpenMP, and do not pass the target-id features at all. The effect was that passing something like `--offload-arch=gfx90a:xnack+` would show up as `-target-cpu=gfx90a:xnack+ -target-cpu=gfx90a`. Thus ignoring the xnack completely and passing it twice. This patch fixes that to pass it once and then separate it like how HIP does.

…m#94592) As discussed in llvm#94443, this PR changes the wording to be more correct.

…lvm#94756)

Otherwise, older copies of LLD may not understand the latest bitcode versions (for example, if we increase `ModuleSummaryIndex::BitCodeSummaryVersion`) Related to llvm#90692 (comment)

…lvm#94538) It also moves the test near other similar test cases.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement grouped conv interface #80870

Implement grouped conv interface #80870

Commits on Jun 7, 2024