WARNING: THIS SITE IS A MIRROR OF GITHUB.COM / IT CANNOT LOGIN OR REGISTER ACCOUNTS / THE CONTENTS ARE PROVIDED AS-IS / THIS SITE ASSUMES NO RESPONSIBILITY FOR ANY DISPLAYED CONTENT OR LINKS / IF YOU FOUND SOMETHING MAY NOT GOOD FOR EVERYONE, CONTACT ADMIN AT ilovescratch@foxmail.com
Skip to content

Conversation

@stevesuzuki-arm
Copy link
Contributor

@stevesuzuki-arm stevesuzuki-arm commented Dec 11, 2025

By design, LLVM shufflevector doesn't accept scalable vectors.
So, we try to use llvm.vector.xx intrinsic where possible.
However, those are not enough to cover wide usage of shuffles in Halide.
To handle arbitrary index pattern, we decompose a shuffle operation
to a sequence of multiple native shuffles, which are lowered to
Arm SVE2 intrinsic TBL or TBL2.

Another approach could be to perform shuffle in fixed sized vector
by adding conversion between scalable vector and fixed vector.
However, it seems to be only possible via load/store memory,
which would presumably be poor performance.

This change also includes:

  • Peep-hole the particular predicate pattern to emit WHILELT instruction
  • Shuffle 1bit type scalable vectors as 8bit with type casts
  • Peep-hole concat_vectors for padding to align up vector
  • Fix redundant broadcast in CodeGen_LLVM

@stevesuzuki-arm
Copy link
Contributor Author

With this PR and #8888, Halide tests pass without fail on host machine with SVE2 128 bits vector. I confirmed by ctest --exclude-regex 'interpolate|lens_blur|unsharp|tutorial_lesson_12' --label-regex 'internal|correctness|generator|error|warning|tutorial|python' --build-config Release

@stevesuzuki-arm
Copy link
Contributor Author

The CI test failure below is a known issue which should be fixed by #8888.

st2w_int32_x8                   (arm-64-linux-no_neon-sve2-vector_bits_256)
StartAssertion failed: (!isScalable() || isZero()) && "Request for a fixed element count on a scalable object", file C:\build_bot\worker\llvm-main-x86-32-windows\llvm-project\llvm\include\llvm/Support/TypeSize.h, line 202

I will rebase once #8888 is merged.

Theoretically, these are llvm common and not ARM specific,
but for now, keep it for ARM only to avoid any affect to
other targets.
The workaround of checking wide_enough in get_vector_type() was
causing the issue of mixing FixedVector and ScalableVector
in generating a intrinsic instruction in SVE2 codegen.
By this change, we select scalable vector for most of the cases.

Note the workaround for vscale > 1 case will be addressed in
a separate commit.
By design, LLVM shufflevector doesn't accept scalable vectors.
So, we try to use llvm.vector.xx intrinsic where possible.
However, those are not enough to cover wide usage of shuffles in Halide.
To handle arbitrary index pattern, we decompose a shuffle operation
to a sequence of multiple native shuffles, which are lowered to
Arm SVE2 intrinsic TBL or TBL2.

Another approach could be to perform shuffle in fixed sized vector
by adding conversion between scalable vector and fixed vector.
However, it seems to be only possible via load/store memory,
which would presumably be poor performance.

This change also includes:
- Peep-hole the particular predicate pattern to emit WHILELT instruction
- Shuffle 1bit type scalable vectors as 8bit with type casts
- Peep-hole concat_vectors for padding to align up vector
- Fix redundant broadcast in CodeGen_LLVM
Modified codegen of vector broadcast in SVE2 to emit
TBL ARM intrin instead of llvm.vector.insert.

Fix performance test failure of nested_vectorization_gemm
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant