WARNING: THIS SITE IS A MIRROR OF GITHUB.COM / IT CANNOT LOGIN OR REGISTER ACCOUNTS / THE CONTENTS ARE PROVIDED AS-IS / THIS SITE ASSUMES NO RESPONSIBILITY FOR ANY DISPLAYED CONTENT OR LINKS / IF YOU FOUND SOMETHING MAY NOT GOOD FOR EVERYONE, CONTACT ADMIN AT ilovescratch@foxmail.com
Skip to content

Conversation

@shubhamvishu
Copy link
Contributor

@shubhamvishu shubhamvishu commented Dec 17, 2025

Description

This is continuation of the work @goankur started in #13572. Quoting the comment below from the initial PR.

Here at Amazon (customer facing product search), we’ve been testing this native dot product implementation in our production environment(ARM - Graviton 2 and 3) and we see 5-14x faster dot product computations in JMH benchmarks and we observed semantic latency improving from 62 msec to 28 msec (avg) for 4K embeddings(4.5 MM). Overall we saw 10-60% improvement on end-end avg search latencies in different scenarios (different sized vectors, vector-focused search vs search combined with other workloads). We haven’t tested all other CPUs types yet. I'm working on a draft PR on top of this PR with following changes and planning to raise it soon :

  • Removing the overhead from heap to off-heap copying by utilizing Linker.Option.critical, which eliminates unnecessary copying
  • Runtime dispatch using IFUNC to choose SVE vs NEON vs scalar implementation at runtime based on available intrinsics
  • Build related changes to generate the binary

We kept the native code isolated in the misc package and not getting it in the core module which we know is highly discouraged. Additionally, PR #15285 would later help eliminate some code duplication and enable a cleaner implementation similar to PanamaVectorUtilSupport - potentially through a NativeVectorUtilSupport class?

Our benchmarking suggests substantial optimization potential for ARM-based deployments, and we believe this could benefit the broader Lucene community. We hope to make it easy for any Lucene user to opt-in to this alternative vector implementation ideally. We're committed to refining this implementation based on community feedback and addressing any concerns during the review process. I'm eager to hear the community's thoughts on this change, as there appears to be significant optimization potential for ARM architectures that could benefit many users. Thank you!

To make use of faster SIMD instructions, PanamaVectorUtilSupport#{dotProduct|uint8DotProduct} will switch to native dot product implementation when org.apache.lucene.util.Constants#NATIVE_DOT_PRODUCT_ENABLED is true (i.e. system property-Dlucene.useNativeDotProduct=true is passed).

GCC version : >= 10

@shubhamvishu
Copy link
Contributor Author

shubhamvishu commented Dec 17, 2025

Sharing some JMH benchmark results from Graviton 2 and Graviton 3 with this PR

  1. dot8sNative : Native dot product implementation (with SVE/NEON/Scalar). Graviton 2 uses NEON and Graviton 3 uses SVE intrinsics.
  2. dot8sNativeSimple : Simple scalar for loop approach (letting GCC auto vectorize for the native arch)

Graviton 2

Benchmark                                   (size)   Mode  Cnt    Score   Error   Units
VectorUtilBenchmark.binaryDotProductVector       1  thrpt   15  208.089 ± 0.017  ops/us
VectorUtilBenchmark.binaryDotProductVector     128  thrpt   15   15.288 ± 0.071  ops/us
VectorUtilBenchmark.binaryDotProductVector     207  thrpt   15    9.948 ± 0.063  ops/us
VectorUtilBenchmark.binaryDotProductVector     256  thrpt   15    8.326 ± 0.030  ops/us
VectorUtilBenchmark.binaryDotProductVector     300  thrpt   15    7.063 ± 0.050  ops/us
VectorUtilBenchmark.binaryDotProductVector     512  thrpt   15    4.311 ± 0.023  ops/us
VectorUtilBenchmark.binaryDotProductVector     702  thrpt   15    3.198 ± 0.026  ops/us
VectorUtilBenchmark.binaryDotProductVector    1024  thrpt   15    2.220 ± 0.020  ops/us
VectorUtilBenchmark.dot8sNative                  1  thrpt   15   86.070 ± 0.014  ops/us
VectorUtilBenchmark.dot8sNative                128  thrpt   15   84.463 ± 3.198  ops/us
VectorUtilBenchmark.dot8sNative                207  thrpt   15   49.372 ± 1.150  ops/us
VectorUtilBenchmark.dot8sNative                256  thrpt   15   70.491 ± 0.226  ops/us
VectorUtilBenchmark.dot8sNative                300  thrpt   15   44.338 ± 0.611  ops/us
VectorUtilBenchmark.dot8sNative                512  thrpt   15   43.895 ± 4.055  ops/us
VectorUtilBenchmark.dot8sNative                702  thrpt   15   27.977 ± 1.614  ops/us
VectorUtilBenchmark.dot8sNative               1024  thrpt   15   27.598 ± 0.102  ops/us
VectorUtilBenchmark.dot8sNativeSimple            1  thrpt   15  103.949 ± 0.216  ops/us
VectorUtilBenchmark.dot8sNativeSimple          128  thrpt   15   89.996 ± 3.834  ops/us
VectorUtilBenchmark.dot8sNativeSimple          207  thrpt   15   52.342 ± 0.909  ops/us
VectorUtilBenchmark.dot8sNativeSimple          256  thrpt   15   66.461 ± 3.691  ops/us
VectorUtilBenchmark.dot8sNativeSimple          300  thrpt   15   49.841 ± 1.630  ops/us
VectorUtilBenchmark.dot8sNativeSimple          512  thrpt   15   47.575 ± 0.230  ops/us
VectorUtilBenchmark.dot8sNativeSimple          702  thrpt   15   30.196 ± 2.045  ops/us
VectorUtilBenchmark.dot8sNativeSimple         1024  thrpt   15   26.870 ± 2.711  ops/us

Graviton 3

Benchmark                                   (size)   Mode  Cnt    Score    Error   Units
VectorUtilBenchmark.binaryDotProductVector       1  thrpt   15  449.506 ±  4.604  ops/us
VectorUtilBenchmark.binaryDotProductVector     128  thrpt   15   60.230 ±  0.007  ops/us
VectorUtilBenchmark.binaryDotProductVector     207  thrpt   15   36.598 ±  0.085  ops/us
VectorUtilBenchmark.binaryDotProductVector     256  thrpt   15   31.289 ±  0.006  ops/us
VectorUtilBenchmark.binaryDotProductVector     300  thrpt   15   26.704 ±  0.044  ops/us
VectorUtilBenchmark.binaryDotProductVector     512  thrpt   15   15.934 ±  0.001  ops/us
VectorUtilBenchmark.binaryDotProductVector     702  thrpt   15   11.607 ±  0.007  ops/us
VectorUtilBenchmark.binaryDotProductVector    1024  thrpt   15    8.041 ±  0.001  ops/us
VectorUtilBenchmark.dot8sNative                  1  thrpt   15  191.014 ±  2.466  ops/us
VectorUtilBenchmark.dot8sNative                128  thrpt   15  134.566 ±  2.626  ops/us
VectorUtilBenchmark.dot8sNative                207  thrpt   15  105.161 ±  6.314  ops/us
VectorUtilBenchmark.dot8sNative                256  thrpt   15   93.163 ±  3.352  ops/us
VectorUtilBenchmark.dot8sNative                300  thrpt   15   90.764 ±  7.961  ops/us
VectorUtilBenchmark.dot8sNative                512  thrpt   15   67.553 ±  1.328  ops/us
VectorUtilBenchmark.dot8sNative                702  thrpt   15   51.275 ±  3.981  ops/us
VectorUtilBenchmark.dot8sNative               1024  thrpt   15   40.886 ±  2.880  ops/us
VectorUtilBenchmark.dot8sNativeSimple            1  thrpt   15  221.399 ±  4.357  ops/us
VectorUtilBenchmark.dot8sNativeSimple          128  thrpt   15  162.158 ±  5.077  ops/us
VectorUtilBenchmark.dot8sNativeSimple          207  thrpt   15  119.323 ± 12.108  ops/us
VectorUtilBenchmark.dot8sNativeSimple          256  thrpt   15  111.288 ±  3.256  ops/us
VectorUtilBenchmark.dot8sNativeSimple          300  thrpt   15   90.587 ± 11.066  ops/us
VectorUtilBenchmark.dot8sNativeSimple          512  thrpt   15   58.725 ±  3.419  ops/us
VectorUtilBenchmark.dot8sNativeSimple          702  thrpt   15   47.595 ±  3.692  ops/us
VectorUtilBenchmark.dot8sNativeSimple         1024  thrpt   15   36.442 ±  0.377  ops/us

Apple M2 Pro

Note : dot8sNative and dot8sNativeSimple are same here in this (i.e. both scalar) since dot8sNative also fallsback to autovectorized scalar implementation; no special handling added)

Benchmark                                   (size)   Mode  Cnt    Score    Error   Units
VectorUtilBenchmark.binaryDotProductVector       1  thrpt   15  418.270 ±  5.354  ops/us
VectorUtilBenchmark.binaryDotProductVector     128  thrpt   15   33.913 ±  0.136  ops/us
VectorUtilBenchmark.binaryDotProductVector     207  thrpt   15   22.309 ±  0.465  ops/us
VectorUtilBenchmark.binaryDotProductVector     256  thrpt   15   18.650 ±  0.344  ops/us
VectorUtilBenchmark.binaryDotProductVector     300  thrpt   15   16.045 ±  0.600  ops/us
VectorUtilBenchmark.binaryDotProductVector     512  thrpt   15    9.969 ±  0.142  ops/us
VectorUtilBenchmark.binaryDotProductVector     702  thrpt   15    7.172 ±  0.096  ops/us
VectorUtilBenchmark.binaryDotProductVector    1024  thrpt   15    4.888 ±  0.083  ops/us
VectorUtilBenchmark.dot8sNative                  1  thrpt   15  241.021 ±  3.690  ops/us
VectorUtilBenchmark.dot8sNative                128  thrpt   15  148.627 ±  0.538  ops/us
VectorUtilBenchmark.dot8sNative                207  thrpt   15   82.045 ±  0.433  ops/us
VectorUtilBenchmark.dot8sNative                256  thrpt   15  107.799 ±  1.698  ops/us
VectorUtilBenchmark.dot8sNative                300  thrpt   15   91.191 ±  1.086  ops/us
VectorUtilBenchmark.dot8sNative                512  thrpt   15   68.093 ±  2.800  ops/us
VectorUtilBenchmark.dot8sNative                702  thrpt   15   45.091 ±  0.321  ops/us
VectorUtilBenchmark.dot8sNative               1024  thrpt   15   41.423 ±  1.561  ops/us
VectorUtilBenchmark.dot8sNativeSimple            1  thrpt   15  257.435 ±  0.778  ops/us
VectorUtilBenchmark.dot8sNativeSimple          128  thrpt   15  158.886 ±  0.417  ops/us
VectorUtilBenchmark.dot8sNativeSimple          207  thrpt   15   85.121 ±  0.215  ops/us
VectorUtilBenchmark.dot8sNativeSimple          256  thrpt   15  108.148 ± 10.157  ops/us
VectorUtilBenchmark.dot8sNativeSimple          300  thrpt   15   92.938 ±  2.058  ops/us
VectorUtilBenchmark.dot8sNativeSimple          512  thrpt   15   67.859 ±  7.649  ops/us
VectorUtilBenchmark.dot8sNativeSimple          702  thrpt   15   44.981 ±  0.618  ops/us
VectorUtilBenchmark.dot8sNativeSimple         1024  thrpt   15   40.809 ±  3.245  ops/us

@github-actions github-actions bot added this to the 11.0.0 milestone Dec 17, 2025
@dweiss
Copy link
Contributor

dweiss commented Dec 17, 2025

I do have some minor comments but the large one is: how do we handle native code for end-users? Do we ship multiple precompiled binaries? Do we ship none? The current build modifications works "for you" but to compile for a matrix of all the possibilities is a bit of a nightmare.

Copy link
Contributor

@dweiss dweiss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure about this. The integration with native libs here is... tight. And to get a working version, you pretty much have to recompile from scratch on the target machine.

It would be more elegant to have it somehow wrapped in a service and a separate (optional) module. I'm not sure if it's possible.

Also - as I mentioned - some thought needs to be given to what the public artifacts are going to be - are binaries going to be shipped, for which cpus, how are they going to be compiled and what dev requirements this entails (installing cross-compilation environments is probably not going to fly with many).

build.gradle Outdated
plugins {
id "base"
id "lucene.root-project.setup"
id "c"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this can be moved to the project which actually uses the plugin (misc?).

Comment on lines 34 to 41
test {
dependsOn ':lucene:misc:buildNative'
systemProperty(
"java.library.path",
project(":lucene:misc").layout.buildDirectory.get().asFile.absolutePath + "/libs/dotProduct/shared"
)
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All of these changes should be pulled into a single gradle java plugin that handles the configuration across all the involved projects. If native libs are not enabled, these changes shouldn't be applied at all. A single plugin will make it easier to see the set of changes applied globally; currently they're scattered around.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, makes sense

@dweiss
Copy link
Contributor

dweiss commented Dec 17, 2025

For the record, any Java library that ships with native libs has similar integration problems. You can look at jansi or any other lib that has native components for inspiration (and to get an idea) how hairy it becomes. Here's jansi's docker-based cross-compilation build.

https://github.com/fusesource/jansi/blob/master/Makefile

@shubhamvishu
Copy link
Contributor Author

shubhamvishu commented Dec 17, 2025

Thanks @dweiss for taking a look. I completely agree with maintaining different native bindings in Lucene can get really challenging as you mentioned and I propose we don't ship don't ship any pre-built binaries (or to say at least in this PR today if someone feels otherwise). I got some cross environment binaries working but I agree we might not be able to support all permutations/combinations of the different environments in the best possible way.

While I got some cross-platform binaries working for few selective environments, supporting all environment permutations will be really painful. I added the build configuration to test across different environments on my end. I'm more happy to remove the Gradle build configurations for binary generation.

One important point here is that GCC's auto-vectorization of simple C code performs comparably to (and sometimes better than) hand-written NEON/SVE implementations(except some cases/runs for SVE). This allows us to keep things simple on Lucene's end and let users experiment. This is inline with what @rmuir proposed in this comment.

So here is what I propose we can do and would look like for user :

  • We could keep the simple dot product implementation in the C code in dotProduct.c in the lucene/misc as-is but not build it.

  • Instead, we let the Lucene user to optionally generate the binary on their end using gcc -shared -O3 -march=native -funroll-loops -o <output_file> <c_code_file> on the platform they wish to run (or) they would need to do so for whichever compiler and environment they are targeting(we can add more documentation if req.). User could either use the default dotProduct.c file with auto-vectorization or get creative and provide a more performant implementation if they wish to.

  • To enable the native dot product codepath they could pass the -Dlucene.useNativeDotProduct=true system property.

Contract: This PR would enable the interface to interact with binaries for dot product computations but providing that binary is in user's court. Users need to provide the binary with the required signature/impl:

int32_t dot8s(int8_t vec1[], int8_t vec2[], int32_t limit);

Let me know if this makes it clear for what it would look like from Lucene user's perspective.

@shubhamvishu
Copy link
Contributor Author

To expand on what exactly a Lucene user would look like on their end to test/enable this :

  • Compile the C code (default impl or custom)
// Linux/Unix
gcc -shared -O3 -march=native -funroll-loops -o /home/simd/libdotProduct.so /home/apachelucene/lucene/misc/src/c/dotProduct.c

// macOS
gcc -shared -O3 -march=native -funroll-loops -o /home/simd/libdotProduct.dylib /home/apachelucene/lucene/misc/src/c/dotProduct.c

// Windows
gcc -shared -O3 -march=native -funroll-loops -o /home/simd/dotProduct.dll /home/apachelucene/lucene/misc/src/c/dotProduct.c
// dotProduct.c

int32_t dot8s(int8_t vec1[], int8_t vec2[], int32_t limit) {
    // dot product impl
}
  • Test if everything works and dot product implementation is correct
./gradlew test                                 // PASSES, since by default native dot product is not testes
./gradlew test -Ptest.native.dotProduct=false  // PASSES, switched off
./gradlew test \
  -Ptest.native.dotProduct=true \
  -Ptests.jvmargs="-Djava.library.path=/home/simd" // PASSES


./gradlew test -Ptest.native.dotProduct=true   // FAILS, needs library to be linked
  • Benchmark against the other C implementation or Panama
java --enable-native-access=ALL-UNNAMED \
  --enable-preview \
  -Djava.library.path="/home/simd" \
  -jar lucene/benchmark-jmh/build/benchmarks/lucene-benchmark-jmh-11.0.0-SNAPSHOT.jar \
  regexp "binaryDotProductVector|dot8sNative"

@dweiss
Copy link
Contributor

dweiss commented Dec 17, 2025

I get the performance improvement is nice but it'll be difficult to maintain and test properly the way it's currently implemented. Not to mention most folks out there use Lucene without the possibility to recompile from sources....

I don't mind adding a native implementation but maybe this should be integrated via more standard mechanism (like service lookup) and then everything would live in that optional "native" module? This would open up the possibility to compile such a module independently. Sorry for my lack of enthusiasm about this but I can tell it'll be a problem to maintain even by looking at it (properties, etc.).

@rmuir
Copy link
Member

rmuir commented Dec 17, 2025

For the record, any Java library that ships with native libs has similar integration problems. You can look at jansi or any other lib that has native components for inspiration (and to get an idea) how hairy it becomes. Here's jansi's docker-based cross-compilation build.

https://github.com/fusesource/jansi/blob/master/Makefile

i'd use zig

@dweiss
Copy link
Contributor

dweiss commented Dec 18, 2025

I looked again and I still think a nicer way to plug this native impl. into Lucene would be to make another implementation of VectorUtilSupport ("NativeVectorUtilSupport") and then use it instead of the default Panama implementation, if it is available. If you used service lookup for picking the implementation, you could select between them when classes are initialized. Then the native implementation could go into its own module and there'd be no need for hacks like system properties, internals opened for benchmarks, etc.

I may be missing something of course, but I think it'd be a much easier way forward. What do you think?

@shubhamvishu
Copy link
Contributor Author

shubhamvishu commented Dec 19, 2025

@dweiss I tried to make it more pluggable and added the NativeVectorUtilSupport and it really made the change cleaner, Thanks!. It let users to opt-in to use native binaries on the fly without doing anything extra on their end. If there is library present at runtime and also contains required implementation ONLY then it automatically switches to native implementation or else fallback to Panama based implementation. So nothing changes for users who don't want to use or try it(they would continue using PanamaVectorUtilSupport basically).

I removed all the gradle related changes and not generate any binary at all for now. We can add support for more functions as we go but I think this is a good start?. There is no requirement to have a system property, though users could still pass lucene.useNativeDotProduct=true to enforce binary exists (explicitly enabling to avoid fallback) but lucene.useNativeDotProduct=false does not disable native impl if its found. To test, a user could just simply :

// Passes

./gradlew test  
./gradlew test -Ptests.jvmargs="-Djava.library.path=/home/simd" 
./gradlew test  -Ptest.native.dotProduct=false
./gradlew test -Ptest.native.dotProduct=true -Ptests.jvmargs="-Djava.library.path=/home/simd" 
./gradlew test -Ptests.jvmargs="-Dlucene.useNativeDotProduct=true -Djava.library.path=/home/simd"  


// Fails (expects binary)
./gradlew test -Ptests.jvmargs="-Dlucene.useNativeDotProduct=true"  
./gradlew test -Ptest.native.dotProduct=true

Note : I kept the C code (as is in misc) even though we are not building since it could be used as a reference for users or provides a good starting point.

 

@shubhamvishu
Copy link
Contributor Author

i'd use zig

Very Interesting! I'd love to explore this path and would be good in general if we could have some reliable way to provide efficient binaries to users. Maybe we could pursue this as a followup and keep this change just focussed on adding support to use the provided binary (next we could try adding actual binary support)?

@dweiss
Copy link
Contributor

dweiss commented Dec 19, 2025

It's not what I had in mind, sorry for being vague. I have a strong feeling that the entire thing should be implemented using standard ServiceLoader mechanisms. So the default singleton in VectorizationProvider.Holder.INSTANCE should be instantiated using a service and the logic of this method -

 static VectorizationProvider lookup(boolean testMode) {

should be moved to individual service implementations (verification if a particular implementation can be used in the current environment, along with the instantiation of that implementation). The "singleton lookup" should only load all service providers, filter out what cannot be used for whatever reason and then pick one of the remaining candidates in their preferred order (which can be controlled by a system property, for example lucene.vectorization-provider=*,panama,default would indicate the desired ordering among available implementations).

So, by default we'd have the "default" (DefaultVectorizationProvider) fallback and "panama" (PanamaVectorizationProvider) but you could add another service implementation (native). The service would need to implement two methods - one to check if it can be used and the other to provide an instance of VectorizedProvider.

It isn't a straightforward patch because a lot of the code is currently package private and intentionally hidden (and won't allow subclassing). But I have a gut feeling it's possible and it would be a lot more elegant in the long run.

It's also related to PRs like this one - #15294 which would like to "know" which service implementation is being used. This could be the name (or class) of the service provider, for example.

I'm unfortunately away this week and won't be able to contribute directly. I would hold this patch until the above avenue can be explored though (either turns out to indeed work out nicely or won't work, for some odd reason).

@dweiss
Copy link
Contributor

dweiss commented Dec 19, 2025

This is a quick and dirty PoC that I quickly wrote to better show what I mean -

https://github.com/apache/lucene/compare/main...dweiss:lucene:vector-prov-service?expand=1

it doesn't fully work (there are some clashes between service loader and the module system, plus a ton of cleanups to be made) but if you take a look at VectorizationProvider class it'll give you an idea what I think those losely-pluggable implementations should be loaded like.

Now... it's just an idea - I'm not really heavily advocating to switch to it but I think it'd be better in the long term if we somehow decoupled those multiple implementations (and this includes potentially removing the need for multi-release jars, which are difficult to work with).

@rmuir
Copy link
Member

rmuir commented Dec 28, 2025

Very Interesting! I'd love to explore this path and would be good in general if we could have some reliable way to provide efficient binaries to users. Maybe we could pursue this as a followup and keep this change just focussed on adding support to use the provided binary (next we could try adding actual binary support)?

Yes, there's no need to create binaries here. Was just a suggestion to avoid some hellacious network of Dockerfiles: for this kind of purpose, zig can be used as a better c compiler that "just works" to cross-compile all those binaries.

@shubhamvishu
Copy link
Contributor Author

Thank you @dweiss for sharing your approach! This indeed is more decoupled and looks better to have separate providers convey themselves if it could be used or not based on the dynamic priority order. I've implemented your patch with an additional service implementation for the Native use case. Currently, this delegates to Panama as the fastest fallback, so the native provider is only available when Panama is usable and the required binary is present.

@dweiss
Copy link
Contributor

dweiss commented Dec 31, 2025

Please give me some time - I've been away on a winter break.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants