Use unaligned.h loads for qb4w scalar ksum loads #9227
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary
Update qb4w-family scalar GEMM kernels to use unaligned.h unaligned load methods to load ksums. This is likely an oversight in the initial scalar implementation, as the ksums are only guaranteed to be 16-bit aligned.
The scalar kernels should only be used as a fallback, so this change should have minimal to no impact on ARM or x86 targets. In theory, we could pad out the packed weights slightly to guarantee 32-bit alignment of the ksums, but it looks like the scalar kernel already uses unaligned loads for non-multiple-of-4 NR values. I'm happy to do a deeper analysis if desired.
Test Plan
I locally built and ran the tests on an M4 Mac with CMake. There are two failures, but I confirmed that these failures are pre-existing on the parent commit.