WARNING: THIS SITE IS A MIRROR OF GITHUB.COM / IT CANNOT LOGIN OR REGISTER ACCOUNTS / THE CONTENTS ARE PROVIDED AS-IS / THIS SITE ASSUMES NO RESPONSIBILITY FOR ANY DISPLAYED CONTENT OR LINKS / IF YOU FOUND SOMETHING MAY NOT GOOD FOR EVERYONE, CONTACT ADMIN AT ilovescratch@foxmail.com
Skip to content

Conversation

@GregoryComer
Copy link
Contributor

@GregoryComer GregoryComer commented Dec 2, 2025

Summary

Update qb4w-family scalar GEMM kernels to use unaligned.h unaligned load methods to load ksums. This is likely an oversight in the initial scalar implementation, as the ksums are only guaranteed to be 16-bit aligned.

The scalar kernels should only be used as a fallback, so this change should have minimal to no impact on ARM or x86 targets. In theory, we could pad out the packed weights slightly to guarantee 32-bit alignment of the ksums, but it looks like the scalar kernel already uses unaligned loads for non-multiple-of-4 NR values. I'm happy to do a deeper analysis if desired.

Test Plan

I locally built and ran the tests on an M4 Mac with CMake. There are two failures, but I confirmed that these failures are pre-existing on the parent commit.

The following tests FAILED:
        291 - f32-vgelu-test (Failed)
        433 - subgraph-fp16-test (Failed)

@GregoryComer GregoryComer marked this pull request as ready for review December 2, 2025 00:22
@GregoryComer
Copy link
Contributor Author

@fbarchard Here's the unaligned load fix.

Copy link
Collaborator

@fbarchard fbarchard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants