brgemm_matmul segfaults with multiple threads when broadcasting dims

# Summary
`brgemm_matmul` segfaults with multiple threads when broadcasting dims.

# Version
`main`: 976bf2d4eb61582c1655e69208ff8173a93d8b45

# Environment
oneDNN includes hardware-specific optimizations and may behave
differently on depending on the compiler and build environment. Include
the following information to help reproduce the issue:
* CPU: `x64` and `AArch64` 
* OS version: Linux 6.14
* git hash: 976bf2d4eb61582c1655e69208ff8173a93d8b45

# Steps to reproduce
On x64:
```sh
$ ONEDNN_VERBOSE=profile_create,profile_exec OMP_NUM_THREADS=2 ./build/tests/benchdnn/benchdnn --matmul --mode=R --stag=abcd --dtag=abcd 2x1x40x20:1x1x20x40
onednn_verbose,v1,info,oneDNN v3.11.0 (commit 976bf2d4eb61582c1655e69208ff8173a93d8b45)
onednn_verbose,v1,info,cpu,runtime:OpenMP,nthr:2
onednn_verbose,v1,info,cpu,isa:Intel AVX-512 with Intel DL Boost
onednn_verbose,v1,info,gpu,runtime:none
onednn_verbose,v1,info,graph,backend,0:dnnl_backend
onednn_verbose,v1,primitive,info,template:operation,engine,primitive,implementation,prop_kind,memory_descriptors,attributes,auxiliary,problem_desc,exec_time
onednn_verbose,v1,graph,info,template:operation,engine,partition_id,partition_kind,op_names,data_formats,logical_tensors,fpmath_mode,implementation,backend,exec_time
onednn_verbose,v1,primitive,create:cache_miss,cpu,matmul,brg_matmul:avx512_core,undef,src:f32::blocked:abcd::f0 wei:f32:a:blocked:abcd::f0 dst:f32::blocked:abcd::f0,,,2x1x40x20:1x1x20x40,0.2771
onednn_verbose,v1,primitive,create:cache_hit,cpu,matmul,brg_matmul:avx512_core,undef,src:f32::blocked:abcd::f0 wei:f32:a:blocked:abcd::f0 dst:f32::blocked:abcd::f0,,,2x1x40x20:1x1x20x40,0.00195312
onednn_verbose,v1,primitive,exec,cpu,matmul,brg_matmul:avx512_core,undef,src:f32::blocked:abcd::f0 wei:f32:a:blocked:abcd::f0 dst:f32::blocked:abcd::f0,,,2x1x40x20:1x1x20x40,0.194824
0:EXECUTED (1 ms) __REPRO: --mode=R --mode-modifier=M --matmul --stag=abcd --dtag=abcd 2x1x40x20:1x1x20x40
============================================================
= Implementation statistics (--summary=no-impl to disable) =
============================================================
| brg_matmul:avx512_core : 1 (100%)                        |
============================================================
tests:1 passed:1 skipped:0 mistrusted:0 unimplemented:0 invalid_arguments:0 failed:0 listed:0
total: 0.00s; create_pd: 0.00s (30%); create_prim: 0.00s (38%); fill: 0.00s (0%); execute: 0.00s (14%);
Segmentation fault (core dumped)
```

# Observed behavior
Segmentation fault.

# Expected behavior
I would strongly prefer if it did not segfault.

# Triage

This bug is common to the `x64` and `AArch64` paths. I have done the triage on the `AArch64` end but I suspect it is the same bug.

Essentially we calculate the batch address here https://github.com/uxlfoundation/oneDNN/blob/6fd57103715166bd59bf1fd6989003e61e201bf0/src/cpu/aarch64/matmul/brgemm_matmul.cpp#L368

And that calculation depends on the thread number https://github.com/uxlfoundation/oneDNN/blob/6fd57103715166bd59bf1fd6989003e61e201bf0/src/cpu/aarch64/matmul/brgemm_matmul.cpp#L1012-L1015

Which means that when broadcasting the following points to garbage:
https://github.com/uxlfoundation/oneDNN/blob/77dfcef253f65be5403d893e947906858bf5b6bb/src/cpu/aarch64/brgemm/brgemm_types.hpp#L102

And therefore segfaults when it is later accessed in the kernel (at execute time):
https://github.com/uxlfoundation/oneDNN/blob/6fd57103715166bd59bf1fd6989003e61e201bf0/src/cpu/aarch64/brgemm/jit_brgemm_kernel.cpp#L1451-L1456

On the AArch64-path the `acl_matmul` implementation picks up this shape and therefore does not crash but the bug is still present. `x64` crashes out-of-the-box.

I'd greatly appreciate advice on how to approach the fix (probably just adding a broadcast branch to the batch pointer calculation?).

@dzarukin @vpirogov 

	if (offset < (1 << 6)) {
	ld1rw(z1.s, P_ALL_ONE / T_z,
	ptr(reg_aux_A, (int32_t)offset));
	} else {
	add_imm(X_DEFAULT_ADDR, reg_aux_A, offset, X_TMP_0);
	ld1rw(z1.s, P_ALL_ONE / T_z, ptr(X_DEFAULT_ADDR));

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

brgemm_matmul segfaults with multiple threads when broadcasting dims #4396

Summary

Version

Environment

Steps to reproduce

Observed behavior

Expected behavior

Triage

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

	brgemm_batch_element_t *get_batch_elem_ptr(int ithr) const {
	return batch_element_ptr_
	+ ithr * bgmmc_.brgemm_batch_element_per_thr_sz;
	}

brgemm_matmul segfaults with multiple threads when broadcasting dims #4396

Description

Summary

Version

Environment

Steps to reproduce

Observed behavior

Expected behavior

Triage

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions