-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Description
Summary
brgemm_matmul segfaults with multiple threads when broadcasting dims.
Version
main: 976bf2d
Environment
oneDNN includes hardware-specific optimizations and may behave
differently on depending on the compiler and build environment. Include
the following information to help reproduce the issue:
- CPU:
x64andAArch64 - OS version: Linux 6.14
- git hash: 976bf2d
Steps to reproduce
On x64:
$ ONEDNN_VERBOSE=profile_create,profile_exec OMP_NUM_THREADS=2 ./build/tests/benchdnn/benchdnn --matmul --mode=R --stag=abcd --dtag=abcd 2x1x40x20:1x1x20x40
onednn_verbose,v1,info,oneDNN v3.11.0 (commit 976bf2d4eb61582c1655e69208ff8173a93d8b45)
onednn_verbose,v1,info,cpu,runtime:OpenMP,nthr:2
onednn_verbose,v1,info,cpu,isa:Intel AVX-512 with Intel DL Boost
onednn_verbose,v1,info,gpu,runtime:none
onednn_verbose,v1,info,graph,backend,0:dnnl_backend
onednn_verbose,v1,primitive,info,template:operation,engine,primitive,implementation,prop_kind,memory_descriptors,attributes,auxiliary,problem_desc,exec_time
onednn_verbose,v1,graph,info,template:operation,engine,partition_id,partition_kind,op_names,data_formats,logical_tensors,fpmath_mode,implementation,backend,exec_time
onednn_verbose,v1,primitive,create:cache_miss,cpu,matmul,brg_matmul:avx512_core,undef,src:f32::blocked:abcd::f0 wei:f32:a:blocked:abcd::f0 dst:f32::blocked:abcd::f0,,,2x1x40x20:1x1x20x40,0.2771
onednn_verbose,v1,primitive,create:cache_hit,cpu,matmul,brg_matmul:avx512_core,undef,src:f32::blocked:abcd::f0 wei:f32:a:blocked:abcd::f0 dst:f32::blocked:abcd::f0,,,2x1x40x20:1x1x20x40,0.00195312
onednn_verbose,v1,primitive,exec,cpu,matmul,brg_matmul:avx512_core,undef,src:f32::blocked:abcd::f0 wei:f32:a:blocked:abcd::f0 dst:f32::blocked:abcd::f0,,,2x1x40x20:1x1x20x40,0.194824
0:EXECUTED (1 ms) __REPRO: --mode=R --mode-modifier=M --matmul --stag=abcd --dtag=abcd 2x1x40x20:1x1x20x40
============================================================
= Implementation statistics (--summary=no-impl to disable) =
============================================================
| brg_matmul:avx512_core : 1 (100%) |
============================================================
tests:1 passed:1 skipped:0 mistrusted:0 unimplemented:0 invalid_arguments:0 failed:0 listed:0
total: 0.00s; create_pd: 0.00s (30%); create_prim: 0.00s (38%); fill: 0.00s (0%); execute: 0.00s (14%);
Segmentation fault (core dumped)Observed behavior
Segmentation fault.
Expected behavior
I would strongly prefer if it did not segfault.
Triage
This bug is common to the x64 and AArch64 paths. I have done the triage on the AArch64 end but I suspect it is the same bug.
Essentially we calculate the batch address here
| const auto addr_batch = brgmm_ctx.get_batch_elem_ptr(ithr); |
And that calculation depends on the thread number
oneDNN/src/cpu/aarch64/matmul/brgemm_matmul.cpp
Lines 1012 to 1015 in 6fd5710
| brgemm_batch_element_t *get_batch_elem_ptr(int ithr) const { | |
| return batch_element_ptr_ | |
| + ithr * bgmmc_.brgemm_batch_element_per_thr_sz; | |
| } |
Which means that when broadcasting the following points to garbage:
| const void *A; |
And therefore segfaults when it is later accessed in the kernel (at execute time):
oneDNN/src/cpu/aarch64/brgemm/jit_brgemm_kernel.cpp
Lines 1451 to 1456 in 6fd5710
| if (offset < (1 << 6)) { | |
| ld1rw(z1.s, P_ALL_ONE / T_z, | |
| ptr(reg_aux_A, (int32_t)offset)); | |
| } else { | |
| add_imm(X_DEFAULT_ADDR, reg_aux_A, offset, X_TMP_0); | |
| ld1rw(z1.s, P_ALL_ONE / T_z, ptr(X_DEFAULT_ADDR)); |
On the AArch64-path the acl_matmul implementation picks up this shape and therefore does not crash but the bug is still present. x64 crashes out-of-the-box.
I'd greatly appreciate advice on how to approach the fix (probably just adding a broadcast branch to the batch pointer calculation?).