WARNING: THIS SITE IS A MIRROR OF GITHUB.COM / IT CANNOT LOGIN OR REGISTER ACCOUNTS / THE CONTENTS ARE PROVIDED AS-IS / THIS SITE ASSUMES NO RESPONSIBILITY FOR ANY DISPLAYED CONTENT OR LINKS / IF YOU FOUND SOMETHING MAY NOT GOOD FOR EVERYONE, CONTACT ADMIN AT ilovescratch@foxmail.com
Skip to content

Conversation

@deepak-pradhan
Copy link

Summary

  • Fixes RuntimeError: Bias expected in BMHK format when using custom attention bias with GQA models (e.g., Mistral-7B) during GRPO training
  • xformers with GQA switches to 5D tensor format during gradient checkpointing (when requires_grad=False), but the cutlass backend doesn't support custom bias with 5D tensors
  • Solution: Temporarily disable xformers during training to force the SDPA path, which always uses 4D tensors and properly supports custom attention bias

Test plan

  • Verified GRPO training completes successfully with custom group/parent attention masking
  • Training output shows expected metrics: loss=-0, grad_norm=43.1, policy_loss=-3.41e-8, entropy=6.31

🤖 Generated with Claude Code

deepak-pradhan and others added 2 commits December 16, 2025 20:20
xformers with GQA (Grouped Query Attention) switches to 5D tensor format
during gradient checkpointing when requires_grad=False. The cutlass backend
doesn't support custom attention bias with 5D tensors, causing:
  "RuntimeError: Bias expected in BMHK format"

Solution: Temporarily disable xformers during training to force the SDPA
path, which always uses 4D tensors and properly supports custom attention
bias for trajectory group/parent masking.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
Add type: ignore comments and explicit int() casts for model config
attributes to pass pyright type checking in CI.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant