Skip to content

Conversation

@DajanaV
Copy link
Collaborator

@DajanaV DajanaV commented Nov 18, 2025

Mirrored from ggml-org/llama.cpp#17356

Fixes #17297

To use subgroup operations on Intel, we have to force full subgroups, otherwise the subgroup size and how many of the threads are actually active may vary. I don't think this change should have any negative effect on other drivers.

@loci-agentic-ai
Copy link

Access the complete analysis in the LOCI Dashboard

Based on the available context from the GitHub code review analysis, I can provide a summary of the pull request changes:

Pull Request Summary: Vulkan Flash Attention Intel Subgroup Fix

Overview

This pull request addresses Intel GPU subgroup crashes in Vulkan flash attention by enforcing full subgroups for all flash attention operations. The change modifies 4 locations in ggml_vk_load_shaders() function, replacing conditional full subgroup enforcement (FAPATH==FA_COOPMAT1) with unconditional enforcement (true).

Code Changes Analysis

The modifications affect four flash attention pipeline variants in ggml/src/ggml-vulkan/ggml-vulkan.cpp:

  • flash_attn_f32_f16_aligned_f32acc
  • flash_attn_f32_f16_aligned_f16acc
  • flash_attn_f32_f16_f32acc
  • flash_attn_f32_f16_f16acc

Technical Impact

Correctness: Eliminates Intel GPU crashes by ensuring consistent subgroup behavior across all Vulkan drivers. This defensive programming approach prioritizes stability over micro-optimizations.

Performance: The change may introduce slight overhead due to forced full subgroup synchronization. However, this is offset by crash elimination on Intel hardware and provides more predictable execution patterns.

Compatibility: Maintains functionality across all GPU vendors (Intel, NVIDIA, AMD) while simplifying the codebase by removing conditional subgroup logic.

Assessment

This represents a conservative fix that addresses a critical compatibility issue. The change simplifies code maintenance by providing uniform behavior across all flash attention variants. While there may be minor performance implications on some hardware, the stability benefits and crash prevention justify the modification.

The pull request demonstrates sound engineering judgment by choosing reliability over potential micro-optimizations in GPU compute operations.

@loci-dev loci-dev force-pushed the main branch 27 times, most recently from 819d372 to ab559ce Compare November 24, 2025 20:10
@loci-dev loci-dev force-pushed the main branch 9 times, most recently from 53eeb3f to 2531f8a Compare November 26, 2025 08:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants