Your current environment
Description: I'm trying to serve GLM-4.7-Flash with KVarN's kvarn_k4v2_g128 KV cache dtype, but it fails with a head_dim validation error. The model has head_dim=576, which is incompatible with the KVarN dtype requirement.
Question: Is kvarn_k4v2_g128 designed to work only with models having head_dim=128? Or am I missing a configuration step?
Command
python -m vllm.entrypoints.openai.api_server \
--model /home/mani/workspace/hfModels/GLM-4.7-Flash \
--host 0.0.0.0 \
--port 8001 \
--tensor-parallel-size 2 \
--block-size 128 \
--kv-cache-dtype kvarn_k4v2_g128
Error
(APIServer pid=406685) pydantic_core._pydantic_core.ValidationError: 1 validation error for VllmConfig
(APIServer pid=406685) Value error, kvarn_k4v2_g128 requires head_dim=128, but this model has head_dim=576. KVarN currently supports head_dim=128 only; use a different --kv-cache-dtype for this model. [type=value_error, input_value=ArgsKwargs((), {'model_co... 'shutdown_timeout': 0}), input_type=ArgsKwargs]
(APIServer pid=406685) For further information visit https://errors.pydantic.dev/2.12/v/value_error
Model Info
- Model: GLM-4.7-Flash
- Architecture:
Glm4MoeLiteForCausalLM
- head_dim:
576
Additional Question: Is there a workaround, or should KVarN dtypes be documented as head_dim-specific?
Your current environment
Description: I'm trying to serve GLM-4.7-Flash with KVarN's
kvarn_k4v2_g128KV cache dtype, but it fails with a head_dim validation error. The model hashead_dim=576, which is incompatible with the KVarN dtype requirement.Question: Is
kvarn_k4v2_g128designed to work only with models havinghead_dim=128? Or am I missing a configuration step?Command
Error
Model Info
Glm4MoeLiteForCausalLM576Additional Question: Is there a workaround, or should KVarN dtypes be documented as head_dim-specific?