Skip to content

【BUG】When prefix caching is not triggered, the sequence's blocks are not deallocated #62

@77z-zhou

Description

@77z-zhou

Description

When the sequence finished and its blocks have not been cache hit by another sequence's prefix caching. the sequence's blocks were not deallocate into free_block_ids

When I tested with multiple sequences, during the waiting queue traversal phase in the scheduler, many sequences got stuck in the waiting queue due to insufficient blocks. Meanwhile, many sequences had already completed, but their blocks were not correctly deallocate into the free_block_ids. Eventually, in the LLMEngine, because there were still tasks in the waiting queue, the engine continued to perform cyclic scheduling, and after scheduling, the resulting sequence was empty, which caused an error.

Test Code

import sys
from pathlib import Path

from transformers import AutoTokenizer

# Add src to Python path
sys.path.insert(0, str(Path(__file__).parent / "src"))

from myvllm.engine.llm_engine import LLMEngine as LLM
from myvllm.sampling_parameters import SamplingParams

config = {
    'max_num_sequences': 16,
    'max_num_batched_tokens': 1024,
    'max_cached_blocks': 1024,
    'block_size': 256,
    'world_size': 1,
    'model_name_or_path': '/home/models/Qwen/Qwen3-0.6B',
    'enforce_eager': True,
    'vocab_size': 151936,  # Fixed: was 151643, HF model uses 151936
    'hidden_size': 1024,
    'num_heads': 16,
    'head_dim': 128,  # Fixed: was 64, should be 128 (hidden_size / num_heads for GQA output)
    'num_kv_heads': 8,
    'intermediate_size': 3072,
    'num_layers': 28,
    'tie_word_embeddings': True,
    'base': 1000000,  # Fixed: was 10000, HF uses rope_theta=1000000
    'rms_norm_epsilon': 1e-6,
    'qkv_bias': False,
    'scale': 1,
    'max_position': 32768, # should be >= max_model_length, max position index allowed in rotary embedding
    'ffn_bias': False,  # Fixed: HF Qwen3 doesn't use MLP bias
    'max_num_batch_tokens': 4096,
    'max_model_length': 128,
    'gpu_memory_utilization': 0.9,
    'eos': 151645,  # Fixed: should match tokenizer.eos_token_id
}

def test_multiple_sequence():
    model_name = config.get('model_name_or_path', 'Qwen/Qwen3-0.6B')
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    llm = LLM(config=config)
    
    sampling_params = SamplingParams(temperature=0.6, max_tokens=256, max_model_length=128)
    prompts = [
        "introduce yourself",
        "list all prime numbers within 100",
        "give me your opinion on the impact of artificial intelligence on society",
    ] * 30

    prompts = [
        tokenizer.apply_chat_template(
            [{"role": "user", "content": prompt}],
            tokenize=False,
            add_generation_prompt=True,
        )
        for prompt in prompts
    ]
    outputs, metric = llm.generate(prompts, sampling_params)
    print(metric)

if __name__ == "__main__":
    test_multiple_sequence()

Error message

================================================================================
Weight Loading Summary:

Successfully loaded: 283 parameter groups
Skipped (merged into other weights): 28

[Rank 0] Global max_cached_blocks (min): 64
267 number of processed tokens 362.5880076125535 tokens/sec during prefilling
273 number of processed tokens 3383.3927464709272 tokens/sec during prefilling
276 number of processed tokens 3571.9885292494237 tokens/sec during prefilling
267 number of processed tokens 4700.772630275205 tokens/sec during prefilling
16 number of processed tokens 113.07630370637804 tokens/sec during decoding
16 number of processed tokens 387.6839310775934 tokens/sec during decoding
16 number of processed tokens 260.18448516977884 tokens/sec during decoding
16 number of processed tokens 303.9529676679917 tokens/sec during decoding
16 number of processed tokens 394.70694283356215 tokens/sec during decoding
16 number of processed tokens 383.68092848281697 tokens/sec during decoding
16 number of processed tokens 404.6578580102278 tokens/sec during decoding
16 number of processed tokens 303.85663063680704 tokens/sec during decoding
16 number of processed tokens 384.48988105152546 tokens/sec during decoding
16 number of processed tokens 408.1302914846302 tokens/sec during decoding
16 number of processed tokens 375.0900647387897 tokens/sec during decoding
16 number of processed tokens 307.1456340331627 tokens/sec during decoding
16 number of processed tokens 401.4264238385268 tokens/sec during decoding
16 number of processed tokens 335.68026980529856 tokens/sec during decoding
16 number of processed tokens 371.98385794409705 tokens/sec during decoding
16 number of processed tokens 404.8165224773649 tokens/sec during decoding
16 number of processed tokens 369.03619951286504 tokens/sec during decoding
16 number of processed tokens 412.80234132496446 tokens/sec during decoding
16 number of processed tokens 403.6137837896921 tokens/sec during decoding
16 number of processed tokens 406.6710933791648 tokens/sec during decoding
16 number of processed tokens 407.70134097797524 tokens/sec during decoding
16 number of processed tokens 410.6778277201471 tokens/sec during decoding
16 number of processed tokens 347.2914526578281 tokens/sec during decoding
16 number of processed tokens 398.7383696829944 tokens/sec during decoding
16 number of processed tokens 414.8925120617092 tokens/sec during decoding
16 number of processed tokens 342.8557174557227 tokens/sec during decoding
16 number of processed tokens 382.6963345811741 tokens/sec during decoding
16 number of processed tokens 325.23754162386706 tokens/sec during decoding
16 number of processed tokens 326.16860282769284 tokens/sec during decoding
16 number of processed tokens 393.420393221796 tokens/sec during decoding
16 number of processed tokens 371.69541531393116 tokens/sec during decoding
16 number of processed tokens 405.03886187714266 tokens/sec during decoding
16 number of processed tokens 422.9781279392789 tokens/sec during decoding
16 number of processed tokens 405.46222527633535 tokens/sec during decoding
16 number of processed tokens 329.24423095029096 tokens/sec during decoding
16 number of processed tokens 398.3431105414779 tokens/sec during decoding
16 number of processed tokens 390.15193472567057 tokens/sec during decoding
16 number of processed tokens 413.4279419835498 tokens/sec during decoding
16 number of processed tokens 406.3583584900782 tokens/sec during decoding
16 number of processed tokens 337.0357022757551 tokens/sec during decoding
16 number of processed tokens 410.35890243726493 tokens/sec during decoding
16 number of processed tokens 356.6471299300158 tokens/sec during decoding
16 number of processed tokens 404.0171448643195 tokens/sec during decoding
16 number of processed tokens 422.3100253784892 tokens/sec during decoding
16 number of processed tokens 393.92383091556917 tokens/sec during decoding
16 number of processed tokens 353.9011731068687 tokens/sec during decoding
16 number of processed tokens 401.59938619995336 tokens/sec during decoding
16 number of processed tokens 301.73085149449406 tokens/sec during decoding
16 number of processed tokens 383.00211073837886 tokens/sec during decoding
16 number of processed tokens 368.81310539906553 tokens/sec during decoding
16 number of processed tokens 399.87405828857834 tokens/sec during decoding
16 number of processed tokens 405.9773253180286 tokens/sec during decoding
16 number of processed tokens 417.395595378353 tokens/sec during decoding
16 number of processed tokens 421.302562156162 tokens/sec during decoding
16 number of processed tokens 413.23446466220173 tokens/sec during decoding
16 number of processed tokens 400.6284070226929 tokens/sec during decoding
16 number of processed tokens 407.1126954413317 tokens/sec during decoding
16 number of processed tokens 367.12033964276725 tokens/sec during decoding
16 number of processed tokens 412.16597363422835 tokens/sec during decoding
16 number of processed tokens 325.2028681120374 tokens/sec during decoding
16 number of processed tokens 407.42906649209147 tokens/sec during decoding
16 number of processed tokens 391.29163898569374 tokens/sec during decoding
16 number of processed tokens 364.8906497401178 tokens/sec during decoding
16 number of processed tokens 385.20936226957355 tokens/sec during decoding
16 number of processed tokens 330.2487801182224 tokens/sec during decoding
16 number of processed tokens 256.31190257803877 tokens/sec during decoding
16 number of processed tokens 345.9505106343711 tokens/sec during decoding
16 number of processed tokens 265.9493609278574 tokens/sec during decoding
16 number of processed tokens 273.16003144559653 tokens/sec during decoding
16 number of processed tokens 279.66804280189 tokens/sec during decoding
16 number of processed tokens 315.053255595363 tokens/sec during decoding
16 number of processed tokens 347.5810510696103 tokens/sec during decoding
16 number of processed tokens 274.0837739538037 tokens/sec during decoding
16 number of processed tokens 307.3383429328065 tokens/sec during decoding
16 number of processed tokens 299.3726222846297 tokens/sec during decoding
16 number of processed tokens 293.5735734552953 tokens/sec during decoding
16 number of processed tokens 296.76066771409296 tokens/sec during decoding
16 number of processed tokens 370.59112159174543 tokens/sec during decoding
16 number of processed tokens 359.51305459595903 tokens/sec during decoding
16 number of processed tokens 391.87657714239447 tokens/sec during decoding
16 number of processed tokens 395.0275707795521 tokens/sec during decoding
16 number of processed tokens 392.8124456236904 tokens/sec during decoding
16 number of processed tokens 313.78197075989874 tokens/sec during decoding
16 number of processed tokens 235.96729911507475 tokens/sec during decoding
16 number of processed tokens 306.88859664470834 tokens/sec during decoding
16 number of processed tokens 397.06567482461946 tokens/sec during decoding
16 number of processed tokens 367.37356489178615 tokens/sec during decoding
16 number of processed tokens 342.2210520069663 tokens/sec during decoding
16 number of processed tokens 354.01505473856105 tokens/sec during decoding
16 number of processed tokens 361.8586826429205 tokens/sec during decoding
16 number of processed tokens 376.6563610148671 tokens/sec during decoding
16 number of processed tokens 410.0580105220626 tokens/sec during decoding
16 number of processed tokens 416.8692584009017 tokens/sec during decoding
16 number of processed tokens 382.50656239599374 tokens/sec during decoding
16 number of processed tokens 306.6095738708052 tokens/sec during decoding
16 number of processed tokens 325.3699994353076 tokens/sec during decoding
16 number of processed tokens 328.0115344240614 tokens/sec during decoding
16 number of processed tokens 379.4333780424395 tokens/sec during decoding
16 number of processed tokens 268.41076175959955 tokens/sec during decoding
16 number of processed tokens 389.262551257143 tokens/sec during decoding
16 number of processed tokens 423.3917580261359 tokens/sec during decoding
16 number of processed tokens 383.05457857766396 tokens/sec during decoding
16 number of processed tokens 365.78764142598845 tokens/sec during decoding
16 number of processed tokens 404.4091274135005 tokens/sec during decoding
16 number of processed tokens 288.6551730795966 tokens/sec during decoding
16 number of processed tokens 377.1410964410426 tokens/sec during decoding
16 number of processed tokens 380.9994597459804 tokens/sec during decoding
11 number of processed tokens 177.38378050511477 tokens/sec during decoding
11 number of processed tokens 233.99660143659418 tokens/sec during decoding
11 number of processed tokens 204.60654616410173 tokens/sec during decoding
6 number of processed tokens 151.49548470019857 tokens/sec during decoding
6 number of processed tokens 146.09802986724205 tokens/sec during decoding
6 number of processed tokens 133.5382849076697 tokens/sec during decoding
6 number of processed tokens 137.06427863202222 tokens/sec during decoding
6 number of processed tokens 129.3752966087255 tokens/sec during decoding
6 number of processed tokens 152.6330009824299 tokens/sec during decoding
False
[rank0]: Traceback (most recent call last):
[rank0]: File "/root/myworkspace/MinivLLM/tests/test_mixed_length_sequence.py", line 67, in
[rank0]: test_mixed_length_sequence()
[rank0]: File "/root/myworkspace/MinivLLM/tests/test_mixed_length_sequence.py", line 63, in test_mixed_length_sequence
[rank0]: outputs, metric = llm.generate(prompts, sampling_params)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/root/myworkspace/MinivLLM/src/myvllm/engine/llm_engine.py", line 108, in generate
[rank0]: outputs, num_processed_tokens, is_prefill = self.step(metric)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: ValueError: not enough values to unpack (expected 3, got 2)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions