add vllm benchmark for spec decode #125
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This pull request introduces a new vLLM backend for benchmarking speculative decoding in the AngelSlim project, alongside several related improvements. The main changes include implementing a modular benchmarking engine for vLLM, updating the engine interface to support multiple backends, and extending the CLI to allow backend selection and configuration. These updates make the benchmarking framework more flexible and extensible for different model serving backends.
vLLM Benchmark Integration:
vllmbenchmarking module withBenchmarkEngine,BenchmarkConfig, andBenchmarkModeclasses, supporting Eagle speculative decoding and baseline evaluation, metrics calculation, and output management.__init__.pyfor the vLLM benchmark module to expose the new benchmarking classes and added appropriate copyright/license headers.Engine Abstraction and Backend Selection:
SpecEngineclass inengine.pyto support backend selection via thedeploy_backendparameter, dynamically importing and using either the PyTorch or vLLM benchmarking classes. [1] [2] [3]SpecEnginemethods to use the selected backend'sBenchmarkConfig,BenchmarkEngine, andBenchmarkModethroughout the workflow. [1] [2] [3]CLI and Argument Improvements:
tools/spec_benchmark.py) to accept a--deploy-backendargument for choosing between PyTorch and vLLM, and added vLLM-specific and general benchmarking parameters (e.g.,--batch-size,--top-p,--speculative-draft-tensor-parallel-size). [1] [2] [3] [4]Path Handling and Output Consistency:
These changes collectively enhance the benchmarking framework's flexibility, maintainability, and usability for both PyTorch and vLLM backends.