A tool to determine the maximum usable context size for Ollama models while monitoring performance and VRAM usage. This tool helps you find the optimal balance between context size and performance for your specific hardware setup.
This tool tests increasing context sizes with your chosen Ollama model to find the maximum size that maintains acceptable performance. It monitors:
- Token processing speed (tokens per second)
- VRAM usage
- Response times
- Model behavior at different context lengths
- Windows 10/11
- Python 3.8+
- Ollama (https://ollama.com/) installed and running
- For VRAM monitoring:
- NVIDIA GPU: No additional setup needed (uses nvidia-smi, included with drivers)
- AMD GPU: ROCm for Windows (if available for your GPU)
- Python 3.8+
- Ollama installed and running
- For VRAM monitoring:
- NVIDIA GPU: No additional setup needed (uses nvidia-smi)
- AMD GPU: Either ROCm or radeontop (
sudo apt install radeontop
on Ubuntu/Debian)
- Clone this repository:
git clone https://github.com/scionero/maxcontextfinder
cd maxcontextfinder
- Install required Python packages:
pip install -r requirements.txt
Basic usage:
python main.py MODEL_NAME
Example:
python main.py codellama:latest
model
: (Required) The Ollama model name (e.g., 'codellama:latest', 'llama2:13b')--min_token_rate
: Minimum acceptable tokens per second (default: 10)--start
: Starting context size (default: 1024)--step
: Step size for context increments (default: 1024)--tests
: Number of tests per context size (default: 3)
Example with all options:
python main.py mistral:7b --min_token_rate 15 --start 2048 --step 2048 --tests 5
The tool generates detailed logs including:
- Test parameters and configuration
- Performance metrics for each test
- VRAM usage statistics (when available)
- Token processing speeds
- Final recommended context size
Logs are saved to the logs
directory with names: context_test_MODEL_TIMESTAMP.log
- NVIDIA GPUs: Fully supported through nvidia-smi
- AMD GPUs: Supported through ROCm when available
- Intel GPUs: Not currently supported
- NVIDIA GPUs: Fully supported through nvidia-smi
- AMD GPUs: Supported through either:
- ROCm (preferred when available)
- radeontop (fallback option)
- Intel GPUs: Not currently supported
-
GPU Support:
- NVIDIA GPUs are fully supported on both Windows and Linux
- AMD GPU support varies by platform and available tools
- Systems without supported GPUs will run without VRAM monitoring
-
Framework Specificity: Results are specific to Ollama and may differ from other frameworks like:
- Pure llama.cpp
- vLLM
- Different quantization methods
- Other serving frameworks
-
Hardware Dependence: Results depend on your hardware:
- GPU memory and performance
- CPU capabilities
- System memory
- Storage speed
The tool stops testing larger context sizes when:
- Token processing speed drops below the minimum threshold
- VRAM usage approaches 100%
- Model encounters errors or timeouts
The "maximum recommended context size" is the largest size that maintained acceptable performance across all metrics.
Contributions are welcome! Areas for improvement:
- Additional GPU support
- More performance metrics
- Support for other frameworks
- Better token counting accuracy
- Alternative testing methodologies
Please feel free to:
- Open issues for bugs or feature requests
- Submit pull requests with improvements
- Share your testing results
- Suggest better testing methodologies
Results should be considered approximate. Real-world performance may vary based on:
- Specific prompt content
- Model implementation details
- System load and conditions
- Hardware configuration