This project implements a high-performance HTTP server with an in-memory key-value cache, built using Rust and the Axum framework. The server is designed for low-latency operations, leveraging lock-free data structures, multithreading, and HTTP/2 optimizations.
This is a low latency cache which has been built for high load and optimized for concurrent access. See the performance data for nuance around low throughput usage, where Http/1.1 is more performant.
- Lock-Free Data Structures: Utilizes the
papayacrate's lock-freeHashMapto enable concurrent access without traditional locking mechanisms. - Multithreading: Employs a custom Tokio runtime with multiple worker threads to handle high load efficiently.
- HTTP/2 Streaming Optimizations: Leverages Axum's support for HTTP/2 to improve throughput and reduce latency under concurrent requests.
Note: this has been optimized for a 4 Core r5.xlarge instance on AWS. Future work involes dynamically tuning async runtime to a learned function of cores available.
The server supports basic CRUD operations:
GET /- Returns a "Hello, World!" message.POST /store/{key}- Stores a value for the given key.GET /get/{key}- Retrieves the value associated with a key.
flags
# Run with default settings (port 3000, 250 concurrent connections)
cargo run --release
# Run with custom settings
cargo run --release -- --port 8080 --concurrency-limit 100
usage
# Store a value
curl -X POST http://localhost:3000/store/mykey -d "my value"
# Retrieve a value
curl http://localhost:3000/get/mykey
# Health check
curl http://localhost:3000/
The server has been benchmarked with a workload of 80% reads and 20% writes, simulating a typical caching scenario. Below are placeholders for benchmark results comparing different configurations:
The best benchmarked hashmap for finegrained locking is dashmap. A newer map called papaya is a truer lock-free implementation. Here they are both benchmarked on the 80/20 workload.
Direct Comparison. Around 3000 TPS, H2C becomes more efficient due to I/O overhead

Take a look at the results folder for flamegraphs and numerical data.
The server incorporates several optimizations to achieve its low-latency goals:
-
Lock-Free HashMap:
- Uses
papaya::HashMapfor thread-safe, lock-free key-value storage. - Pre-allocates capacity for 1,000,000 entries to reduce resizing overhead.
- Uses
-
Custom Tokio Runtime:
- Configured with 8 worker threads to maximize CPU utilization.
- Sets a 4MB stack size per thread to handle deep call stacks.
- Uses a global queue interval of 10 for efficient task distribution.
- Implements a 100ms thread keep-alive to reduce thread churn.
-
Optimized TCP Listener:
- Enables
SO_REUSEADDR(andSO_REUSEPORTon Unix) for faster restarts. - Sets
TCP_NODELAYto disable Nagle's algorithm, reducing latency. - Configures 4MB send/receive buffers for high-throughput scenarios.
- Supports a backlog of 4096 connections to handle bursts of traffic.
- Enables
-
Concurrency Control:
- Applies a
ConcurrencyLimitLayerwith a configurable limit (default 250, capped at 500) to prevent overload. - Uses
Arcfor shared state to ensure safe, efficient access across threads.
- Applies a
-
Socket Configuration:
- Dynamically selects IPv4 or IPv6 based on the address.
- Operates in non-blocking mode for asynchronous I/O.
-
Command-Line Arguments:
- Allows customization of port and concurrency limit via CLI flags, with sensible defaults (port 3000, concurrency 250).
-
Graceful Shutdown:
- Implements a
shutdown_signalhandler for clean termination on CTRL+C.
- Implements a
These optimizations collectively ensure the server can handle high concurrency with minimal latency, making it ideal for caching use cases.
Several areas are planned for further improvement:
- Zero-Copy Handling: Investigate zero-copy techniques for request/response processing to reduce memory overhead and improve throughput.
- Runtime Experimentation: Explore additional Tokio runtime configurations (e.g., thread pool size, queue strategies) to fine-tune performance.
- Hash Function Experimentation: Test alternative hash functions in
papaya::HashMapto optimize for different workload patterns (e.g., collision resistance, speed). - NUMA Awareness:
- Implement NUMA (Non-Uniform Memory Access) awareness to optimize memory access patterns on multi-socket systems.
- Pin worker threads to specific CPU cores to reduce cross-socket memory traffic. Not implemented here due to time constraints. Otherwise, first choice of solution. Requires custom Hyper build.


