-
Notifications
You must be signed in to change notification settings - Fork 458
Description
Description
The LLM Guard plugin (plugins/external/llmguard/llmguardplugin) applies ML-based guardrails using the LLMGuard library to scan prompts for security threats, injections, and other risks. This analysis identifies blocking I/O operations, CPU-intensive computations in the async path, and inefficient algorithms that significantly impact request latency.
Critical Performance Issues
The LLM Guard plugin has severe performance bottlenecks primarily caused by:
- Synchronous Redis operations blocking on every cache access
- Blocking ML inference in LLMGuard scanner library
- Sequential scanner execution instead of parallel
- CPU-intensive operations (pickle, Levenshtein) in async path
The plugin executes ML inference and I/O synchronously in the request path, fundamentally incompatible with async/await architecture.
Without these fixes, the plugin will:
- Block the event loop on every request
- Prevent concurrent request processing
- Create severe latency spikes (500ms+)
- Limit throughput to sequential processing
1. Blocking Redis Operations in Hot Path
Location: cache.py:62, 67, 83, 98
Severity: CRITICAL
Issue: All Redis operations use the synchronous redis.Redis client, blocking the async event loop on every cache operation:
self.cache = redis.Redis(host=redis_host, port=redis_port) # Sync client
def update_cache(self, key: int = None, value: tuple = None) -> tuple[bool]:
serialized_obj = pickle.dumps(value) # Blocking serialization
success_set = self.cache.set(key, serialized_obj) # Blocking network I/O
success_expiry = self.cache.expire(key, self.cache_ttl) # Blocking network I/O
return success_set, success_expiry
def retrieve_cache(self, key: int = None) -> tuple:
value = self.cache.get(key) # Blocking network I/O
if value:
retrieved_obj = pickle.loads(value) # Blocking deserialization
return retrieved_objImpact:
- Every cache operation blocks the event loop (2 operations per update: set + expire)
- Network latency to Redis directly adds to request latency
- Under load, creates severe bottleneck as all requests serialize on Redis operations
- Used in both pre-hook (lines 177, 225) and post-hook paths
Recommendation:
- Use
redis.asyncio.Redisfor async Redis operations - Implement connection pooling with
redis.asyncio.ConnectionPool - Use pipelining to batch
set+expireoperations into single round-trip - Consider using
aioredisorredis-py[hiredis]for better performance - Make cache operations optional/configurable for low-latency scenarios
Example Fix:
import redis.asyncio as aioredis
class CacheTTLDict:
def __init__(self, ttl: int = 0):
self.cache_ttl = ttl
self.cache = aioredis.from_url(f"redis://{redis_host}:{redis_port}")
async def update_cache(self, key: int, value: tuple) -> tuple[bool, bool]:
serialized_obj = pickle.dumps(value) # Still sync, but fast
async with self.cache.pipeline() as pipe:
await pipe.set(key, serialized_obj)
await pipe.expire(key, self.cache_ttl)
results = await pipe.execute()
return results[0], results[1]2. Blocking Pickle Serialization/Deserialization
Location: cache.py:60, 85
Severity: HIGH
Issue: pickle.dumps() and pickle.loads() are synchronous CPU-intensive operations in the async path:
serialized_obj = pickle.dumps(value) # Blocking CPU work
retrieved_obj = pickle.loads(value) # Blocking CPU workImpact:
- Serialization blocks event loop proportional to vault tuple size
- Large vaults (many anonymized entities) cause significant blocking
- No alternative fast path for small objects
Recommendation:
- Use
asyncio.to_thread()for CPU-intensive pickle operations on large objects - Consider faster serialization formats (msgpack, orjson) for structured data
- Implement size threshold: small objects serialize inline, large objects offload to thread
- Cache serialized representations if vault doesn't change
3. LLMGuard Scanner Calls Block Event Loop
Location: llmguard.py:234, 259, 264, 285, 305
Severity: CRITICAL
Issue: All LLMGuard library scanner calls are synchronous and potentially CPU-intensive (ML model inference):
# Input filters - synchronous ML inference
for scanner in self.scanners["input"]["filters"]:
sanitized_prompt, is_valid, risk_score = scanner.scan(input_prompt) # BLOCKING
# Input sanitizers - synchronous transformation
result = scan_prompt(self.scanners["input"]["sanitizers"], input_prompt) # BLOCKING
# Vault leak detection - multiple synchronous operations
sanitized_output_de, _, _ = scanner.scan(result[0], input_prompt) # BLOCKING
input_anonymize_score = word_wise_levenshtein_distance(input_prompt, result[0]) # BLOCKING
input_deanonymize_score = word_wise_levenshtein_distance(result[0], sanitized_output_de) # BLOCKING
# Output operations
for scanner in self.scanners["output"]["filters"]:
sanitized_prompt, is_valid, risk_score = scanner.scan(original_input, model_response) # BLOCKINGImpact:
- ML model inference can take 10-500ms per scanner
- Multiple scanners compound the latency (N scanners = N × inference time)
- Blocks entire event loop preventing other requests from processing
- CPU utilization spikes block other async tasks
- No parallelization of independent scanners
Recommendation:
- Run scanner operations in thread pool using
asyncio.to_thread(scanner.scan, ...) - Execute independent scanners in parallel using
asyncio.gather() - Consider batching multiple inputs to scanners for better GPU utilization
- Add timeout protection to prevent runaway scanner operations
- Implement result caching for identical inputs (hash-based lookup)
- Profile which scanners are most expensive and prioritize optimization
4. Expensive Levenshtein Distance Calculation
Location: policy.py:67-95, used in llmguard.py:265-266
Severity: MEDIUM-HIGH
Issue: Word-wise Levenshtein distance has O(n×m) complexity and runs synchronously in hot path:
def word_wise_levenshtein_distance(sentence1, sentence2):
words1 = sentence1.split()
words2 = sentence2.split()
n, m = len(words1), len(words2)
dp = [[0] * (m + 1) for _ in range(n + 1)] # O(n×m) memory
for i in range(1, n + 1):
for j in range(1, m + 1): # O(n×m) computation
if words1[i - 1] == words2[j - 1]:
dp[i][j] = dp[i - 1][j - 1]
else:
dp[i][j] = min(dp[i - 1][j], dp[i][j - 1], dp[i - 1][j - 1]) + 1
return dp[n][m]Called twice per input when vault leak detection is enabled (lines 265-266).
Impact:
- For 100-word prompts: 100×100 = 10,000 operations × 2 calls = 20,000 operations
- Blocks event loop during computation
- Creates large temporary 2D arrays
- Called on every anonymized input with vault leak detection enabled
Recommendation:
- Run in thread pool using
asyncio.to_thread()for long prompts - Use optimized C-extension libraries like
python-Levenshteinorrapidfuzz - Implement quick-reject heuristics (length difference threshold)
- Cache results for identical prompt pairs
- Consider approximate string matching for acceptable accuracy
- Only enable for critical security scenarios
Performance Comparison:
# Current: Pure Python O(n×m)
distance = word_wise_levenshtein_distance(s1, s2) # ~10ms for 100 words
# Optimized: rapidfuzz C extension
from rapidfuzz.distance import Levenshtein
distance = Levenshtein.distance(s1.split(), s2.split()) # ~0.1ms for 100 words5. Policy Evaluation Using eval()
Location: policy.py:62
Severity: MEDIUM (Security: HIGH)
Issue: Uses eval() to evaluate policy expressions on every scan:
return eval(compile(tree, "<string>", "eval"), {}, policy_variables)Impact:
- AST parsing and compilation on every evaluation
eval()overhead even with sanitized expressions- Executed on every filter result (input and output)
Security Concern: While AST validation helps, eval() is inherently risky
Recommendation:
- Pre-compile policy expressions during initialization and cache compiled code
- Use a dedicated expression evaluator (e.g.,
simpleevallibrary) - For common policies (AND/OR of all filters), use fast path without eval
- Consider declarative policy format that compiles to Python functions
Example Optimization:
class GuardrailPolicy:
def __init__(self):
self._compiled_cache = {}
def evaluate(self, policy: str, scan_result: dict):
if policy not in self._compiled_cache:
tree = ast.parse(policy, mode="eval")
# ... validation ...
self._compiled_cache[policy] = compile(tree, "<string>", "eval")
policy_variables = {key: value["is_valid"] for key, value in scan_result.items()}
return eval(self._compiled_cache[policy], {}, policy_variables)6. Sequential Scanner Execution (No Parallelization)
Location: llmguard.py:233-241, 283-291
Severity: HIGH
Issue: Scanners execute sequentially even though they're independent:
result = {}
for scanner in self.scanners["input"]["filters"]: # Sequential execution
sanitized_prompt, is_valid, risk_score = scanner.scan(input_prompt)
scanner_name = type(scanner).__name__
result[scanner_name] = {...}Impact:
- Total latency = sum of all scanner latencies
- 5 scanners × 50ms each = 250ms total (could be 50ms if parallel)
- Wastes CPU cores and GPU resources
- No benefit from async execution
Recommendation:
- Execute independent scanners in parallel using
asyncio.gather() - Run each scanner in thread pool concurrently
- Aggregate results after all complete
- Handle failures gracefully (some scanners may fail)
Example Fix:
async def _apply_input_filters(self, input_prompt):
async def scan_async(scanner):
return await asyncio.to_thread(scanner.scan, input_prompt)
tasks = [scan_async(scanner) for scanner in self.scanners["input"]["filters"]]
results = await asyncio.gather(*tasks, return_exceptions=True)
result = {}
for scanner, scan_result in zip(self.scanners["input"]["filters"], results):
if isinstance(scan_result, Exception):
logger.error(f"Scanner {scanner} failed: {scan_result}")
continue
sanitized_prompt, is_valid, risk_score = scan_result
result[type(scanner).__name__] = {...}
return result7. Inefficient Context Updates with Nested Loops
Location: plugin.py:75-113
Severity: MEDIUM
Issue: Complex nested conditional logic and loops for context updates:
def update_context(context):
plugin_name = self.__class__.__name__
if plugin_name not in context.state[self.guardrails_context_key]:
context.state[self.guardrails_context_key][plugin_name] = {}
if key not in context.state[self.guardrails_context_key][plugin_name]:
context.state[self.guardrails_context_key][plugin_name][key] = value
else:
if isinstance(value, dict):
for k, v in value.items(): # Nested loop 1
if k not in context.state[self.guardrails_context_key][plugin_name][key]:
context.state[self.guardrails_context_key][plugin_name][key][k] = v
else:
if isinstance(v, dict):
for k_sub, v_sub in v.items(): # Nested loop 2
context.state[self.guardrails_context_key][plugin_name][key][k][k_sub] = v_subImpact:
- Multiple dictionary lookups on hot path
- Nested loops for complex values
- Called multiple times per request (lines 133, 145, 164, 181, 231, 244)
- Creates/updates context even when
set_guardrails_context=False(logic checked after update)
Recommendation:
- Check
self.lgconfig.set_guardrails_contextearly and return - Use
setdefault()to reduce lookups - Consider flattened key structure instead of nested dicts
- Cache plugin_name and guardrails_context_key access
- Only update context when actually needed
8. Vault Recreation Overhead
Location: llmguard.py:49-69, 254-258
Severity: MEDIUM
Issue: Checks vault expiry on every sanitizer call and recreates vault:
def _create_new_vault_on_expiry(self, vault) -> bool:
logger.info(f"Vault creation time {vault.creation_time}") # Unnecessary logging
logger.info(f"Vault ttl {self.vault_ttl}")
if datetime.now() - vault.creation_time > timedelta(seconds=self.vault_ttl): # On every call
del vault # Manual deletion
logger.info("Vault successfully deleted after expiry")
self._update_input_sanitizers() # Reinitializes scanners
return True
return False
# Called on every sanitizer application:
vault, _, _ = self._retreive_vault()
vault_update_status = self._create_new_vault_on_expiry(vault) # Every time!Impact:
datetime.now()andtimedeltaoperations on every call- Unnecessary vault retrieval when TTL=0 (disabled)
- Scanner reinitialization overhead when vault expires
- Excessive logging in hot path
Recommendation:
- Check TTL once during init; skip expiry check if TTL=0
- Cache expiry timestamp instead of recalculating each time
- Use Redis TTL for vault expiry instead of application logic
- Reduce logging verbosity (debug level, not info)
- Don't reinitialize scanners, just update vault reference
9. Repeated Scanner Initialization
Location: llmguard.py:162-220
Severity: MEDIUM
Issue: Scanner initialization in __init_scanners() uses get_scanner_by_name() which may be expensive:
for filter_name in policy_filter_names:
self.scanners["input"]["filters"].append(
input_scanners.get_scanner_by_name(filter_name, self.lgconfig.input.filters[filter_name])
) # May load ML models, download resourcesImpact:
- Happens during plugin initialization (not hot path)
- ML model loading can take seconds
- Blocks plugin startup
- No lazy loading or async initialization
Recommendation:
- Make
initialize()async and await scanner initialization - Load scanners lazily on first use
- Warm up scanners concurrently during startup
- Cache loaded models across plugin instances if possible
Moderate Performance Issues
10. Excessive Logging in Hot Path
Location: Throughout plugin.py and llmguard.py
Severity: LOW-MEDIUM
Issue: Info-level logging on every operation:
logger.info(f"Processing payload {payload}") # Line 125
logger.info(f"Applying input guardrail filters on {payload.args[key]}") # Line 138
logger.info(f"Result of input guardrail filters: {result}") # Line 141
logger.info(f"Result of policy decision: {decision}") # Line 143
# ... many moreImpact:
- String formatting overhead even if logging is disabled
- I/O operations if logging to file
- Potential sensitive data leakage in logs
Recommendation:
- Use debug level for detailed operational logs
- Use lazy evaluation:
logger.debug("Result: %s", result)instead of f-strings - Guard expensive logging with
if logger.isEnabledFor(logging.DEBUG): - Avoid logging payload/result contents in production
11. No Result Caching for Repeated Inputs
Location: All scanner methods in llmguard.py
Severity: MEDIUM
Issue: Same prompts scanned repeatedly without caching:
result = self.llmguard_instance._apply_input_filters(payload.args[key])
# No cache lookup before expensive scanningImpact:
- Identical prompts scanned multiple times
- Wastes CPU/GPU resources
- Increases latency unnecessarily
- Common in batch processing or similar queries
Recommendation:
- Implement LRU cache for scan results keyed by (prompt_hash, scanner_config)
- Use TTL-based invalidation (e.g., 5 minutes)
- Consider cache size limits (e.g., 10,000 entries)
- Make caching configurable
Example:
from functools import lru_cache
import hashlib
def _cache_key(self, prompt: str, scanner_type: str) -> str:
return f"{scanner_type}:{hashlib.sha256(prompt.encode()).hexdigest()}"
@lru_cache(maxsize=10000)
def _apply_input_filters_cached(self, prompt: str) -> dict:
return self._apply_input_filters(prompt)12. Repeated Context State Checks
Location: plugin.py:206-219
Severity: LOW-MEDIUM
Issue: Multiple redundant checks for context keys:
if self.guardrails_context_key in context.state:
original_prompt = context.state[self.guardrails_context_key]["original_prompt"] if "original_prompt" in context.state[self.guardrails_context_key] else ""
vault_id = context.state[self.guardrails_context_key]["vault_cache_id"] if "vault_cache_id" in context.state[self.guardrails_context_key] else None
else:
context.state[self.guardrails_context_key] = {}
if self.guardrails_context_key in context.global_context.state: # Duplicate check
# Same logic repeated for global contextImpact:
- Multiple dictionary lookups for same keys
- Repeated checks in both local and global context
- Code duplication
Recommendation:
- Use
.get()with defaults:context.state.get(key, {}).get("original_prompt", "") - Cache context dictionary reference
- Extract common logic to helper method
13. Inefficient Vault Retrieval
Location: llmguard.py:83-104
Severity: LOW
Issue: Loops through all sanitizers to find vault:
length = len(self.scanners["input"]["sanitizers"])
for i in range(length): # Linear search
scanner_name = type(self.scanners["input"]["sanitizers"][i]).__name__
if scanner_name in sanitizer_names:
# ... access vaultImpact:
- Linear search through scanners on every call
- Repeated type introspection:
type(scanner).__name__ - Variable
iused outside loop (line 104) - potential bug
Recommendation:
- Build scanner name → scanner index mapping during initialization
- Cache vault reference instead of retrieving each time
- Fix potential bug where
imay be undefined if loop doesn't execute
14. Exception Handling Without Specificity
Location: llmguard.py:102, 120, 138, 168, 187, 197, 209
Severity: LOW
Issue: Broad exception catching that swallows errors:
except Exception as e:
logger.error(f"Error retrieving scanner {scanner_name}: {e}")Impact:
- Hides bugs and makes debugging difficult
- Continues execution with partial failures
- No error propagation to caller
Recommendation:
- Catch specific exceptions (ValueError, KeyError, etc.)
- Propagate critical errors to caller
- Use try-except only for expected error conditions
Minor Performance Considerations
15. String Operations in Hot Path
Location: plugin.py:149, 168, 248
Issue: String formatting for error messages created unconditionally:
description="{threat} detected in the prompt".format(threat=list(decision[2].keys())[0])Impact: Minimal, but creates temporary strings and lists
Recommendation: Pre-format common error messages
16. Type Introspection for Scanner Names
Location: llmguard.py:235, 286
Issue: type(scanner).__name__ called for every scanner
Impact: Small overhead, but could be cached
Recommendation: Store scanner name as attribute during initialization
17. Manual Deletion with del
Location: llmguard.py:64, 116
Issue: Explicit del doesn't guarantee immediate memory release
Recommendation: Rely on Python's garbage collection; remove del statements
Architectural Recommendations
1. Async-First Design
Convert entire plugin to async/await pattern:
- Redis operations → async
- Scanner calls → run in thread pool
- Pickle operations → offload for large objects
- Context updates → async if needed
2. Scanner Execution Pipeline
Implement efficient scanner pipeline:
- Cache lookup (fast path)
- Parallel independent scanner execution
- Result aggregation
- Policy evaluation (pre-compiled)
- Cache storage
3. Lazy Initialization
- Load scanners on first use, not during
__init__ - Defer vault creation until needed
- Initialize Redis connection pool asynchronously
4. Observability & Monitoring
- Add metrics for scanner execution time
- Track cache hit rates
- Monitor vault expiry and recreation
- Profile ML inference latency
- Add timeout alerts
Performance Testing Recommendations
Load Testing Scenarios
- Baseline: Single scanner, simple prompts
- Multiple Scanners: 5+ scanners, typical prompts
- Large Prompts: 1000+ word prompts with vault leak detection
- Cache Behavior: Repeated vs unique prompts
- Vault Expiry: Performance during vault recreation
Metrics to Track
- End-to-end latency: Pre-hook and post-hook separately
- Scanner latency: Per-scanner breakdown
- Redis latency: Get/set/pipeline operations
- Cache hit rate: For result caching
- CPU utilization: During ML inference
- Memory usage: Vault size, scanner model memory
Profiling Tools
- cProfile: Identify expensive functions
- py-spy: Low-overhead async profiling
- memory_profiler: Track memory allocations
- Redis SLOWLOG: Identify slow cache operations
- asyncio debug mode: Detect blocking operations
Implementation Priority
Phase 1 - Critical
- Async Redis operations - Single biggest bottleneck
- Parallelize scanner execution - 3-5x speedup potential
- Offload ML inference to threads - Prevent event loop blocking
- Fix pickle blocking - Use async for large objects
Phase 2 - High Impact
- Optimize Levenshtein calculation (use C extension)
- Implement scan result caching
- Pre-compile policy expressions
- Reduce context update overhead
- Optimize vault expiry checking
Phase 3 - Incremental (Future)
- Lazy scanner initialization
- Reduce logging overhead
- Optimize vault retrieval
- Improve error handling specificity
- Add comprehensive metrics
Related Files
plugins/external/llmguard/llmguardplugin/plugin.py- Main plugin implementationplugins/external/llmguard/llmguardplugin/llmguard.py- LLMGuard wrapper and scanner logicplugins/external/llmguard/llmguardplugin/cache.py- Redis caching layerplugins/external/llmguard/llmguardplugin/policy.py- Policy evaluation and utilitiesplugins/external/llmguard/llmguardplugin/schema.py- Configuration schema