Skip to content

Conversation

@Evrard-Nil
Copy link
Contributor

@Evrard-Nil Evrard-Nil commented Dec 9, 2025

This pull request refactors the configuration and initialization logic for inference providers, replacing the previous "model discovery" system with a new "inference router" approach. The changes simplify how inference provider pools are initialized and configured, update environment variable usage, and clean up related shutdown and test logic.

Configuration Refactor

  • Replaced the ModelDiscoveryConfig struct and all related references with a new InferenceRouterConfig struct, updating field names, default values, and environment variable usage throughout crates/config/src/types.rs. [1] [2] [3] [4] [5] [6]

Inference Provider Pool Initialization

  • Refactored init_inference_providers to use the inference router endpoint and removed model discovery refresh logic, simplifying pool creation and startup in crates/api/src/lib.rs.

Test and Mock Logic Updates

  • Updated test configuration and mock initialization to use the new inference router config, replacing model discovery references and removing unused refresh logic in both crates/api/src/lib.rs and crates/api/tests/common/mod.rs. [1] [2] [3] [4]

Shutdown Procedure

  • Updated shutdown logic to clean up inference provider pool resources instead of cancelling model discovery background tasks, with improved logging in crates/api/src/main.rs.

Note

Replaces model discovery with a single inference router, simplifying the provider pool and config, updating shutdown/tests, and adding router attestation to reports.

  • Core Services:
    • Inference Provider Pool: Simplified to a single router_provider with direct forwarding (chat_completion, streaming, models, get_attestation_report). Removed discovery, load-balancing, fallback/retry, periodic refresh, and error sanitization logic. Streamlined shutdown to clear chat/signature maps only.
    • Attestation: Added router_attestation to AttestationReport and API response. Extracts per-model attestations from all_attestations and treats remaining provider payload as router attestation.
  • API:
    • Initialization: init_inference_providers now uses config.inference_router (router_url, api_key, inference_timeout). Removed discovery init/refresh calls.
    • Shutdown: Renamed and simplified Phase 1 to clean up inference resources.
  • Config:
    • Replaced ModelDiscoveryConfig with InferenceRouterConfig and updated env vars (INFERENCE_ROUTER_URL, INFERENCE_ROUTER_API_KEY, MODEL_INFERENCE_TIMEOUT). Updated ApiConfig/DomainConfig wiring.
  • Tests/Mocks:
    • Updated test configs to use InferenceRouterConfig. Simplified mock pool setup to work with the new router-based pool.

Written by Cursor Bugbot for commit 8168edf. This will update automatically on new commits. Configure here.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request refactors the inference provider system from a distributed "model discovery" approach with load balancing and failover to a simplified "inference router" design that delegates all routing logic to a single external endpoint.

Key changes:

  • Replaced ModelDiscoveryConfig with InferenceRouterConfig, removing periodic refresh and discovery-specific settings
  • Simplified InferenceProviderPool from a complex load balancer with multiple providers to a thin wrapper around a single router endpoint
  • Removed background refresh tasks, retry/fallback logic, and multi-provider coordination from the pool implementation

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
crates/config/src/types.rs Renamed configuration struct and environment variables from MODEL_DISCOVERY_* to INFERENCE_ROUTER_*, removed refresh_interval field
crates/services/src/inference_provider_pool/mod.rs Simplified pool to use single router provider, removed discovery/load-balancing/retry logic, updated shutdown to only clear local mappings
crates/api/src/lib.rs Updated initialization to create pool with router URL instead of discovery server, simplified mock provider setup
crates/api/src/main.rs Updated shutdown logging to reflect resource cleanup instead of background task cancellation
crates/api/tests/common/mod.rs Updated test configuration to use inference_router field instead of model_discovery

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

pub api_key: Option<String>,
pub refresh_interval: i64, // seconds
pub timeout: i64, // seconds (for discovery requests)
pub timeout: i64, // seconds (for router requests)
Copy link

Copilot AI Dec 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The timeout field is defined but never used in the new implementation. The pool creation only uses inference_timeout (via config.inference_router.inference_timeout). This appears to be leftover from the previous model discovery implementation where it was used for discovery HTTP requests.

Consider either:

  1. Removing this field entirely if it's no longer needed
  2. Using it for the router's HTTP client timeout if that's the intent
  3. Documenting why it exists and when it might be used in the future

If removing, also update:

  • The from_env() method (lines 132-135)
  • The Default implementation (lines 150-153)
  • The corresponding environment variable INFERENCE_ROUTER_TIMEOUT

Copilot uses AI. Check for mistakes.
Comment on lines 435 to 447
// Create a MockProvider that accepts all models (using new_accept_all)
let mock_provider = Arc::new(MockProvider::new_accept_all());

// For the new simplified pool, we need to create a wrapper pool that uses the mock provider
// Since the pool now expects a router URL, we'll create it with a dummy URL
// but the mock will be used directly in tests via a different mechanism
let pool = Arc::new(
services::inference_provider_pool::InferenceProviderPool::new(
"http://localhost:8080/models".to_string(),
"http://localhost:8080".to_string(),
None,
5,
30 * 60,
),
);
Copy link

Copilot AI Dec 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The mock provider is created but never used. The pool is initialized with a real VLlmProvider pointing to "http://localhost:8080", which means tests will attempt real HTTP requests instead of using the mock. This defeats the purpose of having a mock provider.

Consider creating a mechanism to inject the mock provider into the pool, or create a separate test-only pool constructor that accepts a provider directly. For example:

// Add to InferenceProviderPool
pub fn new_with_provider(provider: Arc<InferenceProviderTrait>) -> Self {
    Self {
        router_provider: provider,
        chat_id_mapping: Arc::new(RwLock::new(HashMap::new())),
        signature_hashes: Arc::new(RwLock::new(HashMap::new())),
    }
}

// Then in this function:
let mock_provider_trait: Arc<InferenceProviderTrait> = mock_provider.clone();
let pool = Arc::new(
    services::inference_provider_pool::InferenceProviderPool::new_with_provider(
        mock_provider_trait
    )
);

Copilot uses AI. Check for mistakes.
Comment on lines 37 to 58
/// Register a provider for testing (useful for testing with mock providers)
pub async fn register_provider(
&self,
_model_id: String,
_provider: Arc<InferenceProviderTrait>,
) {
// For testing, we replace the router provider with the mock
// This is a simplified approach - in tests we'll use the mock directly
// Note: We can't actually replace router_provider since it's Arc, but tests
// should call register_providers which creates a pool differently
tracing::debug!("register_provider called - this is mainly for backwards compatibility");
}

/// Register multiple providers for multiple models (useful for testing)
/// Register multiple providers for testing
/// In the new simplified design, this creates a new pool with a mock provider
pub async fn register_providers(&self, providers: Vec<(String, Arc<InferenceProviderTrait>)>) {
let mut model_mapping = self.model_mapping.write().await;
for (model_id, provider) in providers {
model_mapping
.entry(model_id)
.or_insert_with(Vec::new)
.push(provider);
}
}

/// Initialize model discovery - should be called during application startup
pub async fn initialize(&self) -> Result<(), ListModelsError> {
tracing::debug!(
url = %self.discovery_url,
"Initializing model discovery from discovery server"
);

match self.discover_models().await {
Ok(models_response) => {
tracing::info!(
"Successfully discovered {} models",
models_response.data.len()
);
Ok(())
}
Err(e) => {
tracing::error!("Failed to initialize model discovery");
Err(e)
}
}
}

/// Fetch and parse models from discovery endpoint
async fn fetch_from_discovery(
&self,
) -> Result<HashMap<String, DiscoveryEntry>, ListModelsError> {
tracing::debug!(
url = %self.discovery_url,
"Fetching models from discovery server"
);

let client = reqwest::Client::builder()
.timeout(self.discovery_timeout)
.build()
.map_err(|e| {
ListModelsError::FetchError(format!("Failed to create HTTP client: {e}"))
})?;

let response = client
.get(&self.discovery_url)
.send()
.await
.map_err(|e| ListModelsError::FetchError(format!("HTTP request failed: {e}")))?;

let discovery_map: HashMap<String, DiscoveryEntry> = response
.json()
.await
.map_err(|e| ListModelsError::FetchError(format!("Failed to parse JSON: {e}")))?;

tracing::debug!(entries = discovery_map.len(), "Received discovery response");

Ok(discovery_map)
}

/// Parse IP-based keys like "192.0.2.1:8000"
/// Returns None for keys that don't match IP:PORT format (e.g., "redpill:...")
fn parse_ip_port(key: &str) -> Option<(String, u16)> {
let parts: Vec<&str> = key.split(':').collect();
if parts.len() != 2 {
return None;
}

let ip = parts[0];
let port = parts[1].parse::<u16>().ok()?;

// Verify it's a valid IP address
if ip.parse::<IpAddr>().is_err() {
return None;
}

Some((ip.to_string(), port))
}

/// Ensure models are discovered before using them
async fn ensure_models_discovered(&self) -> Result<(), CompletionError> {
let model_mapping = self.model_mapping.read().await;

// If mapping is empty, we need to discover models
if model_mapping.is_empty() {
drop(model_mapping); // Release read lock
tracing::warn!("Model mapping is empty, triggering model discovery");
self.discover_models().await.map_err(|e| {
CompletionError::CompletionError(format!("Failed to discover models: {e}"))
})?;
}

Ok(())
}

async fn discover_models(&self) -> Result<ModelsResponse, ListModelsError> {
tracing::info!("Starting model discovery from discovery endpoint");

// Fetch from discovery server
let discovery_map = self.fetch_from_discovery().await?;

let mut model_mapping = self.model_mapping.write().await;
model_mapping.clear();

// Group by model name
let mut model_to_endpoints: HashMap<String, Vec<(String, u16)>> = HashMap::new();

for (key, entry) in discovery_map {
// Filter out non-IP keys
if let Some((ip, port)) = Self::parse_ip_port(&key) {
tracing::debug!(
key = %key,
model = %entry.model,
tags = ?entry.tags,
ip = %ip,
port = port,
"Adding IP-based provider"
);

model_to_endpoints
.entry(entry.model)
.or_default()
.push((ip, port));
} else {
tracing::debug!(
key = %key,
model = %entry.model,
tags = ?entry.tags,
"Skipping non-IP key"
);
}
}

// Create providers for each endpoint
let mut all_models = Vec::new();

for (model_name, endpoints) in model_to_endpoints {
tracing::info!(
model = %model_name,
providers_count = endpoints.len(),
"Discovered model with {} provider(s)",
endpoints.len()
);

let mut providers_for_model = Vec::new();

for (ip, port) in endpoints {
let url = format!("http://{ip}:{port}");
tracing::debug!(
model = %model_name,
url = %url,
"Creating provider for model"
);

let provider = Arc::new(VLlmProvider::new(VLlmConfig::new(
url.clone(),
self.api_key.clone(),
Some(self.inference_timeout_secs),
))) as Arc<InferenceProviderTrait>;

match provider
.get_attestation_report(model_name.clone(), None, None, None)
.await
{
Ok(_) => {
providers_for_model.push(provider);
}
Err(e) => {
tracing::debug!(
model = %model_name,
url = %url,
error = %e,
"Provider failed to return attestation report, excluding from pool"
);
}
}
}

model_mapping.insert(model_name.clone(), providers_for_model);

all_models.push(inference_providers::models::ModelInfo {
id: model_name,
object: "model".to_string(),
created: 0,
owned_by: "discovered".to_string(),
});
}

tracing::info!(
total_models = all_models.len(),
total_providers = model_mapping.values().map(|v| v.len()).sum::<usize>(),
model_ids = ?all_models.iter().map(|m| &m.id).collect::<Vec<_>>(),
"Model discovery from endpoint completed"
"register_providers called with {} providers - mainly for backwards compatibility",
providers.len()
);

Ok(ModelsResponse {
object: "list".to_string(),
data: all_models,
})
}

async fn get_providers_for_model(
&self,
model_id: &str,
) -> Option<Vec<Arc<InferenceProviderTrait>>> {
let model_mapping = self.model_mapping.read().await;
model_mapping.get(model_id).cloned()
// In tests, we should create pools directly with mock providers instead of using this
}
Copy link

Copilot AI Dec 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The register_provider and register_providers methods are now no-ops that don't perform any action, yet they're still pub async functions. This is misleading to API consumers who might expect these methods to actually register providers.

Consider one of these approaches:

  1. Remove these methods entirely if they're no longer needed (breaking change but clearer)
  2. Mark them as deprecated with #[deprecated] and a message explaining the new testing approach
  3. If they're truly needed for backwards compatibility, document why they exist and when they can be removed

Example:

#[deprecated(note = "Use InferenceProviderPool::new_with_provider for testing instead")]
pub async fn register_provider(&self, _model_id: String, _provider: Arc<InferenceProviderTrait>) {
    // No-op for backwards compatibility
}

Copilot uses AI. Check for mistakes.
Comment on lines +154 to +156
// Use StreamingResultExt to peek at the first event
use inference_providers::StreamingResultExt;
let mut peekable = stream.peekable();
Copy link

Copilot AI Dec 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The use inference_providers::StreamingResultExt; statement is placed inside the function body. While this works, it's unconventional and less discoverable. Consider either:

  1. Adding it back to the top-level imports (line 3)
  2. If it's intentionally placed here for scoping reasons, add a comment explaining why

Example for top-level import:

use inference_providers::{
    models::{AttestationError, CompletionError, ListModelsError, ModelsResponse},
    ChatCompletionParams, InferenceProvider, StreamingResult, StreamingResultExt, VLlmConfig, VLlmProvider,
};

Copilot uses AI. Check for mistakes.
Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the final PR Bugbot will review for you during this billing cycle

Your free Bugbot reviews will reset on January 9

Details

Your team is on the Bugbot Free tier. On this plan, Bugbot will review limited PRs each billing cycle for each member of your team.

To receive Bugbot reviews on all of your PRs, visit the Cursor dashboard to activate Pro and start your 14-day free trial.

@claude
Copy link

claude bot commented Dec 10, 2025

Code Review: PR #240 - Migrate to Single Router Endpoint

Critical Issues Found ⚠️

1. Breaking Change: Mock Test Setup Broken (Production Safety)

Location: crates/api/src/lib.rs:432-447

The mock provider setup creates a pool but never actually uses the mock provider. Tests will fail because:

  • Mock provider is created but not integrated into the new pool architecture
  • The comment says "mock will be used directly in tests via a different mechanism" but no such mechanism exists
  • Tests expecting MockProvider behavior will make real HTTP requests to http://localhost:8080

Fix Required:
Either:

  1. Update the pool to support injecting mock providers, OR
  2. Update all tests to use real vLLM endpoints, OR
  3. Document the new test strategy and update existing tests accordingly
// Current broken code at lib.rs:432
let mock_provider = Arc::new(MockProvider::new_accept_all());
// ^ Created but never used!

let pool = Arc::new(
    services::inference_provider_pool::InferenceProviderPool::new(
        "http://localhost:8080".to_string(),  // Will make real HTTP calls
        None,
        30 * 60,
    ),
);

2. Logic Flaw: Attestation API Signature Changed (Backward Compatibility)

Location: crates/services/src/inference_provider_pool/mod.rs:75-93

The get_attestation_report method changes its return type behavior:

  • Before: Returns Vec<Map> with multiple attestations (one per provider)
  • After: Always returns Vec with single element vec![attestation]

This breaks the API contract. Code expecting multiple attestations will receive only one.

Impact:

  • Downstream services relying on multiple attestations per model will break
  • The all_attestations field removal suggests this was intentional but needs migration strategy

Fix Required:

  1. Check if any code depends on receiving multiple attestations
  2. Update API documentation to reflect single-attestation behavior
  3. Consider versioning if this breaks existing clients

3. Performance Issue: Unnecessary Clone in Hot Path (Rust-Specific)

Location: crates/services/src/inference_provider_pool/mod.rs:125-127

let pool = self.clone();  // ❌ Unnecessary - pool uses Arc internally
let provider = self.router_provider.clone();  // ✅ Necessary

While InferenceProviderPool is Clone, it contains Arc<RwLock<>> fields. Cloning the entire pool is unnecessary when you only need the provider reference.

Fix:

let chat_mapping = self.chat_id_mapping.clone();  // Clone just the Arc<RwLock<>>
let provider = self.router_provider.clone();
tokio::spawn(async move {
    let mut mapping = chat_mapping.write().await;
    mapping.insert(chat_id, provider);
});

4. Missing Error Handling: Router Timeout Not Configurable (Configuration)

Location: crates/config/src/types.rs:132

The config struct has timeout for router requests but it's never passed to VLlmProvider initialization:

// lib.rs:416-419
let router_provider = Arc::new(VLlmProvider::new(VLlmConfig::new(
    router_url,
    api_key,
    Some(inference_timeout_secs),  // ❌ Only inference timeout, not router/discovery timeout
)));

The config.inference_router.timeout field exists but is unused. Router health checks and model discovery will use default timeouts.

Fix:
Pass both timeouts to VLlmConfig or clarify that timeout is deprecated.


5. Potential Resource Leak: Streaming Peek Behavior (Concurrency)

Location: crates/services/src/inference_provider_pool/mod.rs:134-145

if let Some(Ok(event)) = peekable.peek().await {
    if let inference_providers::StreamChunk::Chat(chat_chunk) = &event.chunk {
        let chat_id = chat_chunk.id.clone();
        tokio::spawn(async move {  // ❌ Fire-and-forget task
            pool.store_chat_id_mapping(chat_id, provider).await;
        });
    }
}

Issues:

  1. If peek() returns Err, the error is silently ignored (not logged)
  2. The spawned task is fire-and-forget with no error handling
  3. If store_chat_id_mapping panics or times out, it's invisible

Fix:

match peekable.peek().await {
    Some(Ok(event)) => {
        if let inference_providers::StreamChunk::Chat(chat_chunk) = &event.chunk {
            // Store synchronously or add error logging to spawned task
            tokio::spawn(async move {
                if let Err(e) = pool.store_chat_id_mapping(...).await {
                    tracing::warn!("Failed to store chat_id mapping: {}", e);
                }
            });
        }
    }
    Some(Err(e)) => {
        tracing::warn!("Failed to peek at stream: {}", e);
    }
    None => {
        tracing::warn!("Stream ended before first event");
    }
}

6. Missing Migration Documentation (Production Safety)

Location: Environment variables renamed

Breaking changes:

  • MODEL_DISCOVERY_SERVER_URLINFERENCE_ROUTER_URL
  • MODEL_DISCOVERY_API_KEYINFERENCE_ROUTER_API_KEY
  • MODEL_DISCOVERY_REFRESH_INTERVAL → (removed)
  • MODEL_DISCOVERY_TIMEOUTINFERENCE_ROUTER_TIMEOUT

Required:

  1. Update env.example with new variable names
  2. Add migration guide in PR description
  3. Consider backward compatibility (read old env vars if new ones missing)

Verdict: ⚠️ CRITICAL ISSUES - DO NOT MERGE YET

Must fix before merge:

  1. Mock provider test setup (breaks all tests using mocks)
  2. Attestation API signature change (confirm no breaking usage)
  3. Environment variable migration plan

Recommended fixes:
4. Unnecessary pool clone in hot path
5. Streaming peek error handling
6. Unused timeout configuration


Estimated Risk: HIGH - This refactor removes 653 lines of complex fallback/retry logic. While simplification is good, ensure the router handles:

  • Load balancing across multiple vLLM backends
  • Automatic failover when backends are down
  • Model discovery and routing

If the router doesn't handle these, this PR removes critical resilience features.

@Evrard-Nil Evrard-Nil temporarily deployed to Cloud API test env December 10, 2025 12:36 — with GitHub Actions Inactive
volumes:
postgres_data:
# volumes:
# postgres_data:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Development docker-compose changes accidentally committed

The docker-compose.yml file has several services and volumes commented out: the api service, datadog-agent service, and postgres volume persistence. These changes appear to be for local development convenience (e.g., starting fresh without persisted data) rather than intentional changes for the inference router migration. This could break other developers' workflows who expect the full docker-compose setup to work, and removes postgres data persistence which may cause data loss between restarts in development environments.

Fix in Cursor Fix in Web

@think-in-universe
Copy link
Contributor

Let's hold on merging this PR before vLLM router is fully verifiable with a reproducible build.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants