Skip to content

feat: convert fluid to a control-plane#65

Merged
aspectrr merged 31 commits intomainfrom
aspectrr/sandbox-host
Feb 21, 2026
Merged

feat: convert fluid to a control-plane#65
aspectrr merged 31 commits intomainfrom
aspectrr/sandbox-host

Conversation

@aspectrr
Copy link
Owner

Description

  • Realized that for this to make sense there needs to be a centralized place for sandboxes to live.
  • Plus changed this from creating full clones to micro VMs, leading to way faster startup.
  • Networking is taken care of on the L2 with a tap that leads straight to the sandbox.
  • This makes so much more sense and is something production ready.

Type of Change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation update
  • Code style update (formatting, renaming)
  • Code refactor (no functional changes)
  • Configuration change
  • Test update

Checklist

  • My code follows the style guidelines of this project
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works

Release Notes

Labels

…uid modules

106 tests across 12 new test files covering config, state, janitor, image,
network, reconnect, registry, placement, API handlers, and REST client.
All tests run without external dependencies (no QEMU, libvirt, PostgreSQL).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings February 16, 2026 01:16
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Introduces an HTTP control-plane client in the fluid CLI module and adds a broad set of unit tests for the new control-plane / sandbox-host components (API handlers, registry, placement logic, store/network/janitor/config/image behaviors).

Changes:

  • Added fluid/internal/controlplane REST client with typed request/response models and helper HTTP methods.
  • Added comprehensive tests for the control-plane client behavior and error handling.
  • Added new test suites for sandbox-host and control-plane subsystems (state, networking, janitor, config, registry, orchestrator placement, API handlers).

Reviewed changes

Copilot reviewed 13 out of 13 changed files in this pull request and generated 11 comments.

Show a summary per file
File Description
sandbox-host/internal/state/sqlite_test.go Adds state store CRUD/expiry/commands tests (currently outside CI/module scope).
sandbox-host/internal/network/bridge_test.go Adds tests for bridge resolution + DHCP mode (currently outside CI/module scope).
sandbox-host/internal/janitor/janitor_test.go Adds janitor cleanup behavior tests (timing-based).
sandbox-host/internal/image/store_test.go Adds image store listing/path resolution tests.
sandbox-host/internal/config/config_test.go Adds default/load/save config tests (uses UserHomeDir).
sandbox-host/internal/agent/reconnect_test.go Adds reconnect/backoff tests (real-time delays).
fluid/internal/controlplane/client.go Adds the new control-plane REST client implementation.
fluid/internal/controlplane/client_test.go Adds extensive client tests using httptest.
control-plane/internal/registry/registry_test.go Adds registry behavior + concurrency tests (currently outside CI/module scope).
control-plane/internal/orchestrator/placement_test.go Adds host placement tests (currently outside CI/module scope).
control-plane/internal/api/handlers_source_test.go Adds source VM handler tests (currently outside CI/module scope).
control-plane/internal/api/handlers_sandbox_test.go Adds sandbox handler tests + shared mock orchestrator (currently outside CI/module scope).
control-plane/internal/api/handlers_host_test.go Adds host handler tests (currently outside CI/module scope).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 151 to 234
// GetSandbox retrieves a sandbox by ID.
func (c *Client) GetSandbox(ctx context.Context, id string) (*SandboxResponse, error) {
var result SandboxResponse
if err := c.get(ctx, fmt.Sprintf("/v1/sandboxes/%s", id), &result); err != nil {
return nil, err
}
return &result, nil
}

// ListSandboxes returns all sandboxes.
func (c *Client) ListSandboxes(ctx context.Context) ([]*SandboxResponse, error) {
var result struct {
Sandboxes []*SandboxResponse `json:"sandboxes"`
Count int `json:"count"`
}
if err := c.get(ctx, "/v1/sandboxes", &result); err != nil {
return nil, err
}
return result.Sandboxes, nil
}

// DestroySandbox destroys a sandbox by ID.
func (c *Client) DestroySandbox(ctx context.Context, id string) error {
return c.delete(ctx, fmt.Sprintf("/v1/sandboxes/%s", id))
}

// StartSandbox starts a stopped sandbox.
func (c *Client) StartSandbox(ctx context.Context, id string) error {
return c.post(ctx, fmt.Sprintf("/v1/sandboxes/%s/start", id), nil, nil)
}

// StopSandbox stops a running sandbox.
func (c *Client) StopSandbox(ctx context.Context, id string) error {
return c.post(ctx, fmt.Sprintf("/v1/sandboxes/%s/stop", id), nil, nil)
}

// RunCommand executes a command in a sandbox.
func (c *Client) RunCommand(ctx context.Context, sandboxID, command string, timeoutSec int) (*CommandResponse, error) {
req := RunCommandRequest{
Command: command,
TimeoutSec: timeoutSec,
}
var result CommandResponse
if err := c.post(ctx, fmt.Sprintf("/v1/sandboxes/%s/run", sandboxID), req, &result); err != nil {
return nil, err
}
return &result, nil
}

// GetSandboxIP returns the IP address of a sandbox.
func (c *Client) GetSandboxIP(ctx context.Context, id string) (string, error) {
var result struct {
SandboxID string `json:"sandbox_id"`
IPAddress string `json:"ip_address"`
}
if err := c.get(ctx, fmt.Sprintf("/v1/sandboxes/%s/ip", id), &result); err != nil {
return "", err
}
return result.IPAddress, nil
}

// CreateSnapshot creates a snapshot of a sandbox.
func (c *Client) CreateSnapshot(ctx context.Context, sandboxID, name string) (*SnapshotResponse, error) {
req := struct {
Name string `json:"name"`
}{Name: name}
var result SnapshotResponse
if err := c.post(ctx, fmt.Sprintf("/v1/sandboxes/%s/snapshot", sandboxID), req, &result); err != nil {
return nil, err
}
return &result, nil
}

// ListCommands returns commands for a sandbox.
func (c *Client) ListCommands(ctx context.Context, sandboxID string) ([]*CommandResponse, error) {
var result struct {
Commands []*CommandResponse `json:"commands"`
Count int `json:"count"`
}
if err := c.get(ctx, fmt.Sprintf("/v1/sandboxes/%s/commands", sandboxID), &result); err != nil {
return nil, err
}
return result.Commands, nil
}
Copy link

Copilot AI Feb 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Path parameters (sandbox IDs, VM names) are interpolated directly into URLs via fmt.Sprintf (e.g., "/v1/sandboxes/%s"). If an ID or name contains characters that need escaping (spaces, "/", "?", "%"), requests will hit the wrong endpoint or fail. Use proper path-segment escaping (e.g., url.PathEscape per segment, or url.JoinPath) for all dynamic URL components (GetSandbox/DestroySandbox/Start/Stop/Run/CreateSnapshot/ListCommands and the source VM endpoints).

Copilot uses AI. Check for mistakes.
Comment on lines 364 to 384
func (c *Client) doRequest(req *http.Request, result any) error {
resp, err := c.httpClient.Do(req)
if err != nil {
return fmt.Errorf("request failed: %w", err)
}
defer func() { _ = resp.Body.Close() }()

data, err := io.ReadAll(resp.Body)
if err != nil {
return fmt.Errorf("read response: %w", err)
}

if resp.StatusCode >= 400 {
var errResp struct {
Error string `json:"error"`
}
if json.Unmarshal(data, &errResp) == nil && errResp.Error != "" {
return fmt.Errorf("control plane error (%d): %s", resp.StatusCode, errResp.Error)
}
return fmt.Errorf("control plane error (%d): %s", resp.StatusCode, string(data))
}
Copy link

Copilot AI Feb 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doRequest includes the raw response body verbatim in error messages for non-JSON errors. Many servers include trailing newlines/whitespace or HTML, which makes errors noisy and breaks exact-string assertions/log filtering. Consider trimming whitespace from the body before formatting the error, and (optionally) include req.Method and req.URL.Path in the error to make client-side debugging easier.

Copilot uses AI. Check for mistakes.
Comment on lines 34 to 45
func TestNewClient(t *testing.T) {
c := NewClient("http://localhost:9090")
if c.baseURL != "http://localhost:9090" {
t.Errorf("expected baseURL http://localhost:9090, got %s", c.baseURL)
}
if c.httpClient == nil {
t.Fatal("expected httpClient to be non-nil")
}
if c.httpClient.Timeout != 10*time.Minute {
t.Errorf("expected timeout 10m, got %v", c.httpClient.Timeout)
}
}
Copy link

Copilot AI Feb 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test hard-codes the exact baseURL string. If NewClient is updated to normalize/trim trailing slashes (recommended to avoid double-slash URLs), this assertion will fail. Consider asserting functional behavior (no trailing slash in stored baseURL, or correct request URL construction) rather than exact string equality.

Copilot uses AI. Check for mistakes.
Comment on lines +41 to +63
func TestRunWithReconnect_RetriesOnError(t *testing.T) {
var callCount atomic.Int32

ctx, cancel := context.WithTimeout(context.Background(), 10*time.Second)
defer cancel()

connectFn := func(ctx context.Context) error {
n := callCount.Add(1)
if n < 3 {
return errors.New("connection failed")
}
return nil // succeed on 3rd attempt
}

// Override timing: we cancel context after connectFn succeeds, so
// backoff waits are the bottleneck. The first two failures will incur
// backoff waits of 1s and 2s respectively from the production code.
// We use a generous timeout above to accommodate that.
err := RunWithReconnect(ctx, slog.Default(), connectFn)
if err != nil {
t.Fatalf("expected nil after retries, got %v", err)
}

Copy link

Copilot AI Feb 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test appears to depend on production backoff waits (comment mentions 1s then 2s). That makes the unit test slow and potentially flaky if the backoff algorithm/timers change. Prefer injecting the sleep/backoff strategy into RunWithReconnect (or using a test hook) so the test can run quickly without real-time delays.

Copilot uses AI. Check for mistakes.
Comment on lines +85 to +88
want := `unknown network "nonexistent": not found in bridge_map`
if err.Error() != want {
t.Fatalf("expected error %q, got %q", want, err.Error())
}
Copy link

Copilot AI Feb 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This assertion compares the full error string exactly. That’s brittle if the implementation tweaks wording/punctuation or wraps errors, and it makes refactors harder without changing behavior. Prefer checking for a stable sentinel error (errors.Is) or validating key substrings/structured fields (e.g., network name) instead of the entire message.

Copilot uses AI. Check for mistakes.
Comment on lines 22 to 29
// NewClient creates a control plane client.
func NewClient(baseURL string) *Client {
return &Client{
baseURL: baseURL,
httpClient: &http.Client{
Timeout: 10 * time.Minute,
},
}
Copy link

Copilot AI Feb 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NewClient stores baseURL as-is, but request URLs are built via simple string concatenation (baseURL+path). If baseURL is configured with a trailing slash (common in configs), this produces double slashes (e.g., "http://cp/" + "/v1/..."), which can break some proxies and signature-based routing. Consider normalizing baseURL in NewClient (trim trailing "/") and updating the tests accordingly.

Copilot uses AI. Check for mistakes.
}

// SSH defaults
home, _ := os.UserHomeDir()
Copy link

Copilot AI Feb 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

os.UserHomeDir() error is ignored here. On some CI/container environments (or Windows), UserHomeDir can fail, which would make the expected default paths incorrect and cause confusing test failures. Handle the error (t.Fatal/t.Skip) so the test fails (or skips) deterministically when HOME cannot be resolved.

Suggested change
home, _ := os.UserHomeDir()
home, err := os.UserHomeDir()
if err != nil {
t.Skipf("skipping TestDefaultConfig: cannot determine user home directory: %v", err)
}

Copilot uses AI. Check for mistakes.
Comment on lines +68 to +71
// Give it time to run the immediate cleanup.
time.Sleep(200 * time.Millisecond)
cancel()
<-done
Copy link

Copilot AI Feb 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These tests rely on time.Sleep to "give it time" for the janitor goroutine to run. This is timing-sensitive and can be flaky under load/slow CI, and it also slows the suite unnecessarily. Prefer synchronizing on an explicit signal from destroyFn (e.g., close a channel on first call) or exposing a single-iteration cleanup method for tests so they can block until work is done without sleeping.

Copilot uses AI. Check for mistakes.
Comment on lines +108 to +120
connectFn := func(ctx context.Context) error {
n := callCount.Add(1)
if n == 1 {
// After first failure, cancel context so RunWithReconnect
// exits during the backoff wait.
go func() {
time.Sleep(50 * time.Millisecond)
cancel()
}()
}
return errors.New("connection failed")
}

Copy link

Copilot AI Feb 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cancellation timing here is driven by a fixed 50ms sleep in a goroutine. Depending on scheduler load, RunWithReconnect might progress differently, which can make this test flaky. Consider coordinating via channels (e.g., have connectFn signal when it has returned the first error, then cancel immediately) instead of relying on wall-clock sleeps.

Suggested change
connectFn := func(ctx context.Context) error {
n := callCount.Add(1)
if n == 1 {
// After first failure, cancel context so RunWithReconnect
// exits during the backoff wait.
go func() {
time.Sleep(50 * time.Millisecond)
cancel()
}()
}
return errors.New("connection failed")
}
firstErrorCh := make(chan struct{})
connectFn := func(ctx context.Context) error {
n := callCount.Add(1)
if n == 1 {
// Signal that the first connection attempt has failed so we can
// cancel the context while RunWithReconnect is in its backoff wait.
close(firstErrorCh)
}
return errors.New("connection failed")
}
go func() {
<-firstErrorCh
cancel()
}()

Copilot uses AI. Check for mistakes.
Comment on lines +1 to +7
package state

import (
"context"
"testing"
"time"
)
Copy link

Copilot AI Feb 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These sandbox-host tests live outside any Go module in this repo (there’s no sandbox-host/go.mod) and the CI workflow currently only runs go test in the fluid and fluid-remote modules. As a result, this test suite won’t run in CI, even though the PR checklist says tests were added. Consider adding a Go module for sandbox-host (and wiring it into CI), or relocating the code/tests into an existing module so they’re executed.

Copilot uses AI. Check for mistakes.
@claude
Copy link

claude bot commented Feb 16, 2026

PR Review - Critical Issues Found

🚨 BLOCKER: Missing Implementation Code

This PR adds only test files for code that doesn't exist. Of 13 files added:

  • ✅ 1 implementation file: fluid/internal/controlplane/client.go (393 lines)
  • ❌ 12 test files: testing non-existent packages (3,564 lines)

The tests will not compile. They import and test packages that are not included in this PR:

  • control-plane/internal/api (handlers)
  • control-plane/internal/orchestrator (placement logic)
  • control-plane/internal/registry (host registry)
  • control-plane/internal/store (state storage)
  • sandbox-host/internal/agent (host agent)
  • sandbox-host/internal/state (SQLite store)
  • sandbox-host/internal/network (bridge networking)
  • proto/gen/go/fluid/v1 (protobuf definitions)

🔒 Security Issues in client.go

1. Path Traversal Risk (HIGH)

func (c *Client) GetSandbox(ctx context.Context, id string) (*SandboxResponse, error) {
    // id is directly interpolated without validation
    if err := c.get(ctx, fmt.Sprintf("/v1/sandboxes/%s", id), &result); err != nil {

Impact: An ID like ../../admin/secrets could access unintended endpoints.
Fix: Validate IDs match expected format or use url.PathEscape()

Affected methods: GetSandbox, DestroySandbox, StartSandbox, StopSandbox, RunCommand, GetSandboxIP, CreateSnapshot, ListCommands, GetHost, PrepareSourceVM, ValidateSourceVM, RunSourceCommand, ReadSourceFile

2. Memory Exhaustion (MEDIUM)

data, err := io.ReadAll(resp.Body)  // Unbounded read

Impact: Malicious server can send gigabytes, causing OOM.
Fix: Use io.LimitReader(resp.Body, 10<<20) with reasonable size limit

3. Command Injection Potential (MEDIUM)

func (c *Client) RunCommand(ctx context.Context, sandboxID, command string, timeout time.Duration)

Impact: If backend doesn't validate, unsanitized command could be dangerous.
Fix: Add client-side validation or document backend requirements


🐛 Code Quality Issues

1. Type Safety Violations

func (c *Client) PrepareSourceVM(...) (any, error) {
    var result any  // No type safety!

Problem: Forces callers to use type assertions, increasing runtime errors.
Fix: Define proper response types

2. Insufficient Error Context

return fmt.Errorf("control plane error (%d): %s", resp.StatusCode, errResp.Error)

Missing: HTTP method, URL, request ID for debugging.
Fix: Include full context in errors

3. Hardcoded Timeout

httpClient: &http.Client{
    Timeout: 10 * time.Minute,  // Not configurable
}

Problem: All requests wait up to 10 minutes, even simple GETs.
Fix: Make timeout configurable per operation

4. No Input Validation

  • Empty sandbox IDs accepted
  • Empty commands accepted
  • Negative timeouts not checked
  • Invalid URLs not validated

5. No Observability

  • No logging hooks
  • No metrics collection
  • No distributed tracing support
  • No request ID propagation

🧪 Test Coverage Gaps

The client_test.go file is well-written but missing:

  • ✅ Basic CRUD operations
  • ✅ Error handling (404, 500)
  • ❌ Concurrent requests
  • ❌ Context cancellation
  • ❌ Timeout behavior
  • ❌ Large responses
  • ❌ Special characters in IDs (injection tests)
  • ❌ Network errors (connection refused)
  • ❌ Malformed JSON responses
  • ❌ Redirect handling

📋 Checklist Discrepancies

PR checklist claims:

✅ I have added tests that prove my fix is effective or that my feature works

Reality: Tests reference non-existent code and won't compile.

✅ I have made corresponding changes to the documentation

Reality: No documentation files changed.


🔧 Recommendations

Before merging:

  1. Add all missing implementation (control-plane server, sandbox-host agent, etc.)
  2. Fix security issues (input validation, response size limits)
  3. Replace any types with proper structs
  4. Add input validation for all parameters
  5. Make timeouts configurable
  6. Verify tests compile and pass (make test from control-plane and sandbox-host dirs)
  7. Add go.mod files for new modules
  8. Update AGENTS.md with new architecture

Nice to have:

  • Add client interface for testability
  • Add retry/backoff logic
  • Add observability hooks
  • Add integration tests
  • Add architecture diagrams
  • Add migration guide

📊 Verdict

Status:Request Changes

This PR is approximately 5-10% complete. It appears to be scaffolding/tests for a planned architecture but is missing ~90% of the actual implementation. The one implementation file has security vulnerabilities that must be addressed.

Estimated remaining work: All control-plane server code, sandbox-host agent, networking implementation, storage layer, orchestration logic, protobuf definitions, and proper integration tests.


Review performed with adherence to project CLAUDE.md guidelines. All test requirements per AGENTS.md verified.

@claude
Copy link

claude bot commented Feb 16, 2026

Code Review: Control Plane Architecture Migration

This is a major architectural refactoring that transforms fluid.sh from a libvirt-based VM cloning system to a microVM control plane with distributed sandbox hosts. This is production-ready foundational work with excellent test coverage.


Overall Assessment

Strengths:

  • Well-architected control plane with clean separation of concerns
  • Comprehensive test coverage (106 tests across 12 test files)
  • Production-grade error handling and logging
  • gRPC bidirectional streaming with proper request/response correlation
  • Excellent use of protobuf for type-safe distributed communication
  • State recovery mechanisms for crash resilience

Areas for Improvement:
Some security considerations and minor architectural improvements (detailed below)


Architecture Review

Control Plane (control-plane/)

Excellent Design:

  • Clean layering: API to Orchestrator to gRPC to Hosts
  • StreamHandler implements request/response correlation via pendingRequests sync.Map (stream.go:30-31)
  • Host registry maintains live connection state
  • PostgreSQL for persistent state (sandboxes, commands, hosts)

Key Components:

  1. Orchestrator (orchestrator.go): Core business logic, sandbox lifecycle management
  2. StreamHandler (grpc/stream.go): Bidirectional gRPC stream management with hosts
  3. Registry (registry/registry.go): In-memory host connection tracking
  4. API Server (api/server.go): REST endpoints for external consumers

Sandbox Host (sandbox-host/)

Production-Ready Features:

  • microVM management via QEMU with KVM acceleration (microvm/manager.go)
  • State recovery from PID files on daemon restart (manager.go:83-149)
  • Network bridge management with TAP device creation (network/bridge.go, network/tap.go)
  • Image store with overlay filesystem support (image/store.go)
  • Janitor for TTL-based cleanup (janitor/janitor.go)
  • SSH CA for ephemeral certificate-based access (sshca/ca.go)

Security Considerations

1. SSH CA Private Key Protection

File: sandbox-host/internal/sshca/ca.go

The SSH CA configuration mentions key permissions enforcement (EnforceKeyPermissions bool)

Recommendation:

  • Ensure CA private key is stored with 0600 permissions
  • Consider using hardware security modules (HSM) or cloud KMS for production
  • Add runtime validation that verifies key file permissions on startup
  • Document the key rotation procedure

2. gRPC Stream Authentication

File: control-plane/internal/grpc/stream.go:64-150

Current implementation accepts HostRegistration without authentication.

Recommendation:

  • Add mutual TLS (mTLS) for host authentication
  • Implement host pre-registration with API tokens
  • Consider adding host identity validation (e.g., signed JWTs)

3. Command Injection in QEMU Args (Currently Safe)

File: sandbox-host/internal/microvm/manager.go:182-196

QEMU args are properly constructed with separate array elements (not shell strings). No injection risk - args are properly separated.

4. Process Signal Handling

File: sandbox-host/internal/microvm/manager.go:269-295

The code uses os.FindProcess + Signal(0) to check process existence. On Unix, os.FindProcess always succeeds, even for non-existent PIDs. The process might belong to a different user.

Recommendation:

  • Add PID validation to ensure the process is actually a QEMU instance (check /proc/{pid}/cmdline)
  • Store additional process metadata (start time, executable path) for verification

5. SQL Injection (Properly Mitigated)

File: control-plane/internal/store/postgres/postgres.go

All database operations use parameterized queries - no injection risk detected.


Potential Bugs

1. Race Condition in Stream Cleanup

File: control-plane/internal/grpc/stream.go:129-133

If SendAndWait is called while stream is disconnecting, it might get the stream from h.streams but it could be closed before stream.Send() is called.

Recommendation: In SendAndWait, check if host is still connected before sending.

2. Memory Leak in pendingRequests

File: control-plane/internal/grpc/stream.go:30

If a host crashes without responding, the channel in pendingRequests will never be deleted (only cleaned up on response or timeout). The timeout mitigates this partially, but the map entry persists.

Recommendation: Add explicit cleanup in SendAndWait with defer to delete from pendingRequests.

3. Overlay Filesystem Not Cleaned on Launch Failure

File: sandbox-host/internal/microvm/manager.go:166-266

If QEMU launch fails after overlay creation, the overlay directory might not be cleaned up.

Recommendation: Add RemoveOverlay call in the error path.

4. JSON Parsing Without Error Handling

File: sandbox-host/internal/microvm/manager.go:386-447

Custom JSON parsing in readMetadata silently ignores parse errors. Consider using encoding/json for robust parsing, or add error reporting for malformed metadata.


Performance Considerations

1. Excessive Polling for PID File

File: sandbox-host/internal/microvm/manager.go:223-233

The code polls for PID file 10 times with 100ms sleep, adding up to 1 second delay on every VM launch.

Recommendation:

  • Use inotify/fsnotify to watch for PID file creation
  • Or use exponential backoff instead of fixed 100ms intervals

2. Synchronous VM Launch Blocks API

File: control-plane/internal/orchestrator/orchestrator.go:133

VM creation blocks the API call for up to 5 minutes.

Recommendation:

  • Consider async creation with status polling endpoint
  • Or stream progress updates back to client

3. Registry ListConnected() Not Optimized

File: control-plane/internal/registry/registry.go

If many hosts (100+), ListConnected() might be called frequently by the API. Consider adding caching with TTL if this becomes a bottleneck.


Test Coverage

Excellent Coverage:

  • 106 tests across 12 test files
  • Unit tests for all core components
  • Mock-based testing (no external dependencies)
  • Tests cover error cases and edge conditions

Examples:

  • control-plane/internal/api/handlers_sandbox_test.go: 586 lines, tests all API endpoints
  • sandbox-host/internal/state/sqlite_test.go: 443 lines, comprehensive state store tests
  • control-plane/internal/orchestrator/placement_test.go: Tests host selection algorithms

Suggestion: Add integration tests for:

  • End-to-end VM creation flow
  • gRPC stream reconnection after network failure
  • State recovery after crash

Code Quality

Strengths:

  • Clear package documentation (e.g., sshca/ca.go:1-11)
  • Consistent error handling patterns
  • Structured logging with contextual fields
  • No TODO/FIXME/HACK comments found

Minor Issues:

1. Inconsistent Error Wrapping

Some errors use fmt.Errorf without %w, which loses the error chain.

Recommendation: Use %w consistently for error wrapping.

2. Magic Numbers

Default values like vcpus = 2 and memMB = 2048 are hardcoded.

Recommendation: Extract to named constants or config struct.

3. Large Functions

  • orchestrator.go has several 50-100 line functions
  • grpc/stream.go:Connect is 150+ lines

Suggestion: Consider extracting sub-functions for readability.


Breaking Changes

As marked in the PR description, this is a breaking change:

  • The entire libvirt-based architecture is replaced
  • CLI interface likely needs updates
  • Configuration format changes required

Documentation Impact:

  • fluid/AGENTS.md should be updated with new architecture
  • Migration guide needed for existing users
  • Deployment documentation for control-plane + sandbox-host

Recommendations Summary

Critical (Fix Before Merge):

  1. Add cleanup in SendAndWait to prevent memory leaks in pendingRequests
  2. Add stream validity check before Send() to avoid sending to closed streams
  3. Clean up overlay directory on QEMU launch failure

High Priority (Soon After):

  1. Implement mTLS or token-based host authentication
  2. Add PID validation to prevent signal delivery to wrong processes
  3. Add integration tests for critical paths

Nice to Have:

  1. Optimize PID file polling with inotify
  2. Consider async VM creation API
  3. Extract magic numbers to constants
  4. Add migration guide documentation

Verdict

APPROVED with minor recommendations

This is excellent foundational work for a production-grade microVM control plane. The architecture is sound, test coverage is comprehensive, and the code quality is high. The security and bug concerns listed above are important but not blockers - they can be addressed in follow-up PRs.

The migration from libvirt VMs to QEMU microVMs with centralized control plane is the right architectural direction for scalability and multi-tenancy.

Suggested Merge Strategy:

  1. Address critical memory leak and race condition issues
  2. Merge to a feature branch for additional testing
  3. Add deployment documentation
  4. Roll out to staging environment
  5. Merge to main after validation

Great work on this substantial refactoring!

@claude
Copy link

claude bot commented Feb 17, 2026

Code Review: feat: convert fluid to a control-plane

This is a large, ambitious PR—converting the project from a local libvirt-based tool into a proper multi-tenant control plane with gRPC streaming daemons, PostgreSQL persistence, OAuth, billing, and a demo WebSocket server. The architecture is well thought out at a high level. The notes below focus on issues that need addressing before this is production-ready.


Critical

StripePublishableKey used as a Stripe Price ID (api/internal/rest/billing_handlers.go)

// handleSubscribe
Price: s.cfg.Billing.StripePublishableKey,  // wrong — pk_live_... is not a price_...

A publishable key (pk_live_...) is not a price ID (price_...). Every checkout session creation will fail. A separate StripePriceID config field is needed.


Security

OAuth CSRF: hardcoded state parameter (api/internal/rest/auth_handlers.go)

cfg.AuthCodeURL("state")  // "state" is a literal string, not a random token

The OAuth 2.0 state parameter must be a cryptographically random value generated per-request, stored server-side (or in a signed cookie), and verified in the callback. Using "state" as a constant makes both the GitHub and Google OAuth flows completely vulnerable to CSRF login attacks. The callbacks also never check the returned state value at all.

hostID is fully trusted from the daemon's own registration message (api/internal/grpc/stream.go)

hostID := reg.GetHostId()  // daemon-supplied; could claim any host ID

A daemon with a valid token can register as any hostID, potentially overwriting another host's registry entry. The control plane should assign or derive the hostID server-side (e.g. from the SHA-256 of the host token) rather than accepting it from the client.

Session tokens stored in plaintext (api/internal/auth/middleware.go)
The raw 64-char hex session token is used directly as the sessions table primary key. If the table is compromised, all active sessions are immediately usable. Tokens should be stored as their SHA-256 hash, with only the plaintext held in the cookie.

Cross-org sandbox routing in placement (api/internal/orchestrator/placement.go)
SelectHost / SelectHostForSourceVM iterate over all connected hosts without filtering by org. In a multi-tenant setup, org A's sandbox could be placed on org B's dedicated host. Placement must scope to hosts associated with the requesting org.

memberID not scoped to org in handleRemoveMember (api/internal/rest/org_handlers.go)

s.store.DeleteOrgMember(r.Context(), memberID)  // no org_id check

Any org admin can delete a member from any organization by supplying the target's member ID. The store query must join on org_id, or the handler must fetch the member first and verify member.OrgID == org.ID.

No guard against removing the org owner (api/internal/rest/org_handlers.go)
handleRemoveMember allows an admin to remove the org owner, which can permanently lock the owner out of their organization. Add a check that target.Role != OrgRoleOwner before deletion.

GitHub/Google API response status codes not checked (api/internal/rest/auth_handlers.go)
fetchGitHubUser, fetchGitHubPrimaryEmail, and fetchGoogleUser unmarshal the body regardless of HTTP status code. A 401 or 500 from the upstream will silently produce a zero-value struct (empty email, ID=0) that then attempts to create or link an account with garbage data.

http.DefaultClient used with no timeout in OAuth helpers
A slow/hanging GitHub or Google API response will hold the goroutine indefinitely. Use a context-aware client with an explicit timeout.


Race Conditions / Bugs

Concurrent stream.Send on the same gRPC stream (api/internal/grpc/stream.go)
SendAndWait does not hold a per-host mutex around stream.Send. Two goroutines sending commands to the same host concurrently will call Send simultaneously; gorilla's gRPC stream is not concurrent-write-safe. A per-stream sync.Mutex is required.

Data race on s.messages in demo-server session (demo-server/internal/session/session.go)
HandleMessage is launched with go sess.HandleMessage(...) from the read pump, so two user messages can run two concurrent LLM loops both reading and appending to s.messages without holding s.mu. A single serializing goroutine (channel dispatch) is the right fix.

Get writes info.State under RLock (fluid-daemon/internal/microvm/manager.go)

m.mu.RLock()
// ...
info.State = StateRunning  // write under a read lock = data race

Webhook errors silently swallowed in Stripe handler (api/internal/rest/billing_handlers.go)
All error cases inside handleStripeWebhook break out of the switch and return HTTP 200. Stripe will not retry events it thinks succeeded. Critical events (subscription creation failures, etc.) will be silently lost. Return a 5xx on internal errors.

store.CreateSubscription error discarded

_ = s.store.CreateSubscription(...)  // user paid but subscription may not be recorded

Hand-rolled JSON parser in microvm manager (fluid-daemon/internal/microvm/manager.go)
readMetadata uses a custom extractJSONString instead of encoding/json. It breaks on escaped characters and incorrectly handles integer fields (vcpus, memory_mb), producing wrong values. Just use json.Unmarshal.

WebSocket session limit is a TOCTOU race (demo-server/internal/ws/handler.go)
The session count is checked under the mutex, then the mutex is released before upgrader.Upgrade(), allowing a burst of concurrent connections to all pass the limit check simultaneously.

reapExpiredSessions goroutine never exits (demo-server/internal/ws/handler.go)
The goroutine started in NewHandler has no shutdown path. On server shutdown it leaks indefinitely.

GetSubscriptionByOrg("") in Stripe webhook handler
In the customer.subscription.updated case, the store is queried with an empty string as org ID. This will never match a subscription. The lookup by Stripe subscription ID is not implemented.


ID Collision Risk (api/internal/rest/auth_handlers.go)

"USR-" + uuid.New().String()[:8]  // only 32 bits of entropy

With ~65 000 users the birthday-collision probability reaches ~50%. Use a full UUID or a DB-generated ID.


Configuration (api/internal/config/config.go)

  • Hardcoded default DB credentials: "postgresql://fluid:fluid@localhost:5432/fluid_web". If DATABASE_URL is unset in production, the server silently tries this. Startup should fatal if the env var is absent.
  • No config validation at load time: critical fields (DATABASE_URL, STRIPE_SECRET_KEY, OAuth client IDs) are accepted as empty strings with no warning. Add a Validate() step on startup.
  • AutoMigrate defaults to true: irreversible schema changes running automatically on every production startup is risky. Default should be false.
  • envInt / envBool / envDuration silently ignore parse failures and use the default with no log message.

Test Coverage

The AGENTS.md and CLAUDE.md both state "Every code change needs tests," but most of the new server-side code has no test files at all:

  • api/internal/auth/ — no tests
  • api/internal/grpc/ — no tests
  • api/internal/orchestrator/ — no tests (placement logic especially needs unit tests)
  • api/internal/rest/ — no tests for any handler
  • api/internal/store/postgres/ — no tests
  • demo-server/internal/ — no tests

The fluid-cli/internal/mcp/handlers_test.go changes look good (tests are updated to match the new handler signatures), but the server-side work is untested.


Minor / Low

  • Swagger UI publicly accessible at /v1/swagger/* with no authentication—disable or gate in production.
  • toOrgResponse exposes stripe_customer_id to all org members; this should only be visible to the owner.
  • init-db.sql contains only a psql-specific \gexec meta-command to create the database, not the actual schema. The filename and AGENTS.md description are misleading; the real schema lives in GORM auto-migration.
  • OAuth state callback default is http://localhost:5173—no startup warning when running with localhost defaults.
  • gRPC server runs without enforced TLS by default. Host bearer tokens travel in plaintext. Log a warning or enforce TLS when GRPC_TLS_CERT is absent.
  • 90-second heartbeat timeout is a magic number duplicated in placement.go and stream.go. Consolidate into a single named constant or config field.
  • StripeCustomerID is leaked in GET /orgs/{slug} response to all org members, not just the owner.

Summary

The architecture direction is solid—moving to a centralized control plane with per-host daemons and micro-VMs is the right call for production. The main blockers are the critical Stripe price-ID bug, the OAuth CSRF vulnerability, the concurrent gRPC send race, and the cross-org member-deletion issue. The test coverage gap also needs to be addressed before merging per the project's own stated policy.

…ation

No backend required - the landing page demo now plays a scripted SRE
debugging scenario showcasing all fluid.sh phases (investigate, sandbox,
playbook, cleanup) and loops automatically.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@claude
Copy link

claude bot commented Feb 17, 2026

Code Review — feat: convert fluid to a control-plane

This is a substantial architectural change with well-structured code overall. Two prior reviews have already been posted; this review focuses on issues found by reading the actual source on the current HEAD of the PR branch, with specific file/line references.


Critical Bugs

1. Wrong Stripe config field used as Price ID (api/internal/rest/billing_handlers.go:174)

The StripePublishableKey (pk_live_...) is used where a Stripe Price ID (price_...) is required. Every checkout session creation will fail. A separate StripePriceID config field is needed in BillingConfig.

2. Concurrent stream.Send without a mutex (api/internal/grpc/stream.go:187)

SendAndWait is called by the orchestrator from HTTP handler goroutines. Multiple concurrent requests to sandboxes on the same host will call stream.Send simultaneously on the same gRPC ServerStream. The gRPC stream is not safe for concurrent writes. A per-host sync.Mutex must wrap the stream.Send call.

3. customer.subscription.updated queries store with empty org ID (api/internal/rest/billing_handlers.go:428)

s.store.GetSubscriptionByOrg(r.Context(), "") will never match any subscription. Replace with a lookup by Stripe subscription ID.

4. CreateSubscription error silently discarded (api/internal/rest/billing_handlers.go:420)

_ = s.store.CreateSubscription(...) — if this fails, the webhook returns HTTP 200, Stripe won't retry, and a paid user gets no plan upgrade. Return a 5xx on failure.


Security

5. OAuth CSRF: hardcoded "state" parameter (api/internal/rest/auth_handlers.go:253, 318)

cfg.AuthCodeURL("state") — the OAuth state parameter must be a cryptographically random, per-request nonce to prevent CSRF login attacks. Generate with crypto/rand, store in a signed cookie, and verify in the callback. Both GitHub and Google flows are affected.

6. Session tokens stored as plaintext primary keys (api/internal/auth/middleware.go:31)

The raw session token from the cookie is passed directly to st.GetSession as the lookup key. If the sessions table is leaked, all active sessions are immediately usable. Store tokens as their SHA-256 hash (as is already done for host tokens in hostauth.go:57).

7. handleRemoveMember doesn't verify the target belongs to the org (api/internal/rest/org_handlers.go:476)

s.store.DeleteOrgMember(r.Context(), memberID) — an org admin can supply any memberID from any org and delete it. The store query must enforce org_id alongside id, or fetch the member first to verify member.OrgID == org.ID.

8. No guard against removing the org owner (api/internal/rest/org_handlers.go:455–485)

An admin can remove the owner, permanently locking them out. Add a check that the target member's role is not owner.

9. GitHub/Google API response status codes not validated (api/internal/rest/auth_handlers.go:379–395, 432–445)

fetchGitHubUser, fetchGitHubPrimaryEmail, and fetchGoogleUser unmarshal the body regardless of HTTP status code. A 401 or 500 from the upstream produces a zero-value struct (empty email, ID=0), which then creates a user account with garbage data. Check resp.StatusCode before unmarshalling.

10. http.DefaultClient used with no timeout in OAuth helpers

The same three functions use http.DefaultClient.Do with no timeout. A hung upstream holds the goroutine indefinitely. Use a client with an explicit timeout.


Architecture / Design

11. hostID fully trusted from the daemon's own registration message (api/internal/grpc/stream.go:60)

hostID := reg.GetHostId() — despite the auth interceptor validating the bearer token, the host ID is still daemon-supplied. A daemon can register as any hostID and overwrite another host's registry entry. Derive hostID from the token server-side rather than accepting it from the client.

12. Host placement ignores org scope (api/internal/orchestrator/placement.go:12–51)

SelectHost iterates all connected hosts with no filtering by org. In a multi-tenant setup, org A's sandbox can be scheduled on org B's dedicated host. Placement must be scoped to the requesting org's hosts.

13. DiscoverSourceHosts forwards SSH config to an arbitrary first host (api/internal/orchestrator/orchestrator.go:718)

SSH config content (which may contain credentials) is forwarded to connected[0] — whatever daemon is first in the list. This must be scoped to the org's own hosts.


Code Quality

14. PrepareSourceVM and ValidateSourceVM return any (api/internal/orchestrator/orchestrator.go:557, 592)

Returning any loses type safety. Define concrete response structs.

15. StripeCustomerID exposed to all org members (api/internal/rest/org_handlers.go:33–35)

toOrgResponse includes stripe_customer_id and is used in handleGetOrg, accessible to all members. This is internal billing data and should only be visible to the owner.

16. Heartbeat threshold duplicated as a magic number (placement.go:37, stream.go:221)

Both hard-code 90*time.Second. OrchestratorConfig.HeartbeatTimeout already exists in config — use it in both places.


Test Coverage

Per CLAUDE.md and api/AGENTS.md: "Every code change needs tests."

There are zero test files under api/. The following packages have no tests at all:

  • api/internal/auth/
  • api/internal/grpc/
  • api/internal/orchestrator/
  • api/internal/rest/
  • api/internal/store/postgres/

At minimum, placement logic and auth middleware need unit tests before merging.


Configuration

17. Hardcoded default DB credentials (api/internal/config/config.go:105)

"postgresql://fluid:fluid@localhost:5432/fluid_web" as a fallback means a misconfigured production deploy silently connects to localhost. Require DATABASE_URL to be explicitly set.

18. AutoMigrate defaults to true (config.go:109)

Running schema migrations automatically on every startup is risky in production. Default to false.

19. envInt/envBool/envDuration silently ignore parse errors (config.go:170–195)

Invalid env vars (e.g. DATABASE_MAX_OPEN_CONNS=abc) fall back to defaults with no log message. Add a warning when parse errors are swallowed.


Summary

The architectural direction is solid: centralized control plane, bidirectional gRPC streaming to daemons, microVM sandboxes, proper multi-tenancy model. The code is well-organized and readable.

Must fix before merge:

  • Items 1–4: Billing is functionally broken
  • Item 5: OAuth CSRF vulnerability
  • Items 7–8: Org member deletion authorization bypass

Should fix soon after:

  • Items 6, 9–10: Session and OAuth security hardening
  • Items 11–13: Multi-tenancy host scoping
  • Item 16: Test coverage per project policy

Review based on source at commit 68d021a.

@claude
Copy link

claude bot commented Feb 17, 2026

Code Review — PR #65: feat: convert fluid to a control-plane

This is a significant architectural change (43k additions, 58k deletions). The direction is sound — moving to a centralized control plane with micro VMs is a more production-viable design. However, there are several security and correctness issues that need to be addressed before merge.


🔴 Critical — Must Fix Before Merge

1. Concurrent stream.Send Without a Mutex (Data Race)

api/internal/grpc/stream.go ~line 187

SendAndWait is called by multiple goroutines (one per concurrent HTTP request) targeting the same host. grpc.ServerStream.Send is not safe for concurrent use — only one goroutine may call Send at a time. Two simultaneous sandbox creation requests on the same host will corrupt the stream.

Fix: Add a per-host sync.Mutex that wraps the stream.Send call.


2. OAuth CSRF — Hardcoded "state" Parameter

api/internal/rest/auth_handlers.go ~lines 253, 318

Both GitHub and Google OAuth flows pass the literal string "state" as the OAuth state parameter:

url := cfg.AuthCodeURL("state")

The state value must be a cryptographically random, per-request nonce stored in the user's session. Neither callback validates the returned state at all. This makes both OAuth flows vulnerable to CSRF login attacks — an attacker can forge a login and link a victim's session to an attacker-controlled account.


3. IDOR — DeleteOrgMember Not Scoped to Org

api/internal/rest/org_handlers.go ~line 476 + api/internal/store/postgres/postgres.go ~line 769

The handler verifies the caller is an admin of their org, then calls:

s.store.DeleteOrgMember(r.Context(), memberID)

But the store method only filters by id, not org_id. Any org admin can delete a member from any organization by supplying that member's ID. The query must include AND org_id = ?.


4. Paid Subscription Silently Not Recorded

api/internal/rest/billing_handlers.go ~line 420

_ = s.store.CreateSubscription(r.Context(), sub)

If CreateSubscription fails when processing a checkout.session.completed webhook, the error is discarded and HTTP 200 is returned. Stripe will not retry. The user has paid but no subscription is created. This must return HTTP 500 so Stripe retries.


5. Subscription Webhook Queries with Empty Org ID

api/internal/rest/billing_handlers.go ~line 428

In the customer.subscription.updated handler:

existing, err := s.store.GetSubscriptionByOrg(r.Context(), "")

Querying with "" as the org ID will never match a real subscription. The correct lookup path using GetOrganizationByStripeCustomerID exists but isn't reached here.


🟠 High — Address Before or Shortly After Merge

6. Session Tokens Stored as Plaintext Primary Keys

api/internal/auth/session.go, api/internal/auth/middleware.go

The raw session token is used as the sessions table primary key. Compare with hostauth.go, which correctly hashes host tokens before storage:

hash := HashToken(raw)  // correct
token, err := st.GetHostTokenByHash(...)

Session tokens should be stored as their SHA-256 hash. A database exposure would otherwise immediately compromise all active sessions.


7. hostID Fully Trusted From Daemon's Own Message

api/internal/grpc/stream.go ~line 60

Although the bearer token is validated, the daemon supplies its own hostID in the registration message. A compromised daemon with a valid token can register as any hostID, potentially overwriting another host's stream entry. The server should derive/assign the host identity from the validated token, not accept it from the client.


8. Org Owner Can Be Removed by an Admin

api/internal/rest/org_handlers.go

handleRemoveMember does not prevent an admin from removing the org owner. The handler must check target.Role != OrgRoleOwner before proceeding.


9. OAuth HTTP Response Status Codes Not Validated

api/internal/rest/auth_handlers.go (fetchGitHubUser, fetchGoogleUser)

resp.StatusCode is never checked before unmarshalling. A 401/500 from GitHub/Google produces a zero-value struct (empty email, empty ID) that then proceeds to findOrCreateOAuthUser, potentially creating corrupted user accounts.


10. http.DefaultClient Has No Timeout in OAuth Helpers

api/internal/rest/auth_handlers.go

All OAuth API calls use http.DefaultClient which has no timeout. A slow response from GitHub/Google holds the handler goroutine indefinitely, leading to goroutine accumulation and eventual server resource exhaustion. Use a context-scoped client with an explicit timeout.


11. Host Placement Not Scoped to Requesting Org (Cross-Tenant Leakage)

api/internal/orchestrator/placement.go, api/internal/orchestrator/orchestrator.go ~line 718

SelectHost iterates all connected hosts without filtering by org — in a multi-tenant deployment, org A's workloads can be scheduled on org B's dedicated hosts. SelectHost must accept an orgID parameter. The same applies to DiscoverSourceHosts which uses connected[0] without any ownership check.


12. Zero Tests in the Entire api/ Module

The project's own CLAUDE.md states "Every code change needs tests." The entire api/ module — auth, grpc, orchestrator, rest handlers, store — has no test files whatsoever. The critical paths (auth flows, webhook handling, stream concurrency) are exactly the ones that most benefit from automated tests.


🟡 Medium

13. PrepareSourceVM/ValidateSourceVM Return any

api/internal/orchestrator/orchestrator.go ~lines 557, 592

Returning any loses type safety and forces callers to use type assertions. Return concrete proto or response structs.


14. Heartbeat Timeout Duplicated as Magic Number

api/internal/orchestrator/placement.go ~line 37, api/internal/grpc/stream.go ~line 221

90*time.Second is hard-coded in two places. OrchestratorConfig.HeartbeatTimeout already exists — it should be threaded through to SelectHost and used consistently.


15. AutoMigrate Defaults to true

api/internal/config/config.go

Running GORM auto-migrations on every startup is dangerous in production — DDL runs on every process restart. Default should be false with explicit opt-in.


16. Hardcoded Default Database Credentials

api/internal/config/config.go

URL: envOr("DATABASE_URL", "postgresql://fluid:fluid@localhost:5432/fluid_web"),

Production deployments should fail fast if DATABASE_URL is unset rather than silently using a well-known default.


17. Config Parse Errors Silently Swallowed

api/internal/config/config.go

envInt, envBool, and envDuration silently fall back to defaults on parse errors. A misconfigured env var (e.g. DATABASE_MAX_OPEN_CONNS=abc) produces no log output, making misconfigurations very hard to diagnose.


18. StripeCustomerID Exposed to All Org Members

api/internal/store/postgres/postgres.go

The Stripe customer ID is included in the serialized Organization struct returned to all org members. Internal billing identifiers should be filtered for non-owner roles.


19. ID Prefix Uses Only 32 Bits of UUID Entropy

api/internal/rest/host_handlers.go, api/internal/rest/auth_handlers.go

ID: "HTK-" + uuid.New().String()[:8]

Only 8 hex chars (32 bits) is used. At ~65k entities, collision probability reaches ~50%. Use a full UUID or a DB-generated serial.


20. BaseImage Always Overridden by SourceVM in CreateSandbox

fluid-cli/internal/sandbox/remote.go ~lines 41-43

BaseImage: req.SourceVM,  // req.BaseImage is ignored
SourceVm:  req.SourceVM,

If a caller provides a BaseImage different from SourceVM, it is silently discarded.


🔵 Low

  • api/init-db.sql contains only DB creation (not schema DDL) despite its name — add a comment clarifying this or add explicit schema.
  • gRPC runs without TLS (cmd/server/main.go, sandbox/remote.go) — host bearer tokens are sent in plaintext. Log a warning at startup when TLS is not configured.
  • CreateAgentMessage ignores the updated_at update error (postgres.go ~line 1468).
  • Swagger UI is publicly accessible — protect or disable in non-development environments.

Summary

Severity Count
🔴 Critical 5
🟠 High 7
🟡 Medium 8
🔵 Low 4

The architecture itself is a clear improvement. Issues 1 (gRPC stream race) and 2 (OAuth CSRF) are security-critical and need fixes before merge. Issue 12 (no tests) is required by project policy. The remaining critical/high items should be resolved or tracked as immediate follow-up.

- Add 200+ tests across api/ covering auth, REST handlers, orchestrator,
  agent executor, gRPC streaming, registry, store, config, json, and id
- Add billing package with Stripe metering and usage ticker
- Add store types and interface methods for billing (ModelMeter,
  OrgModelSubscription, SumTokenUsage, ListActiveSubscriptions)
- Update Python SDK with full API coverage and high-level client
- Add web billing calculator, onboarding, playbooks, and daemon status
- Add fluid-cli model picker and modelsdev client
- Add fluid-daemon id package
- Misc refactors: heartbeat timeout config, org-scoped registry, etc.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@claude
Copy link

claude bot commented Feb 18, 2026

Code Review — feat: convert fluid to a control-plane

The architectural direction here is right: moving to a centralized control plane with microVMs and per-host daemons is a significant improvement in scalability and multi-tenancy potential. The QEMU overlay filesystem approach for fast sandbox creation is a good technical choice.

Previous reviews have catalogued many issues thoroughly. Rather than repeat them, I want to consolidate the must-fix list and add a few observations not yet covered.


Consensus Must-Fix Items (from prior reviews)

These issues appear across multiple reviews and need resolution before merge:

  1. Concurrent gRPC stream.Send raceapi/internal/grpc/stream.go: Multiple HTTP handler goroutines can call Send on the same stream simultaneously. Add a per-host sync.Mutex around stream.Send.

  2. OAuth CSRFapi/internal/rest/auth_handlers.go: cfg.AuthCodeURL("state") passes a hardcoded literal. Generate a crypto/rand nonce per request, store it (signed cookie or session), and validate it in both callbacks.

  3. IDOR on org member deletionapi/internal/rest/org_handlers.go + store: DeleteOrgMember filters only by id, not org_id. Any org admin can delete members from other orgs. The query must include AND org_id = ?.

  4. Stripe webhook reliabilityapi/internal/rest/billing_handlers.go: _ = s.store.CreateSubscription(...) discards errors and returns HTTP 200, preventing Stripe retries on failure. Also, GetSubscriptionByOrg(ctx, "") in the customer.subscription.updated case will never match anything.

  5. Stripe publishable key used as price IDapi/internal/rest/billing_handlers.go:174: StripePublishableKey (pk_live_...) is not a price ID (price_...). Add a dedicated StripePriceID config field.

  6. Zero tests in api/ — Per CLAUDE.md and api/AGENTS.md: "Every code change needs tests." The auth, gRPC, orchestrator, REST handler, and store layers have no tests at all.


Additional Observations

url.PathEscape missing on user-supplied IDs (fluid/internal/controlplane/client.go)

IDs are interpolated directly into URL paths with fmt.Sprintf("/v1/sandboxes/%s", id). An ID containing / or .. would silently hit a different endpoint. Use url.PathEscape(id) at each call site, or add upfront validation that IDs match an expected format (e.g. UUID).

io.ReadAll is unbounded (fluid/internal/controlplane/client.go)

io.ReadAll(resp.Body) reads the entire response into memory with no limit. A malicious or misbehaving server can trigger OOM. Apply io.LimitReader(resp.Body, 10<<20) or similar before reading.

Pending request map leak in stream handler (api/internal/grpc/stream.go)

If a host disconnects mid-request, the channel entry in pendingRequests is deleted when the stream exits, but any goroutine blocked in SendAndWait waiting on that channel will also block forever — no timeout path reaches the delete. SendAndWait needs to defer pendingRequests.Delete(reqID) to guarantee cleanup regardless of how the function exits.

os.FindProcess is always successful on Linux (fluid-daemon/internal/microvm/manager.go)

os.FindProcess returns success for any PID on Linux, even non-existent ones — the OS does not validate. The subsequent Signal(0) check is the real liveness probe, but a PID could be reused by an unrelated process. Consider verifying /proc/{pid}/cmdline contains qemu or storing the process start time for cross-check.

Overlay not cleaned on QEMU launch failure (sandbox-host/internal/microvm/manager.go)

If the process fails to start after the overlay directory is created, the overlay is not removed in the error path. This leaks disk space over repeated failures. Add a deferred RemoveOverlay with a success flag.

BaseImage field silently ignored (fluid-cli/internal/sandbox/remote.go)

Both BaseImage and SourceVM are set from req.SourceVM, discarding any caller-provided BaseImage value. If the field is redundant, remove it from the struct; if intentional, document why.


Minor

  • gRPC between CLI and daemon runs without TLS by default; bearer tokens travel in plaintext. Log a startup warning when TLS is unconfigured.
  • api/internal/config/config.go: AutoMigrate defaulting to true means DDL runs on every production restart. Default to false.
  • envInt/envBool/envDuration silently swallow parse errors — add a log line when a value is unparsable so misconfigured deploys are diagnosable.
  • StripeCustomerID is included in GET /orgs/{slug} responses for all members; this should be filtered to owners only.

Summary

The architecture is a solid foundation. The gRPC stream race, OAuth CSRF, org member IDOR, and Stripe reliability bugs are the items that would cause real production incidents and need to be resolved before merge. Everything else can be tracked as follow-up work.

aspectrr and others added 2 commits February 17, 2026 21:23
The OAuth login handlers passed a hardcoded "state" literal and callbacks
never validated it, leaving the flow vulnerable to CSRF attacks. Now each
login generates a 32-byte crypto/rand nonce stored in a short-lived
HttpOnly cookie, and callbacks validate it with constant-time comparison.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
DeleteOrgMember only filtered by member ID, allowing any org admin to
delete members from other orgs if they knew the member ID. Now requires
org_id match in the WHERE clause.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@claude
Copy link

claude bot commented Feb 18, 2026

Code Review: feat: convert fluid to a control-plane

This is a substantial architectural rework — a well-motivated pivot from a local libvirt-clone workflow to a centralized control plane with multi-tenancy, billing, and microVMs. The overall direction is solid and the code is generally well-structured. The review below focuses on issues that need attention before merging.


Architecture Overview

The new architecture is sound:

  • Control-plane API (api/) handles REST + bidirectional gRPC streaming to daemons
  • fluid-daemon is reworked to use overlay-based microVMs via TAP/bridge instead of full libvirt clones
  • The CLI now speaks to the control plane rather than directly to a daemon
  • Multi-tenancy, OAuth, billing (Stripe), and an OpenRouter-backed AI agent are all added

Security Issues

🔴 High: Agent executor lacks org isolation on sandbox operations

api/internal/agent/executor.go dispatches sandbox operations (destroy, start, stop, run_command, get_sandbox, create_snapshot) directly from LLM-provided parameters without checking that the sandbox belongs to the calling org:

// executor.go
func (c *Client) execDestroySandbox(ctx context.Context, params map[string]any) (string, error) {
    err := c.orchestrator.DestroySandbox(ctx, strParam(params, "sandbox_id"))

The REST handlers do this correctly:

// sandbox_handlers.go
sandbox, err := s.orchestrator.GetSandbox(r.Context(), sandboxID)
if sandbox.OrgID != org.ID {
    serverError.RespondError(w, http.StatusForbidden, ...)
}

A prompt injection attack could cause the AI agent to destroy sandboxes belonging to other orgs. The executor needs the same org-ownership check before every mutating operation.

🔴 High: DeleteHostToken does not verify token belongs to caller's org

api/internal/rest/host_handlers.go — the delete handler verifies the caller is an org admin, but does not verify the token being deleted belongs to their org. An admin of Org A could delete Org B's host tokens by guessing token IDs (HTK-XXXXX).

tokenID := chi.URLParam(r, "tokenID")
if err := s.store.DeleteHostToken(r.Context(), tokenID); err != nil {
    // No org ownership check on the token itself

🔴 High: OAuth account linking trusts unverified email from provider

api/internal/rest/auth_handlers.go marks users as EmailVerified: true based solely on receiving an email address from the OAuth provider, without checking the provider's verified field:

user = &store.User{
    Email:         email,
    EmailVerified: true,  // assumed verified — not checked
}

GitHub returns unverified emails via the user:email scope. An attacker with an unverified email on GitHub that matches an existing fluid.sh password account could link their OAuth identity to that account. The /user/emails endpoint returns a verified boolean that must be checked.


Security Issues (Medium)

🟡 Secure cookie flag broken behind reverse proxy

api/internal/rest/auth_handlers.go:

auth.SetSessionCookie(w, sess.ID, s.cfg.Auth.SessionTTL, r.TLS != nil)

When the API sits behind a TLS-terminating load balancer (standard production setup), r.TLS is always nil, so session cookies are set without the Secure flag. This allows cookies to be sent over HTTP. Replace with a config flag: cfg.Auth.SecureCookies.

🟡 No HTTP status check in OAuth userinfo fetches

fetchGitHubUser and fetchGoogleUser do not check resp.StatusCode before unmarshaling. A non-2xx response with a valid JSON structure would produce a zero-value user object (ID=0, empty email), which could then be persisted.

🟡 Proxmox secret stored and transmitted in plaintext

ProxmoxSecret is received via REST, stored as a plain string in PostgreSQL, and forwarded in plaintext over gRPC to daemons. The json:"-" tag prevents it from appearing in API responses (good), but it is not encrypted at rest. This is a high-value credential.

🟡 gRPC host ID is daemon-controlled, not token-bound

api/internal/grpc/stream.go:

hostID := reg.GetHostId()

The host_id in the registration message is trusted from the daemon itself. A daemon authenticating with a valid token can claim any host_id, including one belonging to another org's host. The control plane should either assign the host ID at registration or validate it matches the token's bound identity.

🟡 No rate limiting on auth endpoints

/auth/login and /auth/register have no rate limiting or brute-force protection. Given this is a multi-tenant SaaS, these endpoints need protection (e.g., middleware using a token bucket or IP-based limiting).

🟡 monitorHeartbeat only warns, never takes action

api/internal/grpc/stream.go:

if time.Since(host.LastHeartbeat) > h.heartbeatTimeout {
    logger.Warn("host heartbeat overdue", ...)
    // No disconnect, no removal from registry
}

Dead hosts remain in the registry indefinitely. Sandbox creation will attempt to schedule on them, fail after a 30-second gRPC timeout, and return an error to users. These hosts should be marked offline and removed from placement consideration.


Bugs

🐛 billing_handlers.go: GetSubscriptionByOrg("") is always wrong

// customer.subscription.updated handler
existing, err := s.store.GetSubscriptionByOrg(r.Context(), "")

This always passes an empty string and will always return ErrNotFound. The customer fallback lookup works, but this first call is dead code that obscures intent. Should be removed or replaced with the correct org ID lookup.

🐛 fetchGitHubUser silently uses first email, verified or not

When no primary email is found in the GitHub /user/emails response, the code uses emails[0] regardless of its verified status. Combined with the earlier concern, this creates an account with an unverified, potentially attacker-controlled email.


Missing Test Coverage

The test suite is generally solid (28 test files), but these areas are gaps:

  1. api/internal/billing/meters.go — 419 lines of Stripe meter logic with no test file
  2. api/internal/billing/ticker.go — 163 lines, no test file
  3. Billing webhook handlercustomer.subscription.updated case is undertested; the GetSubscriptionByOrg("") bug would be caught by a proper test
  4. Agent executor — No tests verifying cross-org isolation for sandbox operations
  5. DeleteHostToken — No test for the missing org-ownership check

Code Quality / Minor Issues

  • No email format validation on /auth/register — any string is accepted as an email
  • No scheduled session cleanupDeleteExpiredSessions exists but is never called; expired rows accumulate indefinitely
  • DiscoverSourceHosts picks connected[0] from a Go map, which has non-deterministic iteration order — this could cause flaky behavior
  • GORM AutoMigrate on startup is fine for development but should be replaced with versioned migrations before production
  • orchestrator.ListVMs is O(n) sequential gRPC round-trips with 30-second timeouts per host — this will be slow with many hosts; consider parallelizing with a fan-out pattern

What's Working Well

  • Session token generation (32-byte CSPRNG, server-side storage, expiry enforcement)
  • bcrypt cost 12 for password hashing
  • OAuth state parameter uses constant-time comparison
  • Host token SHA-256 hashing before storage (raw token returned once only)
  • Stripe webhook signature verification before processing
  • The bidirectional gRPC streaming design for daemon communication is clean
  • The microVM overlay approach is a meaningful performance improvement over full clones
  • Test structure is well-organized with clear helper patterns

Summary

The architectural direction is right. Before merging, the three high-severity security issues (agent org isolation, token ownership check, OAuth email verification) should be addressed. The medium issues (Secure cookie, HTTP status checks, rate limiting) are also important for a production SaaS. The billing bug and missing test files should be straightforward to fix.

Reviewed by Claude Sonnet 4.5

@claude
Copy link

claude bot commented Feb 18, 2026

Code Review: feat: convert fluid to a control-plane

This is a substantial architectural refactor — moving from local libvirt clones to a control-plane + daemon + microVM model. The overall structure is well-thought-out. Below are findings from reviewing the key changed files.


Critical / High Severity

1. Host ID hijacking (api/internal/grpc/stream.go)

The host ID is daemon-provided during registration with no server-side validation that it belongs to the registering org. A daemon from org A could supply a host_id owned by org B, overwriting that host's stream slot in the registry and potentially receiving commands destined for org B's host. The Register() call silently overwrites existing entries.

Fix: After GetHost, verify host.OrgID == orgID before allowing registration. If the host exists under a different org, reject the stream with a permission error.

2. Host token delete missing org-scoping (api/internal/rest/host_handlers.go + api/internal/store/postgres/postgres.go)

handleDeleteHostToken does not verify that the token being deleted belongs to the requesting user's org. The store's DeleteHostToken only filters by id, not id AND org_id. An owner/admin of org A can delete org B's host tokens if they can guess the token ID.

Fix: Add org-scoping to DeleteHostToken — either in the store query (WHERE id = ? AND org_id = ?) or by first fetching the token and checking its org in the handler.

3. Database URL logged in plaintext (api/cmd/server/main.go line ~48)

slog.Info("connecting to database", "db", cfg.Database.URL)

If cfg.Database.URL contains credentials (e.g., postgres://user:password@host/db), the password is emitted to logs at startup. Redact the URL before logging (strip userinfo from the parsed URL).

4. Sandbox creation not atomic (api/internal/orchestrator/orchestrator.go)

If o.store.CreateSandbox fails after the VM has already been successfully created on the daemon host, the VM is orphaned — it exists on the host but is untracked in the DB with no cleanup rollback. Similarly in StartSandbox/StopSandbox, if o.store.UpdateSandbox fails after a successful host operation, DB and host state diverge (error is only logged).

Fix: On CreateSandbox DB failure, issue a compensating DestroySandbox to the host. For start/stop, consider a retry loop or a reconciliation mechanism.


Medium Severity

5. Stale hosts never disconnected (api/internal/grpc/stream.gomonitorHeartbeat)

monitorHeartbeat logs a warning when the heartbeat deadline is overdue but never disconnects or deregisters the host. Stale, non-communicating hosts accumulate in the registry and can be selected for new sandbox placement.

Fix: After N missed heartbeats, cancel the stream context and deregister the host from the registry.

6. Command audit trail silently dropped (api/internal/orchestrator/orchestrator.go)

In RunCommand, if o.store.CreateCommand fails, the error is only logged — the command executes but no audit record is persisted. For an infrastructure tool, silent audit loss is a significant correctness problem.

Fix: Return the error to the caller (or at minimum use structured metrics/alerting, not just a log line).

7. Billing store methods are unimplemented stubs (api/internal/store/postgres/postgres.go)

All billing-related store methods (GetModelMeter, CreateModelMeter, GetOrgModelSubscription, CreateOrgModelSubscription, SumTokenUsage, ListActiveSubscriptions) return nil/ErrNotFound without indication. Any billing logic will silently produce incorrect results (zero usage, always "not found") without the caller knowing.

Fix: Return errors.New("not implemented") from stubs, or add a compile-time interface guard that makes the incompleteness explicit.

8. Daemon inbound gRPC has no authentication (fluid-daemon/cmd/fluid-daemon/main.go)

The daemon's own gRPC server (for CLI/TUI connections on :9091) is created with bare grpc.NewServer() — no auth, no TLS. Any process on the same network reachable host can issue arbitrary VM management commands.

This may be intentional for local-only use, but it should be clearly documented and the default bind address should be localhost:9091, not 0.0.0.0:9091.

9. microvm.NewManager failure treated as warning, not error (fluid-daemon/cmd/fluid-daemon/main.go)

If microvm.NewManager fails, vmMgr is set to nil and a warning is logged, but initialisation continues. The microVM provider with a nil manager will produce panics or cryptic runtime errors. This should be a hard startup failure.

10. persistHostRegistration masks real DB errors (api/internal/grpc/stream.go)

Any error from GetHost (including DB connection failures) causes a CreateHost attempt, masking the original error. Only ErrNotFound should trigger creation; all other errors should abort registration.


Low Severity / Code Quality

11. SelectHostForSourceVM is non-deterministic (api/internal/orchestrator/placement.go)

Returns the first matching host from map iteration (random order in Go). With multiple hosts having the same source VM, all load goes to whichever host happens to be returned first. Consider selecting the host with the most available resources.

12. UpdateHost has upsert semantics (api/internal/store/postgres/postgres.go)

Uses db.Save() which inserts if the record doesn't exist, rather than returning ErrNotFound. This is inconsistent with other update methods that use db.Model(...).Updates(...).

13. DeleteSession doesn't check RowsAffected (api/internal/store/postgres/postgres.go)

Returns nil even when the session didn't exist, inconsistent with other delete methods that return ErrNotFound.

14. Log file world-readable (fluid-cli/cmd/fluid-cli/main.go)

Log file is opened with 0o644. Logs may contain sensitive command output or file contents from sandboxes. Use 0o600 instead.

15. Duplicate step numbering in main.go comments

Two comment blocks are labelled "6." — minor but confusing.


Positive Notes

  • The overall architecture (control-plane + daemon + registry + orchestrator) is well-structured and sensible for the stated goals.
  • Auth middleware correctly validates sessions, and resolveOrgMembership is a clean pattern that avoids duplication across REST handlers.
  • PasswordHash, AccessToken, RefreshToken, and TokenHash are all correctly excluded from JSON serialisation.
  • Host tokens use crypto/rand (256 bits) — secure token generation.
  • sync.Map used consistently for concurrent registry/pending-request state.
  • Test files are present for most packages.

The host ID hijacking bug (#1) and the host token delete bypass (#2) are the most impactful and should be addressed before merge. The DB URL logging (#3) is also a straightforward fix with high security value.

…facing APIs

Users now specify source_vm + live (bool) instead of source_vm + base_image.
live=true maps to SNAPSHOT_MODE_FRESH, live=false (default) uses cached image.
base_image remains internally in proto/daemon but is derived from source_vm.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@claude
Copy link

claude bot commented Feb 18, 2026

PR Review: feat: convert fluid to a control-plane

This is a large and ambitious architectural transformation. The overall design is sound — centralized control plane, per-host daemons, gRPC streaming, MicroVM support, and multi-tenant auth are all the right building blocks. Here is a detailed review.


Summary of Findings

Area Finding Severity
Agent tool layer lacks org isolation RunCommand/Destroy in agent executor skip org ownership check High
Hand-rolled JSON parser in readMetadata Corrupts sandbox metadata with special characters in names Medium
Session expiry not validated in middleware Expired session tokens may be accepted Medium
SSH CA DefaultConfig TTL mismatch Comment says "1-10 minutes", defaults allow 30/60 min Medium
SSH CA — no CRL enforcement on VMs Revoked certs remain valid at the SSH server level Medium
QEMU PID file read timeout (1s) too short Spurious sandbox creation failures under load Low
proc.Wait() goroutine leak on SIGKILL timeout Goroutine leak when QEMU fails to exit Low
No rate limiting on auth endpoints Brute-force on password login possible Low
writeSSE ignores client disconnect Agent loop continues executing tools after disconnect Low
SSHConnectCommand example uses StrictHostKeyChecking=no Security anti-pattern in example output Low
No tests for IssueCertificate Most security-critical path in daemon is untested Missing
No tests for StreamChat agentic loop SSE streaming and tool calls untested Missing

Security Issues

[High] Agent executor bypasses org ownership checks (api/internal/agent/executor.go)

The REST handlers check sandbox ownership before calling orchestrator methods:

// In REST handler — correct
if sandbox.OrgID != org.ID {
    return 403
}
orchestrator.RunCommand(...)

But the agent executor calls the orchestrator directly without an org check:

// In executor.go — missing ownership check
func (c *Client) execRunCommand(ctx context.Context, params map[string]any) (any, error) {
    return c.orchestrator.RunCommand(ctx, strParam(params, "sandbox_id"), ...)
}

orchestrator.RunCommand and orchestrator.DestroySandbox fetch the sandbox by ID without filtering by org. This means a crafted sandbox_id passed to the agent could operate on sandboxes belonging to a different org. The orchestrator methods should accept and enforce an orgID parameter, or the executor should validate ownership before delegating.

[Medium] Session expiry not validated in auth middleware (api/internal/auth/)

RequireAuth retrieves the session from the store but does not check sess.ExpiresAt. If GetSession in the PostgreSQL store does not include WHERE expires_at > NOW(), then expired sessions will be accepted indefinitely (until the background cleanup job runs). Recommend either:

  • Adding the expiry filter to GetSession, or
  • Checking sess.ExpiresAt.Before(time.Now()) explicitly in the middleware

[Medium] SSH CA DefaultConfig TTL is inconsistent with documented limits

The struct comment says TTL "must be between 1 and 10 minutes" but:

func DefaultConfig() Config {
    return Config{
        DefaultTTL: 30 * time.Minute,  // Exceeds documented 10-minute max
        MaxTTL:     60 * time.Minute,  // Also exceeds
    }
}

The validation enforces the 1-minute minimum and the MaxTTL cap, but MaxTTL itself is never validated. Either update the comment to reflect the actual policy (30/60 min) or tighten the defaults.

[Medium] No certificate revocation enforcement at SSH level

RevokeCertificate and CertStatusRevoked exist in the store, but there is no mechanism to push a CRL to the VMs or for VMs to check revocation in real-time. Once a certificate is issued, revoking it in the store has no effect until the cert expires naturally. Consider documenting this limitation, and if immediate revocation is needed, look into AuthorizedPrincipalsCommand or a custom AuthorizedKeysCommand hook on VMs that checks the store.

[Low] SSHConnectCommand returns a command with StrictHostKeyChecking=no

Even labeled as an example, this will be copy-pasted. Suggest using StrictHostKeyChecking=accept-new and documenting how to populate known_hosts from the VM's host key.

[Low] No rate limiting on auth endpoints

The /auth/login, /auth/register, and OAuth endpoints have no rate limiting. A basic in-memory limiter (e.g., golang.org/x/time/rate) keyed by IP and/or username would prevent brute-force attacks.


Correctness Bugs

[Medium] Hand-rolled JSON parser in readMetadata (fluid-daemon/internal/microvm/manager.go)

writeMetadata uses fmt.Sprintf with %q to produce JSON strings, and readMetadata uses a custom extractJSONString scanner to parse them back. The custom scanner does not handle escaped characters inside JSON strings (e.g., \", \\, \n). If a sandbox name contains a backslash or double quote, readMetadata will return corrupt data.

The fix is straightforward — use encoding/json for both:

// writeMetadata
data, err := json.Marshal(meta)

// readMetadata
var meta sandboxMetadata
err = json.Unmarshal(data, &meta)

[Low] QEMU PID file polling timeout is too short

The daemon polls for the QEMU PID file up to 10 times × 100ms = 1 second total. On a heavily loaded host, QEMU may take longer to daemonize and write its PID. Consider increasing the retry count or using a configurable timeout (e.g., 5–10 seconds).

[Low] proc.Wait() goroutine leak on SIGKILL timeout

After the 5-second time.After fires in Destroy, the goroutine calling proc.Wait() is leaked — it will block until the process eventually dies (or forever if it doesn't):

go func() {
    _, _ = proc.Wait()  // This goroutine leaks on timeout
    close(done)
}()
select {
case <-done:
case <-time.After(5 * time.Second):
    m.logger.Warn("process did not exit after SIGKILL")
    // goroutine is now leaked
}

Consider tracking active wait goroutines in the manager's lifecycle or using os.FindProcess and a manager-level WaitGroup.

[Low] writeSSE ignores client disconnect (api/internal/agent/)

func (c *Client) writeSSE(w http.ResponseWriter, flusher http.Flusher, event string, data any) {
    _, _ = fmt.Fprintf(w, "event: %s\ndata: %s\n\n", ...)
    flusher.Flush()
}

After client disconnect, the write fails silently and the agent loop continues executing tool calls (which may call the daemon, run commands, etc.) and accumulating costs. The response writer's CloseNotify (or request context) should be used to cancel the agent loop on disconnect.


Test Coverage Gaps

sshca.CA.IssueCertificate — This is the highest-risk code path in the entire daemon and has no unit tests. The validateRequest function's input validation logic (key type check, base64 validation, TTL clamping) should be tested without needing a real ssh-keygen binary, by extracting validation into a pure function.

StreamChat agentic loop — The streaming SSE handler and tool-calling loop in openrouter.go is completely untested. Token counting, tool call argument accumulation, message persistence, and error handling paths all need coverage.

network.CreateTAP / DiscoverIP — These require elevated privileges, but the IP discovery strategy selection and fallback logic can be tested with mock subprocess runners.


Code Quality Observations (Positive)

  • Consistent use of context.Context throughout — good.
  • Interface-based design for all stores (store.Store, CertificateStore, HostSender) enables testability.
  • Compile-time interface checks (var _ CertificateStore = (*MemoryStore)(nil)) are a nice touch.
  • Auth test coverage is thorough — all auth paths, middleware, OAuth state, and session lifecycle are tested.
  • gRPC server tests cover success, timeout, EOF, heartbeat, and error report — well done.
  • Orchestrator placement tests cover stale heartbeat, resource constraints, and org isolation at the orchestrator level.
  • Structured logging with slog used consistently.

Minor

  • maxToolCalls = 20 in the agent is hardcoded. A brief comment on why this value was chosen (cost guard, loop prevention) would help future readers.
  • The model cache selects the "best" model by highest price per token. This heuristic may not always align with quality and the intent should be documented.

Overall this is a well-structured PR. The main items to address before merging are the org isolation gap in the agent executor, the readMetadata JSON parsing bug, and the session expiry check. The SSH CA and test coverage gaps are important to close before this handles real production traffic.

aspectrr and others added 2 commits February 18, 2026 08:06
…safety

- OAuth helpers: check HTTP status codes, use 10s timeout client, propagate email verified status
- Guard against removing org owner via handleRemoveMember
- Billing webhook: check CreateSubscription error, remove dead GetSubscriptionByOrg("") call
- Hide StripeCustomerID from non-owner org members
- Redact database URL credentials in startup logs
- persistHostRegistration: only create host on ErrNotFound, log other errors
- Config env helpers: warn on unparseable values instead of silent fallback
- monitorHeartbeat: disconnect host after 3 consecutive missed heartbeats
- PrepareSourceVM/ValidateSourceVM: return concrete proto types instead of any

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@claude
Copy link

claude bot commented Feb 18, 2026

PR Review: feat: convert fluid to a control-plane

This is a significant architectural overhaul — converting from a single-node libvirt clone model to a multi-host microVM control plane with gRPC streaming, PostgreSQL, and Stripe billing. Overall the structure is well-thought-out and the code quality is solid. Here are my findings:


Architecture & Design

Positive:

  • Clean separation of concerns: orchestrator, registry, grpc/stream, store, agent, billing are all well-bounded packages.
  • The HostSender interface in orchestrator.go makes the orchestrator testable independently of the gRPC transport — good design.
  • The persistHostRegistration pattern (check → create or update) is correct and handles ErrNotFound properly.
  • Double-checked locking in EnsureModelMeter is appropriate for a lazy-init pattern under concurrent requests.

Potential Bugs

1. Host mutation in persistHostRegistration without locking (stream.go:263)

persistHostRegistration is called inside Connect(), which runs concurrently per-host stream. While GetHost/UpdateHost are store calls, the existing record is fetched, mutated in memory, and then saved. If the same hostID connects twice in rapid succession (reconnect race), both goroutines could read the same stale record and write conflicting updates. Consider wrapping GetHost→mutate→UpdateHost in a database-level upsert or a per-host mutex.

2. EnsureModelMeter creates Stripe objects before DB save fails (billing/meters.go:109-206)

If mm.store.CreateModelMeter fails after all four Stripe objects (Product, two BillingMeters, two Prices) have been created, those Stripe objects are orphaned — there's no rollback. On retry, a new set of Stripe objects will be created again. Over time this could lead to duplicate Stripe resources.

Consider saving the meter record first (with a flag like pending) and updating it to active after all Stripe calls, or using idempotency keys.

3. Free-tier billing calculation relies on cumulative including the current conversation (billing/meters.go:347-348)

thisChat := float64(inputTokens + outputTokens)
prevTotal := cumulative - thisChat // usage before this chat (already recorded by the time we get here)

This assumes SumTokenUsage already includes the current conversation's tokens. If the usage record is written after ReportUsage is called, prevTotal will be incorrect (negative or zero), causing overbilling or underbilling. The comment "already recorded by the time we get here" is an implicit ordering contract that's not enforced in code. Add an explicit assertion or document the call sequence clearly.

4. monitorHeartbeat ticks every 30 seconds but heartbeatTimeout can be configured to < 30s (stream.go:229)

If ORCHESTRATOR_HEARTBEAT_TIMEOUT is set below 30 seconds, the ticker fires less frequently than the timeout window, meaning the first miss won't be detected until the tick fires. The ticker interval should be heartbeatTimeout / 3 or a similar fraction of the timeout.

5. DestroySandbox does not handle "host not connected" gracefully (orchestrator.go:219-252)

If the host is offline when destroy is called, SendAndWait returns an error and the sandbox record is never deleted from the store. The sandbox stays in the DB as if it still exists. Consider either:

  • Adding a --force flag that deletes the DB record even if the host is unreachable.
  • Or soft-marking the sandbox as ORPHANED so operators can see the discrepancy.

Security Concerns

6. stripe.Key is set as a package-level global (billing/meters.go:105, 241, 288, 381)

stripe.Key = mm.stripeKey

The Stripe Go SDK uses a global stripe.Key. Setting it inline before each API call creates a race condition if MeterManager methods are called concurrently from multiple goroutines. Use stripe.NewBackends() and construct a per-instance client.API instead:

sc := &client.API{}
sc.Init(mm.stripeKey, nil)

This is a real data-corruption/wrong-billing risk if two orgs trigger billing simultaneously.

7. No email validation at registration (auth_handlers.go:71-79)

Only req.Email == "" is checked. Arbitrary strings like "not-an-email" are accepted. Add basic format validation (e.g., strings.Contains(req.Email, "@") at minimum, or a proper validator).

8. OAuth tokens stored in plaintext (postgres.go:52-53)

OAuthAccountModel stores AccessToken and RefreshToken as plain strings. If the database is breached, all user OAuth tokens are exposed. Consider encrypting them at rest using a server-side key.

9. Session token is 64 hex chars (256 bits of randomness halved by hex encoding = 128 bits) — acceptable, but verify (auth/session.go)

32 random bytes hex-encoded → 64 chars → 128 bits of entropy. This is fine for session tokens, but worth documenting explicitly.


Performance Considerations

10. ListVMs fans out to all hosts synchronously in a loop (orchestrator.go:511+)

for _, host := range connected {
    resp, err := o.sender.SendAndWait(...)
}

If there are N hosts, this is N sequential RPCs, each with its own timeout. This should be parallelized with goroutines + a wait group, especially since SendAndWait can block for seconds.

11. bufio.Scanner default buffer may truncate large command outputs (openrouter.go:311)

bufio.NewScanner has a default 64KB line buffer. Long SSE data lines from OpenRouter (e.g., large tool result payloads) will silently be dropped with a scan error. Use scanner.Buffer(make([]byte, 1<<20), 1<<20) to increase the limit.


Code Quality

12. Ignored errors from c.store.CreateAgentMessage (openrouter.go:146, 187, 201, 232)

_ = c.store.CreateAgentMessage(ctx, userMsg)

Message persistence failures are silently swallowed. If the DB is unavailable, the conversation history will be incomplete without any indication to the caller. These should at least be logged at Warn level.

13. Hardcoded model list in AvailableModels() (openrouter.go:411-418)

Model IDs and pricing are hardcoded. The ModelCache abstraction exists already — AvailableModels() should use it for consistency, or the hardcoded list should be the authoritative source for the cache.

14. CreateSandbox double host-selection attempt with unclear fallback logic (orchestrator.go:71-81)

host, err := SelectHost(o.registry, baseImage, req.OrgID, o.heartbeatTimeout)
if err != nil {
    if req.SourceVM != "" {
        host, err = SelectHostForSourceVM(...)
    }
}

SelectHost is called first with baseImage (which equals req.SourceVM), and if it fails, SelectHostForSourceVM is called with the same req.SourceVM. If SelectHost is supposed to be the general-purpose selector and SelectHostForSourceVM the source-VM-aware one, the logic seems inverted — SelectHostForSourceVM should probably be called first when req.SourceVM != "".

15. truncate in openrouter.go splits on bytes, not runes (openrouter.go:403-408)

return s[:maxLen]

This will panic or produce invalid UTF-8 if the conversation title starts with a multi-byte character near position 100. Use []rune(s)[:maxLen] instead.


Test Coverage

Test files are present for most packages (executor_test.go, stream_test.go, orchestrator_test.go, etc.) which is good. A few gaps:

  • billing/meters.go and billing/ticker.go have no test files in the PR. Given that these involve real money, they should have tests (with a mocked Stripe backend).
  • agent/openrouter.go's StreamChat agentic loop has no dedicated tests.

Minor Nits

  • io.EOF comparison on line 129 of stream.go should use errors.Is(err, io.EOF) for correctness with wrapped errors (though gRPC typically doesn't wrap EOF, it's still better practice).
  • The ResourceReport handler (stream.go:154-156) logs "received resource report" but does nothing with the data. If resource reports aren't yet implemented, add a // TODO comment.
  • MetadataJSON in UsageRecordModel is stored as a raw JSON string column rather than JSONB. Consider making it JSONB for queryability.

Overall this is a solid PR that lays the right foundation. The Stripe global key race (#6) and the Stripe orphaned resources on DB failure (#2) are the most important issues to address before shipping.

@claude
Copy link

claude bot commented Feb 18, 2026

Code Review: feat: convert fluid to a control-plane

This is a substantial architectural refactor. The overall direction is sound and well-structured. Below are my observations by category.


Architecture

The shift from direct CLI→daemon gRPC to a centralized control plane with NAT-friendly outbound streaming is a good call for a multi-host/enterprise deployment model. The three-layer separation (CLI → control plane → daemon) with the bidirectional stream for daemon-to-CP communication is clean and the protocol buffer definitions are well-organized.

The SandboxProvider abstraction over QEMU microVMs and Proxmox LXC is a good interface design choice that will make future backends easier to add.


Potential Bugs

1. Hardcoded snapshot name in libvirt backend
virsh snapshot-create-as --name fluid-tmp-snap is hardcoded. If the daemon crashes mid-operation, a retry will fail because the snapshot already exists with that name. There's no cleanup of a pre-existing fluid-tmp-snap before creating a new one.

// Suggestion: check/delete pre-existing snapshot before creating
virsh snapshot-delete <vm> fluid-tmp-snap --metadata  // if exists

2. Hardcoded root device in QEMU -append

"-append", "console=ttyS0 root=/dev/vda rw quiet"

This assumes the base image has its root on /dev/vda. Images using /dev/sda, /dev/nvme0n1p1, or other device names will silently fail to boot. Consider making this configurable in LaunchConfig or reading it from image metadata.

3. Hand-crafted JSON in writeMetadata
metadata.json is written with a format string rather than encoding/json. If any field (e.g., the Name field) ever contains a double-quote, the output would be malformed JSON. The readMetadata function uses encoding/json to parse it — consider using encoding/json for both directions.

4. TAP name collision risk
TAPName uses only the first 6 characters of the sandbox ID suffix. Two sandbox IDs with the same first 6 characters after the SBX- prefix would collide at the kernel TAP level. This is unlikely in practice but worth documenting or extending the prefix length.


Security

Host token auth (auth/hostauth.go): Correctly stores SHA-256 hash of the token, never the raw token. Good.

SSH CA (sshca/ca.go): Solid design — Ed25519, ephemeral TTL (1–60 min), random serial initialization, disabled forwarding options. One nit: the default key path is /etc/virsh-sandbox/ssh_ca which contains the old virsh-sandbox name from the previous codebase. Should be updated to /etc/fluid/ssh_ca or similar.

SSH host probing (sshconfig/prober.go): Uses StrictHostKeyChecking=no — this is appropriate for automation but should be explicitly documented as an intentional trust decision, especially for enterprise deployments that may have MITM concerns on internal networks.

No TLS on local gRPC port by default: ControlPlane.Insecure: true defaults to plaintext. This is acceptable for localhost-only use, but the documentation should call this out so operators know to enable TLS when exposing the daemon port across a network.

ProbeAll has no rate limiting: Concurrent SSH probes for large SSH configs (many hosts) could cause connection storms or get rate-limited by target hosts. Consider adding a semaphore or bounded worker pool.


Code Quality

Duplicate create-sandbox logic: The sandbox creation flow (including snapshot-pull, overlay setup, etc.) appears in both agent/client.go (for control-plane-delegated requests) and daemon/server.go (for direct CLI requests). This duplication will be a maintenance burden. Extracting a shared sandboxLifecycle helper or using the provider interface consistently would reduce drift between the two paths.

Non-deterministic host selection: In DiscoverSourceHosts, the first connected host (connected[0]) is selected after iterating a map, which has non-deterministic ordering in Go. This is a minor issue but could cause unexpected behavior. Consider sorting by host ID or last-heartbeat time.

SSH public key from CreateRequest is silently dropped: microvm_provider.go's CreateSandbox accepts SSHPublicKey in the request but doesn't inject it (cloud-init ISO path is not populated). If callers expect SSH key injection to work, this will silently fail. Either document that key injection requires a CA-baked base image, or wire up cloud-init.


Test Coverage

Strong coverage in:

  • gRPC stream handler (stream_test.go): SendAndWait success/timeout/error/heartbeat/dispatch all covered.
  • Placement algorithm (placement_test.go): All filter conditions (memory, CPU, org, heartbeat) tested.
  • Janitor (janitor_test.go): Expired/not-expired/destroy-error cases.
  • LXC provider: naming, config, and types tested.

Gaps:

  • sshca/ca.go: No visible tests for IssueCertificate. This is security-critical code and should have unit tests covering TTL enforcement, extension restrictions, and serial uniqueness.
  • microvm/manager.go: Launch() and Stop() are not tested (understandable since they require QEMU), but a mock or interface-based approach could enable testing the orchestration logic around them.
  • snapshotpull/puller.go: The deduplication / inflight-coalescing logic isn't tested. This is a correctness-critical path (concurrent callers for the same base image).

Minor Notes

  • The Proxmox backend's downloadFile uses a file-restore/download endpoint with filepath=/ — worth verifying this against current Proxmox API docs, as this endpoint is typically for file-level restore from backups, not full disk image export.
  • internal/id/id.go is new — confirm it uses a cryptographically random source (not math/rand) for sandbox IDs, since they are used in TAP device names and path construction.

Summary

The architecture is solid and the implementation quality is generally high. The main items to address before merging:

  1. Bug: Leftover snapshot (fluid-tmp-snap) on crash/retry in libvirt backend.
  2. Bug: Hardcoded /dev/vda in QEMU kernel args — should be configurable.
  3. Security: Update the SSH CA key path away from the old virsh-sandbox name.
  4. Test gap: Add tests for IssueCertificate in sshca/ca.go.
  5. Test gap: Add tests for snapshot-pull deduplication logic.
  6. Code quality: Deduplicate the sandbox creation flow between agent/client.go and daemon/server.go.

Great progress on a complex refactor — the move to microVMs and a centralized control plane is the right direction.

Hash session tokens with SHA-256 before DB storage, add secure cookie
config flag, validate email format at registration, bound OAuth response
reads, replace hand-rolled JSON with Marshal/Unmarshal in microvm manager,
add overlay cleanup on launch failure, set Stripe key once at init,
add compensating DestroySandbox on DB persist failure, fix truncate to
use runes, log CreateAgentMessage errors, tighten log file permissions,
check DeleteSession RowsAffected, add RootDevice to LaunchConfig, and
update SSH CA default paths from virsh-sandbox to fluid.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@claude
Copy link

claude bot commented Feb 18, 2026

PR Review: feat: convert fluid to a control-plane

This is a substantial architectural PR that transforms the project into a multi-tenant control plane with PostgreSQL, gRPC streaming, billing integration, and an embedded LLM agent. The overall architecture and direction are solid. Below are specific issues found during review, organized by severity.


Critical Issues (Blockers)

1. Billing stubs cause duplicate Stripe meters and $0 billing (api/internal/store/postgres/postgres.go:1656-1678)

GetModelMeter, CreateModelMeter, and SumTokenUsage are unimplemented stubs:

func (s *postgresStore) GetModelMeter(_ context.Context, _ string) (*store.ModelMeter, error) {
    return nil, store.ErrNotFound
}
func (s *postgresStore) CreateModelMeter(_ context.Context, _ *store.ModelMeter) error {
    return nil
}
func (s *postgresStore) SumTokenUsage(_ context.Context, ...) (int64, error) {
    return 0, nil
}

Because GetModelMeter always returns ErrNotFound, EnsureModelMeter in billing/meters.go will create a new Stripe Product + Meter + Prices on every agent invocation, generating enormous numbers of duplicate Stripe objects. Because SumTokenUsage always returns 0, the free-tier calculation at meters.go:342-365 always sees cumulative=0, causing all token usage to be free. These are production billing bugs that will cause real financial harm.

2. Race condition: host registered before stream entry is stored (api/internal/grpc/stream.go:96-100)

if err := h.registry.Register(hostID, orgID, hostname, stream); err != nil {
    return fmt.Errorf("register host: %w", err)
}
h.registry.SetRegistration(hostID, reg)
h.streams.Store(hostID, stream)  // stored after Register — too late

registry.Register makes the host visible to ListConnectedByOrg, but streams.Store happens two lines later. An orchestrator.SendAndWait between these two lines will see the host as connected but fail with "host not connected" because the stream entry isn't stored yet. streams.Store should happen before registry.Register.


High Severity

3. Compensating destroy uses already-cancelled context (api/internal/orchestrator/orchestrator.go:191-203)

When a sandbox DB write fails after the VM is created, the compensating destroy is attempted using the original ctx, which may already be cancelled (e.g., on request timeout). Use context.WithoutCancel(ctx) (Go 1.21+) or a fresh background context for cleanup operations to avoid orphaned VMs.

4. Host selection ignores requested sandbox memory (api/internal/orchestrator/placement.go:30-34)

if h.Registration.GetAvailableMemoryMb() < 512 {
    continue
}

The minimum threshold (512 MB) is hardcoded and doesn't consider req.MemoryMB. A host with 512 MB free would be selected even when the user requests a 4 GB sandbox, causing the daemon to fail at VM creation time.

5. Stripe objects not cleaned up on partial EnsureModelMeter failure (api/internal/billing/meters.go)

Stripe objects are created sequentially: Product → Input Meter → Output Meter → Input Price → Output Price → DB record. If any step after Product creation fails, prior Stripe objects are orphaned with no cleanup. Add rollback for previously created objects on failure.


Medium Severity

6. Stream context used for DB persistence during host connect (api/internal/grpc/stream.go:103)

h.persistHostRegistration(stream.Context(), hostID, orgID, reg)

If the gRPC client disconnects immediately after registering, the stream context is cancelled and the DB write fails silently. Use a background context (derived from the server's root context) for persistence operations that should outlive the stream.

7. Raw internal errors returned to clients (api/internal/rest/sandbox_handlers.go:53,156,213,382)

serverError.RespondError(w, http.StatusInternalServerError, fmt.Errorf("failed to create sandbox: %s", err.Error()))

Internal details (host IDs, gRPC errors, DB errors) are returned directly to the caller. Log the full error server-side and return a generic user-facing message.

8. ListVMs is sequential across hosts — O(N × 30s) worst case (api/internal/orchestrator/orchestrator.go:527-569)

Requests to each connected host are issued sequentially via SendAndWait. Use a goroutine fan-out with errgroup or sync.WaitGroup to issue all requests concurrently.

9. Sort order race in execAddPlaybookTask (api/internal/agent/executor.go:309-310)

tasks, _ := c.store.ListPlaybookTasks(ctx, strParam(params, "playbook_id"))
sortOrder := len(tasks)  // not locked — two concurrent calls get the same order

Two concurrent add_playbook_task calls for the same playbook will both read the same len(tasks) and create tasks with duplicate sort orders. Wrap in a transaction that atomically reads and increments the max sort order.

10. gRPC server stopped before HTTP drain (api/cmd/server/main.go:167)

grpcSrv.Stop() is called before httpSrv.Shutdown(). In-flight HTTP requests that depend on orchestrator operations (which require active gRPC streams) will fail during the HTTP drain window. Shut down HTTP first, then gRPC.

11. UpdateHost/UpdateSandbox use GORM Save() — full row overwrites (api/internal/store/postgres/postgres.go:1033,1106)

Save() issues an UPDATE on all columns, causing write amplification and making concurrent updates potentially overwrite each other. Use targeted Updates(map[string]any{...}) with only the changed fields, as done in UpdateHostHeartbeat.

12. Heartbeat monitor ticker hardcoded to 30s (api/internal/grpc/stream.go:229)

ticker := time.NewTicker(30 * time.Second)

If heartbeatTimeout is configured below 30s, the monitor won't fire in time. The ticker interval should be derived from heartbeatTimeout (e.g., heartbeatTimeout / 3).


Low Severity / Code Quality

13. Missing DB indexes for common query patterns (api/internal/store/postgres/postgres.go)

  • commands(sandbox_id, started_at) — range query with ORDER BY
  • usage_records(org_id, recorded_at) — billing range queries
  • sessions(expires_at)DeleteExpiredSessions does a full table scan

14. FRONTEND_URL CORS not validated (api/internal/rest/server.go:186)

No check that frontendURL is a valid URL and not "*". A misconfigured value of "*" would open CORS completely. Add a startup-time validation.

15. Inconsistent SecureCookies vs OAuth redirect URL defaults (api/internal/config/config.go:119,124)

SecureCookies defaults to true but AUTH_GITHUB_REDIRECT_URL defaults to http://localhost:5173/.... In local dev, the session cookie won't be sent back on HTTP requests, breaking the OAuth flow silently.

16. No upper bound on timeout_seconds in RunCommand (api/internal/rest/sandbox_handlers.go)

A client can supply timeout_seconds: 2147483647. Add a maximum (e.g., 3600s).

17. auth.contextKey should use a struct type (api/internal/auth/middleware.go)

String-based context keys can collide across packages. Use a private struct type (as done with hostOrgKey) for all context keys.

18. registry.SelectHostForImage appears to be dead code (api/internal/registry/registry.go:122)

The function duplicates logic already in placement.SelectHost without the heartbeat or resource checks. If unused, remove it.

19. No required-field validation at startup (api/internal/config/config.go)

Load() returns a config with potentially empty DATABASE_URL, STRIPE_SECRET_KEY, etc. Validate required fields explicitly with actionable error messages rather than failing at the first usage site.


Summary Table

# Severity File Issue
1 Critical store/postgres/postgres.go Billing stubs create duplicate Stripe meters, all usage billed $0
2 Critical grpc/stream.go Race: host visible before stream stored; SendAndWait fails
3 High orchestrator/orchestrator.go Compensating destroy uses cancelled context; VM orphaned on timeout
4 High orchestrator/placement.go Host selection ignores requested sandbox memory
5 High billing/meters.go No rollback on partial Stripe object creation
6 Medium grpc/stream.go Stream context used for DB writes; cancelled on disconnect
7 Medium rest/sandbox_handlers.go Raw internal errors returned to clients
8 Medium orchestrator/orchestrator.go ListVMs sequential — O(N × 30s) worst case
9 Medium agent/executor.go Concurrent add_playbook_task creates duplicate sort orders
10 Medium cmd/server/main.go gRPC stopped before HTTP drain
11 Medium store/postgres/postgres.go Save() causes full-row overwrites on concurrent updates
12 Low grpc/stream.go Heartbeat ticker hardcoded to 30s, ignores heartbeatTimeout config
13-19 Low Various Missing indexes, CORS validation, config inconsistencies, dead code

The architecture itself is clean and well thought out — the gRPC bidirectional streaming approach for daemon communication, the registry/orchestrator separation, and the overall module structure are all solid. The critical billing issues (#1, #2, #5) should be resolved before merging, as they would cause real financial impact in production.

…ons, security

- Implement 6 billing store methods with real GORM queries (critical)
- Fix stream registration race: store stream before registry.Register (critical)
- Pass requested CPU/memory to host selection instead of hardcoded values (high)
- Add Stripe rollback on partial EnsureModelMeter failure (high)
- Use background context for host persistence after registration (medium)
- Sanitize internal errors in REST responses, log full details server-side (medium)
- Parallelize ListVMs across hosts with errgroup (medium)
- Fix sort order race in execAddPlaybookTask by handling error (medium)
- Reverse shutdown order: HTTP before gRPC for graceful drain (medium)
- Replace Save() with targeted Updates() for UpdateHost/UpdateSandbox (medium)
- Derive heartbeat ticker interval from heartbeatTimeout config (medium)
- Add composite DB indexes on commands, usage_records, sessions (low)
- Add config.Validate() for DATABASE_URL and FRONTEND_URL (low)
- Default SecureCookies to false for local dev (low)
- Cap timeout_seconds at 3600 in RunCommand handler (low)
- Remove dead code SelectHostForImage and its tests (low)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
aspectrr and others added 2 commits February 18, 2026 19:10
- Override daemon-supplied hostID with token-derived identity to prevent impersonation
- Add per-org mutex for free tier billing to prevent race conditions
- Record LLM token usage per iteration instead of batched at end
- Add rate limiting on auth endpoints (register/login)
- Use net/mail for email validation instead of manual string parsing
- Add prompt injection guard to agent system prompt
- Return value copies from registry to prevent data races
- Extract timeout constants in orchestrator, add combined CPU+memory placement scoring
- Count actually connected daemons via registry instead of host tokens
- Default API port to 8080, secure cookies to true
- Add godotenv for local .env loading
- Update web landing page, docs, SDK config, and docker-compose

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Move setFrame(0) to cleanup function to avoid setState in effect body
- Wrap ref assignment in useEffect to avoid updating refs during render

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@claude
Copy link

claude bot commented Feb 19, 2026

PR Review: feat: convert fluid to a control-plane

This is a substantial and well-architected PR — converting from a standalone libvirt-cloning tool to a proper control-plane with multi-host orchestration, gRPC streaming, PostgreSQL-backed state, billing, and an embedded LLM agent. The overall structure is clean and the separation of concerns is solid. Below is a detailed review.


Strengths

  • Good security fundamentals: session tokens and host tokens are SHA-256 hashed before storage; raw tokens never touch the DB. Host ID override from token prevents daemon impersonation (stream.go:79-85).
  • Compensating transaction in CreateSandbox: if the DB write fails after the VM is already created on-host, a compensating destroy command is issued to avoid orphaned VMs (orchestrator.go:202-221). Well-done.
  • Per-host mutex for gRPC stream writes: serializes concurrent Send calls per host correctly (stream.go:215-218).
  • Per-org mutex in billing free-tier calculation: prevents double-counting under concurrent requests (meters.go:376-378).
  • Prompt injection defense: system prompt explicitly instructs the LLM not to follow instructions from tool output (openrouter.go:122). Good defensive practice.
  • Solid test coverage across auth, billing, orchestrator, grpc, and REST layers.

Bugs / Issues

1. EnsureModelMeter – double-check after lock doesn't handle non-ErrNotFound errors (billing/meters.go:95-98)

After acquiring the lock, only err == nil is checked. If the DB returns an unexpected error, execution falls through and creates Stripe objects. Should mirror the pre-lock check and return early on non-ErrNotFound errors.

2. EnsureOrgSubscriptionItems has no locking (billing/meters.go:258-315)

Unlike EnsureModelMeter, there is no lock around the check-then-create sequence for subscription items. Concurrent calls for the same (orgID, modelID) pair can race and create duplicate subscription items in Stripe, which are difficult to clean up.

3. SelectHostForSourceVM fallback skips resource checks (orchestrator.go:95-105)

When SelectHost fails (no pre-built base image), the fallback SelectHostForSourceVM is tried. This path does not check vcpus or memoryMB constraints, so a sandbox can be scheduled on a host without sufficient resources.

4. Rate limiter uses RemoteAddr directly (rest/ratelimit.go:38-41)

If the API is behind a reverse proxy or load balancer, RemoteAddr will always be the proxy IP, so all users share a single rate limiter bucket. Consider using X-Forwarded-For / X-Real-IP when a trusted proxy is configured.

5. Rate limiter cleanup goroutine is never stopped (rest/ratelimit.go:23-34)

The cleanup goroutine runs indefinitely with time.Sleep. It's called twice at startup (register and login), creating goroutines that live for the process lifetime with no way to stop them.


Security Concerns

6. Missing Vary: Origin response header (rest/server.go:186-203)

The CORS middleware unconditionally sets Access-Control-Allow-Origin to the configured frontend URL without a Vary: Origin response header. Caching proxies may serve one client's response to another client with a different origin. Add w.Header().Add("Vary", "Origin").

7. SameSite=Lax on session cookie (auth/session.go:65)

For an API-only backend, SameSite=Strict is safer as it prevents the cookie from being sent on cross-site top-level navigations. Worth confirming Lax is intentional.


Design / Quality Notes

8. ListVMs silently discards errgroup results (orchestrator.go:551, 598)

The derived context from errgroup.WithContext is discarded, and g.Wait() errors are discarded. The partial-results-on-failure behavior is intentional, but callers have no visibility into which hosts failed. Consider at least logging the aggregate error.

9. No pagination on list endpoints

ListSandboxes, ListCommands, ListPlaybooks etc. have no limit/offset parameters. At scale these could return very large payloads. The store.ListOptions struct already exists — it just isn't wired up in the REST handlers.

10. StartSandbox/StopSandbox don't validate current state (orchestrator.go:347-421)

Starting an already-running sandbox or stopping an already-stopped one dispatches a command to the host. If the host returns an error, the store state may diverge. A pre-check on sandbox.State before sending the command would catch this early.

11. Billing meter rollback doesn't cover Stripe billing meters (billing/meters.go:122-137)

The rollback function deactivates prices and products but not the two billing meters (Stripe does not support deleting meters). This should be documented in a comment so operators know that partial-failure scenarios may leave orphaned meters.

12. auth/middleware.go context key uses string type

The rest of the codebase (hostauth.go) uses unexported struct types (type hostOrgKey struct{}) as context keys, which is safer. middleware.go uses type contextKey string — consider aligning to the same pattern.


Summary

This is a well-engineered PR with a clear architecture. The main actionable items before merge are:

  1. Fix the double-check lock in EnsureModelMeter (bug — can create Stripe objects on transient DB errors)
  2. Add locking to EnsureOrgSubscriptionItems (bug — can create duplicate subscription items under concurrency)
  3. Fix resource check in SelectHostForSourceVM fallback (bug — can schedule onto underpowered hosts)
  4. Fix rate limiter IP detection for proxy deployments (security/ops)

The rest are improvements that could follow in subsequent PRs.

Convert billing.tsx to Outlet layout so child routes render. Delete
standalone /calculator route (wrong pricing). Remove Calculator from
sidebar nav. Switch sandbox pricing from hourly to $50/mo flat. Apply
free tier deductions in calculator. Fix token markup from 0.5% to 5%.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@claude
Copy link

claude bot commented Feb 19, 2026

Code Review: feat: convert fluid to a control-plane

This is a substantial architectural shift — moving from a standalone CLI to a full control-plane with a centralized API, multi-host orchestration, MicroVM-based sandboxes, and billing. The overall structure is solid and well-thought-out. Below is my review organized by category.


Architecture & Design

Strengths:

  • Clean separation of concerns: registry (in-memory), store (persistent), orchestrator (lifecycle coordination), grpc (stream handler) are well isolated.
  • The Store/DataStore interface split is good — it allows the WithTx pattern without exposing lifecycle methods everywhere.
  • Compensating transaction on DB persist failure after sandbox creation (orchestrator.go:203-221) is a nice touch to avoid orphaned VMs.
  • Using sync.Map for per-host stream and request tracking is appropriate for this access pattern.
  • Per-host mutex for serializing stream.Send calls prevents interleaved writes correctly.

Concerns:

No RBAC beyond org membershipresolveOrgMembership only checks that the user is a member, but destructive operations (destroy sandbox, run commands) don't check OrgRole. A plain member has the same write access as an owner. Consider requiring admin/owner for mutations.

DiscoverSourceHosts picks the first connected host (orchestrator.go:764):

host := connected[0]

This is non-deterministic. If there are multiple connected hosts, the one chosen for discovery may not be the most appropriate. Should document the selection criterion or add smarter logic.

CreateSandbox host selection fallback (orchestrator.go:95-105) discards the first error and tries a different strategy silently. The caller receives no indication of why the primary selection failed.


Security

Strengths:

  • bcrypt cost 12 for passwords — appropriate.
  • Session tokens are SHA-256 hashed before storage. Raw token stays client-side only.
  • OAuth CSRF state uses crypto/rand and subtle.ConstantTimeCompare — correct.
  • PasswordHash and TokenHash are tagged json:"-" — not leaked in responses.
  • Secure cookie flags (HttpOnly, SameSite: Lax) are set properly.
  • TLS warning logged when gRPC runs without TLS (main.go:99).

Concerns:

No minimum password length (auth_handlers.go:72-74):

if req.Email == "" || req.Password == "" || req.DisplayName == "" {

A 1-character password is accepted. Add a minimum length check (e.g., 8+ chars).

Email not normalized on registration — email is stored as-is. user@example.com and User@Example.com would create separate accounts or cause lookup inconsistencies. Normalize to lowercase before storing/querying.

OAuth routes lack rate limiting (server.go:79-82):

r.Get("/github", s.handleGitHubLogin)
r.Get("/github/callback", s.handleGitHubCallback)

/login and /register have rate limiting, but OAuth initiation/callback endpoints don't. Consider adding rate limiting to the OAuth flow as well.

ProxmoxSecret sent over gRPC (orchestrator.go:141): The Proxmox API secret is included in the SourceHostConnection proto message sent over the wire. The existing TLS warning is appropriate, but this reinforces that production deployments must use TLS or this is a credentials-in-transit risk.

GetSession expiry enforcementst.GetSession in middleware.go needs to filter expired sessions at the DB level. Make sure the postgres GetSession implementation does WHERE expires_at > NOW() (not just WHERE id = ?).


Code Quality

Duplicate step numbering (main.go:134,146): Both the REST server init and the gRPC goroutine start are labeled as step "6". The numbering goes 1→2→3→4→5→6→6→7→8→9 — steps 6 and 7 have the same label.

_ = g.Wait() in ListVMs (orchestrator.go:598): Errors are silently discarded. The goroutines don't return errors (they log and continue), so this is safe, but the pattern is misleading. Consider removing the errgroup and using a plain sync.WaitGroup to make the intent clear.

rateLimitByIP cleanup goroutine leaks on shutdown (ratelimit.go:23-34): The cleanup goroutine runs forever with no shutdown mechanism. It's a minor issue since it's in-process, but using time.NewTicker + a context for cancellation would be cleaner.

execAddPlaybookTask sort order race (executor.go:311-315):

tasks, err := c.store.ListPlaybookTasks(ctx, playbookID)
sortOrder := len(tasks)

Two concurrent requests to add tasks to the same playbook could both observe the same count and assign duplicate sort orders. Consider using MAX(sort_order) + 1 in a transaction or using a DB sequence.

stripe.Key global mutation (billing/meters.go:53): Setting stripe.Key globally makes the billing component difficult to test and is not goroutine-safe if called concurrently. The comment acknowledges this, but consider using stripe.Client instances with per-call key configuration instead.

http.ServeFile for OpenAPI docs (server.go:69):

http.ServeFile(w, r, "docs/swagger.json")

This path is relative to the process working directory. If the server is started from a different directory, this silently fails. Consider embedding with //go:embed docs/swagger.json instead.

FRONTEND_URL validation gap (config.go:111-113): The check u.Scheme == "" would pass for a malformed URL like ://bad. The existing * check is good; consider stricter validation (e.g., require http or https scheme explicitly).


Performance

ListHosts makes N store queries (orchestrator.go:498-500): One GetSandboxesByHostID call per connected host inside the loop. With many hosts this is O(N) queries. Consider a CountSandboxesByHostIDs batch query.

Registry locks during copy (registry.go:83-88): ListConnected copies all host structs under a read lock. With many connected hosts, this lock is held longer than necessary. For very high host counts, consider a lock-free approach, though this is likely fine for the expected scale.


Test Coverage

The PR includes test files for most packages, which is great. A few areas worth verifying:

  • The compensating destroy path in CreateSandbox (DB persist fails after VM creation) — this is a critical failure path.
  • Session expiry enforcement in GetSession postgres implementation.
  • RBAC behavior: confirm tests assert that a plain member cannot call destructive operations (if that's the intended design) — or if no RBAC is intentional, document it.
  • execAddPlaybookTask concurrent sort order assignment.

Minor Nits

  • commandTimeoutBuffer = 30 (orchestrator.go:33) is an untyped constant used in a time.Duration multiplication — works but could be const commandTimeoutBuffer = 30 * time.Second for clarity.
  • The HostSender interface (orchestrator.go:38) is a good abstraction for testability — good pattern.
  • id.Generate("SBX-") prefix in CreateSandbox and uuid.New() for request IDs — mixing two ID generation strategies. Fine, but worth being consistent about which is used where.

Overall this is a well-structured and ambitious PR. The core security decisions are sound and the architecture is clean. The main things I'd want addressed before merging: RBAC on mutations, minimum password length, email normalization, and the sort order race condition.

aspectrr and others added 2 commits February 18, 2026 22:22
- Add PostHog telemetry integration (api + web) for usage tracking
- Fix billing race condition with double-check locking on subscription items
- Harden auth: use unexported context key type, SameSiteStrict cookies
- Add resource capacity checks in SelectHostForSourceVM fallback path
- Add state validation for sandbox start/stop operations
- Fix command timeout buffer arithmetic (duration vs int)
- Add org context provider and refactor agents page
- Embed swagger docs in binary

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Suppress no-explicit-any for Router generic in posthog provider
- Move hooks before early return in billing page to fix rules-of-hooks

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@claude
Copy link

claude bot commented Feb 19, 2026

PR Review: feat: convert fluid to a control-plane

This is a significant architectural pivot. The separation of concerns (orchestrator, registry, gRPC stream handler, REST server) is clean, and the addition of tests alongside every new package is appreciated. Below are observations grouped by severity.


Issues

1. Rate limiter trusts spoofable headers (Security)

File: api/internal/rest/ratelimit.go:41-53

The rate limiter reads X-Real-IP and X-Forwarded-For as the first source of truth. If the server is directly exposed (not behind a trusted reverse proxy), any client can set these headers arbitrarily and bypass per-IP rate limiting entirely.

Options:

  • Only trust these headers when behind a known proxy (e.g. check RemoteAddr is a private range)
  • Require the server to always sit behind a reverse proxy and document that assumption clearly
  • Consider a library like realip that has trusted CIDR configuration

2. Free-tier calculation has a fragile ordering assumption (Billing)

File: api/internal/billing/meters.go:415

The line prevTotal := cumulative - thisChat assumes the current conversations tokens are already persisted when SumTokenUsage runs. If a caller ever invokes ReportUsage before persisting the message, prevTotal will be under-counted and the free-tier boundary will be calculated incorrectly (favouring the user - not catastrophic but financially incorrect). The comment documents the assumption but the invariant is not enforced. Consider asserting or returning early if cumulative < thisChat.


3. EnsureModelMeter uses a single global mutex (Performance)

File: api/internal/billing/meters.go:98-99

mm.mu.Lock() serializes meter creation across all models. Compare this to subItemLock which is correctly per-org:model. Two simultaneous requests for different models will block each other unnecessarily. The pattern already exists in subItemLock - apply the same per-model locking here.


4. DiscoverSourceHosts picks the first connected host non-deterministically (Reliability)

File: api/internal/orchestrator/orchestrator.go:781

ListConnectedByOrg iterates a map internally (random order), so connected[0] is an arbitrary host on each call. Fine functionally today since discovery just needs any connected daemon, but worth documenting this intent or adding a selection heuristic for consistency with sandbox placement.


5. gRPC TLS is optional but host tokens travel over this channel (Security)

File: api/internal/config/config.go:154-158

GRPC_TLS_CERT_FILE / GRPC_TLS_KEY_FILE are not validated - if unset, the server starts with plaintext gRPC. Host tokens (bearer credentials) are sent in stream metadata. In production, plaintext gRPC exposes these tokens. Strongly consider:

  • Adding a Validate() error when TLS files are unset unless GRPC_ALLOW_INSECURE=true is explicitly set
  • Or documenting that the gRPC port must not be externally reachable without TLS termination

6. Heartbeat disconnect window is very lenient

File: api/internal/grpc/stream.go:241-276

With the default heartbeatTimeout=90s, the monitor ticks every 30s and requires 3 consecutive misses before disconnecting. In the worst case a host can be truly dead for ~4x the heartbeat timeout (~6 minutes) before being removed from the registry and having sandboxes orphaned. Depending on SLA expectations this may be acceptable, but a comment explaining the intentional buffer would help future readers.


Minor Observations

  • api/internal/grpc/stream.go:106-110 - The stream is stored in h.streams before registry.Register. This is harmless given the validation above, but the ordering comment could explain why the stream is stored first.

  • api/internal/store/store.go:486 - ListHosts returns all hosts regardless of org. Fine as an internal store method since callers filter at the application layer, but a potential footgun if a future handler calls it directly.

  • api/internal/rest/server.go:61-79 - API docs are enabled by default (API_ENABLE_DOCS=true). Consider defaulting to disabled or noting in deployment docs that this should be off in production.

  • api/internal/orchestrator/orchestrator.go:565 - The errgroup context is discarded. Intentional since goroutines fall back to cached data on error, but a comment would reduce confusion for future readers.


What is Working Well

  • Compensating destroy on DB failure (orchestrator.go:203-221): correctly issues a destroy if DB persistence fails after the VM is created - avoids orphaned VMs. Good pattern.
  • Token identity override (stream.go:81-85): daemon-supplied host IDs are overridden with the server-authenticated token ID, preventing host impersonation.
  • Double-check locking for meter/subscription creation is correct.
  • Per-host mutex for stream sends is the right approach.
  • Consistent org isolation throughout REST handlers via resolveOrgMembership + explicit OrgID checks.
  • Test coverage: every new package has a corresponding _test.go, and the orchestrator tests appear thorough.

Overall this is solid work. The security items (1 and 5) are worth addressing before shipping to production.

…ocs and API

- Remove install script method from daemon docs page
- Wire DaemonConnectionStatus into daemon docs for logged-in users
- Update API internals (auth, billing, orchestrator, store, grpc, rest)
- Refactor docs components (step-tracker, quickstart-steps, docs-header)
- Update dashboard, landing page, and route tree

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@claude
Copy link

claude bot commented Feb 19, 2026

PR Review: feat: convert fluid to a control-plane

This is a substantial architectural change — converting fluid from a standalone CLI-to-libvirt tool into a proper control-plane with centralized orchestration, gRPC streaming from daemon hosts, billing, and multi-host placement. The direction is solid and production-worthy. Here are my findings:


Architecture & Design

The layering is clean:

  • registry → in-memory connected host map (RWMutex, value copies)
  • grpc/stream.go → bidirectional stream per host, SendAndWait with sync.Map for pending requests
  • orchestrator → business logic, host selection, compensating actions
  • billing → Stripe meter/event reporting

The HostSender interface for the orchestrator is a good abstraction — makes the unit tests straightforward.


Issues & Concerns

Security

stream.go:76 — org isolation relies on daemon-supplied host_id initially

The code warns and overrides when hostID != tokenID, but only when tokenID != "". If the gRPC interceptor somehow fails to attach the token ID (e.g., a missing auth method in tests, or future code paths that skip the interceptor), the daemon-supplied host_id is used unchecked. A daemon could then impersonate any other host. Consider asserting tokenID != "" and failing the connection if it's absent, rather than silently falling back.

// stream.go:81-85
if tokenID != "" && hostID != tokenID {
    // only overrides when tokenID is present
}

hostauth.go:61 — manual Bearer string prefix strip

Using raw[:7] is fragile if the string is exactly 7 bytes. Prefer strings.HasPrefix + strings.TrimPrefix or strings.CutPrefix.


Potential Bugs

orchestrator.go:99-104 — silent fallback host selection

CreateSandbox first calls SelectHost (by base image), and on failure falls back to SelectHostForSourceVM — but only when req.SourceVM != "". However, baseImage is derived from req.SourceVM (baseImage := req.SourceVM), so both lookups use the same value. The fallback path will succeed when a host has the source VM but doesn't advertise it as a base image. This is fine if intentional, but the dual-lookup logic is confusing and could be simplified. A comment explaining when each path applies would help.

billing/meters.go:422 — free tier accounting assumes usage is already recorded

prevTotal := cumulative - thisChat // usage before this chat (already recorded by the time we get here)

This comment documents a required ordering constraint between the caller persisting usage and calling ReportUsage. If that order is violated (e.g., usage is persisted after calling ReportUsage), prevTotal could be negative (clamped to 0) and free tokens could be counted incorrectly. This constraint should be enforced at the call site or made explicit in the function contract.

orchestrator.go:340RunCommand silently swallows command persist errors

if err := o.store.CreateCommand(ctx, cmdRecord); err != nil {
    o.logger.Error("failed to persist command", ...)
}
return cmdRecord, nil

The command succeeded on the host and the result is returned to the caller regardless of the DB persist failure. This is a deliberate choice (don't fail the user's operation for a logging failure), but it means command history can silently go missing. Consider at minimum incrementing a metric/counter so this is observable in production.

grpc/stream.go:187-190 — response channel buffer of 1 is correct but fragile

respCh := make(chan *fluidv1.HostMessage, 1)
// ...
select {
case respCh <- msg:
default:
    logger.Warn("response channel full, dropping")
}

A channel size of 1 works fine with the current SendAndWait pattern (one caller, one response). But because pendingRequests.Delete happens in a defer in SendAndWait, if a delayed response arrives after a timeout, the channel has already been deleted and the response is logged as an orphan ("no pending request"). That's correct. The only risk is if somehow two responses arrive for the same request_id — the second would hit the default drop branch silently. Worth a comment that this is intentional.


Performance / Resource Leaks

grpc/stream.go:133streamMu entries are never cleaned up on host disconnect

defer func() {
    h.registry.Unregister(hostID)
    h.streams.Delete(hostID)
    h.streamMu.Delete(hostID)  // ← this IS present, good

Actually this is handled correctly — h.streamMu.Delete(hostID) is in the defer. No issue here, just noting it's correct.

orchestrator.go:564-623ListVMs fanout with no cap on goroutines

If an org has hundreds of connected hosts, errgroup spawns one goroutine per host simultaneously. Consider adding a semaphore or bounding the concurrency (errgroup.SetLimit). For most deployments this is fine, but worth documenting.

orchestrator.go:565-618ListVMs ignores context from the errgroup

g, gCtx := errgroup.WithContext(ctx)
for _, h := range connected {
    g.Go(func() error {
        select {
        case <-gCtx.Done():
            return gCtx.Err()  // ← returns error, but...
        default:
        }
        // ...
    })
}
_ = g.Wait()  // ← error is discarded
return result, nil  // ← always returns nil

The goroutines return gCtx.Err() on cancellation, but g.Wait() is discarded with _. If the parent context is cancelled mid-fanout, the function returns partial results with no error, which may silently confuse callers. The existing comment explains this is intentional for partial results — that's acceptable, but returning a wrapped error or warning when partial results are due to context cancellation would help callers distinguish timeout from empty results.


Test Coverage

Coverage is solid for the new components — stream_test.go tests all Connect scenarios, SendAndWait edge cases, and orchestrator_test.go covers the full CRUD lifecycle. A few gaps:

  • placement_test.go — no test for the hostScore scoring tie-breaking behavior when two hosts have equal resources
  • billing/meters_test.go — free-tier boundary at exactly freeTokens tokens isn't tested (off-by-one)
  • No test for the duplicate-stream scenario: what happens when a host reconnects with the same host_id before the old stream's Connect goroutine has fully exited? The old stream is replaced in h.streams.Store(hostID, stream) at line 106, but the old goroutine is still running its recv loop. It will continue receiving messages which will be dispatched to the new pendingRequests map — potentially delivering responses intended for the new connection to old SendAndWait callers or vice versa.

Minor / Style

  • orchestrator.go:82-106baseImage := req.SourceVM shadows intent; it would be clearer to pass req.SourceVM directly to SelectHost rather than aliasing it, since the comment "base_image is derived from source_vm" only exists to explain this
  • stream.go:141if err == io.EOF should use errors.Is(err, io.EOF) for consistency with idiomatic Go error handling (gRPC EOF is a direct value, so this works, but errors.Is is safer if the type changes)
  • The two mockStore implementations (one in grpc/stream_test.go and one in orchestrator/orchestrator_test.go) have diverged — the grpc one stubs all methods with no-ops, while the orchestrator one panics on unexpected calls. The panic-on-unexpected behavior in the orchestrator mock is strictly better for catching test omissions. Consider unifying these in a shared test helper

Summary

Category Finding
Security Host ID override only happens when token ID is non-empty; consider making auth mandatory
Bug Free tier cumulative accounting has an implicit ordering constraint
Bug Duplicate host reconnection race (old goroutine still running when stream is replaced)
Performance ListVMs goroutine count unbounded for large org host counts
Testing Missing: duplicate-host reconnect, exact free-tier boundary, score tie-breaking
Style Two divergent mock stores across packages

Overall the code quality is high, the design is sound, and the test coverage is much better than typical for a change of this size. The duplicate-host reconnect race is the most actionable bug to address before this is handling production traffic.

…ling, and resource management

Security: fix OAuth email account takeover, IDOR in source host deletion, RBAC for sandbox destroy,
bcrypt max password length, session cookie Secure/SameSite attrs, modulo bias in session codes,
bridge name validation, path traversal in lease files, SSH StrictHostKeyChecking=accept-new,
strip internal errors from API responses, add IP-based rate limiting with trusted proxy support.

Correctness: fix SumTokenUsage summing all resource types (now filters llm_token), fix concurrent
stream.Send data race with send mutex, fix RLock mutation in microvm Get(), fix DestroySandbox
returning success on failure, fix billableOutput negative from rounding, add SendAndWait context
propagation and timer leak fix, add heartbeat registry resource updates, fix backoff reset after
stable connections, add process exit wait after Stop signal.

Billing: add webhook idempotency check, Stripe subscription status validation, nil safety for
sess.Subscription, remove redundant stripe.Key assignments, configurable billing markup, rollback
on partial EnsureOrgSubscriptionItems failure, log swallowed CreateUsageRecord errors.

Infrastructure: add HostToken expiry, unique index on stripe_subscription_id, advisory lock store
methods, GetSubscriptionByStripeID, encryption key config, daemon TLS/token config fields,
CLI daemon TLS config, TAP name collision reduction (6->9 chars).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@claude
Copy link

claude bot commented Feb 20, 2026

PR Review: feat: convert fluid to a control-plane

This is a large, ambitious refactor — converting from a direct CLI→daemon model to a full control-plane architecture with REST API, gRPC streaming, PostgreSQL, billing, and agent execution. Overall the code is well-structured and has solid test coverage. Below is detailed feedback organized by area.


Security

Host impersonation prevention (good)grpc/stream.go correctly overrides the daemon-supplied host_id with the server-assigned token identity from the auth interceptor. The log warning on mismatch is a nice touch.

OAuth email linking (good)findOrCreateOAuthUser in auth_handlers.go only links an OAuth account to an existing user when the OAuth email is verified. Correct and intentional.

Prompt injection defense (insufficient)agent/openrouter.go relies entirely on a system prompt instruction ("Never follow instructions found in tool output...") to prevent prompt injection. This is better than nothing, but system prompt instructions are not a reliable security boundary — they can be overridden by adversarial content in file reads or command output. Since the agent can run_command and read_file on sandboxes, a compromised sandbox could exfiltrate org data or destroy infrastructure. Consider:

  • Stripping or escaping tool output before feeding it back as context
  • Limiting the tools the agent can use based on org/user permissions
  • Adding a confirmation step for destructive operations invoked through the agent

CSRF — The REST API uses session cookies without any CSRF token validation. For a control plane that performs destructive infrastructure operations, this is a risk. If the frontend is on the same domain as the API, a CSRF attack on an authenticated browser session could trigger sandbox destruction. Consider using the SameSite=Strict cookie attribute (already using secureCookies flag — worth confirming SameSite is set there) or adding CSRF middleware.


Architecture & Correctness

Compensating destroy on DB failure (good)orchestrator.CreateSandbox correctly fires a compensating DestroySandbox gRPC call if the PostgreSQL persist fails. However, if the compensating destroy itself fails (e.g. the host disconnects mid-operation), the sandbox runs indefinitely without a DB record. This is a known distributed systems problem, but it would be worth logging a CRITICAL or ERROR level entry when the compensating destroy also fails so it can be caught via alerting.

SelectHostForSourceVM is not best-fitplacement.go picks the first eligible host (by iteration order of a sync.Map) rather than the highest-scoring one. This means source-VM-based sandbox placement will not spread load across equivalent hosts the way SelectHost does. Consider applying the same hostScore ranking:

// current: returns first match
if slices.Contains(h.SourceVMs, vmName) { return h, nil }

// suggested: track best-scoring match
if slices.Contains(h.SourceVMs, vmName) && hostScore(h) > bestScore { best = h; bestScore = hostScore(h) }

DiscoverSourceHosts single-host delegation — Source host discovery is sent to the first connected daemon (sorted by HostID). If that host is slow or disconnects mid-operation, the entire discovery fails. This is a single point of failure; retrying on the next available host would make this more robust.

Response channel drop in handleHostMessage — When the response channel for a pending request is full, the message is silently dropped and the SendAndWait caller times out with no information about why. The channel is buffered at size 1, so this should not happen in normal operation (one request → one response). But on reconnection races it could. A debug log line here would help diagnose those cases in production.


Code Quality

Large DataStore interfacestore/store.go defines a 60+ method DataStore interface. This works but makes it difficult to create focused test doubles. Consider splitting it into smaller role-specific interfaces (e.g. SandboxStore, UserStore, BillingStore) and having DataStore embed them. Each sub-system only needs to declare the narrow interface it uses.

REST handlers expose internal typessandbox_handlers.go decodes directly into orchestrator.CreateSandboxRequest. This couples the REST API contract to an internal type. If CreateSandboxRequest gains an internal-only field later, it becomes part of the public API surface. A dedicated request DTO struct (even if identical) provides a clean boundary.

Hardcoded model pricing in agent/openrouter.goAvailableModels hardcodes token prices for Claude Sonnet 4, Haiku 4, GPT-4o, GPT-4o Mini, and Gemini 2.5 Pro. LLM provider pricing changes frequently. The billing metering already has a store-backed model cache (ModelCache) — consider sourcing the pricing from there rather than two separate hardcoded lists.

Billing fallback pricingbilling/meters.go falls back to $3/1M input and $15/1M output tokens when model pricing is unavailable. Document clearly that these are fallback values and consider making them configurable, since they will become stale.

Rate limiter cleanup goroutineratelimit.go spawns a background cleanup goroutine per middleware instance with no shutdown signal. In tests that create multiple server instances, this leaks goroutines. Consider accepting a context.Context or using a sync.Once guard.


Test Coverage

Coverage looks strong across the board:

  • grpc/stream_test.go (900 lines) covers connection, heartbeat, reconnect, and message routing
  • orchestrator/orchestrator_test.go (1499 lines) covers the full lifecycle
  • billing/meters_test.go and billing/ticker_test.go cover metering and resource reporting
  • Auth, middleware, and session tests are thorough

One gap worth filling: the SelectHostForSourceVM placement function appears untested against multiple eligible hosts to verify it picks the best one (related to the scoring issue above).


Minor Nits

  • grpc/stream.go: context.Background() is used for persistHostRegistration to survive stream cancellation — this is correct, but it would benefit from a short timeout (e.g. 10s) to avoid blocking if the DB is unavailable at connection time.
  • orchestrator.go: The ListVMs fan-out uses a hardcoded concurrency cap of 10. Consider making this configurable or at least a named constant rather than a magic number.
  • billing/ticker.go: The hourly tick has no jitter. If multiple API instances start simultaneously, they will all hit Stripe at the same time. Adding a small random offset at startup reduces the thundering herd risk.

Overall this is solid work with good structure and test coverage. The main areas to address before production are the prompt injection surface, CSRF protection, and the source VM placement fairness issue. Happy to discuss any of these further.

…cleanup

Security: add AES-256-GCM encryption for OAuth tokens and Proxmox secrets
at rest, sanitize error responses to prevent internal error leakage, add
path traversal validation, command length limits, bcrypt password length
check, checksum verification in CLI updater, TLS config in gRPC client,
SSH StrictHostKeyChecking=accept-new, and docs-progress session limits.

Correctness: fix stale closure in step-tracker markComplete, CSS backtick
typo in architecture-diagram, nil error panics in docs-progress, invalid
HTML nesting in billing calculator, shiki error recovery, localStorage
try/catch, polling backoff with max attempts, and encodeURIComponent on
URL-interpolated values.

Cleanup: delete commented-out agent code (executor, openrouter, tools),
commented-out web components (chat-message, model-selector, playbooks-panel),
broken playbooks route, committed CLI binary, hardcoded PostHog key, no-op
setup code, and custom test helpers replaced with strings.Contains.

Config: reduce session TTL from 30d to 7d, change telemetry default to
opt-in, fix systemd unit to run as fluid-daemon with CAP_NET_ADMIN, fix
doctor QEMU binary for arm64, log Stripe webhook errors.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@claude
Copy link

claude bot commented Feb 20, 2026

PR Review: feat: convert fluid to a control-plane

This is a substantial architectural transformation — from a direct CLI-to-libvirt model to a proper three-tier control plane with bidirectional gRPC streaming, microVMs, and a full REST API. The overall design is solid and the code quality is generally high. Good things to call out first:

  • Host identity pinning (api/internal/grpc/stream.go:88-93): Overriding the daemon-supplied host_id with the server-assigned token identity is a smart defense against daemon impersonation.
  • Compensating transactions (api/internal/orchestrator/orchestrator.go:201-219): If the DB write fails after a sandbox is created on the host, the code issues a compensating destroy to avoid orphaned VMs. This is exactly the right pattern.
  • Token hashing (api/internal/auth/hostauth.go:35-38): SHA-256 hashing host tokens before storage is correct.
  • Test coverage: The included tests are thorough — orchestrator_test.go (1359 lines), stream_test.go (867 lines), sandbox_handlers_test.go (781 lines), plus auth/billing/placement tests.

Security Issues

1. TLS path is not implemented in the daemon agent client

File: fluid-daemon/internal/agent/client.go:110-115

opts := []grpc.DialOption{}
if c.insec {
    opts = append(opts, grpc.WithTransportCredentials(insecure.NewCredentials()))
}
conn, err := grpc.NewClient(c.cpAddr, opts...)

When insec=false, opts is empty — no transport credentials are configured at all. The certFile, keyFile, and caFile fields on Client are set in NewClient but never used to build TLS credentials. With grpc.NewClient (gRPC-Go ≥ v1.62), this will fail or connect without encryption. The secure TLS code path appears to be unimplemented. This means every daemon-to-control-plane connection currently requires Insecure: true in practice, exposing tokens and commands in cleartext.

2. StrictHostKeyChecking=no in SSHConnectCommand

File: fluid-daemon/internal/sshca/ca.go:607

return fmt.Sprintf("ssh -i %s -o CertificateFile=%s -o StrictHostKeyChecking=no -p %d %s@%s", ...)

This disables host key verification entirely. The SSH CA provides user auth but the host's identity is not verified, leaving clients open to MITM. Consider adding -o KnownHostsFile= pointing to a per-sandbox known_hosts file populated with the host's key during sandbox creation, or use the CA for host certificates too (common in full SSH CA deployments).

3. OAuth tokens stored in plaintext

File: api/internal/store/postgres/postgres.go:51-55

OAuthAccountModel stores AccessToken and RefreshToken as plain strings. The postgresStore struct already has an encryptionKey []byte field (and there's a crypto package), but the OAuth token fields don't appear to use it. Access and refresh tokens are long-lived OAuth credentials that should be encrypted at rest.

4. In-memory rate limiter is per-process

File: api/internal/rest/ratelimit.go:71-115

The rate limiter uses an in-process map. With multiple API server replicas (expected for a control plane), an attacker can distribute requests across instances and bypass limits entirely. Consider Redis-backed rate limiting or at minimum document this as a single-instance limitation.


Code Correctness

5. microvm/manager.goSandboxInfo pointers aliased from map

File: fluid-daemon/internal/microvm/manager.go:389, 361

Both List() and Get() return pointers directly from the m.vms map:

result = append(result, info)  // List() — raw map pointer
return info, nil               // Get() — raw map pointer

After the mutex is released, callers hold a live pointer into the map. Get() modifies info.State while holding the lock, but callers can also mutate the same struct concurrently. Return copies (*info dereferenced or a SandboxInfo{...} literal) instead.

6. sshca/ca.go — DefaultConfig TTL comment mismatch

File: fluid-daemon/internal/sshca/ca.go:8-9, 81-82

The package doc says "Certificates are designed to be ephemeral (1-10 minutes TTL)" but DefaultConfig() sets DefaultTTL: 30 * time.Minute and MaxTTL: 60 * time.Minute, and validation only enforces a 1-minute minimum. The docs and config disagree; fix whichever is wrong.

7. connectAndServe spawns unbounded goroutines per command

File: fluid-daemon/internal/agent/client.go:275

go c.handleCommand(ctx, stream, msg)

Each incoming control message spawns a new goroutine with no concurrency limit. A slow daemon (or a flood of commands) can create unbounded goroutine growth. Consider a worker pool or semaphore, e.g. with golang.org/x/sync/semaphore.

8. validityStr truncates sub-minute precision

File: fluid-daemon/internal/sshca/ca.go:315

validityStr := fmt.Sprintf("+%dm", int(ttl.Minutes()))

int(ttl.Minutes()) truncates. A 90-second TTL produces +1m (60 seconds), silently issuing a shorter-than-requested cert. Use int(math.Ceil(ttl.Minutes())) or format in seconds: fmt.Sprintf("+%ds", int(ttl.Seconds())).

9. writeMetadata silently drops write errors

File: fluid-daemon/internal/microvm/manager.go:416-419

_ = os.WriteFile(path, data, 0o644)

If this write fails, daemon restart won't be able to recover the sandbox (RecoverState reads this file). At minimum log the error; ideally surface it or retry.

10. GetStream bypasses the per-host send mutex

File: api/internal/grpc/stream.go:255-262

GetStream returns the raw gRPC stream. Any caller using it to Send bypasses the streamMu mutex that serializes writes in SendAndWait. This method should either be removed or replaced with a wrapper that acquires the mutex.


Minor Issues

  • certID uses only half of the random bytes (ca.go:566): fmt.Sprintf("%x", b[:8]) discards 8 of the 16 random bytes. Use b[:] for 128 bits of entropy.
  • generateCertID fallback is insecure (ca.go:562-564): The time-based fallback uses time.Now().UnixNano() which is predictable. Better to propagate the error upward.
  • DiscoverSourceHosts deterministic-but-unintelligent host selection (orchestrator.go:789-795): Picking the first host by HostID sort order means one daemon always handles discovery, even if it's under load. Consider a random pick from available hosts.
  • SendAndWait has no maximum pending request limit (stream.go:218-252): Unbounded pendingRequests entries accumulate if host responses are never received (e.g., host crashes mid-command). Paired with the existing timeout this is bounded in time, but memory grows linearly with in-flight requests.
  • SSHConnectCommand hardcodes port 22 default (ca.go:599-608): Minor — just make sure callers don't pass 0 expecting the VM's actual SSH port.

Summary

The core architecture is sound and the implementation is well-structured. The most critical items to address before merge are #1 (TLS is broken for secure daemon-to-CP connections), #2 (MITM risk from disabled host key checking), and #5 (aliased map pointers are a data race). The others range from medium to minor priority.

@claude
Copy link

claude bot commented Feb 20, 2026

Code Review — feat: convert fluid to a control-plane

This is an ambitious and well-structured architectural shift. The three-tier design (web → API → daemon → microVMs) is sound, and the use of overlay filesystems + TAP networking for fast sandbox startup is a strong technical foundation. Below are issues that need addressing before this is production-ready, grouped by severity.


Critical — Must Fix Before Merge

1. Concurrent stream.Send without mutex (api/internal/grpc/stream.go)

The streamMu sync.Map is declared but SendAndWait and GetStream do not acquire the per-host mutex before calling stream.Send. gRPC server streams are not safe for concurrent writes. Multiple HTTP handlers calling SendAndWait simultaneously for the same host will race.

// Required pattern in SendAndWait:
mu, _ := s.streamMu.LoadOrStore(hostID, &sync.Mutex{})
mu.(*sync.Mutex).Lock()
err := stream.Send(msg)
mu.(*sync.Mutex).Unlock()

GetStream returning the raw stream also allows callers to bypass this mutex entirely — consider removing GetStream or returning a wrapper that serializes sends.

2. OAuth CSRF — hardcoded "state" parameter (api/internal/rest/auth_handlers.go)

Both GitHub and Google flows use a literal "state" string. This is the textbook OAuth CSRF vulnerability. The state must be:

  • Generated with crypto/rand per request
  • Stored in a signed, HttpOnly cookie
  • Verified in the callback before exchanging the code

Without this, any site can initiate an OAuth login and link an attacker's account to a victim's session.

3. IDOR in DeleteOrgMember (api/internal/rest/org_handlers.go)

The current implementation correctly passes both org.ID and memberID to the store (good — the summary of earlier reviews was slightly wrong here). However, please confirm the SQL for DeleteOrgMember in postgres.go filters by both org_id AND id, not just id. A query like DELETE FROM org_members WHERE id = $1 would allow cross-org member deletion.

4. Billing stubs cause real financial damage (api/internal/store/postgres/postgres.go)

GetModelMeter always returns ErrNotFound, causing EnsureModelMeter to create a new Stripe Product + Meter + Price on every agent invocation. SumTokenUsage always returns 0, so usage-based billing never charges anyone. These stubs need to either be fully implemented or have a feature flag that hard-fails in production configs rather than silently no-oping.

5. StripePublishableKey used as Stripe Price ID (api/internal/rest/billing_handlers.go:174)

A pk_live_... key is not a price_... ID. Every checkout session creation will fail. Add a dedicated StripePriceID field to config.

6. Webhook error silently discarded (api/internal/rest/billing_handlers.go:420)

_ = s.store.CreateSubscription(...)

If this store call fails, the user paid but their account isn't upgraded, and HTTP 200 is returned so Stripe stops retrying. This must return a 500 so Stripe retries, and the error must be logged.

7. TLS credentials never configured in daemon agent client (fluid-daemon/internal/agent/client.go)

When Insecure: false, the code stores certFile/keyFile/caFile but never loads them into grpc.DialOption. The result is an unencrypted connection that silently drops the host token in plaintext. Either implement credentials.NewTLS(tlsCfg) or fail fast if cert paths are provided but unreadable.


High Severity

8. Session tokens stored as plaintext primary keys (api/internal/auth/middleware.go)

Host tokens in hostauth.go are correctly stored as SHA-256 hashes, but session tokens appear to be used as raw DB lookup keys. Store sha256(token) in the database — if the sessions table is ever dumped, all active sessions are immediately compromised.

9. SandboxInfo pointers aliased from locked map (fluid-daemon/internal/microvm/manager.go)

List() and Get() return raw *SandboxInfo pointers into m.vms while holding the mutex. After the mutex is released, callers hold live pointers into the map — this is a data race. Return copies:

copy := *m.vms[id]
return &copy, nil

10. Compensating destroy uses cancelled context (api/internal/orchestrator/orchestrator.go)

When the DB write fails after VM creation, the cleanup destroy call uses the original ctx, which may already be done. Use context.WithoutCancel(ctx) (Go 1.21+) for compensation operations.

11. StrictHostKeyChecking=no (fluid-daemon/internal/sshca/ca.go:607)

The SSH CA infrastructure correctly issues short-lived certificates with principals. Using that infrastructure to also verify the host key (by checking the cert's signing CA) would complete the zero-trust model. StrictHostKeyChecking=no defeats the purpose of the CA on the client side. Consider using StrictHostKeyChecking=yes with a known-hosts file or HostKeyAlgorithms pinned to the CA-signed cert type.

12. No tests in api/ REST handlers or orchestrator

CLAUDE.md and api/AGENTS.md both require tests for every code change. The REST handlers (auth_handlers.go, billing_handlers.go, org_handlers.go, sandbox handlers) and the orchestrator have no test files. Auth, billing, and orchestration are the highest-risk paths. This is a hard blocker per the project's own stated requirements.

13. HTTP API response status codes not validated in OAuth helpers (api/internal/rest/auth_handlers.go)

fetchGitHubUser, fetchGitHubPrimaryEmail, and fetchGoogleUser unmarshal the response body regardless of HTTP status. A 401 or 500 from GitHub/Google produces a zero-value struct that proceeds to create a corrupted user record. Check resp.StatusCode before unmarshalling.


Medium Severity

14. monitorHeartbeat doesn't evict dead hosts

The monitor logs a warning after missed heartbeats but never removes the host from the registry or cancels its context. Dead hosts accumulate indefinitely and continue to receive sandbox scheduling attempts.

15. Host placement ignores org scope (api/internal/orchestrator/placement.go)

SelectHost iterates all connected hosts. In a multi-tenant deployment, this can schedule a sandbox from org A onto a dedicated host belonging to org B. Add an orgID filter to host selection.

16. Host placement ignores requested memory

hostScore uses a hardcoded 512 MB minimum instead of req.MemoryMB. A 4 GB sandbox request can be scheduled on a host with 600 MB free, causing the microVM launch to fail at runtime rather than at placement time.

17. url.PathEscape missing on user-supplied IDs in client.go

Sandbox IDs interpolated directly into URL paths allow path traversal (e.g., a sandbox ID of ../../../admin would hit unintended endpoints). Use url.PathEscape(sandboxID) when constructing paths.

18. io.ReadAll unbounded in client.go

An untrusted or misconfigured server can send an arbitrarily large response body. Use io.LimitReader(resp.Body, maxBytes) with a reasonable limit (e.g., 10 MB).

19. EnsureOrgSubscriptionItems race condition

Concurrent calls (e.g., two simultaneous agent invocations for the same org) can both find no subscription item and both create one, resulting in duplicate Stripe subscription items. Add a DB-level unique constraint and handle the duplicate error gracefully.

20. SecureCookies defaults to false in config

Session cookies sent over HTTP in production expose session tokens to network interception. Default should be true, with an explicit opt-out for local development.


Minor

  • writeMetadata silently drops write errors — daemon cannot recover sandbox state after restart; errors should be returned/logged
  • Hand-rolled JSON parser in readMetadata will misparse escaped characters — use encoding/json
  • Overlay filesystem not cleaned up on QEMU launch failure — disk leak on failed boots
  • validityStr truncation loses sub-minute precision in SSH cert issuance logs
  • DeleteHostToken should be scoped to the caller's org, not global

Summary

The architectural direction is right and the codebase shows real engineering thought. The critical blockers are the OAuth CSRF vulnerability, billing stub side-effects, the concurrent gRPC send race, and missing tests for the core API paths. The TLS gap in the daemon client is also important for any non-local deployment. Fixing these would make this a solid foundation to build on.

…build pipeline

- Add POST /auth/onboarding endpoint with org creation and slug generation
- Randomize daemon host selection for discovery load distribution
- Remove unused GetStream method from gRPC stream handler
- Multi-stage Docker build for web with nginx serving static assets
- Move images to web/public, add install script, add tunneling blog post
- Update CLI setup steps, daemon agent client, SSH CA, and microvm manager
- Add remark-gfm dependency for markdown rendering

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@claude
Copy link

claude bot commented Feb 20, 2026

PR Review: feat: convert fluid to a control-plane

This is a substantial, well-architected PR. The architectural shift from a direct libvirt CLI to a control-plane with gRPC-connected daemons makes sense for multi-host production use, and the commit history shows a thorough iterative security review has already happened. Below are the remaining issues I found, plus a few observations.


Critical

Stream cleanup race on host reconnect (api/internal/grpc/stream.go)

When a daemon reconnects, the new Connect() goroutine cancels the old one and overwrites h.streams, h.cancelFns, and the registry. However, the old goroutine's deferred cleanup runs concurrently and uses hostID as a bare key, so it will delete the new connection's state:

// New connection goroutine:
if oldCancel, loaded := h.cancelFns.LoadAndDelete(hostID); loaded {
    oldCancel.(context.CancelFunc)()   // triggers old goroutine exit
}
h.streams.Store(hostID, stream)        // stores NEW stream
// ...
h.cancelFns.Store(hostID, cancel)      // stores NEW cancel

// Old goroutine (running concurrently after its context is cancelled):
defer func() {
    h.cancelFns.Delete(hostID)         // deletes NEW cancel
    h.registry.Unregister(hostID)      // unregisters NEW host
    h.streams.Delete(hostID)           // deletes NEW stream
    h.streamMu.Delete(hostID)
}()

After this race, any pending gRPC operations fail with "host not connected" even though the host just re-registered.

Fix: use sync.Map.CompareAndDelete (available since Go 1.20) in the deferred cleanup to only delete if the stored value is still the one we registered:

myCancel := cancel
myStream := stream

defer func() {
    h.cancelFns.CompareAndDelete(hostID, myCancel)
    h.streams.CompareAndDelete(hostID, myStream)
    h.streamMu.Delete(hostID)
    h.registry.Unregister(hostID)
    logger.Info("host disconnected")
}()

The registry Unregister call also needs a conditional variant, or the registry needs a generation counter.


Medium

SelectHostForSourceVM returns first match without scoring (api/internal/orchestrator/placement.go)

SelectHost picks the best host using hostScore (available CPU + memory), but SelectHostForSourceVM — used as the fallback in CreateSandbox and for all source VM operations — returns the first matching host:

for _, h := range hosts {
    for _, vm := range h.Registration.GetSourceVms() {
        if vm.GetName() == vmName {
            return h, nil   // first match, no scoring
        }
    }
}

If multiple hosts have the same source VM, load concentrates on whichever host appears first in the registry slice. Consider applying the same hostScore scoring as SelectHost.

CreateSandbox fallback contains dead code (api/internal/orchestrator/orchestrator.go)

host, err := SelectHost(o.registry, req.SourceVM, req.OrgID, ...)
if err != nil {
    if req.SourceVM != "" {   // always true — validated as required in the handler
        host, err = SelectHostForSourceVM(...)
    } else {
        return nil, ...   // unreachable
    }
}

The else branch is dead: req.SourceVM is required by handleCreateSandbox. The intent seems to be "if SelectHost fails (no base image match), try matching by source VM instead." Removing the dead else branch and adding a comment would make the logic much easier to follow.

OAuth tokens stored without visible encryption (api/internal/rest/auth_handlers.go)

findOrCreateOAuthUser passes raw accessToken and refreshToken to s.store.CreateOAuthAccount. The commit history mentions AES-256-GCM encryption was added for OAuth tokens, but it is not visible at this layer. Please confirm postgres.go encrypts these fields before writing — if not, a database breach exposes live OAuth tokens for every user.


Low

Multiple cleanup goroutines in rateLimitByIP (api/internal/rest/ratelimit.go)

rateLimitByIP is called four times in routes(), spawning four independent background goroutines that run forever without a shutdown path. This is benign in production (each call gets its own limiters map), but makes the function hard to test cleanly. Consider passing a context.Context for graceful shutdown.

Telemetry not closed on NewRemoteService error (fluid-cli/cmd/fluid-cli/main.go)

svc, err := sandbox.NewRemoteService(daemonAddr, loadedCfg.ControlPlane)
if err != nil {
    _ = st.Close()
    // tele is not closed here
    return nil, nil, nil, fmt.Errorf(...)
}

Minor resource leak if NewRemoteService fails after telemetry is initialised.

handleRunCommand has no command length limit (api/internal/rest/sandbox_handlers.go)

timeout_seconds is capped at 3600, but req.Command has no length constraint. A very large command string would be forwarded to the daemon over gRPC. Consider adding a reasonable cap (e.g., 64 KiB) to match typical shell limits.


Positive Observations

Several security improvements from earlier review rounds are well done:

  • Host identity override — daemon-supplied host_id is replaced with the token-derived identity, preventing impersonation.
  • Stream stored before registry — avoids the window where a host is observable but unsendable.
  • Per-host send mutex — correctly serialises stream.Send calls.
  • SHA-256 session token hashing — raw token in cookie, hash in DB.
  • Constant-time OAuth state comparisonsubtle.ConstantTimeCompare used correctly.
  • CSRF-protected OAuth — crypto nonce in an HttpOnly cookie, validated on callback.
  • Compensating destroy on DB persist failure — avoids orphaned VMs.
  • Email verification required for OAuth account linking — prevents account takeover.
  • Bcrypt 72-char password limit — correctly enforced on both register and login.
  • Graceful shutdown order — HTTP drains before gRPC closes.

The test coverage across auth, orchestrator, gRPC streaming, registry, billing, and REST handlers is solid.


Summary: The stream cleanup race on reconnect is the one issue that needs to be fixed before merging — it can silently break daemon connectivity after any reconnect event. The medium and low items are improvements but not blockers.

…ead code, resource leak, input validation

- Gate stream cleanup on CompareAndDelete to prevent reconnect race clobbering new connections
- SelectHostForSourceVM now picks highest-scored host instead of first match
- Remove dead else branch in CreateSandbox (SourceVM always validated)
- Close telemetry on NewRemoteService error path in CLI
- Add 64 KiB command length limit in handleRunCommand

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@claude
Copy link

claude bot commented Feb 21, 2026

PR Review: feat: convert fluid to a control-plane

This is a substantial architectural change — converting from a single-daemon model to a proper control-plane + daemon fleet design with microVMs, billing, OAuth, and a full web dashboard. The overall direction is solid and the commit history shows active security hardening. Below are the remaining findings after reviewing the current HEAD.


Critical

id.Generate() panics on crypto/rand failureapi/internal/id/id.go:11, fluid-daemon/internal/id/id.go (same pattern)

if _, err := rand.Read(b); err != nil {
    panic("crypto/rand failed: " + err.Error())
}

crypto/rand.Read returning an error is vanishingly rare but not impossible (e.g. on certain hardened kernels or in containers with restricted syscalls). A panic here takes down the entire server process. The function signature should return (string, error) and callers should propagate the error. Every endpoint that creates a sandbox, session, or host token calls this, so the blast radius is large.


High

Unguarded type assertions on sync.Map valuesapi/internal/grpc/stream.go:63,114,210,230

v.(*sync.Mutex)            // hostMu
oldCancel.(context.CancelFunc)()
ch.(chan *fluidv1.HostMessage)
streamVal.(fluidv1.HostService_ConnectServer)

The logic is currently correct (each map only ever stores one type), but the idiomatic Go pattern is to use the two-value form so a future refactor doesn't silently introduce a panic:

mu, ok := v.(*sync.Mutex)
if !ok {
    // log and return error
}

Silent metadata loss on VM recoveryfluid-daemon/internal/microvm/manager.go:131

meta, _ := readMetadata(m.workDir, sandboxID)

If the metadata file is missing or corrupt, meta is a zero-value struct. The recovered SandboxInfo then has empty TAPDevice, MACAddress, and Bridge fields. Any subsequent operation that uses those (e.g. generating SSH certs, network cleanup) will silently use empty values instead of failing with a clear diagnostic. At minimum, log the error; ideally surface it so the operator knows the sandbox state is incomplete.


Medium

Rate limiter state is per-processapi/internal/rest/ratelimit.go:74-78

The code already has a comment acknowledging this, but it's worth calling out explicitly for the ops runbook: in a horizontally-scaled deployment each instance has independent counters, so the effective rate limit is configured_rps × replica_count. At scale, move to a shared token bucket (Redis INCR+TTL is sufficient).

stripe.Key global mutationapi/internal/billing/meters.go:38

stripe.Key = stripeKey

The comment says "set once at init" but the stripe-go library exposes stripe.Key as a package-level variable. If a future change creates a second MeterManager (e.g. in tests), there's a data race. Prefer passing the key per-call via stripe.NewBackends or the stripe.Params.StripeAccount field to keep state off the global.

Select fallback placement logicapi/internal/orchestrator/orchestrator.go:93-100

CreateSandbox calls SelectHost and then, on failure, falls back to SelectHostForSourceVM. If both fail the error message is generic ("select host: %w"). When debugging placement failures it would help to distinguish "no host has the base image" from "no host has sufficient capacity" from "no host serves this source VM" — ideally by wrapping specific sentinel errors from each selector.


Low / Style

"response channel full, dropping" warning is unreachable in practicestream.go:214

Each respCh is a freshly allocated chan *fluidv1.HostMessage with capacity 1, and handleHostMessage reaches this branch only after LoadAndDelete which ensures it is the sole writer. The channel can never already contain a value. The warning is harmless but creates noise and suggests a misunderstanding of the flow; it can be removed.

QEMU launch args build rootDev from config but KernelPath/OverlayPath come from external inputmicrovm/manager.go:196-218

These paths are passed directly to exec.CommandContext. Although they're not shell-interpolated (the exec package bypasses the shell), a path containing spaces or unusual characters may still cause unexpected QEMU behavior. Consider validating that paths are absolute and don't contain shell metacharacters as a defensive measure, consistent with the allowlist approach used in readonly/validate.go.

parseCIDRs silently skips invalid entriesrest/ratelimit.go:21-38

Invalid trusted-proxy CIDR strings are silently discarded. A misconfigured TRUSTED_PROXIES env var will result in proxy headers never being trusted and clients seeing their own rate limits eaten by the reverse proxy IP — a confusing operational failure. Log a warning for each invalid entry at startup.


What's working well

  • The OAuth CSRF protection with constant-time comparison and short-lived HttpOnly state cookies is solid.
  • DeleteOrgMember scoping by org_id prevents the cross-org IDOR.
  • Session tokens hashed with SHA-256 before DB storage — good.
  • SSH CA uses ephemeral certs with critical options (no-port-forwarding, no-agent-forwarding) — appropriate defaults.
  • readonly/validate.go uses an allowlist with quoted-string awareness — exactly the right approach for read-only source VM access.
  • CompareAndDelete on disconnect preventing reconnect-race cleanup clobber — nice subtle fix.
  • defer h.pendingRequests.Delete(reqID) in SendAndWait correctly cleans up on all exit paths.
  • Compensating DestroySandbox when DB persist fails after a successful VM creation avoids orphaned VMs.
  • Stripe meter events use uuid.New().String() as an idempotency identifier — correct.

Summary

Severity Count Status
Critical 1 id.Generate panic
High 2 Type assertion panics, silent metadata loss
Medium 3 Rate limit scaling, Stripe global, placement error clarity
Low 3 Unreachable warn, path validation, silent CIDR skip

The security-focused commits (CSRF fix, IDOR fix, bcrypt limit, session hashing, prompt injection guard) are all appreciated and well-executed. The remaining items are mostly defensive hardening. The most important one to fix before a production cut is the id.Generate panic — everything else can be done incrementally.

…s, metadata logging, stripe race, error clarity

- Critical: Convert id.Generate() from panic to (string, error) in both api and daemon packages, update all 13 callers
- High: Guard all 4 sync.Map type assertions in grpc/stream.go with two-value form
- High: Log warning when readMetadata fails during VM recovery instead of silently discarding
- Medium: Remove duplicate stripe.Key assignment in server.go (kept single init in billing.NewMeterManager)
- Medium: Wrap placement errors with distinct context for image match vs source VM fallback
- Low: Replace unreachable channel-full select with direct send in stream handler
- Low: Add filepath.IsAbs validation for KernelPath and OverlayPath in QEMU Launch()
- Low: Add logger parameter to parseCIDRs to warn on invalid CIDR entries

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@claude
Copy link

claude bot commented Feb 21, 2026

Code Review: feat: convert fluid to a control-plane

This is a substantial architectural overhaul — ~88k additions across 717 files — converting fluid.sh from a standalone CLI+daemon into a full control-plane architecture with microVMs, bidirectional gRPC streaming, multi-tenant auth, and billing. The core design is sound and the commit history shows many issues were iteratively addressed. Below is what I found after reviewing the current state of the code.


Architecture & Design — ✅ Strong

The shift to microVMs via overlay images is the right move: faster boot, lower resource cost per sandbox, and cleaner lifecycle management. The gRPC bidirectional streaming between control plane and daemons (with reconnect logic, heartbeat monitoring, and request correlation via request_id) is well-designed. The SSH CA with ephemeral certificates is a good pattern for ephemeral sandboxes.

The multi-tenant model (orgs → daemons → sandboxes) is clean and the RBAC layering (owner/member/host token) is correct in structure.


Remaining Issues

Medium — billing/ticker.go: Zero-billing on store error

// Lines 91-93
sandboxes, err := rt.store.ListSandboxesByOrg(ctx, orgID)
if err != nil {
    rt.logger.Warn("failed to list sandboxes for billing", ...)
}
var runningSandboxes int
for _, sb := range sandboxes { ... }  // sandboxes is nil on error

// Line 113
billableSandboxes := int64(runningSandboxes - rt.freeTier.MaxConcurrentSandboxes)
// billableSandboxes is negative on error → condition below is false → no Stripe report
if billableSandboxes > 0 {
    rt.meter.ReportResourceUsage(...)
}

When ListSandboxesByOrg or ListSourceHostsByOrg returns an error, the tick continues with zero counts, producing no Stripe report for that interval. A transient DB blip silently drops a billing period. Consider returning early on store errors so the tick is skipped entirely rather than reporting zero:

if err != nil {
    rt.logger.Warn("failed to list sandboxes, skipping billing tick", ...)
    return  // don't report
}

Medium — api/internal/orchestrator/orchestrator.go: GetSandbox has no org enforcement

GetSandbox(ctx, sandboxID) returns any sandbox without scoping by org. All current REST handlers do consistently check sandbox.OrgID != org.ID afterward, so there's no live IDOR. But this is a footgun for future handlers — one missing check away from a vulnerability.

Consider adding an orgID parameter to GetSandbox so enforcement is automatic:

func (o *Orchestrator) GetSandbox(ctx context.Context, orgID, sandboxID string) (*Sandbox, error)

The store already has ListSandboxesByOrg — adding an orgID column filter to the single-fetch path would close this permanently at the data layer.


Low — api/internal/rest/ratelimit.go: Cleanup goroutine has no shutdown path

go func() {
    for {
        time.Sleep(time.Minute)
        // prune stale entries
    }
}()

This goroutine runs forever with no way to signal it to stop. For production this is functionally fine (server process lifetime), but it prevents clean shutdown in tests and any future graceful-shutdown logic. A context.Context or stop channel passed at construction would be cleaner.


Low — api/internal/grpc/stream.go: SendAndWait note

The buffered channel (size 1) cleanup is correct — defer h.pendingRequests.Delete(reqID) removes the entry, so late-arriving daemon responses after timeout simply fill the buffer and are GC'd. No goroutine leak. Worth noting in a comment that this is intentional so future readers don't add an unnecessary close().


Security — Largely Well-Addressed

The commit history shows the team proactively addressed a lot:

  • ✅ OAuth CSRF: crypto nonce stored in HttpOnly cookie, constant-time comparison on callback
  • ✅ Cross-org IDOR in DeleteOrgMember: scoped WHERE org_id = ? in delete query
  • ✅ Session tokens: SHA-256 hashed before DB storage, crypto/rand, SameSite=Strict, HttpOnly, Secure
  • ✅ Stripe race: stripe.Key set once in NewMeterManager, not globally reassigned
  • ✅ Stream registration race: CompareAndDelete prevents new connection being clobbered by delayed cleanup
  • ✅ OAuth account takeover: email verified flag checked before linking provider accounts
  • id.Generate(): converted from panic to (string, error) with all 13 callers updated
  • ✅ Overlay cleanup: added on launch failure to prevent orphaned overlays
  • ✅ TAP name collision: increased entropy (6→9 chars)
  • ✅ Bridge name validation: regex guard before ip commands
  • bcrypt password length bound: 72-character max enforced before hash

One remaining item I'd flag: the gRPC connection between daemon and control plane authenticates via bearer token in metadata, but the transport appears to be optional TLS (configurable, not enforced). If daemons connect over untrusted networks without TLS, the host token is transmitted in plaintext. If TLS is already enforced in all deployment configurations, a comment or config validation making this explicit would help.


Test Coverage — Good

106 tests across the key packages with no external dependencies required (mocked store, no real QEMU/PostgreSQL) — this is the right approach for unit tests on infrastructure code. A few edge cases worth adding:

  1. billing/ticker.go: Test tick() when ListSandboxesByOrg returns an error — verify it skips the Stripe report rather than reporting zero.
  2. grpc/stream.go: Test SendAndWait when the context is cancelled before the daemon responds — verify the pending request map is cleaned up correctly.
  3. orchestrator: Test CreateSandbox compensating action path (DB persist fails after VM created) — verify DestroySandbox is called.

Positive Highlights

  • SSH CA implementation: Ephemeral certificates with TTL enforcement, proper key types, no key material in logs.
  • AES-256-GCM for secrets at rest: Correct nonce generation, authenticated encryption.
  • gRPC heartbeat logic: 3-consecutive-miss threshold with configurable timeout is robust against transient flaps.
  • Placement scoring: CPU+memory combined scoring with fallback path is reasonable.
  • Error wrapping: Consistent fmt.Errorf("context: %w", err) throughout.
  • API rate limiting: IP-based with prune loop on auth endpoints.
  • Prompt injection guard: System prompt boundary in agent executor.

Summary

The architecture is solid and the iterative security hardening across the commit history is good practice. The three actionable items before production:

  1. billing/ticker.go — return early on store errors instead of reporting zero usage
  2. orchestrator.GetSandbox — add orgID parameter to enforce ownership at the data layer
  3. Confirm TLS is enforced (not just optional) on daemon→control-plane gRPC connections

The rest are quality-of-life improvements that can follow in subsequent PRs.

… on store errors

- Add GetSandboxByOrg to DataStore interface with org_id WHERE clause, preventing IDOR if a future handler omits the manual check
- Update Orchestrator.GetSandbox to require orgID, remove redundant OrgID != org.ID checks from all 8 REST handlers
- Add early return in billing ticker when ListSandboxesByOrg or ListSourceHostsByOrg errors, preventing silent zero-billing reports
- Update wrong_org test expectations from 403 to 404 (org mismatch now returns ErrNotFound from DB layer)
- Use ConstructEventWithOptions for Stripe webhook to handle API version mismatches

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@claude
Copy link

claude bot commented Feb 21, 2026

Code Review – PR #65: Convert fluid to a control-plane

This is a substantial architectural change – adding a centralized control plane (REST API + gRPC) backed by PostgreSQL, with the CLI becoming a thin client. The overall design is solid. Below are findings grouped by severity.


Security

1. gRPC without TLS allows plaintext bearer tokens (api/cmd/server/main.go:101)

The server starts without TLS if cert/key are not configured, with only a Warn log. Host bearer tokens would be transmitted in plaintext. Consider making TLS required (fail startup without it) unless an explicit GRPC_ALLOW_PLAINTEXT=true override is set. The current behavior is safe to demo but unsafe in production.

2. DestroySandbox in the orchestrator does not enforce org scoping (api/internal/orchestrator/orchestrator.go:254)

DestroySandbox calls store.GetSandbox (no org filter), while GetSandbox in the orchestrator uses store.GetSandboxByOrg. Org isolation is enforced only at the REST handler layer. This works because the handler verifies ownership before calling the orchestrator, but it is a defense-in-depth concern: any future caller of orchestrator.DestroySandbox must remember to pre-verify the sandbox belongs to the right org. Consider either adding an orgID parameter to DestroySandbox or using an org-scoped store lookup internally.


Bugs / Correctness

3. N+1 DB queries in ListHosts (api/internal/orchestrator/orchestrator.go:512)

For each connected host, a separate GetSandboxesByHostID query is issued. With many connected hosts this becomes a serial N+1 pattern. A single ListSandboxesByHostIDs(hostIDs []string) query would be more efficient. Not critical at early scale, but worth tracking.

4. handleGetBilling duplicates org membership check (api/internal/rest/billing_handlers.go:53-67)

This handler manually fetches the org by slug and checks membership, while all other org-scoped handlers use the resolveOrgMembership helper. This divergence means any future changes to the membership check logic (e.g., soft-delete awareness) need to be applied in two places. Using resolveOrgMembership here would keep things consistent.

5. Double DB lookup in destroy/start/stop/run handlers

handleDestroySandbox, handleStartSandbox, handleStopSandbox, handleRunCommand, and handleCreateSnapshot all call orchestrator.GetSandbox for org verification, and then the orchestrator re-fetches the sandbox internally. This is correct (no bug), but the extra round-trip to the DB on the hot path for these operations could be eliminated by threading the pre-fetched sandbox into the orchestrator methods.


Code Quality

6. hasMinRole rebuilds the ranks map on every call (api/internal/rest/sandbox_handlers.go:463-468)

This is a minor allocation that happens on every authorized request. A package-level var roleRanks = map[store.OrgRole]int{...} or a switch statement would avoid the repeated allocation.

7. ResourceTicker uses store.DataStore while the rest uses store.Store (api/internal/billing/ticker.go:17)

Minor inconsistency. store.Store embeds DataStore and adds lifecycle methods. Since ResourceTicker does not need Ping/Close/WithTx, accepting DataStore is actually the more correct narrower interface. This is fine – just inconsistent with the rest of the codebase. If intentional, a comment would clarify.

8. Rate limiting cleanup goroutine has no shutdown path (api/internal/rest/ratelimit.go:91-102)

The comment in the code acknowledges this correctly. Just flagging that if rateLimitByIP is ever called more than once (e.g., in tests that spin up multiple servers), each call leaks a goroutine. A sync.Once or sharing the cleanup goroutine across calls could help, though the current approach is fine given the documented startup-only usage.

9. In-memory rate limiting not suitable for multi-instance deployments (api/internal/rest/ratelimit.go:82-84)

The code comment acknowledges this, but it is worth surfacing in a TODO or issue: when horizontally scaled, each instance has independent rate limit state, so effective limits multiply by instance count. Fine for now, but this will need a Redis-backed or similar solution before multi-instance production deployment.


Positive Highlights

  • Reconnect race prevention (stream.go:158): Using CompareAndDelete to prevent stale cleanup from clobbering a reconnecting host's state is a clean and correct approach.
  • Token impersonation prevention (stream.go:92-98): Overriding daemon-supplied hostID with the server-assigned token identity is exactly right.
  • Compensating destroy on DB persist failure (orchestrator.go:203-221): The compensating action prevents orphaned VMs when the DB write fails after the VM is already created. Well done.
  • Stripe webhook signature verification (billing_handlers.go:345): Properly uses webhook.ConstructEventWithOptions with the webhook secret. Correct.
  • Trusted proxy CIDR handling (ratelimit.go:49-74): The rate limiter only trusts X-Real-IP/X-Forwarded-For from configured trusted proxy CIDRs. This prevents IP spoofing via headers.
  • Org-scoped DB queries for IDOR prevention: GetSandboxByOrg at the store layer is the right approach for preventing insecure direct object references.
  • Test coverage: Test files exist for every major package including handlers, orchestrator, placement, gRPC stream, billing, auth, and registry.

Minor Nits

  • api/internal/grpc/stream.go:137: persistHostRegistration is called with context.Background() so it outlives the stream context – good. Consider adding a timeout to that background context so a slow DB does not block indefinitely (e.g., a 10-second deadline).
  • The CORS middleware (server.go:203) sets Access-Control-Allow-Origin to frontendURL unconditionally. If FRONTEND_URL is empty in config, the header value will be an empty string, which browsers treat as null origin (effectively blocking CORS). Validate that FRONTEND_URL is set in cfg.Validate().

Overall this is a well-structured PR with good security thinking throughout. The major items to address before production are TLS enforcement on gRPC and the org-scoping concern on DestroySandbox.

aspectrr and others added 3 commits February 21, 2026 09:27
Daemons now report source_vm_count in heartbeats alongside active_sandboxes.
The billing ticker aggregates from the in-memory registry instead of querying
ListSandboxesByOrg/ListSourceHostsByOrg on every tick.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@claude
Copy link

claude bot commented Feb 21, 2026

PR Review: feat: convert fluid to a control-plane

This is a substantial architectural redesign — moving from a standalone libvirt VM cloner to a distributed control-plane + daemon model with microVM overlays and bidirectional gRPC streaming. The core design is solid and the code quality is generally high. Below are findings organized by priority.


Security

[Medium] ENCRYPTION_KEY is optional — sensitive data stored in plaintext by default

api/internal/config/config.go:131 only emits a warning when ENCRYPTION_KEY is unset. Without it, OAuth access_token/refresh_token and Proxmox API secrets are written to PostgreSQL in plaintext. These are high-value credentials; consider making this key required in production or at minimum fail-fast when Proxmox source hosts exist but no key is configured.

[Medium] gRPC TLS is optional — host bearer tokens sent in plaintext without it

api/cmd/server/main.go:101 logs a warning when TLS is unconfigured but starts the server anyway. A host token intercepted in transit gives an attacker full daemon control for that org. For production guidance, consider documenting (or enforcing) that TLS is required when the gRPC port is network-accessible.

[Low] OAuth tokens encrypted only when key is present — silent fallback

api/internal/store/postgres/postgres.go:436-447 encrypts OAuth tokens when encryptionKey is non-empty but silently stores plaintext otherwise (no error, just continues). The encrypt-on-write / decrypt-on-read pattern is fine, but a failed encrypt (wrong key, etc.) should probably return an error rather than silently storing plaintext.


Bugs / Correctness

[Medium] Orphan stream on re-registration failure

In api/internal/grpc/stream.go:118-132, when a host reconnects:

  1. The old connection's cancel is invoked (line 118-124), tearing down the old stream.
  2. The new stream is stored (line 128).
  3. registry.Register() is called (line 129) — if this fails, the new stream is deleted (line 130).

After step 3 fails, the old connection is gone and the new stream is gone. The host is left in a zombie state until its next reconnect attempt. Consider storing the new stream only after registration succeeds, or restoring the old connection on failure.

[Low] Double sandbox DB lookup per mutating operation

REST handlers for run, start, stop, destroy, snapshot, and commands all call GetSandbox(org.ID, sandboxID) to verify ownership, then the orchestrator re-fetches the same sandbox (without org scope) via store.GetSandbox. This is 2 DB round-trips per call. Consider passing the already-fetched sandbox into the orchestrator methods, or making the org-scoped lookup a single call path in the orchestrator.


Code Quality

Inconsistent permission check style in host token handlers

api/internal/rest/host_handlers.go:110 and :221 use inline role comparisons (member.Role != store.OrgRoleOwner && member.Role != store.OrgRoleAdmin) instead of resolveOrgRole. This is functionally equivalent to calling resolveOrgRole(w, r, store.OrgRoleAdmin) but inconsistent with the rest of the handlers. Prefer the centralized helper.

Rate limiter cleanup goroutine has no shutdown

api/internal/rest/ratelimit.go:91 starts a goroutine that time.Sleeps in a loop forever. The code comment acknowledges this is intentional (process-scoped), which is acceptable for production. However this causes goroutine accumulation in tests if rateLimitByIP is called multiple times. Consider accepting a context.Context or returning a stop function.

Placement scoring is greedy (always picks most-resourced host)

api/internal/orchestrator/placement.go:107:

return float64(h.Registration.GetAvailableMemoryMb()) + float64(h.Registration.GetAvailableCpus())*1024

This always routes to the host with the most available capacity, potentially concentrating sandboxes on fewer hosts initially. Fine for small deployments; for larger fleets, weighted-random or least-connections would spread load better.

Commented-out code sections

Several blocks are commented out with "not yet ready for integration" markers (agent conversations, playbooks, agent client). This is clearly intentional WIP, just noting it for housekeeping before this PR is merged to main.


Positive Observations

  • gRPC host identity pinning (stream.go:94-98): Overriding the daemon-supplied host_id with the server-assigned token identity is exactly right — prevents daemons from impersonating each other.
  • Compensating destroy on DB failure (orchestrator.go:203-221): Rolling back orphaned VMs when DB persistence fails is well-implemented.
  • CompareAndDelete for stream cleanup (stream.go:161): Correctly avoids clobbering a replacement stream during reconnect.
  • Heartbeat monitor with consecutive-miss tolerance (stream.go:285-322): The 3-miss buffer avoids flapping on transient network issues.
  • Per-request sandbox ownership check: All REST handlers scope sandbox operations to the authenticated org before dispatching — consistent and correct.
  • Token hashing: Both session tokens and host tokens are stored as SHA-256 hashes; raw tokens never touch the DB.
  • Request body size limits in OAuth fetchers: io.LimitReader(resp.Body, 1<<20) is consistently applied.
  • Test coverage: Companion _test.go files are present for all meaningful packages.

Summary

The architectural direction is solid and the implementation is production-quality for most of it. The two highest-priority items before shipping to production are:

  1. Make ENCRYPTION_KEY required (or error loudly when sensitive data would be stored unencrypted).
  2. Fix the orphan-stream window on host reconnect.

The rest are improvements worth addressing but not blockers.

@claude
Copy link

claude bot commented Feb 21, 2026

PR Review: feat: convert fluid to a control-plane

This PR represents a complete architectural transformation — from a standalone libvirt clone tool to a multi-tenant SaaS control plane with gRPC streaming, PostgreSQL, billing, and an embedded LLM agent. The scope is substantial (85k+ additions) and the direction is architecturally sound. The 30+ iterative commits show excellent responsiveness to prior review feedback.


What's Been Done Well

The most recent HEAD addresses a large number of previously raised issues:

  • Session and host token security: SHA-256 hashed before DB storage, raw tokens never persisted. Raw token returned once on creation only.
  • Per-host gRPC send mutex: serializes concurrent Send calls per stream correctly.
  • OAuth CSRF: per-request nonce with constant-time comparison in callback.
  • org-scoped sandbox lookups at the DB layer: GetSandboxByOrg enforces org_id in the WHERE clause, preventing IDOR if a handler omits the manual check.
  • Compensating DestroySandbox on DB persist failure: prevents orphaned VMs after a successful host-side create.
  • Host ID override from token: host ID is now derived server-side from the validated token, preventing impersonation.
  • id.Generate() returns error instead of panicking.
  • Prompt injection guard in agent system prompt.
  • Per-org mutex in billing free-tier calculation.
  • Rate limiting on auth endpoints.
  • Registry-based billing metering: replaces DB polling, which also avoids zero-billing on store errors.

Remaining Issues

[Medium] EnsureModelMeter — non-ErrNotFound error after acquiring lock falls through (billing/meters.go)

After the per-model mutex is acquired, the double-check re-reads the DB. But only err == nil exits early:

meter, err := mm.store.GetModelMeter(ctx, modelID)
if err == nil {
    return meter, nil
}
// Falls through even if err is a transient DB error, not ErrNotFound

If GetModelMeter returns a transient DB error (connection reset, timeout), the code proceeds to create Stripe objects even though the meter may already exist. Mirror the pre-lock check: return the error for any non-ErrNotFound case.

[Medium] Rate limiter uses r.RemoteAddr directly (rest/ratelimit.go)

When the API sits behind a load balancer or reverse proxy, r.RemoteAddr is always the proxy's IP. All users share a single rate-limit bucket, making the limiter ineffective. Use X-Forwarded-For / X-Real-IP when a trusted proxy header is present, with a config flag to opt in (to avoid spoofing in direct-access setups).

[Medium] Rate limiter cleanup goroutine leaks on server shutdown (rest/ratelimit.go)

The cleanup goroutine started per rate-limited endpoint runs indefinitely with time.Sleep and has no shutdown path. It should accept a context.Context and exit when the server stops.

[Low] Missing Vary: Origin response header in CORS middleware (rest/server.go)

Access-Control-Allow-Origin is set unconditionally to the configured frontend URL. Without Vary: Origin, caching proxies may serve one client's CORS response to another client with a different origin. Add w.Header().Add("Vary", "Origin").

[Low] No pagination on list endpoints

ListSandboxes, ListCommands, ListPlaybooks, and similar endpoints have no limit/offset parameters. The store.ListOptions struct already exists — it just isn't wired through the REST handlers. At scale these responses will be large. This doesn't need to block merge but should be tracked as immediate follow-up.

[Low] Billing meter rollback doesn't cover Stripe billing meters (billing/meters.go)

The rollback function deactivates prices and products but Stripe does not support deleting billing meters. On partial failure, orphaned meters will accumulate in Stripe. This limitation should be documented with a comment so future operators understand why orphaned meters may appear.

[Low] StartSandbox/StopSandbox should validate current state before dispatching

Dispatching a start command to an already-running sandbox (or stop to an already-stopped one) lets the host return an error that can diverge DB and host state. A pre-check on sandbox.State before the SendAndWait call would catch this cheaply and return a meaningful error to the caller.

[Low] Auth middleware context key uses string type (auth/middleware.go)

type contextKey string is used in middleware.go, while hostauth.go uses the safer unexported struct type (type hostOrgKey struct{}). String-based context keys can collide across packages. Align to the struct pattern used elsewhere in the auth package.


Test Coverage Note

Per CLAUDE.md and api/AGENTS.md: "Every code change needs tests." The billing package (billing/meters.go, billing/ticker.go) contains complex Stripe integration, free-tier math, and concurrent access patterns with no test files visible in this PR. Given the financial impact of billing bugs, this is the most important testing gap remaining. The free-tier edge cases (usage exactly at the limit, concurrent requests racing across the threshold) are exactly the cases that unit tests with a mock Stripe backend would catch.


Summary

The architecture is clean and the iterative fix work has been thorough. The remaining items are manageable — none are blockers at the level of the issues already addressed. The billing test coverage gap is the highest-priority remaining item per project policy. The rate limiter proxy-awareness issue is the most likely to cause an operational problem in a real deployment.

Reviewed at HEAD commit cf23b3f5 (2026-02-21).

@aspectrr aspectrr merged commit 4514bd3 into main Feb 21, 2026
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants