Conversation
…uid modules 106 tests across 12 new test files covering config, state, janitor, image, network, reconnect, registry, placement, API handlers, and REST client. All tests run without external dependencies (no QEMU, libvirt, PostgreSQL). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Introduces an HTTP control-plane client in the fluid CLI module and adds a broad set of unit tests for the new control-plane / sandbox-host components (API handlers, registry, placement logic, store/network/janitor/config/image behaviors).
Changes:
- Added
fluid/internal/controlplaneREST client with typed request/response models and helper HTTP methods. - Added comprehensive tests for the control-plane client behavior and error handling.
- Added new test suites for sandbox-host and control-plane subsystems (state, networking, janitor, config, registry, orchestrator placement, API handlers).
Reviewed changes
Copilot reviewed 13 out of 13 changed files in this pull request and generated 11 comments.
Show a summary per file
| File | Description |
|---|---|
| sandbox-host/internal/state/sqlite_test.go | Adds state store CRUD/expiry/commands tests (currently outside CI/module scope). |
| sandbox-host/internal/network/bridge_test.go | Adds tests for bridge resolution + DHCP mode (currently outside CI/module scope). |
| sandbox-host/internal/janitor/janitor_test.go | Adds janitor cleanup behavior tests (timing-based). |
| sandbox-host/internal/image/store_test.go | Adds image store listing/path resolution tests. |
| sandbox-host/internal/config/config_test.go | Adds default/load/save config tests (uses UserHomeDir). |
| sandbox-host/internal/agent/reconnect_test.go | Adds reconnect/backoff tests (real-time delays). |
| fluid/internal/controlplane/client.go | Adds the new control-plane REST client implementation. |
| fluid/internal/controlplane/client_test.go | Adds extensive client tests using httptest. |
| control-plane/internal/registry/registry_test.go | Adds registry behavior + concurrency tests (currently outside CI/module scope). |
| control-plane/internal/orchestrator/placement_test.go | Adds host placement tests (currently outside CI/module scope). |
| control-plane/internal/api/handlers_source_test.go | Adds source VM handler tests (currently outside CI/module scope). |
| control-plane/internal/api/handlers_sandbox_test.go | Adds sandbox handler tests + shared mock orchestrator (currently outside CI/module scope). |
| control-plane/internal/api/handlers_host_test.go | Adds host handler tests (currently outside CI/module scope). |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // GetSandbox retrieves a sandbox by ID. | ||
| func (c *Client) GetSandbox(ctx context.Context, id string) (*SandboxResponse, error) { | ||
| var result SandboxResponse | ||
| if err := c.get(ctx, fmt.Sprintf("/v1/sandboxes/%s", id), &result); err != nil { | ||
| return nil, err | ||
| } | ||
| return &result, nil | ||
| } | ||
|
|
||
| // ListSandboxes returns all sandboxes. | ||
| func (c *Client) ListSandboxes(ctx context.Context) ([]*SandboxResponse, error) { | ||
| var result struct { | ||
| Sandboxes []*SandboxResponse `json:"sandboxes"` | ||
| Count int `json:"count"` | ||
| } | ||
| if err := c.get(ctx, "/v1/sandboxes", &result); err != nil { | ||
| return nil, err | ||
| } | ||
| return result.Sandboxes, nil | ||
| } | ||
|
|
||
| // DestroySandbox destroys a sandbox by ID. | ||
| func (c *Client) DestroySandbox(ctx context.Context, id string) error { | ||
| return c.delete(ctx, fmt.Sprintf("/v1/sandboxes/%s", id)) | ||
| } | ||
|
|
||
| // StartSandbox starts a stopped sandbox. | ||
| func (c *Client) StartSandbox(ctx context.Context, id string) error { | ||
| return c.post(ctx, fmt.Sprintf("/v1/sandboxes/%s/start", id), nil, nil) | ||
| } | ||
|
|
||
| // StopSandbox stops a running sandbox. | ||
| func (c *Client) StopSandbox(ctx context.Context, id string) error { | ||
| return c.post(ctx, fmt.Sprintf("/v1/sandboxes/%s/stop", id), nil, nil) | ||
| } | ||
|
|
||
| // RunCommand executes a command in a sandbox. | ||
| func (c *Client) RunCommand(ctx context.Context, sandboxID, command string, timeoutSec int) (*CommandResponse, error) { | ||
| req := RunCommandRequest{ | ||
| Command: command, | ||
| TimeoutSec: timeoutSec, | ||
| } | ||
| var result CommandResponse | ||
| if err := c.post(ctx, fmt.Sprintf("/v1/sandboxes/%s/run", sandboxID), req, &result); err != nil { | ||
| return nil, err | ||
| } | ||
| return &result, nil | ||
| } | ||
|
|
||
| // GetSandboxIP returns the IP address of a sandbox. | ||
| func (c *Client) GetSandboxIP(ctx context.Context, id string) (string, error) { | ||
| var result struct { | ||
| SandboxID string `json:"sandbox_id"` | ||
| IPAddress string `json:"ip_address"` | ||
| } | ||
| if err := c.get(ctx, fmt.Sprintf("/v1/sandboxes/%s/ip", id), &result); err != nil { | ||
| return "", err | ||
| } | ||
| return result.IPAddress, nil | ||
| } | ||
|
|
||
| // CreateSnapshot creates a snapshot of a sandbox. | ||
| func (c *Client) CreateSnapshot(ctx context.Context, sandboxID, name string) (*SnapshotResponse, error) { | ||
| req := struct { | ||
| Name string `json:"name"` | ||
| }{Name: name} | ||
| var result SnapshotResponse | ||
| if err := c.post(ctx, fmt.Sprintf("/v1/sandboxes/%s/snapshot", sandboxID), req, &result); err != nil { | ||
| return nil, err | ||
| } | ||
| return &result, nil | ||
| } | ||
|
|
||
| // ListCommands returns commands for a sandbox. | ||
| func (c *Client) ListCommands(ctx context.Context, sandboxID string) ([]*CommandResponse, error) { | ||
| var result struct { | ||
| Commands []*CommandResponse `json:"commands"` | ||
| Count int `json:"count"` | ||
| } | ||
| if err := c.get(ctx, fmt.Sprintf("/v1/sandboxes/%s/commands", sandboxID), &result); err != nil { | ||
| return nil, err | ||
| } | ||
| return result.Commands, nil | ||
| } |
There was a problem hiding this comment.
Path parameters (sandbox IDs, VM names) are interpolated directly into URLs via fmt.Sprintf (e.g., "/v1/sandboxes/%s"). If an ID or name contains characters that need escaping (spaces, "/", "?", "%"), requests will hit the wrong endpoint or fail. Use proper path-segment escaping (e.g., url.PathEscape per segment, or url.JoinPath) for all dynamic URL components (GetSandbox/DestroySandbox/Start/Stop/Run/CreateSnapshot/ListCommands and the source VM endpoints).
| func (c *Client) doRequest(req *http.Request, result any) error { | ||
| resp, err := c.httpClient.Do(req) | ||
| if err != nil { | ||
| return fmt.Errorf("request failed: %w", err) | ||
| } | ||
| defer func() { _ = resp.Body.Close() }() | ||
|
|
||
| data, err := io.ReadAll(resp.Body) | ||
| if err != nil { | ||
| return fmt.Errorf("read response: %w", err) | ||
| } | ||
|
|
||
| if resp.StatusCode >= 400 { | ||
| var errResp struct { | ||
| Error string `json:"error"` | ||
| } | ||
| if json.Unmarshal(data, &errResp) == nil && errResp.Error != "" { | ||
| return fmt.Errorf("control plane error (%d): %s", resp.StatusCode, errResp.Error) | ||
| } | ||
| return fmt.Errorf("control plane error (%d): %s", resp.StatusCode, string(data)) | ||
| } |
There was a problem hiding this comment.
doRequest includes the raw response body verbatim in error messages for non-JSON errors. Many servers include trailing newlines/whitespace or HTML, which makes errors noisy and breaks exact-string assertions/log filtering. Consider trimming whitespace from the body before formatting the error, and (optionally) include req.Method and req.URL.Path in the error to make client-side debugging easier.
| func TestNewClient(t *testing.T) { | ||
| c := NewClient("http://localhost:9090") | ||
| if c.baseURL != "http://localhost:9090" { | ||
| t.Errorf("expected baseURL http://localhost:9090, got %s", c.baseURL) | ||
| } | ||
| if c.httpClient == nil { | ||
| t.Fatal("expected httpClient to be non-nil") | ||
| } | ||
| if c.httpClient.Timeout != 10*time.Minute { | ||
| t.Errorf("expected timeout 10m, got %v", c.httpClient.Timeout) | ||
| } | ||
| } |
There was a problem hiding this comment.
This test hard-codes the exact baseURL string. If NewClient is updated to normalize/trim trailing slashes (recommended to avoid double-slash URLs), this assertion will fail. Consider asserting functional behavior (no trailing slash in stored baseURL, or correct request URL construction) rather than exact string equality.
| func TestRunWithReconnect_RetriesOnError(t *testing.T) { | ||
| var callCount atomic.Int32 | ||
|
|
||
| ctx, cancel := context.WithTimeout(context.Background(), 10*time.Second) | ||
| defer cancel() | ||
|
|
||
| connectFn := func(ctx context.Context) error { | ||
| n := callCount.Add(1) | ||
| if n < 3 { | ||
| return errors.New("connection failed") | ||
| } | ||
| return nil // succeed on 3rd attempt | ||
| } | ||
|
|
||
| // Override timing: we cancel context after connectFn succeeds, so | ||
| // backoff waits are the bottleneck. The first two failures will incur | ||
| // backoff waits of 1s and 2s respectively from the production code. | ||
| // We use a generous timeout above to accommodate that. | ||
| err := RunWithReconnect(ctx, slog.Default(), connectFn) | ||
| if err != nil { | ||
| t.Fatalf("expected nil after retries, got %v", err) | ||
| } | ||
|
|
There was a problem hiding this comment.
This test appears to depend on production backoff waits (comment mentions 1s then 2s). That makes the unit test slow and potentially flaky if the backoff algorithm/timers change. Prefer injecting the sleep/backoff strategy into RunWithReconnect (or using a test hook) so the test can run quickly without real-time delays.
| want := `unknown network "nonexistent": not found in bridge_map` | ||
| if err.Error() != want { | ||
| t.Fatalf("expected error %q, got %q", want, err.Error()) | ||
| } |
There was a problem hiding this comment.
This assertion compares the full error string exactly. That’s brittle if the implementation tweaks wording/punctuation or wraps errors, and it makes refactors harder without changing behavior. Prefer checking for a stable sentinel error (errors.Is) or validating key substrings/structured fields (e.g., network name) instead of the entire message.
| // NewClient creates a control plane client. | ||
| func NewClient(baseURL string) *Client { | ||
| return &Client{ | ||
| baseURL: baseURL, | ||
| httpClient: &http.Client{ | ||
| Timeout: 10 * time.Minute, | ||
| }, | ||
| } |
There was a problem hiding this comment.
NewClient stores baseURL as-is, but request URLs are built via simple string concatenation (baseURL+path). If baseURL is configured with a trailing slash (common in configs), this produces double slashes (e.g., "http://cp/" + "/v1/..."), which can break some proxies and signature-based routing. Consider normalizing baseURL in NewClient (trim trailing "/") and updating the tests accordingly.
| } | ||
|
|
||
| // SSH defaults | ||
| home, _ := os.UserHomeDir() |
There was a problem hiding this comment.
os.UserHomeDir() error is ignored here. On some CI/container environments (or Windows), UserHomeDir can fail, which would make the expected default paths incorrect and cause confusing test failures. Handle the error (t.Fatal/t.Skip) so the test fails (or skips) deterministically when HOME cannot be resolved.
| home, _ := os.UserHomeDir() | |
| home, err := os.UserHomeDir() | |
| if err != nil { | |
| t.Skipf("skipping TestDefaultConfig: cannot determine user home directory: %v", err) | |
| } |
| // Give it time to run the immediate cleanup. | ||
| time.Sleep(200 * time.Millisecond) | ||
| cancel() | ||
| <-done |
There was a problem hiding this comment.
These tests rely on time.Sleep to "give it time" for the janitor goroutine to run. This is timing-sensitive and can be flaky under load/slow CI, and it also slows the suite unnecessarily. Prefer synchronizing on an explicit signal from destroyFn (e.g., close a channel on first call) or exposing a single-iteration cleanup method for tests so they can block until work is done without sleeping.
| connectFn := func(ctx context.Context) error { | ||
| n := callCount.Add(1) | ||
| if n == 1 { | ||
| // After first failure, cancel context so RunWithReconnect | ||
| // exits during the backoff wait. | ||
| go func() { | ||
| time.Sleep(50 * time.Millisecond) | ||
| cancel() | ||
| }() | ||
| } | ||
| return errors.New("connection failed") | ||
| } | ||
|
|
There was a problem hiding this comment.
The cancellation timing here is driven by a fixed 50ms sleep in a goroutine. Depending on scheduler load, RunWithReconnect might progress differently, which can make this test flaky. Consider coordinating via channels (e.g., have connectFn signal when it has returned the first error, then cancel immediately) instead of relying on wall-clock sleeps.
| connectFn := func(ctx context.Context) error { | |
| n := callCount.Add(1) | |
| if n == 1 { | |
| // After first failure, cancel context so RunWithReconnect | |
| // exits during the backoff wait. | |
| go func() { | |
| time.Sleep(50 * time.Millisecond) | |
| cancel() | |
| }() | |
| } | |
| return errors.New("connection failed") | |
| } | |
| firstErrorCh := make(chan struct{}) | |
| connectFn := func(ctx context.Context) error { | |
| n := callCount.Add(1) | |
| if n == 1 { | |
| // Signal that the first connection attempt has failed so we can | |
| // cancel the context while RunWithReconnect is in its backoff wait. | |
| close(firstErrorCh) | |
| } | |
| return errors.New("connection failed") | |
| } | |
| go func() { | |
| <-firstErrorCh | |
| cancel() | |
| }() |
| package state | ||
|
|
||
| import ( | ||
| "context" | ||
| "testing" | ||
| "time" | ||
| ) |
There was a problem hiding this comment.
These sandbox-host tests live outside any Go module in this repo (there’s no sandbox-host/go.mod) and the CI workflow currently only runs go test in the fluid and fluid-remote modules. As a result, this test suite won’t run in CI, even though the PR checklist says tests were added. Consider adding a Go module for sandbox-host (and wiring it into CI), or relocating the code/tests into an existing module so they’re executed.
PR Review - Critical Issues Found🚨 BLOCKER: Missing Implementation CodeThis PR adds only test files for code that doesn't exist. Of 13 files added:
The tests will not compile. They import and test packages that are not included in this PR:
🔒 Security Issues in
|
Code Review: Control Plane Architecture MigrationThis is a major architectural refactoring that transforms fluid.sh from a libvirt-based VM cloning system to a microVM control plane with distributed sandbox hosts. This is production-ready foundational work with excellent test coverage. Overall AssessmentStrengths:
Areas for Improvement: Architecture ReviewControl Plane (control-plane/)Excellent Design:
Key Components:
Sandbox Host (sandbox-host/)Production-Ready Features:
Security Considerations1. SSH CA Private Key ProtectionFile: sandbox-host/internal/sshca/ca.go The SSH CA configuration mentions key permissions enforcement (EnforceKeyPermissions bool) Recommendation:
2. gRPC Stream AuthenticationFile: control-plane/internal/grpc/stream.go:64-150 Current implementation accepts HostRegistration without authentication. Recommendation:
3. Command Injection in QEMU Args (Currently Safe)File: sandbox-host/internal/microvm/manager.go:182-196 QEMU args are properly constructed with separate array elements (not shell strings). No injection risk - args are properly separated. 4. Process Signal HandlingFile: sandbox-host/internal/microvm/manager.go:269-295 The code uses os.FindProcess + Signal(0) to check process existence. On Unix, os.FindProcess always succeeds, even for non-existent PIDs. The process might belong to a different user. Recommendation:
5. SQL Injection (Properly Mitigated)File: control-plane/internal/store/postgres/postgres.go All database operations use parameterized queries - no injection risk detected. Potential Bugs1. Race Condition in Stream CleanupFile: control-plane/internal/grpc/stream.go:129-133 If SendAndWait is called while stream is disconnecting, it might get the stream from h.streams but it could be closed before stream.Send() is called. Recommendation: In SendAndWait, check if host is still connected before sending. 2. Memory Leak in pendingRequestsFile: control-plane/internal/grpc/stream.go:30 If a host crashes without responding, the channel in pendingRequests will never be deleted (only cleaned up on response or timeout). The timeout mitigates this partially, but the map entry persists. Recommendation: Add explicit cleanup in SendAndWait with defer to delete from pendingRequests. 3. Overlay Filesystem Not Cleaned on Launch FailureFile: sandbox-host/internal/microvm/manager.go:166-266 If QEMU launch fails after overlay creation, the overlay directory might not be cleaned up. Recommendation: Add RemoveOverlay call in the error path. 4. JSON Parsing Without Error HandlingFile: sandbox-host/internal/microvm/manager.go:386-447 Custom JSON parsing in readMetadata silently ignores parse errors. Consider using encoding/json for robust parsing, or add error reporting for malformed metadata. Performance Considerations1. Excessive Polling for PID FileFile: sandbox-host/internal/microvm/manager.go:223-233 The code polls for PID file 10 times with 100ms sleep, adding up to 1 second delay on every VM launch. Recommendation:
2. Synchronous VM Launch Blocks APIFile: control-plane/internal/orchestrator/orchestrator.go:133 VM creation blocks the API call for up to 5 minutes. Recommendation:
3. Registry ListConnected() Not OptimizedFile: control-plane/internal/registry/registry.go If many hosts (100+), ListConnected() might be called frequently by the API. Consider adding caching with TTL if this becomes a bottleneck. Test CoverageExcellent Coverage:
Examples:
Suggestion: Add integration tests for:
Code QualityStrengths:
Minor Issues:1. Inconsistent Error WrappingSome errors use fmt.Errorf without %w, which loses the error chain. Recommendation: Use %w consistently for error wrapping. 2. Magic NumbersDefault values like vcpus = 2 and memMB = 2048 are hardcoded. Recommendation: Extract to named constants or config struct. 3. Large Functions
Suggestion: Consider extracting sub-functions for readability. Breaking ChangesAs marked in the PR description, this is a breaking change:
Documentation Impact:
Recommendations SummaryCritical (Fix Before Merge):
High Priority (Soon After):
Nice to Have:
VerdictAPPROVED with minor recommendations This is excellent foundational work for a production-grade microVM control plane. The architecture is sound, test coverage is comprehensive, and the code quality is high. The security and bug concerns listed above are important but not blockers - they can be addressed in follow-up PRs. The migration from libvirt VMs to QEMU microVMs with centralized control plane is the right architectural direction for scalability and multi-tenancy. Suggested Merge Strategy:
Great work on this substantial refactoring! |
Code Review: feat: convert fluid to a control-planeThis is a large, ambitious PR—converting the project from a local libvirt-based tool into a proper multi-tenant control plane with gRPC streaming daemons, PostgreSQL persistence, OAuth, billing, and a demo WebSocket server. The architecture is well thought out at a high level. The notes below focus on issues that need addressing before this is production-ready. Critical
// handleSubscribe
Price: s.cfg.Billing.StripePublishableKey, // wrong — pk_live_... is not a price_...A publishable key ( SecurityOAuth CSRF: hardcoded cfg.AuthCodeURL("state") // "state" is a literal string, not a random tokenThe OAuth 2.0
hostID := reg.GetHostId() // daemon-supplied; could claim any host IDA daemon with a valid token can register as any Session tokens stored in plaintext ( Cross-org sandbox routing in placement (
s.store.DeleteOrgMember(r.Context(), memberID) // no org_id checkAny org admin can delete a member from any organization by supplying the target's member ID. The store query must join on No guard against removing the org owner ( GitHub/Google API response status codes not checked (
Race Conditions / BugsConcurrent Data race on
m.mu.RLock()
// ...
info.State = StateRunning // write under a read lock = data raceWebhook errors silently swallowed in Stripe handler (
_ = s.store.CreateSubscription(...) // user paid but subscription may not be recordedHand-rolled JSON parser in microvm manager ( WebSocket session limit is a TOCTOU race (
ID Collision Risk (
|
…ation No backend required - the landing page demo now plays a scripted SRE debugging scenario showcasing all fluid.sh phases (investigate, sandbox, playbook, cleanup) and loops automatically. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Code Review — feat: convert fluid to a control-planeThis is a substantial architectural change with well-structured code overall. Two prior reviews have already been posted; this review focuses on issues found by reading the actual source on the current HEAD of the PR branch, with specific file/line references. Critical Bugs1. Wrong Stripe config field used as Price ID ( The 2. Concurrent
3.
4.
Security5. OAuth CSRF: hardcoded
6. Session tokens stored as plaintext primary keys ( The raw session token from the cookie is passed directly to 7.
8. No guard against removing the org owner ( An admin can remove the owner, permanently locking them out. Add a check that the target member's role is not 9. GitHub/Google API response status codes not validated (
10. The same three functions use Architecture / Design11.
12. Host placement ignores org scope (
13. SSH config content (which may contain credentials) is forwarded to Code Quality14. Returning 15.
16. Heartbeat threshold duplicated as a magic number ( Both hard-code Test CoveragePer There are zero test files under
At minimum, placement logic and auth middleware need unit tests before merging. Configuration17. Hardcoded default DB credentials (
18. Running schema migrations automatically on every startup is risky in production. Default to 19. Invalid env vars (e.g. SummaryThe architectural direction is solid: centralized control plane, bidirectional gRPC streaming to daemons, microVM sandboxes, proper multi-tenancy model. The code is well-organized and readable. Must fix before merge:
Should fix soon after:
Review based on source at commit 68d021a. |
Code Review — PR #65: feat: convert fluid to a control-planeThis is a significant architectural change (43k additions, 58k deletions). The direction is sound — moving to a centralized control plane with micro VMs is a more production-viable design. However, there are several security and correctness issues that need to be addressed before merge. 🔴 Critical — Must Fix Before Merge1. Concurrent
|
| Severity | Count |
|---|---|
| 🔴 Critical | 5 |
| 🟠 High | 7 |
| 🟡 Medium | 8 |
| 🔵 Low | 4 |
The architecture itself is a clear improvement. Issues 1 (gRPC stream race) and 2 (OAuth CSRF) are security-critical and need fixes before merge. Issue 12 (no tests) is required by project policy. The remaining critical/high items should be resolved or tracked as immediate follow-up.
- Add 200+ tests across api/ covering auth, REST handlers, orchestrator, agent executor, gRPC streaming, registry, store, config, json, and id - Add billing package with Stripe metering and usage ticker - Add store types and interface methods for billing (ModelMeter, OrgModelSubscription, SumTokenUsage, ListActiveSubscriptions) - Update Python SDK with full API coverage and high-level client - Add web billing calculator, onboarding, playbooks, and daemon status - Add fluid-cli model picker and modelsdev client - Add fluid-daemon id package - Misc refactors: heartbeat timeout config, org-scoped registry, etc. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Code Review — feat: convert fluid to a control-planeThe architectural direction here is right: moving to a centralized control plane with microVMs and per-host daemons is a significant improvement in scalability and multi-tenancy potential. The QEMU overlay filesystem approach for fast sandbox creation is a good technical choice. Previous reviews have catalogued many issues thoroughly. Rather than repeat them, I want to consolidate the must-fix list and add a few observations not yet covered. Consensus Must-Fix Items (from prior reviews)These issues appear across multiple reviews and need resolution before merge:
Additional Observations
IDs are interpolated directly into URL paths with
Pending request map leak in stream handler ( If a host disconnects mid-request, the channel entry in
Overlay not cleaned on QEMU launch failure ( If the process fails to start after the overlay directory is created, the overlay is not removed in the error path. This leaks disk space over repeated failures. Add a deferred
Both Minor
SummaryThe architecture is a solid foundation. The gRPC stream race, OAuth CSRF, org member IDOR, and Stripe reliability bugs are the items that would cause real production incidents and need to be resolved before merge. Everything else can be tracked as follow-up work. |
The OAuth login handlers passed a hardcoded "state" literal and callbacks never validated it, leaving the flow vulnerable to CSRF attacks. Now each login generates a 32-byte crypto/rand nonce stored in a short-lived HttpOnly cookie, and callbacks validate it with constant-time comparison. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
DeleteOrgMember only filtered by member ID, allowing any org admin to delete members from other orgs if they knew the member ID. Now requires org_id match in the WHERE clause. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Code Review: feat: convert fluid to a control-planeThis is a substantial architectural rework — a well-motivated pivot from a local libvirt-clone workflow to a centralized control plane with multi-tenancy, billing, and microVMs. The overall direction is solid and the code is generally well-structured. The review below focuses on issues that need attention before merging. Architecture OverviewThe new architecture is sound:
Security Issues🔴 High: Agent executor lacks org isolation on sandbox operations
// executor.go
func (c *Client) execDestroySandbox(ctx context.Context, params map[string]any) (string, error) {
err := c.orchestrator.DestroySandbox(ctx, strParam(params, "sandbox_id"))The REST handlers do this correctly: // sandbox_handlers.go
sandbox, err := s.orchestrator.GetSandbox(r.Context(), sandboxID)
if sandbox.OrgID != org.ID {
serverError.RespondError(w, http.StatusForbidden, ...)
}A prompt injection attack could cause the AI agent to destroy sandboxes belonging to other orgs. The executor needs the same org-ownership check before every mutating operation. 🔴 High:
|
Code Review: feat: convert fluid to a control-planeThis is a substantial architectural refactor — moving from local libvirt clones to a control-plane + daemon + microVM model. The overall structure is well-thought-out. Below are findings from reviewing the key changed files. Critical / High Severity1. Host ID hijacking ( The host ID is daemon-provided during registration with no server-side validation that it belongs to the registering org. A daemon from org A could supply a Fix: After 2. Host token delete missing org-scoping (
Fix: Add org-scoping to 3. Database URL logged in plaintext ( slog.Info("connecting to database", "db", cfg.Database.URL)If 4. Sandbox creation not atomic ( If Fix: On Medium Severity5. Stale hosts never disconnected (
Fix: After N missed heartbeats, cancel the stream context and deregister the host from the registry. 6. Command audit trail silently dropped ( In Fix: Return the error to the caller (or at minimum use structured metrics/alerting, not just a log line). 7. Billing store methods are unimplemented stubs ( All billing-related store methods ( Fix: Return 8. Daemon inbound gRPC has no authentication ( The daemon's own gRPC server (for CLI/TUI connections on This may be intentional for local-only use, but it should be clearly documented and the default bind address should be 9. If 10. Any error from Low Severity / Code Quality11. Returns the first matching host from map iteration (random order in Go). With multiple hosts having the same source VM, all load goes to whichever host happens to be returned first. Consider selecting the host with the most available resources. 12. Uses 13. Returns 14. Log file world-readable ( Log file is opened with 15. Duplicate step numbering in Two comment blocks are labelled "6." — minor but confusing. Positive Notes
The host ID hijacking bug (#1) and the host token delete bypass (#2) are the most impactful and should be addressed before merge. The DB URL logging (#3) is also a straightforward fix with high security value. |
…facing APIs Users now specify source_vm + live (bool) instead of source_vm + base_image. live=true maps to SNAPSHOT_MODE_FRESH, live=false (default) uses cached image. base_image remains internally in proto/daemon but is derived from source_vm. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
PR Review: feat: convert fluid to a control-planeThis is a large and ambitious architectural transformation. The overall design is sound — centralized control plane, per-host daemons, gRPC streaming, MicroVM support, and multi-tenant auth are all the right building blocks. Here is a detailed review. Summary of Findings
Security Issues[High] Agent executor bypasses org ownership checks (
|
…safety
- OAuth helpers: check HTTP status codes, use 10s timeout client, propagate email verified status
- Guard against removing org owner via handleRemoveMember
- Billing webhook: check CreateSubscription error, remove dead GetSubscriptionByOrg("") call
- Hide StripeCustomerID from non-owner org members
- Redact database URL credentials in startup logs
- persistHostRegistration: only create host on ErrNotFound, log other errors
- Config env helpers: warn on unparseable values instead of silent fallback
- monitorHeartbeat: disconnect host after 3 consecutive missed heartbeats
- PrepareSourceVM/ValidateSourceVM: return concrete proto types instead of any
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
PR Review: feat: convert fluid to a control-planeThis is a significant architectural overhaul — converting from a single-node libvirt clone model to a multi-host microVM control plane with gRPC streaming, PostgreSQL, and Stripe billing. Overall the structure is well-thought-out and the code quality is solid. Here are my findings: Architecture & DesignPositive:
Potential Bugs1. Host mutation in
2. If Consider saving the meter record first (with a flag like 3. Free-tier billing calculation relies on thisChat := float64(inputTokens + outputTokens)
prevTotal := cumulative - thisChat // usage before this chat (already recorded by the time we get here)This assumes 4. If 5. If the host is offline when destroy is called,
Security Concerns6. stripe.Key = mm.stripeKeyThe Stripe Go SDK uses a global sc := &client.API{}
sc.Init(mm.stripeKey, nil)This is a real data-corruption/wrong-billing risk if two orgs trigger billing simultaneously. 7. No email validation at registration (auth_handlers.go:71-79) Only 8. OAuth tokens stored in plaintext (postgres.go:52-53)
9. Session token is 64 hex chars (256 bits of randomness halved by hex encoding = 128 bits) — acceptable, but verify (auth/session.go) 32 random bytes hex-encoded → 64 chars → 128 bits of entropy. This is fine for session tokens, but worth documenting explicitly. Performance Considerations10. for _, host := range connected {
resp, err := o.sender.SendAndWait(...)
}If there are N hosts, this is N sequential RPCs, each with its own timeout. This should be parallelized with goroutines + a wait group, especially since 11.
Code Quality12. Ignored errors from _ = c.store.CreateAgentMessage(ctx, userMsg)Message persistence failures are silently swallowed. If the DB is unavailable, the conversation history will be incomplete without any indication to the caller. These should at least be logged at 13. Hardcoded model list in Model IDs and pricing are hardcoded. The 14. host, err := SelectHost(o.registry, baseImage, req.OrgID, o.heartbeatTimeout)
if err != nil {
if req.SourceVM != "" {
host, err = SelectHostForSourceVM(...)
}
}
15. return s[:maxLen]This will panic or produce invalid UTF-8 if the conversation title starts with a multi-byte character near position 100. Use Test CoverageTest files are present for most packages (
Minor Nits
Overall this is a solid PR that lays the right foundation. The Stripe global key race (#6) and the Stripe orphaned resources on DB failure (#2) are the most important issues to address before shipping. |
Code Review: feat: convert fluid to a control-planeThis is a substantial architectural refactor. The overall direction is sound and well-structured. Below are my observations by category. ArchitectureThe shift from direct CLI→daemon gRPC to a centralized control plane with NAT-friendly outbound streaming is a good call for a multi-host/enterprise deployment model. The three-layer separation (CLI → control plane → daemon) with the bidirectional stream for daemon-to-CP communication is clean and the protocol buffer definitions are well-organized. The Potential Bugs1. Hardcoded snapshot name in libvirt backend // Suggestion: check/delete pre-existing snapshot before creating
virsh snapshot-delete <vm> fluid-tmp-snap --metadata // if exists2. Hardcoded root device in QEMU "-append", "console=ttyS0 root=/dev/vda rw quiet"This assumes the base image has its root on 3. Hand-crafted JSON in 4. TAP name collision risk SecurityHost token auth ( SSH CA ( SSH host probing ( No TLS on local gRPC port by default:
Code QualityDuplicate create-sandbox logic: The sandbox creation flow (including snapshot-pull, overlay setup, etc.) appears in both Non-deterministic host selection: In SSH public key from Test CoverageStrong coverage in:
Gaps:
Minor Notes
SummaryThe architecture is solid and the implementation quality is generally high. The main items to address before merging:
Great progress on a complex refactor — the move to microVMs and a centralized control plane is the right direction. |
Hash session tokens with SHA-256 before DB storage, add secure cookie config flag, validate email format at registration, bound OAuth response reads, replace hand-rolled JSON with Marshal/Unmarshal in microvm manager, add overlay cleanup on launch failure, set Stripe key once at init, add compensating DestroySandbox on DB persist failure, fix truncate to use runes, log CreateAgentMessage errors, tighten log file permissions, check DeleteSession RowsAffected, add RootDevice to LaunchConfig, and update SSH CA default paths from virsh-sandbox to fluid. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
PR Review: feat: convert fluid to a control-planeThis is a substantial architectural PR that transforms the project into a multi-tenant control plane with PostgreSQL, gRPC streaming, billing integration, and an embedded LLM agent. The overall architecture and direction are solid. Below are specific issues found during review, organized by severity. Critical Issues (Blockers)1. Billing stubs cause duplicate Stripe meters and $0 billing (
func (s *postgresStore) GetModelMeter(_ context.Context, _ string) (*store.ModelMeter, error) {
return nil, store.ErrNotFound
}
func (s *postgresStore) CreateModelMeter(_ context.Context, _ *store.ModelMeter) error {
return nil
}
func (s *postgresStore) SumTokenUsage(_ context.Context, ...) (int64, error) {
return 0, nil
}Because 2. Race condition: host registered before stream entry is stored ( if err := h.registry.Register(hostID, orgID, hostname, stream); err != nil {
return fmt.Errorf("register host: %w", err)
}
h.registry.SetRegistration(hostID, reg)
h.streams.Store(hostID, stream) // stored after Register — too late
High Severity3. Compensating destroy uses already-cancelled context ( When a sandbox DB write fails after the VM is created, the compensating destroy is attempted using the original 4. Host selection ignores requested sandbox memory ( if h.Registration.GetAvailableMemoryMb() < 512 {
continue
}The minimum threshold (512 MB) is hardcoded and doesn't consider 5. Stripe objects not cleaned up on partial Stripe objects are created sequentially: Product → Input Meter → Output Meter → Input Price → Output Price → DB record. If any step after Product creation fails, prior Stripe objects are orphaned with no cleanup. Add rollback for previously created objects on failure. Medium Severity6. Stream context used for DB persistence during host connect ( h.persistHostRegistration(stream.Context(), hostID, orgID, reg)If the gRPC client disconnects immediately after registering, the stream context is cancelled and the DB write fails silently. Use a background context (derived from the server's root context) for persistence operations that should outlive the stream. 7. Raw internal errors returned to clients ( serverError.RespondError(w, http.StatusInternalServerError, fmt.Errorf("failed to create sandbox: %s", err.Error()))Internal details (host IDs, gRPC errors, DB errors) are returned directly to the caller. Log the full error server-side and return a generic user-facing message. 8. Requests to each connected host are issued sequentially via 9. Sort order race in tasks, _ := c.store.ListPlaybookTasks(ctx, strParam(params, "playbook_id"))
sortOrder := len(tasks) // not locked — two concurrent calls get the same orderTwo concurrent 10. gRPC server stopped before HTTP drain (
11.
12. Heartbeat monitor ticker hardcoded to 30s ( ticker := time.NewTicker(30 * time.Second)If Low Severity / Code Quality13. Missing DB indexes for common query patterns (
14. No check that 15. Inconsistent
16. No upper bound on A client can supply 17. String-based context keys can collide across packages. Use a private struct type (as done with 18. The function duplicates logic already in 19. No required-field validation at startup (
Summary Table
The architecture itself is clean and well thought out — the gRPC bidirectional streaming approach for daemon communication, the registry/orchestrator separation, and the overall module structure are all solid. The critical billing issues (#1, #2, #5) should be resolved before merging, as they would cause real financial impact in production. |
…ons, security - Implement 6 billing store methods with real GORM queries (critical) - Fix stream registration race: store stream before registry.Register (critical) - Pass requested CPU/memory to host selection instead of hardcoded values (high) - Add Stripe rollback on partial EnsureModelMeter failure (high) - Use background context for host persistence after registration (medium) - Sanitize internal errors in REST responses, log full details server-side (medium) - Parallelize ListVMs across hosts with errgroup (medium) - Fix sort order race in execAddPlaybookTask by handling error (medium) - Reverse shutdown order: HTTP before gRPC for graceful drain (medium) - Replace Save() with targeted Updates() for UpdateHost/UpdateSandbox (medium) - Derive heartbeat ticker interval from heartbeatTimeout config (medium) - Add composite DB indexes on commands, usage_records, sessions (low) - Add config.Validate() for DATABASE_URL and FRONTEND_URL (low) - Default SecureCookies to false for local dev (low) - Cap timeout_seconds at 3600 in RunCommand handler (low) - Remove dead code SelectHostForImage and its tests (low) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Override daemon-supplied hostID with token-derived identity to prevent impersonation - Add per-org mutex for free tier billing to prevent race conditions - Record LLM token usage per iteration instead of batched at end - Add rate limiting on auth endpoints (register/login) - Use net/mail for email validation instead of manual string parsing - Add prompt injection guard to agent system prompt - Return value copies from registry to prevent data races - Extract timeout constants in orchestrator, add combined CPU+memory placement scoring - Count actually connected daemons via registry instead of host tokens - Default API port to 8080, secure cookies to true - Add godotenv for local .env loading - Update web landing page, docs, SDK config, and docker-compose Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Move setFrame(0) to cleanup function to avoid setState in effect body - Wrap ref assignment in useEffect to avoid updating refs during render Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
PR Review: feat: convert fluid to a control-planeThis is a substantial and well-architected PR — converting from a standalone libvirt-cloning tool to a proper control-plane with multi-host orchestration, gRPC streaming, PostgreSQL-backed state, billing, and an embedded LLM agent. The overall structure is clean and the separation of concerns is solid. Below is a detailed review. Strengths
Bugs / Issues1.
|
Convert billing.tsx to Outlet layout so child routes render. Delete standalone /calculator route (wrong pricing). Remove Calculator from sidebar nav. Switch sandbox pricing from hourly to $50/mo flat. Apply free tier deductions in calculator. Fix token markup from 0.5% to 5%. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Code Review: feat: convert fluid to a control-planeThis is a substantial architectural shift — moving from a standalone CLI to a full control-plane with a centralized API, multi-host orchestration, MicroVM-based sandboxes, and billing. The overall structure is solid and well-thought-out. Below is my review organized by category. Architecture & DesignStrengths:
Concerns: No RBAC beyond org membership —
host := connected[0]This is non-deterministic. If there are multiple connected hosts, the one chosen for discovery may not be the most appropriate. Should document the selection criterion or add smarter logic.
SecurityStrengths:
Concerns: No minimum password length ( if req.Email == "" || req.Password == "" || req.DisplayName == "" {A 1-character password is accepted. Add a minimum length check (e.g., 8+ chars). Email not normalized on registration — email is stored as-is. OAuth routes lack rate limiting ( r.Get("/github", s.handleGitHubLogin)
r.Get("/github/callback", s.handleGitHubCallback)
Code QualityDuplicate step numbering (
tasks, err := c.store.ListPlaybookTasks(ctx, playbookID)
sortOrder := len(tasks)Two concurrent requests to add tasks to the same playbook could both observe the same count and assign duplicate sort orders. Consider using
http.ServeFile(w, r, "docs/swagger.json")This path is relative to the process working directory. If the server is started from a different directory, this silently fails. Consider embedding with FRONTEND_URL validation gap ( Performance
Registry locks during copy ( Test CoverageThe PR includes test files for most packages, which is great. A few areas worth verifying:
Minor Nits
Overall this is a well-structured and ambitious PR. The core security decisions are sound and the architecture is clean. The main things I'd want addressed before merging: RBAC on mutations, minimum password length, email normalization, and the sort order race condition. |
- Add PostHog telemetry integration (api + web) for usage tracking - Fix billing race condition with double-check locking on subscription items - Harden auth: use unexported context key type, SameSiteStrict cookies - Add resource capacity checks in SelectHostForSourceVM fallback path - Add state validation for sandbox start/stop operations - Fix command timeout buffer arithmetic (duration vs int) - Add org context provider and refactor agents page - Embed swagger docs in binary Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Suppress no-explicit-any for Router generic in posthog provider - Move hooks before early return in billing page to fix rules-of-hooks Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
PR Review: feat: convert fluid to a control-planeThis is a significant architectural pivot. The separation of concerns (orchestrator, registry, gRPC stream handler, REST server) is clean, and the addition of tests alongside every new package is appreciated. Below are observations grouped by severity. Issues1. Rate limiter trusts spoofable headers (Security)File: api/internal/rest/ratelimit.go:41-53 The rate limiter reads X-Real-IP and X-Forwarded-For as the first source of truth. If the server is directly exposed (not behind a trusted reverse proxy), any client can set these headers arbitrarily and bypass per-IP rate limiting entirely. Options:
2. Free-tier calculation has a fragile ordering assumption (Billing)File: api/internal/billing/meters.go:415 The line prevTotal := cumulative - thisChat assumes the current conversations tokens are already persisted when SumTokenUsage runs. If a caller ever invokes ReportUsage before persisting the message, prevTotal will be under-counted and the free-tier boundary will be calculated incorrectly (favouring the user - not catastrophic but financially incorrect). The comment documents the assumption but the invariant is not enforced. Consider asserting or returning early if cumulative < thisChat. 3. EnsureModelMeter uses a single global mutex (Performance)File: api/internal/billing/meters.go:98-99 mm.mu.Lock() serializes meter creation across all models. Compare this to subItemLock which is correctly per-org:model. Two simultaneous requests for different models will block each other unnecessarily. The pattern already exists in subItemLock - apply the same per-model locking here. 4. DiscoverSourceHosts picks the first connected host non-deterministically (Reliability)File: api/internal/orchestrator/orchestrator.go:781 ListConnectedByOrg iterates a map internally (random order), so connected[0] is an arbitrary host on each call. Fine functionally today since discovery just needs any connected daemon, but worth documenting this intent or adding a selection heuristic for consistency with sandbox placement. 5. gRPC TLS is optional but host tokens travel over this channel (Security)File: api/internal/config/config.go:154-158 GRPC_TLS_CERT_FILE / GRPC_TLS_KEY_FILE are not validated - if unset, the server starts with plaintext gRPC. Host tokens (bearer credentials) are sent in stream metadata. In production, plaintext gRPC exposes these tokens. Strongly consider:
6. Heartbeat disconnect window is very lenientFile: api/internal/grpc/stream.go:241-276 With the default heartbeatTimeout=90s, the monitor ticks every 30s and requires 3 consecutive misses before disconnecting. In the worst case a host can be truly dead for ~4x the heartbeat timeout (~6 minutes) before being removed from the registry and having sandboxes orphaned. Depending on SLA expectations this may be acceptable, but a comment explaining the intentional buffer would help future readers. Minor Observations
What is Working Well
Overall this is solid work. The security items (1 and 5) are worth addressing before shipping to production. |
…ocs and API - Remove install script method from daemon docs page - Wire DaemonConnectionStatus into daemon docs for logged-in users - Update API internals (auth, billing, orchestrator, store, grpc, rest) - Refactor docs components (step-tracker, quickstart-steps, docs-header) - Update dashboard, landing page, and route tree Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
PR Review: feat: convert fluid to a control-planeThis is a substantial architectural change — converting fluid from a standalone CLI-to-libvirt tool into a proper control-plane with centralized orchestration, gRPC streaming from daemon hosts, billing, and multi-host placement. The direction is solid and production-worthy. Here are my findings: Architecture & DesignThe layering is clean:
The Issues & ConcernsSecurity
The code warns and overrides when // stream.go:81-85
if tokenID != "" && hostID != tokenID {
// only overrides when tokenID is present
}
Using Potential Bugs
prevTotal := cumulative - thisChat // usage before this chat (already recorded by the time we get here)This comment documents a required ordering constraint between the caller persisting usage and calling
if err := o.store.CreateCommand(ctx, cmdRecord); err != nil {
o.logger.Error("failed to persist command", ...)
}
return cmdRecord, nilThe command succeeded on the host and the result is returned to the caller regardless of the DB persist failure. This is a deliberate choice (don't fail the user's operation for a logging failure), but it means command history can silently go missing. Consider at minimum incrementing a metric/counter so this is observable in production.
respCh := make(chan *fluidv1.HostMessage, 1)
// ...
select {
case respCh <- msg:
default:
logger.Warn("response channel full, dropping")
}A channel size of 1 works fine with the current Performance / Resource Leaks
defer func() {
h.registry.Unregister(hostID)
h.streams.Delete(hostID)
h.streamMu.Delete(hostID) // ← this IS present, goodActually this is handled correctly —
If an org has hundreds of connected hosts,
g, gCtx := errgroup.WithContext(ctx)
for _, h := range connected {
g.Go(func() error {
select {
case <-gCtx.Done():
return gCtx.Err() // ← returns error, but...
default:
}
// ...
})
}
_ = g.Wait() // ← error is discarded
return result, nil // ← always returns nilThe goroutines return Test CoverageCoverage is solid for the new components —
Minor / Style
Summary
Overall the code quality is high, the design is sound, and the test coverage is much better than typical for a change of this size. The duplicate-host reconnect race is the most actionable bug to address before this is handling production traffic. |
…ling, and resource management Security: fix OAuth email account takeover, IDOR in source host deletion, RBAC for sandbox destroy, bcrypt max password length, session cookie Secure/SameSite attrs, modulo bias in session codes, bridge name validation, path traversal in lease files, SSH StrictHostKeyChecking=accept-new, strip internal errors from API responses, add IP-based rate limiting with trusted proxy support. Correctness: fix SumTokenUsage summing all resource types (now filters llm_token), fix concurrent stream.Send data race with send mutex, fix RLock mutation in microvm Get(), fix DestroySandbox returning success on failure, fix billableOutput negative from rounding, add SendAndWait context propagation and timer leak fix, add heartbeat registry resource updates, fix backoff reset after stable connections, add process exit wait after Stop signal. Billing: add webhook idempotency check, Stripe subscription status validation, nil safety for sess.Subscription, remove redundant stripe.Key assignments, configurable billing markup, rollback on partial EnsureOrgSubscriptionItems failure, log swallowed CreateUsageRecord errors. Infrastructure: add HostToken expiry, unique index on stripe_subscription_id, advisory lock store methods, GetSubscriptionByStripeID, encryption key config, daemon TLS/token config fields, CLI daemon TLS config, TAP name collision reduction (6->9 chars). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
PR Review: feat: convert fluid to a control-planeThis is a large, ambitious refactor — converting from a direct CLI→daemon model to a full control-plane architecture with REST API, gRPC streaming, PostgreSQL, billing, and agent execution. Overall the code is well-structured and has solid test coverage. Below is detailed feedback organized by area. SecurityHost impersonation prevention (good) — OAuth email linking (good) — Prompt injection defense (insufficient) —
CSRF — The REST API uses session cookies without any CSRF token validation. For a control plane that performs destructive infrastructure operations, this is a risk. If the frontend is on the same domain as the API, a CSRF attack on an authenticated browser session could trigger sandbox destruction. Consider using the Architecture & CorrectnessCompensating destroy on DB failure (good) —
// current: returns first match
if slices.Contains(h.SourceVMs, vmName) { return h, nil }
// suggested: track best-scoring match
if slices.Contains(h.SourceVMs, vmName) && hostScore(h) > bestScore { best = h; bestScore = hostScore(h) }
Response channel drop in Code QualityLarge REST handlers expose internal types — Hardcoded model pricing in Billing fallback pricing — Rate limiter cleanup goroutine — Test CoverageCoverage looks strong across the board:
One gap worth filling: the Minor Nits
Overall this is solid work with good structure and test coverage. The main areas to address before production are the prompt injection surface, CSRF protection, and the source VM placement fairness issue. Happy to discuss any of these further. |
…cleanup Security: add AES-256-GCM encryption for OAuth tokens and Proxmox secrets at rest, sanitize error responses to prevent internal error leakage, add path traversal validation, command length limits, bcrypt password length check, checksum verification in CLI updater, TLS config in gRPC client, SSH StrictHostKeyChecking=accept-new, and docs-progress session limits. Correctness: fix stale closure in step-tracker markComplete, CSS backtick typo in architecture-diagram, nil error panics in docs-progress, invalid HTML nesting in billing calculator, shiki error recovery, localStorage try/catch, polling backoff with max attempts, and encodeURIComponent on URL-interpolated values. Cleanup: delete commented-out agent code (executor, openrouter, tools), commented-out web components (chat-message, model-selector, playbooks-panel), broken playbooks route, committed CLI binary, hardcoded PostHog key, no-op setup code, and custom test helpers replaced with strings.Contains. Config: reduce session TTL from 30d to 7d, change telemetry default to opt-in, fix systemd unit to run as fluid-daemon with CAP_NET_ADMIN, fix doctor QEMU binary for arm64, log Stripe webhook errors. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
PR Review: feat: convert fluid to a control-planeThis is a substantial architectural transformation — from a direct CLI-to-libvirt model to a proper three-tier control plane with bidirectional gRPC streaming, microVMs, and a full REST API. The overall design is solid and the code quality is generally high. Good things to call out first:
Security Issues1. TLS path is not implemented in the daemon agent clientFile: opts := []grpc.DialOption{}
if c.insec {
opts = append(opts, grpc.WithTransportCredentials(insecure.NewCredentials()))
}
conn, err := grpc.NewClient(c.cpAddr, opts...)When 2.
|
Code Review — feat: convert fluid to a control-planeThis is an ambitious and well-structured architectural shift. The three-tier design (web → API → daemon → microVMs) is sound, and the use of overlay filesystems + TAP networking for fast sandbox startup is a strong technical foundation. Below are issues that need addressing before this is production-ready, grouped by severity. Critical — Must Fix Before Merge1. Concurrent The // Required pattern in SendAndWait:
mu, _ := s.streamMu.LoadOrStore(hostID, &sync.Mutex{})
mu.(*sync.Mutex).Lock()
err := stream.Send(msg)
mu.(*sync.Mutex).Unlock()
2. OAuth CSRF — hardcoded Both GitHub and Google flows use a literal
Without this, any site can initiate an OAuth login and link an attacker's account to a victim's session. 3. IDOR in The current implementation correctly passes both 4. Billing stubs cause real financial damage (
5. A 6. Webhook error silently discarded ( _ = s.store.CreateSubscription(...)If this store call fails, the user paid but their account isn't upgraded, and HTTP 200 is returned so Stripe stops retrying. This must return a 500 so Stripe retries, and the error must be logged. 7. TLS credentials never configured in daemon agent client ( When High Severity8. Session tokens stored as plaintext primary keys ( Host tokens in 9.
copy := *m.vms[id]
return ©, nil10. Compensating destroy uses cancelled context ( When the DB write fails after VM creation, the cleanup destroy call uses the original 11. The SSH CA infrastructure correctly issues short-lived certificates with principals. Using that infrastructure to also verify the host key (by checking the cert's signing CA) would complete the zero-trust model. 12. No tests in
13. HTTP API response status codes not validated in OAuth helpers (
Medium Severity14. The monitor logs a warning after missed heartbeats but never removes the host from the registry or cancels its context. Dead hosts accumulate indefinitely and continue to receive sandbox scheduling attempts. 15. Host placement ignores org scope (
16. Host placement ignores requested memory
17. Sandbox IDs interpolated directly into URL paths allow path traversal (e.g., a sandbox ID of 18. An untrusted or misconfigured server can send an arbitrarily large response body. Use 19. Concurrent calls (e.g., two simultaneous agent invocations for the same org) can both find no subscription item and both create one, resulting in duplicate Stripe subscription items. Add a DB-level unique constraint and handle the duplicate error gracefully. 20. Session cookies sent over HTTP in production expose session tokens to network interception. Default should be Minor
SummaryThe architectural direction is right and the codebase shows real engineering thought. The critical blockers are the OAuth CSRF vulnerability, billing stub side-effects, the concurrent gRPC send race, and missing tests for the core API paths. The TLS gap in the daemon client is also important for any non-local deployment. Fixing these would make this a solid foundation to build on. |
…build pipeline - Add POST /auth/onboarding endpoint with org creation and slug generation - Randomize daemon host selection for discovery load distribution - Remove unused GetStream method from gRPC stream handler - Multi-stage Docker build for web with nginx serving static assets - Move images to web/public, add install script, add tunneling blog post - Update CLI setup steps, daemon agent client, SSH CA, and microvm manager - Add remark-gfm dependency for markdown rendering Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
PR Review: feat: convert fluid to a control-planeThis is a substantial, well-architected PR. The architectural shift from a direct libvirt CLI to a control-plane with gRPC-connected daemons makes sense for multi-host production use, and the commit history shows a thorough iterative security review has already happened. Below are the remaining issues I found, plus a few observations. CriticalStream cleanup race on host reconnect (
|
…ead code, resource leak, input validation - Gate stream cleanup on CompareAndDelete to prevent reconnect race clobbering new connections - SelectHostForSourceVM now picks highest-scored host instead of first match - Remove dead else branch in CreateSandbox (SourceVM always validated) - Close telemetry on NewRemoteService error path in CLI - Add 64 KiB command length limit in handleRunCommand Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
PR Review: feat: convert fluid to a control-planeThis is a substantial architectural change — converting from a single-daemon model to a proper control-plane + daemon fleet design with microVMs, billing, OAuth, and a full web dashboard. The overall direction is solid and the commit history shows active security hardening. Below are the remaining findings after reviewing the current HEAD. Critical
if _, err := rand.Read(b); err != nil {
panic("crypto/rand failed: " + err.Error())
}
HighUnguarded type assertions on v.(*sync.Mutex) // hostMu
oldCancel.(context.CancelFunc)()
ch.(chan *fluidv1.HostMessage)
streamVal.(fluidv1.HostService_ConnectServer)The logic is currently correct (each map only ever stores one type), but the idiomatic Go pattern is to use the two-value form so a future refactor doesn't silently introduce a panic: mu, ok := v.(*sync.Mutex)
if !ok {
// log and return error
}Silent metadata loss on VM recovery — meta, _ := readMetadata(m.workDir, sandboxID)If the metadata file is missing or corrupt, MediumRate limiter state is per-process — The code already has a comment acknowledging this, but it's worth calling out explicitly for the ops runbook: in a horizontally-scaled deployment each instance has independent counters, so the effective rate limit is
stripe.Key = stripeKeyThe comment says "set once at init" but the stripe-go library exposes
Low / Style
Each QEMU launch args build These paths are passed directly to
Invalid trusted-proxy CIDR strings are silently discarded. A misconfigured What's working well
Summary
The security-focused commits (CSRF fix, IDOR fix, bcrypt limit, session hashing, prompt injection guard) are all appreciated and well-executed. The remaining items are mostly defensive hardening. The most important one to fix before a production cut is the |
…s, metadata logging, stripe race, error clarity - Critical: Convert id.Generate() from panic to (string, error) in both api and daemon packages, update all 13 callers - High: Guard all 4 sync.Map type assertions in grpc/stream.go with two-value form - High: Log warning when readMetadata fails during VM recovery instead of silently discarding - Medium: Remove duplicate stripe.Key assignment in server.go (kept single init in billing.NewMeterManager) - Medium: Wrap placement errors with distinct context for image match vs source VM fallback - Low: Replace unreachable channel-full select with direct send in stream handler - Low: Add filepath.IsAbs validation for KernelPath and OverlayPath in QEMU Launch() - Low: Add logger parameter to parseCIDRs to warn on invalid CIDR entries Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Code Review: feat: convert fluid to a control-planeThis is a substantial architectural overhaul — ~88k additions across 717 files — converting fluid.sh from a standalone CLI+daemon into a full control-plane architecture with microVMs, bidirectional gRPC streaming, multi-tenant auth, and billing. The core design is sound and the commit history shows many issues were iteratively addressed. Below is what I found after reviewing the current state of the code. Architecture & Design — ✅ StrongThe shift to microVMs via overlay images is the right move: faster boot, lower resource cost per sandbox, and cleaner lifecycle management. The gRPC bidirectional streaming between control plane and daemons (with reconnect logic, heartbeat monitoring, and request correlation via The multi-tenant model (orgs → daemons → sandboxes) is clean and the RBAC layering (owner/member/host token) is correct in structure. Remaining IssuesMedium —
|
… on store errors - Add GetSandboxByOrg to DataStore interface with org_id WHERE clause, preventing IDOR if a future handler omits the manual check - Update Orchestrator.GetSandbox to require orgID, remove redundant OrgID != org.ID checks from all 8 REST handlers - Add early return in billing ticker when ListSandboxesByOrg or ListSourceHostsByOrg errors, preventing silent zero-billing reports - Update wrong_org test expectations from 403 to 404 (org mismatch now returns ErrNotFound from DB layer) - Use ConstructEventWithOptions for Stripe webhook to handle API version mismatches Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Code Review – PR #65: Convert fluid to a control-planeThis is a substantial architectural change – adding a centralized control plane (REST API + gRPC) backed by PostgreSQL, with the CLI becoming a thin client. The overall design is solid. Below are findings grouped by severity. Security1. gRPC without TLS allows plaintext bearer tokens ( The server starts without TLS if cert/key are not configured, with only a 2.
Bugs / Correctness3. N+1 DB queries in For each connected host, a separate 4. This handler manually fetches the org by slug and checks membership, while all other org-scoped handlers use the 5. Double DB lookup in destroy/start/stop/run handlers
Code Quality6. This is a minor allocation that happens on every authorized request. A package-level 7. Minor inconsistency. 8. Rate limiting cleanup goroutine has no shutdown path ( The comment in the code acknowledges this correctly. Just flagging that if 9. In-memory rate limiting not suitable for multi-instance deployments ( The code comment acknowledges this, but it is worth surfacing in a TODO or issue: when horizontally scaled, each instance has independent rate limit state, so effective limits multiply by instance count. Fine for now, but this will need a Redis-backed or similar solution before multi-instance production deployment. Positive Highlights
Minor Nits
Overall this is a well-structured PR with good security thinking throughout. The major items to address before production are TLS enforcement on gRPC and the org-scoping concern on |
Daemons now report source_vm_count in heartbeats alongside active_sandboxes. The billing ticker aggregates from the in-memory registry instead of querying ListSandboxesByOrg/ListSourceHostsByOrg on every tick. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
PR Review: feat: convert fluid to a control-planeThis is a substantial architectural redesign — moving from a standalone libvirt VM cloner to a distributed control-plane + daemon model with microVM overlays and bidirectional gRPC streaming. The core design is solid and the code quality is generally high. Below are findings organized by priority. Security[Medium]
[Medium] gRPC TLS is optional — host bearer tokens sent in plaintext without it
[Low] OAuth tokens encrypted only when key is present — silent fallback
Bugs / Correctness[Medium] Orphan stream on re-registration failure In
After step 3 fails, the old connection is gone and the new stream is gone. The host is left in a zombie state until its next reconnect attempt. Consider storing the new stream only after registration succeeds, or restoring the old connection on failure. [Low] Double sandbox DB lookup per mutating operation REST handlers for Code QualityInconsistent permission check style in host token handlers
Rate limiter cleanup goroutine has no shutdown
Placement scoring is greedy (always picks most-resourced host)
return float64(h.Registration.GetAvailableMemoryMb()) + float64(h.Registration.GetAvailableCpus())*1024This always routes to the host with the most available capacity, potentially concentrating sandboxes on fewer hosts initially. Fine for small deployments; for larger fleets, weighted-random or least-connections would spread load better. Commented-out code sections Several blocks are commented out with "not yet ready for integration" markers (agent conversations, playbooks, agent client). This is clearly intentional WIP, just noting it for housekeeping before this PR is merged to main. Positive Observations
SummaryThe architectural direction is solid and the implementation is production-quality for most of it. The two highest-priority items before shipping to production are:
The rest are improvements worth addressing but not blockers. |
PR Review: feat: convert fluid to a control-planeThis PR represents a complete architectural transformation — from a standalone libvirt clone tool to a multi-tenant SaaS control plane with gRPC streaming, PostgreSQL, billing, and an embedded LLM agent. The scope is substantial (85k+ additions) and the direction is architecturally sound. The 30+ iterative commits show excellent responsiveness to prior review feedback. What's Been Done WellThe most recent HEAD addresses a large number of previously raised issues:
Remaining Issues[Medium]
|
Description
Type of Change
Checklist
Release Notes
Labels