chore(pegboard-gateway): add new message id format & add deprecated tunnel ack #3492

NathanFlurry · 2025-11-19T21:22:25Z

No description provided.

vercel · 2025-11-19T21:22:30Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Preview	Comments	Updated (UTC)
rivetkit-serverless	Error			Nov 20, 2025 8:52pm

3 Skipped Deployments

Project	Deployment	Preview	Updated (UTC)
rivet-cloud	Ignored		Nov 20, 2025 8:52pm
rivet-inspector	Ignored	Preview	Nov 20, 2025 8:52pm
rivet-site	Ignored	Preview	Nov 20, 2025 8:52pm

NathanFlurry · 2025-11-19T21:22:43Z

Warning

This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
Learn more

fix(tunnel): implement ping pong #3491 : 2 dependent PRs (#3475 , #3494 )
chore(pegboard-gateway): add new message id format & add deprecated tunnel ack #3492 👈 (View in Graphite)
chore(pegboard-gateway): remove TunnelAck #3488
chore(engine-runner): remove echo close on ToClientWebSocketClose #3487
chore(rivetkit): implement new hibernating ws protocol #3464 : 1 other dependent PR (#3452 )
fix(actor-kv): fix listing keys #3484
fix(guard): handle actor stopped during ws/req #3481
chore: rename retry -> hibernate in runner protocol #3473
fix(udb): update backoff algo #3469
chore: write/update docs #3468
chore: add debug scripts #3467
main

How to use the Graphite Merge Queue

Add the label merge-queue to this PR to add it to the merge queue.

You must have a Graphite account in order to use the merge queue. Sign up using this link.

_{An organization admin has enabled the Graphite Merge Queue in this repository.} _{Please do not merge from GitHub as this will restart CI on PRs being processed by the merge queue.}

This stack of pull requests is managed by Graphite. Learn more about stacking.

claude · 2025-11-19T21:25:27Z

Code Review

This PR introduces a significant architectural improvement to the tunnel messaging system by implementing a more efficient message ID format and removing the need for explicit tunnel acknowledgments. Overall, the changes are well-structured and maintain backward compatibility. Here are my findings:

✅ Strengths

Smart ID Design: The new message ID format (gateway_id + request_id + message_index) is efficient at 12 bytes total and provides all necessary routing information in a single identifier.
Backward Compatibility: The DeprecatedTunnelAck approach with version_needs_tunnel_ack() is a clean way to maintain compatibility with older protocol versions while removing unnecessary overhead.
Improved Efficiency: Removing the explicit tunnel acknowledgment and GC cycles for non-WebSocket messages reduces message overhead and simplifies the state tracking logic.
Type Safety: Using distinct type aliases (GatewayId, RequestId, MessageIndex, MessageId) provides better type safety across Rust and TypeScript implementations.
Consistent Implementation: The Rust (tunnel/id.rs) and TypeScript (tunnel-id.ts) implementations are well-aligned with proper validation.

⚠️ Issues & Concerns

1. Resource Leak: Request ID Not Released ⚠️ CRITICAL

Location: engine/packages/guard-core/src/proxy_service.rs:1889

The request ID generated in the WebSocket custom serve handler is never released. The generate_unique_request_id() adds the ID to the inflight_requests HashSet, but there's no corresponding release_request_id() call after the WebSocket loop completes.

Impact: This will cause a slow memory leak where request IDs accumulate in the HashSet indefinitely. Over time with many connections, this could lead to:

Increased collision probability when generating new IDs
Memory growth (though minimal at 4 bytes per ID + HashMap overhead)
Eventually exhausting the retry limit (100 attempts) when the HashSet becomes too full

Fix: Add cleanup after line 2093:

anyhow::Ok(())
}
.instrument(tracing::info_span\!("handle_ws_task_custom_serve"))
// Add RAII guard or explicit cleanup
.then(|result| async move {
    state.release_request_id(request_id).await;
    result
})

Or better, implement a RAII guard that releases on drop.

2. Missing Cleanup for HTTP Custom Serve

Location: engine/packages/guard-core/src/proxy_service.rs:1086

Similar to the WebSocket issue, the HTTP custom serve path generates a request ID but doesn't appear to clean it up after the request completes. The scope ends at line ~1130, but there's no explicit release_request_id() call.

Fix: Ensure cleanup after the request completes, possibly with a scope guard or defer-like pattern.

3. Potential Panic in TupleUnpack

Location: engine/packages/pegboard/src/keys/hibernating_request.rs:56-64

let gateway_id: GatewayId = gateway_id_bytes
    .as_slice()
    .try_into()
    .expect("invalid gateway_id length");  // ⚠️ Panic

Using .expect() in deserialization code can panic if database corruption or version mismatches occur. This is inside TupleUnpack, so corruption could cause runtime panics.

Recommendation: Convert to proper error handling:

let gateway_id: GatewayId = gateway_id_bytes
    .as_slice()
    .try_into()
    .map_err(|_| PackError::InvalidFormat)?;

4. Serialization Size Verification

Location: engine/packages/pegboard/src/tunnel/id.rs:34-39

The build_message_id() function verifies that exactly 12 bytes were written, which is good. However, this check happens at runtime. Consider:

Is the 12-byte size guaranteed by the BARE schema? (Looks like yes: 4 + 4 + 2 = 10... wait)
Actually reviewing the struct: GatewayId (4 bytes) + RequestId (4 bytes) + MessageIndex (u16 = 2 bytes) = 10 bytes, not 12

Potential Issue: The serialization may be adding 2 bytes of padding or metadata. Verify this is intentional and documented.

5. Wrapping Arithmetic Documentation

Location: engine/packages/pegboard-gateway/src/shared_state.rs:332

hs.pending_ws_msgs.retain(|msg| {
    wrapping_gt(msg.message_index, ack_index)
});

The use of wrapping_gt for u16 indices is correct for handling wraparound, but there's no comment explaining this design choice. Add a comment about why wrapping comparison is needed (message indices can wrap around after 65535).

6. Missing Test Coverage

I don't see test files in the diff for:

Request ID collision handling (the 100-attempt retry logic)
Message ID serialization/deserialization edge cases
WebSocket hibernation with message index wraparound
Backward compatibility with older protocol versions

Recommendation: Add unit tests, especially for the collision handling logic in generate_unique_request_id().

📝 Minor Observations

Logging: Good use of structured logging with ?request_id and proper formatting functions like tunnel_id::request_id_to_string().
Protocol Schema: The v3.bare schema is well-documented with comments explaining deprecated types.
Error Messages: Error messages are descriptive and include expected vs. actual values (e.g., byte length mismatches).
Code Consistency: The removal of index field from ToClientWebSocketMessage and moving it to the message ID level is architecturally cleaner.

🎯 Recommendations

Before Merging:

✅ Fix the request ID resource leak (critical)
✅ Add error handling instead of .expect() in TupleUnpack
✅ Verify the 12-byte message ID size (10 bytes expected from types)
✅ Add test coverage for edge cases

Future Improvements:

Consider implementing a RAII guard for request ID management
Add integration tests for protocol version compatibility
Document the message index wrapping behavior

Summary

This is a solid architectural improvement that reduces message overhead and simplifies the tunnel messaging protocol. The main blocker is the request ID leak which must be fixed before merge. The other issues are important for robustness but less critical.

Great work on maintaining backward compatibility and providing clear migration paths! 🚀

claude · 2025-11-19T21:26:07Z

Code Review

I've reviewed this PR which adds a new message ID format and deprecates the old tunnel ack mechanism. Here's my feedback:

Positive Aspects

✅ Well-structured message ID system: The new MessageId format combining gateway_id (4 bytes) + request_id (4 bytes) + message_index (2 bytes) is compact and efficient at 12 bytes total. The use of serde_bare for serialization is appropriate.

✅ Backwards compatibility: The PR thoughtfully handles protocol versioning with version_needs_tunnel_ack() (engine/sdks/rust/runner-protocol/src/compat.rs:5-7) to support older clients. The deprecated DeprecatedTunnelAck is properly ignored by newer code.

✅ Clean architecture: The new pegboard::tunnel::id module (engine/packages/pegboard/src/tunnel/id.rs) provides well-organized utility functions with proper error handling. The TypeScript implementation mirrors the Rust implementation well.

✅ Improved state management: Removing the pending_msgs tracking and MESSAGE_ACK_TIMEOUT logic simplifies the codebase significantly. The new approach using message_index for WebSocket message acknowledgment is cleaner.

✅ Request ID collision handling: Good defensive programming in generate_unique_request_id() (engine/packages/guard-core/src/proxy_service.rs:657-683) with collision detection, though collisions should be extremely rare with 4 billion possibilities.

Issues & Concerns

🔴 Critical: Request ID Not Released After Completion

Location: engine/packages/guard-core/src/proxy_service.rs:1889

The generate_unique_request_id() adds request IDs to inflight_requests, but I don't see a corresponding release_request_id() call in the WebSocket handling task. This will cause a memory leak where request IDs are never removed from the HashSet.

// At line 1889, request_id is generated
let request_id = state.generate_unique_request_id().await?;

// ... but I don't see state.release_request_id(request_id) anywhere in the task

Recommendation: Add a release_request_id() call in the cleanup section of the WebSocket task, similar to how it should be done after HTTP request completion.

🟡 Medium: Missing Request ID Release in HTTP Path

Location: engine/packages/guard-core/src/proxy_service.rs:1086

Similar to above, the HTTP request path generates a unique request ID but doesn't appear to release it when the request completes. The request ID should be released after the response is sent.

Recommendation: Wrap the request handling in a scope guard or defer pattern to ensure release_request_id() is called on all code paths (success, error, panic).

🟡 Medium: Serialization Size Not Verified at Compile Time

Location: engine/packages/pegboard/src/tunnel/id.rs:34-39

The code uses a runtime assertion to verify the serialized size is exactly 12 bytes:

ensure!(
    written == 12,
    "message id serialization produced wrong size: expected 12 bytes, got {}",
    written
);

While this is good defensive programming, if the BARE schema changes, this will only be caught at runtime.

Recommendation: Consider adding a compile-time test or const assertion to verify the size. At minimum, add a unit test that exercises this code path.

🟡 Medium: Wrapping Arithmetic Without Overflow Documentation

Location: engine/packages/pegboard-gateway/src/shared_state.rs:136

The message index uses wrapping arithmetic:

req.message_index = req.message_index.wrapping_add(1);

This is correct for a u16 counter, but there's no documentation about what happens when the index wraps from 65535 to 0, especially for WebSocket message acknowledgments.

Recommendation: Add a comment explaining the wrapping behavior and confirming that the wrapping_gt comparison (line 327) handles this correctly across the wrap boundary.

🟢 Minor: Inconsistent Error Message Formatting

Location: engine/packages/pegboard/src/tunnel/id.rs:80-81 vs TypeScript implementation

The Rust error messages use lowercase ("invalid request id length") while some TypeScript errors might use different casing. While minor, consistency helps with debugging.

Recommendation: Standardize error message formats across Rust and TypeScript implementations.

🟢 Minor: Missing Documentation on MessageId Format

Location: engine/sdks/schemas/runner-protocol/v3.bare:217

The BARE schema defines MessageId as data[12] but doesn't document the internal structure. Future maintainers would benefit from a comment explaining the layout.

Recommendation: Add a comment in the BARE schema:

# MessageId is a 12-byte array containing:
# - 4 bytes: gateway_id (identifies the gateway instance)
# - 4 bytes: request_id (unique to this gateway)  
# - 2 bytes: message_index (u16, wraps at 65535)
# - 2 bytes: reserved/padding from BARE encoding

Performance Considerations

✅ The new format is actually more efficient - the old approach with UUID-based IDs used 16 bytes per message ID, while the new format uses 12 bytes. Over millions of messages, this saves bandwidth.

✅ Removing the MESSAGE_ACK_TIMEOUT GC check eliminates unnecessary processing overhead.

⚠️ The inflight_requests HashSet lock is held briefly during insert/remove operations, which is acceptable. However, with the memory leak mentioned above, this HashSet will grow unbounded.

Security Considerations

✅ Using random u32 for both gateway_id and request_id provides sufficient entropy for non-cryptographic uniqueness.

⚠️ The collision retry mechanism uses a hardcoded MAX_TRIES = 100. While collisions are astronomically rare (1 in 4 billion), in theory an attacker could try to DoS by triggering many requests simultaneously. However, this is extremely unlikely to be exploitable in practice.

Test Coverage

❌ Missing tests: I found no test files for the new tunnel ID functionality. Given this is a protocol-level change, unit tests are strongly recommended.

Recommended tests:

Test build_message_id and parse_message_id round-trip
Test message index wrapping from 65535 to 0
Test collision handling in generate_unique_request_id
Test wrapping_gt comparison across wrap boundary
Integration test verifying old protocol clients still work with DeprecatedTunnelAck

Summary

This is a solid refactoring that simplifies the protocol and improves efficiency. The main concern is the request ID memory leak which must be fixed before merging. The other issues are lower priority but should still be addressed.

Required before merge:

Fix request ID memory leak (add release_request_id() calls)

Recommended before merge:

Add unit tests for tunnel ID module
Document message index wrapping behavior
Add schema documentation

Nice to have:

Standardize error messages
Add compile-time size verification test

Great work on the backwards compatibility handling! The deprecated tunnel ack approach is clean and will make rolling this out much safer.

Review generated with Claude Code

engine/packages/guard-core/src/proxy_service.rs

engine/packages/pegboard/src/tunnel/id.rs

engine/packages/pegboard/src/tunnel/mod.rs

claude · 2025-11-19T21:57:31Z

PR Review: New Message ID Format & Deprecated Tunnel Ack

I have completed a thorough review of this PR. Overall, this is a solid refactoring that improves the message ID system and maintains backwards compatibility. Here are my findings:

Strengths

Well-designed Message ID Format: The new message ID structure (gateway_id + request_id + message_index) is well thought out and uses efficient 4-byte IDs with proper serialization via serde_bare.
Backwards Compatibility: Excellent handling of protocol version compatibility with version_needs_tunnel_ack() and DeprecatedTunnelAck messages for older clients.
Improved WebSocket Hibernation: The message tracking logic is now simpler and more maintainable - storing message_index directly in PendingWebsocketMessage is much cleaner than the previous last_ws_msg_index approach.
Request ID Collision Detection: The generate_unique_request_id() implementation properly prevents collisions with retry logic and warning logs.
Database Schema Updates: The hibernating request keys properly incorporate both gateway_id and request_id for better data modeling.

Critical Issues

1. Memory Leak: Request IDs Never Released

The release_request_id() method is defined in proxy_service.rs:687-689 but never called. This means the inflight_requests HashSet will grow indefinitely, memory usage will increase over time, and eventually could lead to performance degradation.

Fix needed: Add cleanup calls after custom serve request completes (both success and error paths) and after WebSocket task completion.

2. Unsafe .expect() Calls in Database Code

In engine/packages/pegboard/src/keys/actor.rs there are .expect() calls that will panic if the database contains invalid data. Per CLAUDE.md error handling guidelines, these should return proper Result<> with context instead of panicking in production code paths.

Moderate Issues

3. Missing Test Coverage - No tests added for the new message ID format, collision detection logic, or backwards compatibility handling.

4. Message Index Overflow - proxy_service.rs uses wrapping_add(1) for message_index (u16). After 65,535 messages it wraps to 0. Consider adding a comment or warning log.

5. Import Organization - Some files have imports that should be moved to the top of the file per CLAUDE.md style.

Performance & Security

Positive: Smaller message IDs, removed unnecessary timeout tracking
Security: Random ID generation is appropriate for non-cryptographic use
No security vulnerabilities introduced

Summary

This is a strong refactoring with good architectural decisions, but has two critical issues that must be fixed before merge:

MUST FIX: Implement release_request_id() calls to prevent memory leak
SHOULD FIX: Replace .expect() with proper error handling in database code

Once these are addressed, this PR will be ready to merge. The new message ID format is well-designed and the backwards compatibility approach is solid.

claude · 2025-11-19T21:58:53Z

PR Review: New Message ID Format & Deprecated Tunnel Ack

This is a well-structured refactoring that modernizes the tunnel message ID system. The changes introduce a more efficient message identification scheme while maintaining backward compatibility. Here's my detailed review:

✅ Strengths

1. Improved Message ID Design

The new compact 12-byte message ID format (gateway_id: 4 bytes + request_id: 4 bytes + message_index: 2 bytes) is much more efficient than the previous UUID-based approach
Message IDs are now deterministically constructed from components rather than randomly generated, which improves debuggability
The serialization using serde_bare ensures consistent encoding across Rust and TypeScript implementations

2. Strong Backward Compatibility

Protocol version converters (v2_to_v3, v3_to_v2) properly handle the migration between message formats
The DeprecatedTunnelAck mechanism ensures older runners continue to work
The version_needs_tunnel_ack() check in engine/sdks/rust/runner-protocol/src/compat.rs:5 cleanly segregates legacy behavior

3. Good Type Safety

Type aliases (GatewayId, RequestId, MessageIndex) provide semantic clarity
Fixed-size arrays prevent size-related bugs at compile time
The TypeScript implementation mirrors the Rust types well

4. Hibernation Request Improvements

The updated HibernatingRequest structure now properly tracks both gateway_id and request_id (engine/sdks/schemas/runner-protocol/v3.bare:175-178)
Database operations correctly use composite keys for hibernating request tracking

⚠️ Issues & Suggestions

1. Request ID Collision Handling ⚠️ Medium Priority

In engine/packages/guard-core/src/proxy_service.rs:657-684, the collision detection loop only tries 100 times before giving up. While collisions are extremely unlikely with 4 billion possibilities:

const MAX_TRIES: u32 = 100;

Issue: If the system has sustained high load with many concurrent requests, there's a non-zero chance of exhausting attempts, which would cause request failures.

Suggestions:

Consider logging metrics when collisions occur to monitor if this becomes an issue in production
Alternatively, increase MAX_TRIES to a higher value (e.g., 1000) for better safety margin
The collision warning at line 674 is good but could include the current inflight count for debugging

2. Message Index Wrapping Logic ℹ️ Low Priority

The message index uses u16::wrapping_add for incrementing (engine/packages/pegboard-gateway/src/shared_state.rs:150):

req.message_index = req.message_index.wrapping_add(1);

While wrapping is intentional, there's no explicit comment explaining this design choice. After ~65k messages, the index wraps to 0.

Suggestion: Add a comment explaining the wrapping behavior is intentional and acceptable for the use case.

3. WebSocket Message Index Extraction 🐛 Potential Bug

In engine/sdks/rust/runner-protocol/src/versioned.rs:1295-1296, when converting v3 to v2, the code extracts the message index using:

let index = u16::from_be_bytes([message_id[8], message_id[9]]);

Issue: This hardcodes big-endian extraction, but the MessageId is serialized using serde_bare. You should verify that serde_bare serializes u16 as big-endian. If it uses little-endian or a variable-length encoding, this extraction will produce incorrect values.

Recommendation:

Verify the byte order used by serde_bare for u16 serialization
Consider using the parseMessageId function instead of manual byte extraction for consistency
Add a test case to ensure round-trip conversion preserves the message index correctly

4. Cleanup in GC Logic ✅ Good Fix

The removal of MESSAGE_ACK_TIMEOUT and associated pending_msgs tracking in shared_state.rs:377-467 is correct since v3 no longer uses tunnel acks. However, I noticed a potential typo at line 410:

if now.duration_since(earliest_pending_ws_msg.send_instant)
    > HWS_MESSAGE_ACK_TIMEOUT  // Changed from <= to >
{

This is actually correct - the comment just needs to clarify this was intentional (garbage collect if timeout exceeded).

5. Missing Error Context ℹ️ Low Priority

In engine/packages/pegboard-gateway/src/shared_state.rs:212, when parsing fails:

Err(err) => {
    tracing::error!(?err, "failed to parse message id");
    continue;
}

Suggestion: Include the raw message_id bytes in the error log for debugging:

tracing::error!(?err, message_id=?msg.message_id, "failed to parse message id");

6. Request ID Cleanup ⚠️ Medium Priority

The release_request_id function is added but I don't see it being called in the diff. The inflight_requests HashSet will grow unbounded if IDs aren't released.

Action Required: Verify that release_request_id is called when:

A request completes successfully
A request fails/times out
A WebSocket connection closes

If this is missing, it's a memory leak.

📝 Minor Code Quality Notes

Logging consistency (engine/packages/pegboard-gateway/src/lib.rs:216, 352, 418): Using tunnel_id::request_id_to_string for debug formatting is good for human readability
Empty PR body: The PR has no description explaining the motivation and changes. Consider adding context for future reference.
Hard tabs: Code follows the rustfmt.toml convention correctly ✅

🧪 Testing Recommendations

Add a test verifying buildMessageId → parseMessageId round-trip in both Rust and TypeScript
Test protocol version conversion with actual WebSocket messages containing indices
Verify request ID collision handling under high concurrent load
Test hibernation request tracking with the new composite keys

Summary

This is a solid refactoring with thoughtful backward compatibility. The main concerns are:

Critical: Verify release_request_id() is being called (potential memory leak)
Important: Verify the u16 byte order assumption in the v3→v2 converter
Nice-to-have: Improve collision handling metrics and documentation

Overall: ✅ Approve with requested verification of the critical issues above

claude · 2025-11-19T23:30:02Z

PR Review: Add New Message ID Format & Deprecated Tunnel Ack

Overview

This PR introduces a significant refactor to the message ID system in the Pegboard gateway, moving from UUID-based IDs to a more compact and structured format. The changes touch critical paths in the proxy service and gateway communication.

Code Quality & Best Practices ✅

Strengths:

Well-structured message ID system with clear component separation (GatewayId, RequestId, MessageIndex)
Good use of type aliases for clarity (type GatewayId = [u8; 4], etc.)
Proper error handling with context using anyhow
Clean separation of concerns between ID generation and parsing

Areas for Improvement:

1. Inconsistent Whitespace (engine/packages/guard-core/src/proxy_service.rs:641)

// Generate unique request ID
        let request_id = Some(self.generate_unique_request_id().await?);

Mixed tabs and spaces. The leading whitespace appears to be spaces instead of tabs. Per CLAUDE.md, this project uses hard tabs for Rust formatting.

2. Missing Import Consolidation (engine/packages/guard-core/src/proxy_service.rs:16)

The import for pegboard::tunnel::id is added separately:

use pegboard::tunnel::id::{RequestId, generate_request_id};

Consider grouping this with other pegboard imports if any exist nearby.

Potential Bugs & Issues ⚠️

1. Critical: Resource Leak in Error Paths (engine/packages/guard-core/src/proxy_service.rs:1094-1173)

In the CustomServe HTTP request handler, there's a potential resource leak:

let request_id = match self
    .state
    .acquire_in_flight(client_ip, &actor_id, &req_headers)
    .await?
{
    Some(id) => id,
    None => {
        return Err(errors::RateLimit { /* ... */ }.build());
    }
};

// ... request processing loop ...
while attempts < max_attempts {
    let res = handler.handle_request(req_collected.clone(), request_context, request_id).await;
    if should_retry_request(&res) {
        // ... retry logic ...
        continue; // ⚠️ Resource still held during retries!
    }
    
    // Release only happens on non-retry path
    self.state.release_in_flight(client_ip, &actor_id, request_id).await;
    return res;
}

// Release after all retries
self.state.release_in_flight(client_ip, &actor_id, request_id).await;

Issue: The acquire_in_flight call increments both the in-flight counter AND inserts the request_id into inflight_requests HashSet. However, during the retry loop, these resources remain held. If the retry loop iterates many times (especially with backoff delays), this could:

Artificially inflate in-flight counts
Prevent request ID reuse unnecessarily
Cause downstream systems to see inflated connection counts

Recommendation: Consider whether the in-flight tracking should be per-attempt or per-request. If per-request is correct (current design), this is fine. If per-attempt is desired, you'll need to release/reacquire between attempts.

2. Duplicate acquire_in_flight Call (engine/packages/guard-core/src/proxy_service.rs:809 & 1100)

The acquire_in_flight is called at line 809 for all requests:

let request_id = match self.state.acquire_in_flight(client_ip, &actor_id, req.headers()).await? {
    Some(id) => id,
    None => { return Err(errors::RateLimit { /* ... */ }.build()); }
};

Then for CustomServe routes, it's called AGAIN at line 1100:

let request_id = match self.state.acquire_in_flight(client_ip, &actor_id, &req_headers).await? {
    // ...
}

Issue: This will:

Increment the in-flight counter twice
Generate two different request IDs (the first one is overwritten)
Insert two request IDs into inflight_requests but only release one

Impact: This is a resource leak that will eventually exhaust the request ID space and cause the generate_unique_request_id function to fail after MAX_TRIES attempts.

Recommendation: Remove the first acquire_in_flight call (line 809) for CustomServe routes, or ensure it's only called once per request lifecycle.

3. Collision Handling May Cause Performance Issues (engine/packages/pegboard/src/tunnel/id.rs:67-69 & engine/packages/guard-core/src/proxy_service.rs:667-694)

pub fn generate_request_id() -> RequestId {
    rand::random::<u32>().to_le_bytes()
}

async fn generate_unique_request_id(&self) -> anyhow::Result<RequestId> {
    const MAX_TRIES: u32 = 100;
    let mut requests = self.inflight_requests.lock().await;
    
    for attempt in 0..MAX_TRIES {
        let request_id = generate_request_id();
        if !requests.contains(&request_id) {
            requests.insert(request_id);
            return Ok(request_id);
        }
        // Collision logging...
    }
    bail!("failed to generate unique request id after {} attempts", MAX_TRIES);
}

Issue: With 4-byte request IDs (2^32 ≈ 4 billion possibilities), collisions become likely when you have ~65,536 concurrent requests (birthday paradox). The mutex lock is held during the entire retry loop, which could become a bottleneck under high concurrency.

Recommendations:

Monitor collision rates in production (good that you're logging them)
Consider 8-byte request IDs if collision rates become problematic
Consider lock-free alternatives or reducing lock contention (acquire lock per attempt instead of holding across all attempts)

4. WebSocket Resource Leak in Error Path (engine/packages/guard-core/src/proxy_service.rs:1920-2145)

Similar to issue #1, the WebSocket handling in CustomServe acquires resources but may not release them properly in all error paths. Specifically, the spawned task acquires request_id but the release only happens at the very end (line 2142). If the task panics or is cancelled before reaching line 2142, the resources leak.

Recommendation: Use a Drop guard or ensure release happens in a finally-equivalent pattern (e.g., using scopeguard crate or manual defer pattern).

Performance Considerations 🔍

1. Serialization on Hot Path (engine/packages/pegboard/src/tunnel/id.rs:28-31)

let mut message_id = [0u8; 12];
let mut cursor = std::io::Cursor::new(&mut message_id[..]);
serde_bare::to_writer(&mut cursor, &parts).context("failed to serialize message id parts")?;

Observation: Message IDs are built on every request/response message, making this a hot path. Using serde_bare serialization adds overhead compared to manual byte packing.

Measurement Needed: Profile whether this is a bottleneck. If message throughput is high, consider:

pub fn build_message_id(gateway_id: GatewayId, request_id: RequestId, message_index: MessageIndex) -> MessageId {
    let mut id = [0u8; 12];
    id[0..4].copy_from_slice(&gateway_id);
    id[4..8].copy_from_slice(&request_id);
    id[8..12].copy_from_slice(&message_index.to_le_bytes());
    id
}

This would be ~10x faster but loses the benefit of having serde maintain the format contract.

2. Lock Contention on inflight_requests (engine/packages/guard-core/src/proxy_service.rs:662-663, 669-678)

The inflight_requests HashSet is protected by a single Mutex, which is acquired for every request start and end. Under high request rates, this could become a bottleneck.

Recommendations:

Consider sharding the HashSet (e.g., 16 shards based on request_id hash)
Use dashmap::DashSet for lock-free concurrent access
Profile to confirm this is actually a bottleneck before optimizing

Security Concerns 🔒

1. Request ID Predictability

Using rand::random::<u32>() without specifying a cryptographically secure RNG could make request IDs predictable if the RNG state is compromised.

Impact: Low to Medium - An attacker who can predict request IDs could potentially:

Inject messages into other users' request streams
Cause targeted collisions to DoS specific requests

Recommendation: Use rand::thread_rng().gen::<u32>() or explicitly use a CSPRNG like getrandom if security is a concern for request routing.

2. No Validation of Message Index

The MessageIndex (u16) can wrap around after 65,536 messages. While wrapping_gt is used for comparison, there's no validation that prevents reuse of the same (gateway_id, request_id, message_index) tuple if a request lives long enough.

Impact: Low - Most requests won't send >65k messages, but long-lived WebSocket connections could.

Recommendation: Document the wrap-around behavior and ensure it's tested.

Test Coverage 📊

Observations:

The PR modifies 50 files with significant logic changes
Test file engine/packages/guard-core/tests/proxy.rs was not modified
No new tests added for:
- Request ID generation and collision handling
- Message ID serialization/deserialization
- Resource cleanup in error paths
- WebSocket hibernation with new message format

Recommendations:

Add unit tests for pegboard/src/tunnel/id.rs:
- Test build_message_id and parse_message_id round-trip
- Test edge cases (max values for each component)
- Test error handling for malformed message IDs
Add integration tests for resource management:
- Verify acquire_in_flight / release_in_flight pairing
- Test request ID uniqueness under concurrent load
- Verify no leaks when requests fail/timeout
Add WebSocket hibernation tests:
- Test message replay after hibernation with new message format
- Test message acknowledgment with wrapping indices
- Test cleanup when hibernation fails

Summary

Overall Assessment: This is a well-architected refactor that improves message ID structure and reduces overhead. However, there are critical resource leaks that need to be addressed before merging.

Must Fix Before Merge:

⚠️ Issue [SVC-2479] Send cluster events to PostHog #2: Duplicate acquire_in_flight calls causing resource leaks
⚠️ Issue [SVC-2555] Set up issue templates #1/[SVC-2483] Remove hardcoded uses of rivet.gg #4: Ensure proper resource cleanup in all error paths

Recommended Improvements:

Add test coverage for new message ID format
Consider performance optimization for message ID serialization if profiling shows it's a bottleneck
Review cryptographic security requirements for request ID generation
Fix whitespace inconsistencies per CLAUDE.md formatting rules

Nice to Have:

Monitor collision rates in production
Add metrics for request ID generation attempts
Document wrap-around behavior for MessageIndex

Great work on the refactor! The new message ID format is much more efficient than UUIDs. Just needs the resource leak issues addressed.

🤖 Generated with Claude Code

claude · 2025-11-19T23:30:56Z

Pull Request Review

Summary

This PR introduces a new message ID format for the tunnel protocol and adds backward compatibility for deprecated tunnel acknowledgments. The changes migrate from a 16-byte UUID-based message ID to a more efficient 12-byte structured format.

Code Quality - Strengths

Well-structured protocol versioning with clean v2 to v3 migration
Clear separation of concerns with new tunnel/id.rs module
Backward compatibility maintained with DeprecatedTunnelAck
Consistent error handling using anyhow::Result

CRITICAL BUG - Double In-Flight Acquisition

Location: engine/packages/guard-core/src/proxy_service.rs:1097-1114

The CustomServe path acquires in-flight limits TWICE:

First at line 812-828 in handle_request()
Again at line 1097-1114 inside handle_http_request()

Impact: CustomServe HTTP requests count twice against the in-flight limit, reducing effective capacity by 50%.

Fix: Remove the second acquire_in_flight call in the CustomServe branch.

Other Issues

1. Request ID Collision Detection (proxy_service.rs:465-495)

Uses 4-byte random space with 100 max retries
Birthday paradox suggests collisions likely around 65K concurrent requests
Recommendation: Add metrics to track collision rates

2. Message Index Wrap-Around (pegboard-gateway/src/shared_state.rs:136-151)

u16 wraps after 65,535 messages
Uses wrapping_gt for comparison (correct)
Recommendation: Add comment and test case for wrap-around behavior

3. Lock Contention (proxy_service.rs:381, 465)

Global mutex on inflight_requests HashSet
Potential bottleneck at high request rates
Recommendation: Consider DashSet or RwLock

4. Test Coverage
Missing edge case tests for:

Collision handling
Message index wrap-around
Concurrent request ID generation
In-flight limit edge cases

Recommendations

High Priority:

Fix double in-flight acquisition (BLOCKING)
Evaluate lock contention under load
Add test coverage for edge cases

Medium Priority:
4. Document wrap-around behavior
5. Add collision rate metrics

Low Priority:
6. Clean up unused imports
7. Document RNG security properties

Conclusion

Solid refactoring with good protocol versioning. However, the double in-flight acquisition bug should be fixed before merging. The code follows Rivet conventions well. Great work on backward compatibility!

claude · 2025-11-20T22:47:08Z

PR Review: New Message ID Format & Deprecated Tunnel Ack

I've reviewed this PR and have several observations about code quality, potential issues, and areas for improvement.

Summary

This PR introduces a new compact message ID format (12 bytes instead of 16) and refactors in-flight request management. The changes touch critical path code in the gateway and proxy services.

Positive Aspects

1. Well-Designed Message ID Structure

The new message ID format (gateway_id (4) + request_id (4) + message_index (2)) is efficient and provides good collision resistance:

Uses BARE serialization for consistent encoding
Helper functions for parsing/stringifying are clean
Type aliases improve code readability

2. Proper Version Compatibility

The backward compatibility handling in versioned.rs is thorough:

Proper conversion between v2 and v3 protocols
DeprecatedTunnelAck preserves compatibility
Message ID extraction/construction handles padding correctly

3. Collision Detection

The generate_unique_request_id() function properly handles the (extremely rare) case of ID collisions with retry logic and warning logs.

Issues & Concerns

🔴 Critical: Potential Resource Leak

Location: engine/packages/guard-core/src/proxy_service.rs:809-870

There's a problematic pattern where acquire_in_flight() is called twice for the same request in some code paths:

Line 812: First acquisition in handle_request()
Line 1100: Second acquisition in CustomServe path

Problem: The second acquisition will succeed and generate a different request ID, but the first request ID is never released. This creates:

Request ID leak in the HashSet
In-flight counter imbalance
Memory leak over time

Affected code:

// First acquire - line 812
let request_id = match self.state.acquire_in_flight(client_ip, &actor_id, req.headers()).await? {
    Some(id) => id,
    None => { return Err(errors::RateLimit{...}.build()); }
};

// ... later in CustomServe path - line 1100
let request_id = match self.state.acquire_in_flight(client_ip, &actor_id, &req_headers).await? {
    Some(id) => id,  // NEW ID generated, old one leaked\!
    None => { return Err(errors::RateLimit{...}.build()); }
};

Recommendation:

Pass the request_id through to custom handlers instead of re-acquiring
Add a #[must_use] attribute to the RequestId type to catch unused IDs
Consider RAII pattern with a guard that auto-releases on drop

🟡 Medium: Race Condition Risk

Location: engine/packages/guard-core/src/proxy_service.rs:666-694

async fn generate_unique_request_id(&self) -> anyhow::Result<RequestId> {
    const MAX_TRIES: u32 = 100;
    let mut requests = self.inflight_requests.lock().await;  // Lock held during loop
    
    for attempt in 0..MAX_TRIES {
        let request_id = generate_request_id();
        if \!requests.contains(&request_id) {
            requests.insert(request_id);
            return Ok(request_id);
        }
        // Lock still held while logging collision\!
        tracing::warn\!(?request_id, attempt, "request id collision, generating new id");
    }
}

Issues:

Mutex held during entire retry loop (up to 100 iterations)
Logging while holding lock can cause contention
With 4 bytes = ~4 billion possibilities, collisions should be vanishingly rare under normal conditions
If collisions are frequent (100+ attempts failing), something is seriously wrong and should fail fast

Recommendation:

Release lock before logging, re-acquire after
Reduce MAX_TRIES to 10-20 (if you hit this, you have bigger problems)
Add metrics to track collision frequency

🟡 Medium: Indentation Issue

Location: engine/packages/guard-core/src/proxy_service.rs:641

// Generate unique request ID
        let request_id = Some(self.generate_unique_request_id().await?);  // Mixed spaces/tabs
    Ok(request_id)

This line uses spaces instead of hard tabs (violates rustfmt.toml config). Will be caught by cargo fmt but should be fixed.

🟢 Minor: Unnecessary Code Duplication

Location: engine/packages/pegboard-gateway/src/lib.rs:213, 349

The error logging pattern is duplicated:

tracing::warn\!(
    request_id=?tunnel_id::request_id_to_string(&request_id),
    "received no message response during request init",
);

Recommendation: Extract to a helper function or use a macro.

🟢 Minor: Missing Error Context

Location: engine/packages/pegboard/src/tunnel/id.rs:45-47

pub fn parse_message_id(message_id: MessageId) -> Result<protocol::MessageIdParts> {
    serde_bare::from_slice(&message_id).context("failed to deserialize message id")
}

Could include the actual message_id bytes in the error context for better debugging:

serde_bare::from_slice(&message_id)
    .with_context(|| format\!("failed to deserialize message id: {:?}", message_id))

Security Considerations

✅ Randomness Source

Uses rand::random() for ID generation - secure for this use case since these aren't cryptographic tokens.

✅ Collision Resistance

4 bytes (2^32 = ~4 billion IDs) provides good collision resistance given:

IDs are short-lived (request duration)
Active tracking prevents actual collisions
Birthday paradox: ~77k concurrent requests before 1% collision probability

⚠️ DoS Potential

The in-flight tracking uses a HashSet that grows unbounded. Under extreme load or attack:

HashSet could grow very large
Memory exhaustion possible
Consider adding max in-flight limit per gateway

Testing Recommendations

Unit test the collision retry logic with a mock that forces collisions
Integration test for the request ID lifecycle (acquire → use → release)
Stress test with many concurrent requests to verify no leaks
Test the v2→v3 protocol conversion with hibernation scenarios
Test the double-acquisition bug path (CustomServe requests)

Performance Considerations

Positive

Smaller message IDs (12 vs 16 bytes) reduce bandwidth
Removal of unused TunnelAck reduces message overhead
Message index wrapping allows compact u16 storage

Concerns

Global mutex on inflight_requests could become a bottleneck under high concurrency
Consider using a sharded lock (e.g., DashMap) if this becomes an issue

Code Style Adherence

✅ Follows Conventions

Hard tabs used (mostly)
Structured logging with tracing
Proper error types with RivetError derive
Lowercase log messages

❌ Violations

One line with space indentation (line 641)
Some anyhow::bail\! instead of structured errors (acceptable for internal functions)

Recommendation

Before merging:

🔴 Fix the double-acquisition bug - this is a resource leak
🟡 Fix the indentation issue
🟡 Consider adding metrics for request ID collision rate
Add test coverage for the request lifecycle

After merging:

Monitor for any increase in memory usage (potential ID leak indicator)
Watch for collision warnings in logs (should be extremely rare)

Overall, this is a solid refactor with good protocol design. The main blocker is the resource leak issue that needs to be addressed.

claude · 2025-11-20T22:49:44Z

Code Review: New Message ID Format & Deprecated Tunnel Ack

I've reviewed this PR and found the implementation to be well-structured with good attention to backward compatibility. Here are my findings:

✅ Strengths

Well-designed ID structure: The new message ID format (gateway_id + request_id + message_index = 12 bytes) is compact and provides good uniqueness guarantees across distributed gateways.
Proper backward compatibility: The protocol version converters handle v2↔v3 transitions correctly, including:
- Extracting/packing message IDs between formats
- Converting TunnelAck to DeprecatedTunnelAck
- Preserving WebSocket message indices during conversion
Collision handling: The generate_unique_request_id() function includes retry logic with proper logging for the rare case of collisions.
Clean separation of concerns: The tunnel::id module provides well-organized utilities for ID generation, parsing, and string conversion.
Hibernation improvements: The new system properly tracks pending WebSocket messages using message indices rather than UUIDs.

🔍 Issues & Concerns

1. Potential Double In-Flight Acquisition in `proxy_service.rs`

In proxy_service.rs:809-826 and again at proxy_service.rs:1094-1116, the code acquires in-flight limits for CustomServe requests. However, there's already an acquisition at the beginning of handle_request() (line ~806-824). This means CustomServe requests may be acquiring the in-flight limit twice, which could:

Consume double the in-flight slots
Potentially create incorrect rate limiting behavior

Location: engine/packages/guard-core/src/proxy_service.rs:1094-1116

Recommendation: Review if CustomServe needs its own in-flight tracking or if it should reuse the parent's acquisition.

2. Whitespace Issue (Line 641)

There's inconsistent whitespace indentation on line 641:

        let request_id = Some(self.generate_unique_request_id().await?);

Should use hard tabs consistently with the rest of the codebase (per CLAUDE.md).

Location: engine/packages/guard-core/src/proxy_service.rs:641

3. Missing Import Organization

Several files have inline imports or could benefit from reorganization:

pegboard-gateway/src/lib.rs has imports mixed with usage
Following the project's style guide, imports should be at the top of files

Locations:

engine/packages/pegboard-gateway/src/lib.rs:8
engine/packages/guard-core/src/proxy_service.rs:13

4. Error Handling: Serialization Size Verification

In tunnel::id::build_message_id() (line 34-39), there's a good verification that exactly 12 bytes were written. However, this could theoretically panic in production if the BARE serialization changes unexpectedly.

Location: engine/packages/pegboard/src/tunnel/id.rs:34-39

Recommendation: Consider if this should be a more graceful error or if a panic is acceptable (current approach seems fine for catching protocol bugs early).

5. WebSocket Index Extraction in v3→v2 Conversion

The conversion extracts message index from bytes 8-9 of the message_id:

let index = u16::from_le_bytes([message_id[8], message_id[9]]);

Location: engine/sdks/rust/runner-protocol/src/versioned.rs:1296

This assumes the BARE serialization format for MessageIdParts is stable. While this is likely correct (4 bytes + 4 bytes + 2 bytes = 10 bytes), it would be safer to:

Add a comment documenting the expected layout
Consider parsing the message_id properly instead of raw byte access

6. Request ID HashSet Management

The inflight_requests HashSet in ProxyState properly prevents collisions, but:

If release_in_flight() is never called (e.g., panic or early return), the request ID leaks
No cleanup mechanism for leaked IDs

Location: engine/packages/guard-core/src/proxy_service.rs:662-664

Recommendation: Consider adding a timeout-based cleanup or using a time-based cache like the existing rate limiters.

7. Wrapping Arithmetic for Message Indices

The code uses wrapping_add(1) for message index increment and has a wrapping_gt() helper for comparison. This is good, but:

No documentation on what happens when indices wrap around
Potential edge case if a single request sends >65k messages

Location: engine/packages/pegboard-gateway/src/shared_state.rs:150 and shared_state.rs:468

Recommendation: Add comments documenting the wrapping behavior and confirm this is acceptable for the use case.

🧪 Testing Recommendations

Test collision handling: Verify the retry logic works under high load
Test request ID cleanup: Ensure IDs are properly released in all code paths
Test protocol version conversion: Verify v2↔v3 conversions preserve message indices correctly
Test wrapping behavior: Verify message indices work correctly when wrapping at u16::MAX
Test hibernation with new IDs: Ensure hibernating requests properly resume with the new ID format

📝 Minor Suggestions

Logging consistency: Some logs use ?request_id format while others use request_id=?. Consider standardizing.
Magic number: MAX_TRIES: u32 = 100 in collision handling seems reasonable but could be configurable.
Documentation: Consider adding module-level docs to tunnel::id explaining the ID structure and design decisions.

Summary

This is a solid refactoring that improves the message ID system while maintaining backward compatibility. The main concern is the potential double in-flight acquisition for CustomServe requests. The other issues are minor and mostly about code polish.

Overall Assessment: ✅ Approve with minor concerns

The code is well-structured and follows good practices. Address the in-flight acquisition issue and the whitespace problem, and this will be ready to merge.

claude · 2025-11-20T23:34:59Z

Code Review

I've reviewed this PR which adds a new message ID format and deprecated tunnel ack functionality to the pegboard-gateway. Overall, the changes are well-structured and improve the tunnel protocol architecture. Here are my findings:

✅ Strengths

1. Improved Message ID Architecture

The new composite message ID format (GatewayId + RequestId + MessageIndex) in engine/packages/pegboard/src/tunnel/id.rs is a significant improvement
Using 4-byte IDs (u32) for gateway and request provides good collision resistance (4 billion possibilities each)
The serialization using serde_bare is appropriate and includes proper validation (12-byte size check)
Helper functions for base64 encoding/decoding are well-designed

2. Request ID Uniqueness

The generate_unique_request_id() method with collision detection and retry logic is excellent (proxy_service.rs:667-694)
Using a HashSet to track in-flight requests prevents collisions
Warning logs on collision with attempt counter aids debugging

3. Better Resource Management

Proper in-flight counter tracking with request ID release on completion
Hibernation state management improved with message_index tracking instead of last_ws_msg_index
Removed deprecated pending_msgs tracking in favor of cleaner message indexing

4. Protocol Versioning

Adding DeprecatedTunnelAck maintains backward compatibility while signaling deprecation
Protocol schema properly updated in v3.bare

⚠️ Issues & Concerns

1. Critical: Double Acquire Bug in CustomServe Path ⚠️
In proxy_service.rs, the CustomServe handler acquires in-flight limit twice:

Once at line 809-825 in handle_service()
Again at lines 1100-1115 in the CustomServe match arm

This means custom serve requests acquire two slots instead of one, causing:

Rate limiting triggers at 50% of intended capacity
Request ID set contains duplicate entries
Potential deadlocks if limit equals concurrent requests

Fix: Remove the duplicate acquire in the CustomServe match arm (lines 1099-1115) since it's already handled at the top of handle_service().

2. Indentation Issue
Line 641 in proxy_service.rs has incorrect indentation (spaces instead of hard tabs):

        let request_id = Some(self.generate_unique_request_id().await?);  // <- uses spaces

Should use hard tabs per rustfmt.toml configuration.

3. Missing Error Context
In shared_state.rs:211-214, when message ID parsing fails, the error log doesn't include the raw message data which could help debugging:

Err(err) => {
    tracing::error!(?err, message_id=?msg.message_id, "failed to parse message id");
    continue;
}

Consider adding context about the expected format.

4. Potential Memory Leak in Request Tracking
The inflight_requests HashSet in ProxyState (line 382) is only cleaned up when:

Request completes normally
Release is explicitly called

However, if a request handler panics or the release task fails, the request ID remains forever. Consider:

Adding the request ID cleanup to a drop guard
Implementing a periodic GC for stale request IDs (similar to the hibernation GC)

5. WebSocket Hibernation: Inconsistent acquire/release
In the websocket upgrade path (proxy_service.rs:1922-1938), request ID is acquired inside the spawned task. If the spawn fails or task panics before the final release (line 2141), the ID leaks. Consider:

Acquiring before spawn
Using a guard pattern for cleanup

📝 Minor Observations

6. Magic Constants

MAX_TRIES: u32 = 100 (line 668) - This seems quite high for a 4-byte ID space. Even 10 retries would be extremely safe. Consider reducing to 10-20.
The collision probability is ~2.3e-7 per request assuming 10k concurrent requests, making 100 retries overkill.

7. Missing Test Coverage
No new tests were added for:

Message ID generation and parsing
Request ID collision handling
The new hibernation request upsert flow with gateway_id

Consider adding unit tests for pegboard::tunnel::id module.

8. Documentation
The new ID format is well-structured but lacks module-level documentation explaining:

Why 4 bytes for each component
The endianness choice (little-endian)
The lifetime of each ID type

🔒 Security Considerations

9. Random Number Generation
Using rand::random::() (lines 13, 67) without specifying the RNG could use a weak source in some contexts. However, for IDs (not cryptographic secrets), this is acceptable.

10. No Timing Attacks
The collision detection is constant-time per attempt (HashSet lookup), no security concerns here.

🎯 Recommendations

High Priority:

Fix the double-acquire bug in CustomServe path
Fix indentation on line 641
Add drop guard or periodic GC for request ID cleanup

Medium Priority:
4. Add unit tests for tunnel::id module
5. Reduce MAX_TRIES to 10-20
6. Add module-level documentation

Low Priority:
7. Consider adding message format context to parsing error logs
8. Document the ID lifetime and reuse policy

Summary

This is a solid refactoring that improves the protocol architecture. The new message ID format is well-designed and the hibernation improvements are good. However, the critical double-acquire bug needs to be fixed before merging as it will cause unexpected rate limiting behavior in production.

The code follows Rivet conventions well (proper error handling, structured logging, workspace dependencies). Once the double-acquire issue is resolved and indentation fixed, this will be a good addition to the codebase.

Review generated with Claude Code

claude · 2025-11-20T23:38:19Z

Pull Request Review: This PR introduces a significant refactoring of the tunnel messaging system. Key findings: 1) Improved 12-byte message ID format vs previous 16-byte UUIDs. 2) Good backward compatibility. 3) CRITICAL BUG: Double in-flight acquisition in CustomServe path (proxy_service.rs:815 and 1100) causing incorrect rate limiting. 4) Formatting issue at line 641 (mixed indentation). 5) Consider adding tests for ID collision, message wraparound, and protocol conversion. Recommendation: Fix double acquisition bug before merge.

claude · 2025-11-20T23:38:31Z

Detailed Review Points:

POSITIVES:

New 12-byte message ID (gateway_id+request_id+message_index) is more efficient than 16-byte UUIDs
Excellent backward compatibility with v1/v2/v3 protocol versions
Collision detection with retry logic is well-implemented
Proper cleanup of request IDs prevents memory leaks

ISSUES FOUND:

DOUBLE IN-FLIGHT ACQUISITION BUG (proxy_service.rs)
- Line ~815: First acquire_in_flight in handle_request
- Line ~1100: Second acquire_in_flight in CustomServe path
- Impact: CustomServe routes consume 2 rate limit slots instead of 1
- Fix: Remove the second acquisition since the first already handles it
FORMATTING (proxy_service.rs:641)
- Mixed tabs/spaces indentation violates CLAUDE.md convention
SECURITY:
- Ensure rand::random uses CSRNG for production
- Consider upper bound on inflight_requests HashSet to prevent DoS
MISSING DOCS:
- Message index wraparound behavior (u16 wraps at 65535)
- BARE encoding wire format assumptions in protocol conversion
- Collision probability for 4-byte request IDs
TEST COVERAGE GAPS:
- Request ID collision handling
- Message index wraparound
- Protocol version conversion edge cases

claude · 2025-11-21T00:52:36Z

Pull Request Review

I've reviewed PR #3492 which introduces a new message ID format and adds deprecated tunnel acknowledgment functionality. Here's my detailed feedback:

Overview

This PR refactors the tunnel message identification system from UUID-based IDs to a composite ID structure (gateway_id + request_id + message_index) and simplifies message acknowledgment by removing the deprecated TunnelAck system in favor of WebSocket message acknowledgment only.

✅ Strengths

Well-structured ID system: The new pegboard/tunnel/id.rs module provides a clean abstraction with proper type aliases and conversion utilities
Collision detection: Good defensive programming with the collision retry logic in generate_unique_request_id() (proxy_service.rs:668-691)
Proper error handling: Uses anyhow::Context throughout for actionable error messages
Backward compatibility: Version conversion logic preserves compatibility with older protocol versions

🔴 Critical Issues

1. Race Condition in In-Flight Request Management (HIGH PRIORITY)

Location: guard-core/src/proxy_service.rs:809-830, 1094-1164, 1912-2138

There's a serious bug where in-flight limits are acquired twice for the same request in some code paths:

// First acquisition at line 809
let request_id = match self.state.acquire_in_flight(...).await? { ... };

// Then later for CustomServe routes, it acquires AGAIN at line 1100
let request_id = match self.state.acquire_in_flight(...).await? { ... };

Problem:

For CustomServe routes, the code acquires the in-flight slot twice but only releases once
This will cause the in-flight counter to leak, eventually blocking all requests even when capacity is available
Similar issue exists for WebSocket upgrades (line 1921)

Fix: Remove the duplicate acquire_in_flight calls inside the match arms for ResolveRouteOutput::CustomServe and WebSocket handling, since acquisition already happened at the top of handle_request.

2. Request ID Not Released on Early Returns (HIGH PRIORITY)

Location: guard-core/src/proxy_service.rs:876-889

The code spawns a task to release in-flight counters, but this only happens on normal returns:

let state_clone = self.state.clone();
tokio::spawn(async move {
    state_clone.release_in_flight(client_ip, &actor_id, request_id).await;
});

Problem: If handle_http_request or handle_websocket_upgrade returns an error before this spawn, the request ID is never released from inflight_requests HashSet.

Fix: Use RAII pattern with a guard struct that releases on drop, or move the release logic into a defer-like pattern.

3. Integer Overflow Vulnerability (MEDIUM PRIORITY)

Location: pegboard/src/tunnel/id.rs:8, shared_state.rs:40,150

pub type MessageIndex = u16;
// ...
req.message_index = req.message_index.wrapping_add(1);

Problem: While wrapping arithmetic is intentional, there's no guarantee that wrapping_gt comparison logic is correct when overflow occurs. If a long-lived connection sends 65,536+ messages, the index wraps and might cause message acknowledgment bugs.

Recommendation:

Add integration tests specifically for the wraparound case
Document the wraparound behavior clearly
Consider using u32 or u64 instead (only costs 2-6 more bytes per message)

⚠️ High Priority Issues

4. Hardcoded Retry Limit Without Justification (MEDIUM)

Location: proxy_service.rs:668

const MAX_TRIES: u32 = 100;

With 4 billion possible IDs (2^32), the probability of a collision on the first try is approximately n/2^32 where n is the number of in-flight requests. Even with 10,000 concurrent requests, collision probability is ~0.0002%. 100 retries seems excessive and could mask underlying issues.

Recommendation: Reduce to 10 retries and add metrics to track collision frequency.

5. Missing Validation in `build_message_id` (MEDIUM)

Location: pegboard/src/tunnel/id.rs:17-42

The function verifies the serialization produces exactly 12 bytes, but doesn't validate that the components themselves are sensible. For example, there's no check that message_index hasn't overflowed beyond a safe threshold.

Recommendation: Add validation or at least logging when message_index is approaching wraparound (e.g., > 60000).

6. Potential Memory Leak in WebSocket Hibernation (MEDIUM)

Location: shared_state.rs:174-190

if hs.total_pending_ws_msgs_size > HWS_MAX_PENDING_MSGS_SIZE_PER_REQ
    || hs.pending_ws_msgs.len() >= u16::MAX as usize
{
    return Err(WebsocketPendingLimitReached {}.build());
}

Problem: When this limit is reached, the request fails but the HibernationState might remain in memory until GC runs. The error doesn't trigger immediate cleanup.

Recommendation: Explicitly mark the request as stopping when this error occurs to ensure timely cleanup.

📋 Code Quality Issues

7. Inconsistent Logging (LOW)

Some logs use ?request_id formatting (pegboard-gateway/lib.rs:216), others don't
Mix of tracing::warn! and tracing::debug! for similar conditions
Recommendation: Standardize on structured logging with ?request_id everywhere

8. Commented-Out Code (LOW)

Location: shared_state.rs:472-474

// fn wrapping_lt(a: u16, b: u16) -> bool {
//     b.wrapping_sub(a) < u16::MAX / 2
// }

Recommendation: Remove if truly unused, or document why it's kept for future use.

9. Magic Numbers (LOW)

Location: shared_state.rs:18-19

const HWS_MESSAGE_ACK_TIMEOUT: Duration = Duration::from_secs(30);
const HWS_MAX_PENDING_MSGS_SIZE_PER_REQ: u64 = util::size::mebibytes(1);

These timeouts and limits should be configurable or at least documented with justification.

🔒 Security Concerns

10. Predictable Request IDs (LOW-MEDIUM)

Location: pegboard/src/tunnel/id.rs:67

pub fn generate_request_id() -> RequestId {
    rand::random::<u32>().to_le_bytes()
}

Issue: Uses default rand::random() which might use a predictable RNG depending on the crate features. If request IDs are predictable, an attacker could potentially send malicious messages to hijack requests.

Recommendation:

Explicitly use rand::thread_rng() with Rng::gen() for cryptographically secure randomness
Or document that request IDs are not security-critical

11. No Rate Limiting on Request ID Generation (LOW)

If the collision retry loop triggers frequently, an attacker could potentially cause DoS by forcing the system to spend CPU time generating IDs.

Recommendation: Add metrics and alerts for high collision rates.

🧪 Test Coverage

12. Missing Tests (HIGH PRIORITY)

No tests found for:

Request ID collision handling
Message index wraparound behavior
In-flight request limit enforcement with new ID system
Hibernation state cleanup edge cases
Protocol version conversion with new message ID format

Recommendation: Add integration tests covering:

#[test]
fn test_request_id_collision_retry() { ... }

#[test]  
fn test_message_index_wraparound() { ... }

#[test]
fn test_inflight_request_cleanup_on_error() { ... }

🚀 Performance Considerations

13. Serialization in Hot Path (LOW-MEDIUM)

Location: tunnel/id.rs:29-31

let mut cursor = std::io::Cursor::new(&mut message_id[..]);
serde_bare::to_writer(&mut cursor, &parts).context(...)?;

Message ID construction happens for every single message sent. Bare encoding is efficient, but profiling would confirm whether manual bit packing would be faster.

Recommendation: Benchmark if this becomes a bottleneck. Consider manual packing:

pub fn build_message_id_fast(gateway_id: GatewayId, request_id: RequestId, index: u16) -> MessageId {
    let mut id = [0u8; 12];
    id[0..4].copy_from_slice(&gateway_id);
    id[4..8].copy_from_slice(&request_id);
    id[8..10].copy_from_slice(&index.to_le_bytes());
    id
}

📝 Additional Recommendations

Documentation: Add module-level docs to tunnel/id.rs explaining the ID format and why it was chosen over UUIDs
Metrics: Add counters for:
- Request ID collisions
- Message index wraparounds
- In-flight request limit hits
Type Safety: Consider newtype wrappers to prevent mixing up gateway_id and request_id:
```
#[derive(Copy, Clone, Debug)]
pub struct GatewayId([u8; 4]);
```

🎯 Summary

Must Fix Before Merge:

❌ Fix double acquisition of in-flight limits (Issue [SVC-2555] Set up issue templates #1)
❌ Fix request ID leak on early returns (Issue [SVC-2479] Send cluster events to PostHog #2)
❌ Add basic test coverage for new ID system (Issue [SVC-2569] Add CI check for running prettier on all Markdown docs #12)

Should Fix Soon:
4. ⚠️ Add wraparound tests for message index (Issue #3)
5. ⚠️ Use cryptographically secure RNG for request IDs (Issue #10)

Nice to Have:
6. 📊 Add metrics for collision detection
7. 📚 Improve documentation

Overall, this is a well-intentioned refactoring with good architectural improvements, but it has critical bugs that would cause production issues. The in-flight management bugs (#1, #2) are showstoppers.

Let me know if you'd like me to provide code snippets for the fixes!

claude · 2025-11-21T01:02:57Z

Code Review

This PR introduces a new message ID format for the pegboard gateway tunnel system, transitioning from UUID-based IDs to a more compact format. Overall, the implementation is well-structured with good backwards compatibility handling. Here are my findings:

Positive Aspects

✅ Excellent backwards compatibility: The protocol version conversion logic (v1→v2→v3) is comprehensive and maintains compatibility with older clients.

✅ Good error handling: Proper error handling throughout, especially in message ID parsing and serialization.

✅ Structured logging: Uses tracing with structured fields instead of string interpolation (following CLAUDE.md guidelines).

✅ Memory efficiency: The new 12-byte message ID format (4 bytes gateway + 4 bytes request + 2 bytes index + 2 bytes padding) is more compact than the previous UUID approach.

Issues and Concerns

🔴 Critical: Indentation in proxy_service.rs

Location: engine/packages/guard-core/src/proxy_service.rs:641

// Generate unique request ID
        let request_id = Some(self.generate_unique_request_id().await?);

This line has incorrect indentation (spaces instead of tabs). Per CLAUDE.md, "Hard tabs for Rust formatting". This should be:

	// Generate unique request ID
	let request_id = Some(self.generate_unique_request_id().await?);

🟡 Medium: Request ID Collision Handling

Location: engine/packages/guard-core/src/proxy_service.rs:667-694

The collision detection uses only 4 bytes (2^32 possibilities) for request IDs. While the code handles collisions with retry logic, there are concerns:

Birthday Paradox: With ~65,000 concurrent requests, you have a 50% chance of collision (birthday paradox applies)
MAX_TRIES=100: This seems low if you're under heavy load with many concurrent requests
Lock held during retries: The inflight_requests mutex is held during the entire loop, which could cause contention

Suggestion: Consider using 8 bytes for request IDs to drastically reduce collision probability, or implement a more sophisticated algorithm (e.g., atomically incrementing counter + timestamp + random bits).

🟡 Medium: Message Index Wrapping

Location: engine/packages/pegboard-gateway/src/shared_state.rs:48

message_index: req.message_index.wrapping_add(1);

The message index is a u16 (max 65,535). For long-lived connections with high message volume, this will wrap. While wrapping arithmetic is used correctly, the ack logic at line 327 uses wrapping_gt which assumes message ordering.

Potential Issue: If a connection sends >65k messages and wraps around while some messages are still pending ack, the comparison logic may incorrectly retain/discard messages.

Suggestion: Add a comment explaining the wrapping behavior and its limitations, or consider using a larger type (u32) for message indices.

🟡 Medium: Duplicate In-Flight Acquisition

Location: engine/packages/guard-core/src/proxy_service.rs:1097-1113

In the CustomServe path, the code acquires in-flight limit twice:

First at line 807 (for all requests)
Again at line 1100 (inside the CustomServe handler)

This means:

The in-flight counter is incremented twice for CustomServe requests
The request ID is generated twice
If the second acquisition fails (rate limited), the first acquisition is never released

Suggestion: Remove the duplicate acquisition inside the CustomServe handler since it's already handled at the top level. Or clearly document why CustomServe needs different rate limiting.

🟢 Minor: BARE Serialization Assumption

Location: engine/packages/pegboard/src/tunnel/id.rs:34-39

The code assumes BARE serialization of MessageIdParts will always produce exactly 12 bytes:

ensure!(
    written == 12,
    "message id serialization produced wrong size: expected 12 bytes, got {}",
    written
);

While this is currently correct (4 + 4 + 2 = 10, with BARE encoding adding minimal overhead), it's fragile if the struct changes. Consider adding a compile-time assertion or unit test to verify this invariant.

🟢 Minor: Missing Import Organization

Location: engine/packages/guard-core/src/proxy_service.rs:16

Per CLAUDE.md: "Always add imports at the top of the file inside of inline within the function."

The new import for pegboard::tunnel::id is correctly placed at the top, good!

🟢 Minor: Inconsistent Error Messages

Location: Various logging statements

Some log messages use base64 representation (tunnel_id::request_id_to_string()) while error messages use the raw format. For consistency, consider standardizing on one format for all user-facing messages.

Security Considerations

✅ No SQL injection: Uses parameterized queries throughout.
✅ No obvious DOS vectors: Rate limiting is preserved and enhanced.
⚠️ Collision DoS: An attacker could theoretically trigger many collisions by sending requests that generate colliding request IDs, though this is mitigated by the retry limit.

Performance Considerations

✅ Reduced memory footprint: 12-byte message IDs vs 16-byte UUIDs saves memory at scale.
✅ Stack allocation: Message IDs use stack arrays instead of heap allocation.
⚠️ Mutex contention: The inflight_requests HashSet is protected by a single mutex, which could become a bottleneck under very high concurrency.

Test Coverage

The PR doesn't include new tests for:

Request ID collision handling
Message index wrapping behavior
Hibernation state with the new message format
Protocol version conversions (v2↔v3)

Recommendation: Add unit tests for these scenarios.

Recommendations

Fix the indentation issue (critical for merge)
Remove duplicate in-flight acquisition in CustomServe path
Add unit tests for collision handling and message index wrapping
Consider increasing request ID size to 8 bytes to reduce collision probability
Document wrapping behavior for message indices
Add integration test verifying v2↔v3 protocol compatibility

Summary

This is a solid refactoring that improves the tunnel message format. The main concerns are around the collision handling for 4-byte request IDs and duplicate resource acquisition. The indentation issue needs to be fixed before merge.

Overall Assessment: ✅ Approve with changes requested

🤖 Generated with Claude Code

claude · 2025-11-21T01:29:57Z

Pull Request Review: Message ID Format & Deprecated Tunnel Ack

I've reviewed this PR which implements a new message ID format and adds backwards compatibility for deprecated tunnel acknowledgments. Here's my comprehensive feedback:

✅ Strengths

1. Well-Designed ID System

The new tunnel ID module (pegboard/src/tunnel/id.rs) is clean and well-structured
Using fixed-size byte arrays (GatewayId: [u8; 4], RequestId: [u8; 4], MessageIndex: u16) provides predictable memory layout
Collision detection with retry logic is a good safety measure (lines 667-694 in proxy_service.rs)
Base64 encoding/decoding utilities are properly implemented

2. Request Lifecycle Management
The refactoring of request tracking in proxy_service.rs improves correctness:

Request IDs are now generated once and tracked consistently throughout their lifecycle
The inflight_requests HashSet prevents ID reuse
Proper cleanup via release_in_flight ensures IDs are freed

3. Protocol Versioning
The version conversion logic in versioned.rs handles backwards compatibility well:

V2 ↔ V3 conversions properly handle the message ID format changes
DeprecatedTunnelAck maintains compatibility with older protocol versions
Message index extraction from message_id for v2 compatibility (line 1296)

4. Database Key Updates
The hibernating request keys now properly include both gateway_id and request_id, making them more precise and avoiding potential collisions.

⚠️ Issues & Concerns

1. CRITICAL: Duplicate In-Flight Acquisition

In proxy_service.rs, the handle_http_request function acquires in-flight limits twice for CustomServe routes:

// Line 809-821: First acquisition
let request_id = match self.state.acquire_in_flight(client_ip, &actor_id, req.headers()).await? {
    Some(id) => id,
    None => { /* return rate limit error */ }
};

// Line 1094-1113: Second acquisition inside CustomServe handling
let request_id = match self.state.acquire_in_flight(client_ip, &actor_id, &req_headers).await? {
    Some(id) => id,
    None => { /* return rate limit error */ }
};

Impact: This double-acquires the in-flight limit for CustomServe requests, effectively halving the available capacity.

Fix: The second acquisition at line 1094 should be removed since the first one already obtained a valid request_id. The request_id from line 809 should be passed through to the CustomServe handler.

2. Missing Release on Early Return Paths

In handle_http_request around lines 1094-1113, if the second (incorrect) acquire_in_flight fails, the function returns an error WITHOUT releasing the first acquired slot from line 809.

Impact: This leaks in-flight capacity on every rate limit hit for CustomServe routes.

Fix: Ensure the first acquired request ID is released before any error return.

3. Inconsistent Error Handling Pattern

In handle_websocket_upgrade (line 1915-1938), the spawned task acquires in-flight limits, but if it fails, the error is only logged inside the spawned task. The outer function has already returned success to the caller.

Concern: This could lead to silent failures where the WebSocket upgrade appears successful but the connection is never established.

Recommendation: Consider whether acquisition should happen before spawning the task, or add better error propagation.

4. Serialization Size Verification

In build_message_id (tunnel/id.rs:34-39), the code verifies that exactly 12 bytes were written:

ensure!(
    written == 12,
    "message id serialization produced wrong size: expected 12 bytes, got {}",
    written
);

Concern: While this is good defensive programming, if the BARE serialization format changes or has unexpected behavior, this will panic at runtime.

Recommendation: Consider adding a compile-time test to verify the serialization format produces exactly 12 bytes for the expected structure.

5. Collision Probability Analysis

The collision detection uses 100 attempts (line 668) for a 4-byte request ID space (~4.3 billion possibilities).

Math: With a good RNG and uniform distribution:

Birthday paradox suggests ~50% collision probability at ~77k concurrent requests
With 100 retries, the failure probability is extremely low but non-zero

Question: What's the expected concurrent request load? Is 4 bytes sufficient, or should this be 8 bytes for long-term scalability?

6. Message Index Wrapping

The message_index uses u16 and wraps with wrapping_add(1) (line 150 in shared_state.rs).

Concern: For long-lived connections (especially WebSockets), after 65,536 messages, indices wrap. If old messages are still pending acknowledgment, this could cause confusion.

Recommendation: Document this limitation or consider upgrading to u32 for the message index if long-lived connections are expected.

7. Missing Import Organization

In proxy_service.rs line 16, there's a standalone import that should be organized with other imports at the top:

use pegboard::tunnel::id::{RequestId, generate_request_id};

Per CLAUDE.md: "Always add imports at the top of the file inside of inline within the function."

🔍 Minor Issues & Style

1. Logging Conventions

Most logging follows structured logging well, but ensure all logs are lowercase per CLAUDE.md
Example: Lines 213 & 349 in lib.rs use proper structured logging with request_id=?tunnel_id::request_id_to_string(&request_id)

2. Deprecated Field Handling
In the protocol conversion, last_msg_index is set to -1 for backwards compatibility (line 1390). Consider documenting why -1 was chosen as the sentinel value.

3. Test Coverage
No tests are included for:

Request ID collision handling
Message ID serialization/deserialization round-trip
Wrapping behavior for message indices

📋 Security Considerations

✅ Good:

Request IDs are generated using rand::random, which uses a CSPRNG
No user-controlled input affects ID generation
The HashSet prevents reuse of active IDs

⚠️ Consider:

If gateway IDs are predictable (sequential or low entropy), combined with predictable message indices, an attacker might be able to guess message IDs
Currently using rand::random which should be fine, but document that this requires a good system RNG

🎯 Recommendations

High Priority:

Fix the double in-flight acquisition bug in handle_http_request for CustomServe routes
Add proper cleanup for leaked in-flight slots on error paths
Add tests for the collision detection logic

Medium Priority:
4. Consider increasing request ID size from 4 to 8 bytes for better collision resistance at scale
5. Document the 65k message limit for long-lived connections (or increase to u32)
6. Move inline imports to the top of files per CLAUDE.md

Low Priority:
7. Add compile-time verification of message ID serialization size
8. Document the -1 sentinel value for last_msg_index

📊 Overall Assessment

This is a solid architectural improvement that migrates to a more structured message ID format and improves request tracking. The protocol versioning is well-handled and maintains backwards compatibility.

However, the critical double-acquisition bug needs to be fixed before merging, as it will significantly impact production performance and could lead to resource leaks.

Recommendation: Request changes for the in-flight acquisition issues, then approve once fixed.

Let me know if you'd like me to help fix any of these issues!

claude · 2025-11-21T01:38:01Z

PR Review: Add New Message ID Format

This PR introduces a new message ID format for improved tunnel message tracking. Changes span 50 files with significant refactoring in gateway and protocol handling.

Critical Issues (Fix before merge)

1. Indentation Error (proxy_service.rs:641) - Line uses spaces instead of tabs, violates CLAUDE.md hard tabs requirement

2. Double In-Flight Acquisition Bug - handle_http_request calls acquire_in_flight again for custom serve routes, but request ID was already acquired in handle_request. This causes double-counting and potential resource leaks. Pass request_id from outer scope instead.

3. Resource Leak Risk - In handle_websocket_upgrade, spawned task only releases in-flight at task end. If task panics or cancels, request ID leaks from HashSet. Implement RAII guard pattern for cleanup.

Performance Concerns

Lock Contention: generate_unique_request_id holds mutex for entire loop (100 iterations). Release between attempts.
Unbounded Growth: inflight_requests HashSet can grow if releases fail. Add monitoring.

Security

Request ID Space: 4 bytes has birthday paradox issues at ~65k concurrent requests. Consider 8 bytes or add collision metrics.

Test Coverage

Missing tests for:

tunnel/id.rs unit tests (round-trip, collisions)
Integration tests (hibernation, ID lifecycle)
Load tests (concurrent generation)

Verdict

Solid architectural improvement but has critical bugs that must be fixed before merge. Test coverage needed for such significant changes.

x

graphite-app · 2025-11-21T02:40:41Z

Merge activity

Nov 21, 2:40 AM UTC: NathanFlurry added this pull request to the Graphite merge queue.
Nov 21, 2:42 AM UTC: CI is running for this pull request on a draft pull request (#3515) due to your merge queue CI optimization settings.
Nov 21, 2:43 AM UTC: Merged by the Graphite merge queue via draft PR: #3515.

…unnel ack (#3492)

NathanFlurry mentioned this pull request Nov 19, 2025

chore(pegboard-gateway): remove TunnelAck #3488

Closed

NathanFlurry mentioned this pull request Nov 19, 2025

chore(engine-runner): remove echo close on ToClientWebSocketClose #3487

Closed

NathanFlurry mentioned this pull request Nov 19, 2025

fix(tunnel): implement ping pong #3491

Closed

vercel bot had a problem deploying to Preview – rivetkit-serverless November 19, 2025 21:23 Failure

NathanFlurry force-pushed the 11-19-chore_pegboard-gateway_add_new_message_id_format_add_deprecated_tunnel_ack branch from c020fe1 to e47241f Compare November 19, 2025 21:23

vercel bot had a problem deploying to Preview – rivetkit-serverless November 19, 2025 21:24 Failure

MasterPtato previously requested changes Nov 19, 2025

View reviewed changes

engine/packages/guard-core/src/proxy_service.rs Show resolved Hide resolved

engine/packages/pegboard/src/tunnel/id.rs Show resolved Hide resolved

engine/packages/pegboard/src/tunnel/mod.rs Show resolved Hide resolved

MasterPtato force-pushed the 11-19-chore_pegboard-gateway_remove_tunnelack_ branch from 4adef6f to 759df3e Compare November 19, 2025 21:54

MasterPtato force-pushed the 11-19-chore_pegboard-gateway_add_new_message_id_format_add_deprecated_tunnel_ack branch from e47241f to 6521b98 Compare November 19, 2025 21:54

vercel bot had a problem deploying to Preview – rivetkit-serverless November 19, 2025 21:54 Failure

NathanFlurry force-pushed the 11-19-chore_pegboard-gateway_add_new_message_id_format_add_deprecated_tunnel_ack branch from 6521b98 to d6a2e79 Compare November 19, 2025 23:25

NathanFlurry force-pushed the 11-19-chore_pegboard-gateway_remove_tunnelack_ branch from 759df3e to 4adef6f Compare November 19, 2025 23:25

vercel bot deployed to Preview – rivetkit-serverless November 19, 2025 23:28 View deployment

MasterPtato force-pushed the 11-19-chore_pegboard-gateway_add_new_message_id_format_add_deprecated_tunnel_ack branch from d6a2e79 to adeb05d Compare November 20, 2025 03:07

NathanFlurry force-pushed the 11-19-chore_pegboard-gateway_add_new_message_id_format_add_deprecated_tunnel_ack branch from 9d2a169 to e98d4d0 Compare November 20, 2025 23:32

NathanFlurry force-pushed the 11-19-chore_pegboard-gateway_remove_tunnelack_ branch from bdf8c0e to bb070a3 Compare November 20, 2025 23:32

NathanFlurry mentioned this pull request Nov 20, 2025

chore(examples): add counter-next-js example #3506

Closed

NathanFlurry mentioned this pull request Nov 21, 2025

fix(engine): ensure first keepalive tick is not skipped to prevent timeout on ws hibernation #3507

Closed

MasterPtato force-pushed the 11-19-chore_pegboard-gateway_remove_tunnelack_ branch from bb070a3 to bdf8c0e Compare November 21, 2025 00:48

MasterPtato force-pushed the 11-19-chore_pegboard-gateway_add_new_message_id_format_add_deprecated_tunnel_ack branch from e98d4d0 to 9d2a169 Compare November 21, 2025 00:48

MasterPtato mentioned this pull request Nov 21, 2025

chore: remove msg id parts from protocol, move id utils to runner protocol crate #3508

Closed

NathanFlurry force-pushed the 11-19-chore_pegboard-gateway_add_new_message_id_format_add_deprecated_tunnel_ack branch from 9d2a169 to e98d4d0 Compare November 21, 2025 01:27

NathanFlurry force-pushed the 11-19-chore_pegboard-gateway_remove_tunnelack_ branch from bdf8c0e to bb070a3 Compare November 21, 2025 01:27

This was referenced Nov 21, 2025

chore(rivetkit): buffer outbound tunnel messages #3509

Closed

docs: runner lifecycle #3510

Closed

chore(engine): upgrade vbare 0.0.4 #3511

Closed

This was referenced Nov 21, 2025

fix(runner-protocol): add noop converters for v1 -> v3 #3512

Closed

fix(runner-protocol): fix decoding gateway & req id from req id instead of message id #3513

Closed

chore(pegboard-runner): log runner messages #3514

Closed

NathanFlurry requested a review from MasterPtato November 21, 2025 02:32

graphite-app bot pushed a commit that referenced this pull request Nov 21, 2025

chore(pegboard-gateway): add new message id format & add deprecated t…

6b7ebbb

…unnel ack (#3492)

graphite-app bot closed this Nov 21, 2025

graphite-app bot deleted the 11-19-chore_pegboard-gateway_add_new_message_id_format_add_deprecated_tunnel_ack branch November 21, 2025 02:43

chore(pegboard-gateway): add new message id format & add deprecated tunnel ack #3492

chore(pegboard-gateway): add new message id format & add deprecated tunnel ack #3492

Uh oh!

Conversation

NathanFlurry commented Nov 19, 2025

Uh oh!

vercel bot commented Nov 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

NathanFlurry commented Nov 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

How to use the Graphite Merge Queue

Uh oh!

claude bot commented Nov 19, 2025

Code Review

✅ Strengths

⚠️ Issues & Concerns

1. Resource Leak: Request ID Not Released ⚠️ CRITICAL

2. Missing Cleanup for HTTP Custom Serve

3. Potential Panic in TupleUnpack

4. Serialization Size Verification

5. Wrapping Arithmetic Documentation

6. Missing Test Coverage

📝 Minor Observations

🎯 Recommendations

Summary

Uh oh!

claude bot commented Nov 19, 2025

Code Review

Positive Aspects

Issues & Concerns

🔴 Critical: Request ID Not Released After Completion

🟡 Medium: Missing Request ID Release in HTTP Path

🟡 Medium: Serialization Size Not Verified at Compile Time

🟡 Medium: Wrapping Arithmetic Without Overflow Documentation

🟢 Minor: Inconsistent Error Message Formatting

🟢 Minor: Missing Documentation on MessageId Format

Performance Considerations

Security Considerations

Test Coverage

Summary

Uh oh!

Uh oh!

Uh oh!

Uh oh!

claude bot commented Nov 19, 2025

PR Review: New Message ID Format & Deprecated Tunnel Ack

Strengths

Critical Issues

Moderate Issues

Performance & Security

Summary

Uh oh!

claude bot commented Nov 19, 2025

PR Review: New Message ID Format & Deprecated Tunnel Ack

✅ Strengths

⚠️ Issues & Suggestions

📝 Minor Code Quality Notes

🧪 Testing Recommendations

Summary

Uh oh!

claude bot commented Nov 19, 2025

PR Review: Add New Message ID Format & Deprecated Tunnel Ack

Overview

Code Quality & Best Practices ✅

1. Inconsistent Whitespace (engine/packages/guard-core/src/proxy_service.rs:641)

2. Missing Import Consolidation (engine/packages/guard-core/src/proxy_service.rs:16)

Potential Bugs & Issues ⚠️

1. Critical: Resource Leak in Error Paths (engine/packages/guard-core/src/proxy_service.rs:1094-1173)

2. Duplicate acquire_in_flight Call (engine/packages/guard-core/src/proxy_service.rs:809 & 1100)

3. Collision Handling May Cause Performance Issues (engine/packages/pegboard/src/tunnel/id.rs:67-69 & engine/packages/guard-core/src/proxy_service.rs:667-694)

4. WebSocket Resource Leak in Error Path (engine/packages/guard-core/src/proxy_service.rs:1920-2145)

Performance Considerations 🔍

1. Serialization on Hot Path (engine/packages/pegboard/src/tunnel/id.rs:28-31)

2. Lock Contention on inflight_requests (engine/packages/guard-core/src/proxy_service.rs:662-663, 669-678)

Security Concerns 🔒

1. Request ID Predictability

2. No Validation of Message Index

Test Coverage 📊

Observations:

vercel bot commented Nov 19, 2025 •

edited

Loading

NathanFlurry commented Nov 19, 2025 •

edited

Loading

1. Potential Double In-Flight Acquisition in `proxy_service.rs`