Skip to content

Conversation

@ecordell
Copy link
Contributor

@ecordell ecordell commented Oct 23, 2025

Description

This introduces a new memory protection middleware to help prevent out-of-memory conditions in SpiceDB by implementing admission control based on current memory usage.

This is not a perfect solution (doesn't prevent non-traffic-related sources of OOM) and is meant to support other future improvements to resource sharing in a single SpiceDB node.

The middleware is installed both in the main api and in dispatch, but at different thresholds. Memory usage is polled in the background, and if in-flight memory rises above the threshold, backpressure is placed on incoming requests.

The dispatch threshold is higher than the API threshold to preserve already admitted traffic as much as possible.

Testing

  • Unit tests included
  • Manual E2E test:
    • Modify [docker-compose.yaml] to set mem_limit: "200mb"
    • Run docker-compose up --build
    • Run this
zed context set example localhost:50051 foobar --insecure
zed import development/schema.yaml
{
    echo '{"items":['
    for i in $(seq 1 200); do
      d=$(( (RANDOM % 9999) + 1 ))
      echo -n "{\"resource\":{\"objectTy  pe\":\"document\",\"objectId\": \"${d}\"}, \"permission\":\"view\",\"subject\":{ \"object\": {\"objectType\": \"user\", \"objectId\": \"1\"}}}"
      [ $i -lt 200 ] && echo -n ","
    done
    echo "], \"with_tracing\": true}"
} > payload.json
ab -n 100000 -c 200 -T 'application/json' -H 'Authorization: Bearer foobar' -p payload.json http://localhost:8443/v1/permissions/checkbulk

you should see logs such as:

{
  "level": "warn",
  "traceID": "125b7b37c5775af1f2d9ebf253dcf3d1",
  "protocol": "grpc",
  "grpc.component": "server",
  "grpc.service": "authzed.api.v1.PermissionsService",
  "grpc.method": "CheckBulkPermissions",
  "grpc.method_type": "unary",
  "requestID": "d3vurmuoqu8s73bsdom0",
  "peer.address": "127.0.0.1:60376",
  "grpc.start_time": "2025-10-27T22:10:35Z",
  "grpc.code": "ResourceExhausted",
  "grpc.error": "rpc error: code = ResourceExhausted desc = server rejected the request because memory usage is above configured threshold",
  "grpc.time_ms": 0,
  "time": "2025-10-27T22:10:35Z",
  "message": "finished call"
}

and this graph in Grafana:

image

@github-actions github-actions bot added area/cli Affects the command line area/dependencies Affects dependencies area/tooling Affects the dev or user toolchain (e.g. tests, ci, build tools) labels Oct 23, 2025
@codecov
Copy link

codecov bot commented Oct 23, 2025

Codecov Report

❌ Patch coverage is 11.97605% with 1470 lines in your changes missing coverage. Please review.
✅ Project coverage is 77.12%. Comparing base (e55404b) to head (4a317d3).

Files with missing lines Patch % Lines
internal/mocks/mock_datastore.go 0.32% 1277 Missing ⚠️
internal/mocks/mock_dispatcher.go 2.57% 152 Missing ⚠️
...ware/memoryprotection/mocks/mock_memory_sampler.go 38.89% 22 Missing ⚠️
...rnal/middleware/memoryprotection/memory_sampler.go 83.08% 9 Missing and 2 partials ⚠️
pkg/cmd/server/server.go 87.24% 3 Missing and 3 partials ⚠️
...l/middleware/memoryprotection/memory_protection.go 96.50% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2646      +/-   ##
==========================================
- Coverage   79.46%   77.12%   -2.33%     
==========================================
  Files         455      460       +5     
  Lines       47161    48779    +1618     
==========================================
+ Hits        37470    37618     +148     
- Misses       6945     8408    +1463     
- Partials     2746     2753       +7     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@miparnisari miparnisari force-pushed the oomprotect branch 6 times, most recently from d65a7c0 to 7afdbc9 Compare October 27, 2025 23:51
- job_name: "spicedb"
static_configs:
- targets: ["spicedb:9090"]
- targets: ["spicedb-1:9090"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI so that we can verify the new metrics in Grafana

@github-actions github-actions bot added the area/dispatch Affects dispatching of requests label Oct 27, 2025
@miparnisari miparnisari force-pushed the oomprotect branch 5 times, most recently from 7d7fb92 to dd062d7 Compare October 28, 2025 01:06
@miparnisari miparnisari marked this pull request as ready for review October 28, 2025 05:01
@miparnisari miparnisari requested a review from a team as a code owner October 28, 2025 05:01
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI i can move the mock generation to a different PR

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fine.

@@ -0,0 +1,4 @@
internal/mocks/*.go linguist-generated=true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI so that when changing these, the "View PR" view on Github says that the files are automatically generated and people can just mark "viewed" on them

pkg/cmd/serve.go Outdated

// Memory Protection flags
apiFlags.BoolVar(&config.MemoryProtectionEnabled, "memory-protection-enabled", true, "enables a memory-based middleware that rejects requests when memory usage is too high")
apiFlags.IntVar(&config.MemoryProtectionAPIThresholdPercent, "memory-protection-api-threshold", 90, "memory usage threshold percentage for regular API requests (0-100)")
Copy link
Contributor

@miparnisari miparnisari Oct 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i don't like that we have the default 90 here and also in the struct itself 😭 i'd love a unified approach in the future

pkg/cmd/serve.go Outdated
apiFlags.StringVar(&config.MismatchZedTokenBehavior, "mismatch-zed-token-behavior", "full-consistency", "behavior to enforce when an API call receives a zedtoken that was originally intended for a different kind of datastore. One of: full-consistency (treat as a full-consistency call, ignoring the zedtoken), min-latency (treat as a min-latency call, ignoring the zedtoken), error (return an error). defaults to full-consistency for safety.")

// Memory Protection flags
apiFlags.BoolVar(&config.MemoryProtectionEnabled, "memory-protection-enabled", true, "enables a memory-based middleware that rejects requests when memory usage is too high")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand the appeal to have this enabled by default (it's for the better!), but playing devil's advocate, this behavior may be surprising for folks as they update to the next release.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree in that it would be surprising, but the other side of the coin is that right now if they put a limit on their memory usage and they go above that, they'd be getting an OOM, so this approach is better IMO.

I don't have a strong opinion and i would love other people's opinion on this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to enabling it by default - the node dying helps no one, and its better to get some requests through than none, IMO

WithInterceptor(grpcMetricsUnaryInterceptor).
Done(),

NewUnaryMiddleware().
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand why we would want to add it here, and I think realistic and practical place for it, but I'd be remiss if I didn't mention that we miss protection as the early middleware layers are traversed.

But again, this is not meant to be perfect, but good enough ™️

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at the middlewares that run before this one (logging, metrics), it seems right to me, like i wouldn't want this new middleware to run earlier than the others 🤔 do you have a suggestion on a different order?

pkg/cmd/serve.go Outdated

// Memory Protection flags
apiFlags.BoolVar(&config.MemoryProtectionEnabled, "memory-protection-enabled", true, "enables a memory-based middleware that rejects requests when memory usage is too high")
apiFlags.IntVar(&config.MemoryProtectionAPIThresholdPercent, "memory-protection-api-threshold", 90, "memory usage threshold percentage for regular API requests (0-100)")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have some percent-based flags that are defined as [0...1] floats. Worth having a look and deciding which approach to commit to.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made it similar to this

 --datastore-revision-quantization-max-staleness-percent float           float percentage (where 1 = 100%)

i don't love it but I prefer being consistent

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Funny, I just found that you can pass a large number to that flag and the server accepts it 😆

- "SPICEDB_DATASTORE_REVISION_QUANTIZATION_MAX_STALENESS_PERCENT=10000"

@miparnisari miparnisari force-pushed the oomprotect branch 4 times, most recently from 77c4458 to e73f76c Compare October 31, 2025 03:56
ecordell and others added 2 commits October 31, 2025 09:22
The commit introduces a new memory protection middleware to help prevent out-of-memory conditions in SpiceDB by implementing admission control based on current memory usage.

This is not a perfect solution (doesn't prevent non-traffic-related sources of OOM) and is meant to support other future improvements to resource sharing in a single SpiceDB node.

The middleware is installed both in the main api and in dispatch, but at
different thresholds. Memory usage is polled in the background, and if
in-flight memory rises above the threshold, backpressure is placed on
incoming requests.

The API threshold is higher than the dispatch threshold to preserve
already admitted traffic as much as possible.
Copy link
Contributor

@tstirrat15 tstirrat15 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comments

Comment on lines 74 to 78
_ MemoryLimitProvider = (*DefaultMemoryLimitProvider)(nil)
_ MemoryLimitProvider = (*HardCodedMemoryLimitProvider)(nil)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this syntax as opposed to constructing one?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WDYM? This is standard go code to assert that a struct satisfies an interface

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fine.

synctest.Wait()
t.Log("When we get here, the sampling is guaranteed to have run")

require.True(t, sampler.GetTimestampLastMemorySample().After(now))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is in a goroutine :P

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/cli Affects the command line area/dependencies Affects dependencies area/dispatch Affects dispatching of requests area/tooling Affects the dev or user toolchain (e.g. tests, ci, build tools)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants