-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Background Reconciliation for Entities #2262
Comments
/assign |
Hi @Vyom-Yadav I was thinking about this feature some time ago (though I haven't gotten far, thank you very much for picking it up) here are my notes from that time. Feel free to take anything you like from them or not if otherwise :-). A lot of it is very hand-wavy, as I said, I didn't get too deep.
|
Good points, @jhrozek ! I'm also thinking about having this as a separate process. Based on the number of replicas, we can process an entity repeatedly, leading to rate-limiting errors. The Singleton Client Binary would have the logic for scheduling reconciliation events. The approach I had in my mind is:
To scale up, we may shard the database so that each reconciler is attached to a unique set of data. |
Yes, this is what I tried to say with "a project", but yes, your description is more precise.
I think it's hard to know how often to run the reconciler without some more experiments and testing and I strongly agree with making it configurable. At the very least for testing I would have used something in the matter of seconds to see the reconciler triggered repeatedly.
Interesting idea, I like that it's decoupled from the watermill implementation we have now.
I have literally 0 experience with database sharding so I don't feel qualified to answer. I know Evan does have a lot, maybe he'd have an opinion? |
cc @evankanderson, what do you think about the above-mentioned approach (having reconciler as a separate process)? |
FWIW, I think having a separate process for the reconciliation trigger makes sense and it's ideal. I'm not super sure about doing it with RPCs and would prefer events instead, but I don't have a strong opinion about this that would block it. |
It would be done using events only. RPCs would trigger events on the server. Our new process would invoke those RPCs. Similar to webhook handler. Edit: |
I agree that this is probably worth a quick design. Trying to hit the design points:
|
Oh, and the first point of scaling if we needed it would be a read replica, which should be able to scale to 2M repos under management at a scan rate of less than 300 rows/s. If we're reading in an index-aligned way, that should be trivial, and even if not, it should be pretty easy unless we do some sort of order by/limit/offset hijinks. |
@evankanderson Which message queue would you suggest? I know you're keen on using Amazon SQS, still what would you suggest for the current use case? (How about NATS?)
Now that you've mentioned using message queues, let's implement that only. |
(full disclosure: I don't have much of a working experience with message queues, so this is more of a gut feeling than anything backed by hard data) I would prefer not to tie Minder to a specific message queue implementation. Wasn't an abstraction a reason why we chose Watermill back in the day in the first place? IIRC Watermill also have an SQS and Jetstream back end..
One reason where a message queue might be nicer in this respect over an RPC call is that you might get a notion of when a task in a queue was finished as opposed to claimed by a worker who might disappear before the task had finished. I would hope that for this case it might not have a practical effect as the next "tick" would just reinsert the same refresh, but still.. |
Watermill is a wrapper over the queue SDKs (With additional features). We would still use Watermill, but we have to decide whether we want to use Kafka, NATS, RabbitMQ, etc. Each one of them has a unique set of features, so we should decide what's best for the use case. Honestly, I also haven't worked much with message queues, so I'm unsure which one would be the best suited.
Agreed, using queues would be significantly better. |
Ah, then we just didn't understand each other, I read your earlier comment as wanting to go directly for the message queue. |
Sorry, missed this question -- I would suggest for now using Watermill, and making the "extract message queue" a separate epic which would apply to all our existing message queueing in Minder. I'm partial to tools like SQS, EventBridge, or something like Google's Cloud Tasks which manage delivery as a weakly ordered set of deliveries, rather than a strongly-ordered log like Kafka. However, I think that's a separate "plumbing" task, so let's not conflate the two.
Please don't! Feel free to put the RPC-to-Minder behind an interface that we could replace with a one-way delivery mechanism, but switching to an external message queueing system would be a major surgery that should be separate from the revisit mechanism. (Basically, we'd be taking the vertical stack from the Watermill router diagram and externalizing it into configuration managed by a tool like Terraform) |
To explain the externalized config further, the Watermill Router concept support a simple "named topic" fan-out, but SNS, EventBridge, and Google PubSub support filtering messages based on message attributes (e.g. Using one of the above providers would also allow us to absorb some of the existing middlewares -- some of the telemetry and the Poison / Retry management would move into the external service. This is also a pretty big commitment that requires a design document and consideration of tradeoffs (cost, scale, difficulty-of-use, credentials), and we probably will end up putting it behind an interface to allow some agility later. |
That's exactly why I wanted to do it with queues.
Makes sense. Let's keep that as a separate issue. Later we can replace the RPC call with a message publish. Thanks for the detailed explanation. |
I was suggesting the following pseudocode:
With the pattern of synchronously publishing the event and then returning the error from the publish, you shouldn't have to worry about the task being lost. In general a message queue will not give you a notice when the message has been delivered -- once the queue takes over, the delivery to all destinations is eventually guaranteed, but you don't get to know when it happened (it's asynchronous and may be plural -- one publish may result in 3 subscribers getting the message, so "done" is ambiguous in those circumstances). |
Yeah, we will use something like that only, with some authn/z mechanism that works for both the reminder process and users trying to do one shot op. |
Please describe the enhancement
Minder relies purely on event-based processing triggered by webhooks/user actions. Apart from edge-based triggering, Minder should also reconcile the entities in the background. Reconcilers under the scope of this issue are:
These reconcilers are required to ensure all existing rule evaluations in minder are updated according to any newly added logic. See #1609 for example.
Solution Proposal
Continuously running the reconcilers is not feasible due to resource constraints and the provider's API limits. Reconcilers would run based on a cron job with a small enough batch size not to hit rate-limiting errors for any provider.
Describe alternatives you've considered
No response
Additional context
For more details see: https://discord.com/channels/1184987096302239844/1201083912521252885
Acceptance Criteria
No response
The text was updated successfully, but these errors were encountered: