Add Background Reconciliation for Entities #2262

Vyom-Yadav · 2024-02-01T19:54:16Z

Please describe the enhancement

Minder relies purely on event-based processing triggered by webhooks/user actions. Apart from edge-based triggering, Minder should also reconcile the entities in the background. Reconcilers under the scope of this issue are:

Continuously update the evaluation status for entities against registered rules. (Task - I)
Continuously evaluate the state of the webhook created by the minder on GitHub. (As the user may accidentally delete it) (Task - II)

These reconcilers are required to ensure all existing rule evaluations in minder are updated according to any newly added logic. See #1609 for example.

Solution Proposal

Continuously running the reconcilers is not feasible due to resource constraints and the provider's API limits. Reconcilers would run based on a cron job with a small enough batch size not to hit rate-limiting errors for any provider.

Describe alternatives you've considered

No response

Additional context

For more details see: https://discord.com/channels/1184987096302239844/1201083912521252885

Acceptance Criteria

No response

Vyom-Yadav · 2024-02-01T19:54:31Z

/assign

jhrozek · 2024-02-01T20:38:13Z

Hi @Vyom-Yadav I was thinking about this feature some time ago (though I haven't gotten far, thank you very much for picking it up) here are my notes from that time. Feel free to take anything you like from them or not if otherwise :-). A lot of it is very hand-wavy, as I said, I didn't get too deep.

I was thinking about having a separate process, a singleton as opposed to be a part of the potentially replicated minder instances. The idea behind this was that presumably we'd use a queue of entities to be updated and having a singleton writing to the queue just seemed simpler.
this new component (please name it reminder :-)) would periodically (say once an hour?) traverse the database, picking up projects that contain entities that haven't been updated in $TIME_PERIOD.
each project would get a time slot and could update up to N entities from the project within that time slot. Individual updates within that interval would be spaced apart with a jitter to avoid spikes. An entry to update an entity would be written to a work queue. We use watermill with an SQL back end, without thinking about the details too deep I was wondering if we could make this reminder a topic publisher and minder workers subscribers to the periodical update topics.
the minder processes being subscribers to the periodical update table would then issue an ExecuteEntityEventTopic

Vyom-Yadav · 2024-02-02T05:40:04Z

Good points, @jhrozek ! I'm also thinking about having this as a separate process. Based on the number of replicas, we can process an entity repeatedly, leading to rate-limiting errors.

The Singleton Client Binary would have the logic for scheduling reconciliation events. The approach I had in my mind is:

Pick entities from different owners/tokens so we can process multiple entities with larger batch sizes and not hit any rate-limiting errors.
Evan suggested running the reconciler once in 24 hours, but all those parameters would be configurable so that we can tune it accordingly.
Send a request to the server using RPCs. The server would then internally process those events. It would be similar to receiving webhook events; this time, it would be from the reminder process.

To scale up, we may shard the database so that each reconciler is attached to a unique set of data.

jhrozek · 2024-02-02T08:09:29Z

Good points, @jhrozek ! I'm also thinking about having this as a separate process. Based on the number of replicas, we can process an entity repeatedly, leading to rate-limiting errors.

The Singleton Client Binary would have the logic for scheduling reconciliation events. The approach I had in my mind is:

Pick entities from different owners/tokens so we can process multiple entities with larger batch sizes and not hit any rate-limiting errors.

Yes, this is what I tried to say with "a project", but yes, your description is more precise.

Evan suggested running the reconciler once in 24 hours, but all those parameters would be configurable so that we can tune it accordingly.

I think it's hard to know how often to run the reconciler without some more experiments and testing and I strongly agree with making it configurable. At the very least for testing I would have used something in the matter of seconds to see the reconciler triggered repeatedly.

Send a request to the server using RPCs. The server would then internally process those events. It would be similar to receiving webhook events; this time, it would be from the reminder process.

Interesting idea, I like that it's decoupled from the watermill implementation we have now.

To scale up, we may shard the database so that each reconciler is attached to a unique set of data.

I have literally 0 experience with database sharding so I don't feel qualified to answer. I know Evan does have a lot, maybe he'd have an opinion?

Vyom-Yadav · 2024-02-02T09:30:08Z

cc @evankanderson, what do you think about the above-mentioned approach (having reconciler as a separate process)?

JAORMX · 2024-02-02T09:47:11Z

FWIW, I think having a separate process for the reconciliation trigger makes sense and it's ideal. I'm not super sure about doing it with RPCs and would prefer events instead, but I don't have a strong opinion about this that would block it.

Vyom-Yadav · 2024-02-02T09:55:29Z

doing it with RPCs and would prefer events instead

It would be done using events only. RPCs would trigger events on the server. Our new process would invoke those RPCs. Similar to webhook handler.

Edit:
I didn't mention it explicitly, but we won't directly expose the evaluation engine. The server would also have some form of authorization mechanism to verify the reconciler.

evankanderson · 2024-02-02T15:04:17Z

I agree that this is probably worth a quick design. Trying to hit the design points:

Using a singleton to avoid needing coordination / locking across multiple Minder processes. This should work for now. It adds a small amount of deployment complexity, but that seems worth it. You may also want to consider having the singleton support a "one-shot" mode to address @jhrozek 's comment about wanting "second-level revisits" in development. You probably don't actually want constant revisits at the second level, but you want them on demand.
I think we'll need to be nimble on the pacing. I'd start with something simple, but we may need to spread out on some of the axes @Vyom-Yadav mentions. Right now, we're looking at scales of maybe 10k repositories, which is... one every 86 seconds at a 24-hour revisit. This will change if Minder takes off, of course.
My preference would be to use events on an externally-configured queue (i.e. rather than the in-binary configuration we have today with Watermill). Using an external queue allows us to "freeze" the contract between the two components so each can evolve independently. At the same time, that would be a major undertaking that doesn't provide much direct value, so I think the short-term answer is to have a "kick" RPC that triggers the Minder server to enqueue an event similar to the current webhook events. It's not clear that the event respondent needs to know whether they received a webhook or a reminder kick -- I'd aim to keep it that way; the reconciliation logic should be idempotent.

evankanderson · 2024-02-02T15:34:11Z

Oh, and the first point of scaling if we needed it would be a read replica, which should be able to scale to 2M repos under management at a scan rate of less than 300 rows/s. If we're reading in an index-aligned way, that should be trivial, and even if not, it should be pretty easy unless we do some sort of order by/limit/offset hijinks.

Vyom-Yadav · 2024-02-06T07:24:33Z

My preference would be to use events on an externally-configured queue (i.e. rather than the in-binary configuration we have today with Watermill). Using an external queue allows us to "freeze" the contract between the two components so each can evolve independently.

@evankanderson Which message queue would you suggest? I know you're keen on using Amazon SQS, still what would you suggest for the current use case? (How about NATS?)

so I think the short-term answer is to have a "kick" RPC that triggers the Minder server to enqueue an event similar to the current webhook events.

Now that you've mentioned using message queues, let's implement that only.

jhrozek · 2024-02-07T09:59:02Z

My preference would be to use events on an externally-configured queue (i.e. rather than the in-binary configuration we have today with Watermill). Using an external queue allows us to "freeze" the contract between the two components so each can evolve independently.

Which message queue would you suggest? I know you're keen on using Amazon SQS, still what would you suggest for the current use case?

(full disclosure: I don't have much of a working experience with message queues, so this is more of a gut feeling than anything backed by hard data)

I would prefer not to tie Minder to a specific message queue implementation. Wasn't an abstraction a reason why we chose Watermill back in the day in the first place? IIRC Watermill also have an SQS and Jetstream back end..

so I think the short-term answer is to have a "kick" RPC that triggers the Minder server to enqueue an event similar to the current webhook events.

Now that you've mentioned using message queues, let's implement that only.

One reason where a message queue might be nicer in this respect over an RPC call is that you might get a notion of when a task in a queue was finished as opposed to claimed by a worker who might disappear before the task had finished. I would hope that for this case it might not have a practical effect as the next "tick" would just reinsert the same refresh, but still..

Vyom-Yadav · 2024-02-07T10:26:44Z

I would prefer not to tie Minder to a specific message queue implementation. Wasn't an abstraction a reason why we chose Watermill back in the day in the first place? IIRC Watermill also have an SQS and Jetstream back end..

Watermill is a wrapper over the queue SDKs (With additional features). We would still use Watermill, but we have to decide whether we want to use Kafka, NATS, RabbitMQ, etc. Each one of them has a unique set of features, so we should decide what's best for the use case. Honestly, I also haven't worked much with message queues, so I'm unsure which one would be the best suited.

One reason where a message queue might be nicer in this respect over an RPC call is that you might get a notion of when a task in a queue was finished as opposed to claimed by a worker who might disappear before the task had finished. I would hope that for this case it might not have a practical effect as the next "tick" would just reinsert the same refresh, but still..

Agreed, using queues would be significantly better.

jhrozek · 2024-02-07T10:34:23Z

I would prefer not to tie Minder to a specific message queue implementation. Wasn't an abstraction a reason why we chose Watermill back in the day in the first place? IIRC Watermill also have an SQS and Jetstream back end..

Watermill is a wrapper over the queue SDKs (With additional features). We would still use Watermill, but we have to decide

Ah, then we just didn't understand each other, I read your earlier comment as wanting to go directly for the message queue.

evankanderson · 2024-02-07T14:22:45Z

My preference would be to use events on an externally-configured queue (i.e. rather than the in-binary configuration we have today with Watermill). Using an external queue allows us to "freeze" the contract between the two components so each can evolve independently.

@evankanderson Which message queue would you suggest? I know you're keen on using Amazon SQS, still what would you suggest for the current use case? (How about NATS?)

Sorry, missed this question -- I would suggest for now using Watermill, and making the "extract message queue" a separate epic which would apply to all our existing message queueing in Minder.

I'm partial to tools like SQS, EventBridge, or something like Google's Cloud Tasks which manage delivery as a weakly ordered set of deliveries, rather than a strongly-ordered log like Kafka. However, I think that's a separate "plumbing" task, so let's not conflate the two.

so I think the short-term answer is to have a "kick" RPC that triggers the Minder server to enqueue an event similar to the current webhook events.

Now that you've mentioned using message queues, let's implement that only.

Please don't! Feel free to put the RPC-to-Minder behind an interface that we could replace with a one-way delivery mechanism, but switching to an external message queueing system would be a major surgery that should be separate from the revisit mechanism.

(Basically, we'd be taking the vertical stack from the Watermill router diagram and externalizing it into configuration managed by a tool like Terraform)

evankanderson · 2024-02-07T14:34:45Z

To explain the externalized config further, the Watermill Router concept support a simple "named topic" fan-out, but SNS, EventBridge, and Google PubSub support filtering messages based on message attributes (e.g. attributes.tier=0 and attributes.type = "repo" in Google-syntax, or {"tier": [{"numeric": ["=", 0]}], "type": ["repo"]} in AWS-syntax).

Using one of the above providers would also allow us to absorb some of the existing middlewares -- some of the telemetry and the Poison / Retry management would move into the external service.

This is also a pretty big commitment that requires a design document and consideration of tradeoffs (cost, scale, difficulty-of-use, credentials), and we probably will end up putting it behind an interface to allow some agility later.

Vyom-Yadav · 2024-02-07T16:18:31Z

Using one of the above providers would also allow us to absorb some of the existing middlewares -- some of the telemetry and the Poison / Retry management would move into the external service.

That's exactly why I wanted to do it with queues.

Please don't! Feel free to put the RPC-to-Minder behind an interface that we could replace with a one-way delivery mechanism, but switching to an external message queueing system would be a major surgery that should be separate from the revisit mechanism.

Makes sense. Let's keep that as a separate issue. Later we can replace the RPC call with a message publish. Thanks for the detailed explanation.

evankanderson · 2024-02-07T23:26:39Z

One reason where a message queue might be nicer in this respect over an RPC call is that you might get a notion of when a task in a queue was finished as opposed to claimed by a worker who might disappear before the task had finished. I would hope that for this case it might not have a practical effect as the next "tick" would just reinsert the same refresh, but still..

Agreed, using queues would be significantly better.

I was suggesting the following pseudocode:

func (s *Server) MinderRevisit(ctx, req) (resp, err) {
  msg := ConvertRevisitToMsg(req)
  
  err := s.Router.Publish(msg)
  return &EmptyResp{}, err
}

With the pattern of synchronously publishing the event and then returning the error from the publish, you shouldn't have to worry about the task being lost. In general a message queue will not give you a notice when the message has been delivered -- once the queue takes over, the delivery to all destinations is eventually guaranteed, but you don't get to know when it happened (it's asynchronous and may be plural -- one publish may result in 3 subscribers getting the message, so "done" is ambiguous in those circumstances).

Vyom-Yadav · 2024-02-08T03:14:29Z

Yeah, we will use something like that only, with some authn/z mechanism that works for both the reminder process and users trying to do one shot op.

Vyom-Yadav added the user-story label Feb 1, 2024

jhrozek assigned Vyom-Yadav Feb 1, 2024

Vyom-Yadav mentioned this issue Feb 21, 2024

Refactor project structure: Move cursor file to utils package and migrate common config struct to common.go #2394

Merged

This was referenced Mar 2, 2024

Add Background Reconciliation for Entities (Reminder) #2475

Closed

Add repo reconcile command to minder CLI #2505

Merged

Vyom-Yadav mentioned this issue Mar 13, 2024

Add reminder service with empty sendReminders logic #2638

Merged

10 tasks

Vyom-Yadav mentioned this issue May 23, 2024

Add batch creation logic for the reminder service #3413

Merged

10 tasks

Vyom-Yadav linked a pull request Jun 16, 2024 that will close this issue

Connect reminder service to minder server to dispatch reminders #3630

Open

10 tasks

evankanderson linked a pull request Jun 20, 2024 that will close this issue

Connect reminder service to minder server to dispatch reminders #3630

Open

10 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Background Reconciliation for Entities #2262

Add Background Reconciliation for Entities #2262

Vyom-Yadav commented Feb 1, 2024 •

edited

Loading

Vyom-Yadav commented Feb 1, 2024

jhrozek commented Feb 1, 2024

Vyom-Yadav commented Feb 2, 2024 •

edited

Loading

jhrozek commented Feb 2, 2024

Vyom-Yadav commented Feb 2, 2024

JAORMX commented Feb 2, 2024

Vyom-Yadav commented Feb 2, 2024 •

edited

Loading

evankanderson commented Feb 2, 2024

evankanderson commented Feb 2, 2024

Vyom-Yadav commented Feb 6, 2024 •

edited

Loading

jhrozek commented Feb 7, 2024

Vyom-Yadav commented Feb 7, 2024

jhrozek commented Feb 7, 2024

evankanderson commented Feb 7, 2024

evankanderson commented Feb 7, 2024

Vyom-Yadav commented Feb 7, 2024

evankanderson commented Feb 7, 2024

Vyom-Yadav commented Feb 8, 2024

Add Background Reconciliation for Entities #2262

Add Background Reconciliation for Entities #2262

Comments

Vyom-Yadav commented Feb 1, 2024 • edited Loading

Please describe the enhancement

Solution Proposal

Describe alternatives you've considered

Additional context

Acceptance Criteria

Vyom-Yadav commented Feb 1, 2024

jhrozek commented Feb 1, 2024

Vyom-Yadav commented Feb 2, 2024 • edited Loading

jhrozek commented Feb 2, 2024

Vyom-Yadav commented Feb 2, 2024

JAORMX commented Feb 2, 2024

Vyom-Yadav commented Feb 2, 2024 • edited Loading

evankanderson commented Feb 2, 2024

evankanderson commented Feb 2, 2024

Vyom-Yadav commented Feb 6, 2024 • edited Loading

jhrozek commented Feb 7, 2024

Vyom-Yadav commented Feb 7, 2024

jhrozek commented Feb 7, 2024

evankanderson commented Feb 7, 2024

evankanderson commented Feb 7, 2024

Vyom-Yadav commented Feb 7, 2024

evankanderson commented Feb 7, 2024

Vyom-Yadav commented Feb 8, 2024

Vyom-Yadav commented Feb 1, 2024 •

edited

Loading

Vyom-Yadav commented Feb 2, 2024 •

edited

Loading

Vyom-Yadav commented Feb 2, 2024 •

edited

Loading

Vyom-Yadav commented Feb 6, 2024 •

edited

Loading