Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Background Reconciliation for Entities #2262

Open
2 tasks
Vyom-Yadav opened this issue Feb 1, 2024 · 18 comments · May be fixed by #3630
Open
2 tasks

Add Background Reconciliation for Entities #2262

Vyom-Yadav opened this issue Feb 1, 2024 · 18 comments · May be fixed by #3630
Assignees

Comments

@Vyom-Yadav
Copy link
Contributor

Vyom-Yadav commented Feb 1, 2024

Please describe the enhancement

Minder relies purely on event-based processing triggered by webhooks/user actions. Apart from edge-based triggering, Minder should also reconcile the entities in the background. Reconcilers under the scope of this issue are:

  • Continuously update the evaluation status for entities against registered rules. (Task - I)
  • Continuously evaluate the state of the webhook created by the minder on GitHub. (As the user may accidentally delete it) (Task - II)

These reconcilers are required to ensure all existing rule evaluations in minder are updated according to any newly added logic. See #1609 for example.

Solution Proposal

Continuously running the reconcilers is not feasible due to resource constraints and the provider's API limits. Reconcilers would run based on a cron job with a small enough batch size not to hit rate-limiting errors for any provider.

Describe alternatives you've considered

No response

Additional context

For more details see: https://discord.com/channels/1184987096302239844/1201083912521252885

Acceptance Criteria

No response

@Vyom-Yadav
Copy link
Contributor Author

/assign

@jhrozek
Copy link
Contributor

jhrozek commented Feb 1, 2024

Hi @Vyom-Yadav I was thinking about this feature some time ago (though I haven't gotten far, thank you very much for picking it up) here are my notes from that time. Feel free to take anything you like from them or not if otherwise :-). A lot of it is very hand-wavy, as I said, I didn't get too deep.

  • I was thinking about having a separate process, a singleton as opposed to be a part of the potentially replicated minder instances. The idea behind this was that presumably we'd use a queue of entities to be updated and having a singleton writing to the queue just seemed simpler.
  • this new component (please name it reminder :-)) would periodically (say once an hour?) traverse the database, picking up projects that contain entities that haven't been updated in $TIME_PERIOD.
  • each project would get a time slot and could update up to N entities from the project within that time slot. Individual updates within that interval would be spaced apart with a jitter to avoid spikes. An entry to update an entity would be written to a work queue. We use watermill with an SQL back end, without thinking about the details too deep I was wondering if we could make this reminder a topic publisher and minder workers subscribers to the periodical update topics.
  • the minder processes being subscribers to the periodical update table would then issue an ExecuteEntityEventTopic

@Vyom-Yadav
Copy link
Contributor Author

Vyom-Yadav commented Feb 2, 2024

Good points, @jhrozek ! I'm also thinking about having this as a separate process. Based on the number of replicas, we can process an entity repeatedly, leading to rate-limiting errors.

The Singleton Client Binary would have the logic for scheduling reconciliation events. The approach I had in my mind is:

  • Pick entities from different owners/tokens so we can process multiple entities with larger batch sizes and not hit any rate-limiting errors.
  • Evan suggested running the reconciler once in 24 hours, but all those parameters would be configurable so that we can tune it accordingly.
  • Send a request to the server using RPCs. The server would then internally process those events. It would be similar to receiving webhook events; this time, it would be from the reminder process.

To scale up, we may shard the database so that each reconciler is attached to a unique set of data.

@jhrozek
Copy link
Contributor

jhrozek commented Feb 2, 2024

Good points, @jhrozek ! I'm also thinking about having this as a separate process. Based on the number of replicas, we can process an entity repeatedly, leading to rate-limiting errors.

The Singleton Client Binary would have the logic for scheduling reconciliation events. The approach I had in my mind is:

  • Pick entities from different owners/tokens so we can process multiple entities with larger batch sizes and not hit any rate-limiting errors.

Yes, this is what I tried to say with "a project", but yes, your description is more precise.

  • Evan suggested running the reconciler once in 24 hours, but all those parameters would be configurable so that we can tune it accordingly.

I think it's hard to know how often to run the reconciler without some more experiments and testing and I strongly agree with making it configurable. At the very least for testing I would have used something in the matter of seconds to see the reconciler triggered repeatedly.

  • Send a request to the server using RPCs. The server would then internally process those events. It would be similar to receiving webhook events; this time, it would be from the reminder process.

Interesting idea, I like that it's decoupled from the watermill implementation we have now.

To scale up, we may shard the database so that each reconciler is attached to a unique set of data.

I have literally 0 experience with database sharding so I don't feel qualified to answer. I know Evan does have a lot, maybe he'd have an opinion?

@Vyom-Yadav
Copy link
Contributor Author

cc @evankanderson, what do you think about the above-mentioned approach (having reconciler as a separate process)?

@JAORMX
Copy link
Contributor

JAORMX commented Feb 2, 2024

FWIW, I think having a separate process for the reconciliation trigger makes sense and it's ideal. I'm not super sure about doing it with RPCs and would prefer events instead, but I don't have a strong opinion about this that would block it.

@Vyom-Yadav
Copy link
Contributor Author

Vyom-Yadav commented Feb 2, 2024

doing it with RPCs and would prefer events instead

It would be done using events only. RPCs would trigger events on the server. Our new process would invoke those RPCs. Similar to webhook handler.

Edit:
I didn't mention it explicitly, but we won't directly expose the evaluation engine. The server would also have some form of authorization mechanism to verify the reconciler.

@evankanderson
Copy link
Member

I agree that this is probably worth a quick design. Trying to hit the design points:

  1. Using a singleton to avoid needing coordination / locking across multiple Minder processes. This should work for now. It adds a small amount of deployment complexity, but that seems worth it. You may also want to consider having the singleton support a "one-shot" mode to address @jhrozek 's comment about wanting "second-level revisits" in development. You probably don't actually want constant revisits at the second level, but you want them on demand.

  2. I think we'll need to be nimble on the pacing. I'd start with something simple, but we may need to spread out on some of the axes @Vyom-Yadav mentions. Right now, we're looking at scales of maybe 10k repositories, which is... one every 86 seconds at a 24-hour revisit. This will change if Minder takes off, of course.

  3. My preference would be to use events on an externally-configured queue (i.e. rather than the in-binary configuration we have today with Watermill). Using an external queue allows us to "freeze" the contract between the two components so each can evolve independently. At the same time, that would be a major undertaking that doesn't provide much direct value, so I think the short-term answer is to have a "kick" RPC that triggers the Minder server to enqueue an event similar to the current webhook events. It's not clear that the event respondent needs to know whether they received a webhook or a reminder kick -- I'd aim to keep it that way; the reconciliation logic should be idempotent.

@evankanderson
Copy link
Member

Oh, and the first point of scaling if we needed it would be a read replica, which should be able to scale to 2M repos under management at a scan rate of less than 300 rows/s. If we're reading in an index-aligned way, that should be trivial, and even if not, it should be pretty easy unless we do some sort of order by/limit/offset hijinks.

@Vyom-Yadav
Copy link
Contributor Author

Vyom-Yadav commented Feb 6, 2024

My preference would be to use events on an externally-configured queue (i.e. rather than the in-binary configuration we have today with Watermill). Using an external queue allows us to "freeze" the contract between the two components so each can evolve independently.

@evankanderson Which message queue would you suggest? I know you're keen on using Amazon SQS, still what would you suggest for the current use case? (How about NATS?)

so I think the short-term answer is to have a "kick" RPC that triggers the Minder server to enqueue an event similar to the current webhook events.

Now that you've mentioned using message queues, let's implement that only.

@jhrozek
Copy link
Contributor

jhrozek commented Feb 7, 2024

My preference would be to use events on an externally-configured queue (i.e. rather than the in-binary configuration we have today with Watermill). Using an external queue allows us to "freeze" the contract between the two components so each can evolve independently.

Which message queue would you suggest? I know you're keen on using Amazon SQS, still what would you suggest for the current use case?

(full disclosure: I don't have much of a working experience with message queues, so this is more of a gut feeling than anything backed by hard data)

I would prefer not to tie Minder to a specific message queue implementation. Wasn't an abstraction a reason why we chose Watermill back in the day in the first place? IIRC Watermill also have an SQS and Jetstream back end..

so I think the short-term answer is to have a "kick" RPC that triggers the Minder server to enqueue an event similar to the current webhook events.

Now that you've mentioned using message queues, let's implement that only.

One reason where a message queue might be nicer in this respect over an RPC call is that you might get a notion of when a task in a queue was finished as opposed to claimed by a worker who might disappear before the task had finished. I would hope that for this case it might not have a practical effect as the next "tick" would just reinsert the same refresh, but still..

@Vyom-Yadav
Copy link
Contributor Author

I would prefer not to tie Minder to a specific message queue implementation. Wasn't an abstraction a reason why we chose Watermill back in the day in the first place? IIRC Watermill also have an SQS and Jetstream back end..

Watermill is a wrapper over the queue SDKs (With additional features). We would still use Watermill, but we have to decide whether we want to use Kafka, NATS, RabbitMQ, etc. Each one of them has a unique set of features, so we should decide what's best for the use case. Honestly, I also haven't worked much with message queues, so I'm unsure which one would be the best suited.

One reason where a message queue might be nicer in this respect over an RPC call is that you might get a notion of when a task in a queue was finished as opposed to claimed by a worker who might disappear before the task had finished. I would hope that for this case it might not have a practical effect as the next "tick" would just reinsert the same refresh, but still..

Agreed, using queues would be significantly better.

@jhrozek
Copy link
Contributor

jhrozek commented Feb 7, 2024

I would prefer not to tie Minder to a specific message queue implementation. Wasn't an abstraction a reason why we chose Watermill back in the day in the first place? IIRC Watermill also have an SQS and Jetstream back end..

Watermill is a wrapper over the queue SDKs (With additional features). We would still use Watermill, but we have to decide

Ah, then we just didn't understand each other, I read your earlier comment as wanting to go directly for the message queue.

@evankanderson
Copy link
Member

My preference would be to use events on an externally-configured queue (i.e. rather than the in-binary configuration we have today with Watermill). Using an external queue allows us to "freeze" the contract between the two components so each can evolve independently.

@evankanderson Which message queue would you suggest? I know you're keen on using Amazon SQS, still what would you suggest for the current use case? (How about NATS?)

Sorry, missed this question -- I would suggest for now using Watermill, and making the "extract message queue" a separate epic which would apply to all our existing message queueing in Minder.

I'm partial to tools like SQS, EventBridge, or something like Google's Cloud Tasks which manage delivery as a weakly ordered set of deliveries, rather than a strongly-ordered log like Kafka. However, I think that's a separate "plumbing" task, so let's not conflate the two.

so I think the short-term answer is to have a "kick" RPC that triggers the Minder server to enqueue an event similar to the current webhook events.

Now that you've mentioned using message queues, let's implement that only.

Please don't! Feel free to put the RPC-to-Minder behind an interface that we could replace with a one-way delivery mechanism, but switching to an external message queueing system would be a major surgery that should be separate from the revisit mechanism.

(Basically, we'd be taking the vertical stack from the Watermill router diagram and externalizing it into configuration managed by a tool like Terraform)

Image

@evankanderson
Copy link
Member

To explain the externalized config further, the Watermill Router concept support a simple "named topic" fan-out, but SNS, EventBridge, and Google PubSub support filtering messages based on message attributes (e.g. attributes.tier=0 and attributes.type = "repo" in Google-syntax, or {"tier": [{"numeric": ["=", 0]}], "type": ["repo"]} in AWS-syntax).

Using one of the above providers would also allow us to absorb some of the existing middlewares -- some of the telemetry and the Poison / Retry management would move into the external service.

This is also a pretty big commitment that requires a design document and consideration of tradeoffs (cost, scale, difficulty-of-use, credentials), and we probably will end up putting it behind an interface to allow some agility later.

@Vyom-Yadav
Copy link
Contributor Author

Using one of the above providers would also allow us to absorb some of the existing middlewares -- some of the telemetry and the Poison / Retry management would move into the external service.

That's exactly why I wanted to do it with queues.

Please don't! Feel free to put the RPC-to-Minder behind an interface that we could replace with a one-way delivery mechanism, but switching to an external message queueing system would be a major surgery that should be separate from the revisit mechanism.

Makes sense. Let's keep that as a separate issue. Later we can replace the RPC call with a message publish. Thanks for the detailed explanation.

@evankanderson
Copy link
Member

One reason where a message queue might be nicer in this respect over an RPC call is that you might get a notion of when a task in a queue was finished as opposed to claimed by a worker who might disappear before the task had finished. I would hope that for this case it might not have a practical effect as the next "tick" would just reinsert the same refresh, but still..

Agreed, using queues would be significantly better.

I was suggesting the following pseudocode:

func (s *Server) MinderRevisit(ctx, req) (resp, err) {
  msg := ConvertRevisitToMsg(req)
  
  err := s.Router.Publish(msg)
  return &EmptyResp{}, err
}

With the pattern of synchronously publishing the event and then returning the error from the publish, you shouldn't have to worry about the task being lost. In general a message queue will not give you a notice when the message has been delivered -- once the queue takes over, the delivery to all destinations is eventually guaranteed, but you don't get to know when it happened (it's asynchronous and may be plural -- one publish may result in 3 subscribers getting the message, so "done" is ambiguous in those circumstances).

@Vyom-Yadav
Copy link
Contributor Author

Yeah, we will use something like that only, with some authn/z mechanism that works for both the reminder process and users trying to do one shot op.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants