Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Connect reminder service to minder server to dispatch reminders #3630

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

Vyom-Yadav
Copy link
Contributor

@Vyom-Yadav Vyom-Yadav commented Jun 16, 2024

Summary

Provide a brief overview of the changes and the issue being addressed.
Explain the rationale and any background necessary for understanding the changes.
List dependencies required by this change, if any.

Issue #2262 (Task - I)

Adds logic for sending reminders to the minder server. Minder server processes these reminders to trigger a TopicQueueReconcileRepoInit event to reconcile a repository.

Change Type

Mark the type of change your PR introduces:

  • Bug fix (resolves an issue without affecting existing features)
  • Feature (adds new functionality without breaking changes)
  • Breaking change (may impact existing functionalities or require documentation updates)
  • Documentation (updates or additions to documentation)
  • Refactoring or test improvements (no bug fixes or new functionality)

Testing

Outline how the changes were tested, including steps to reproduce and any relevant configurations.
Attach screenshots if helpful.

Tested locally (no unit tests)

Review Checklist:

  • Reviewed my own code for quality and clarity.
  • Added comments to complex or tricky code sections.
  • Updated any affected documentation.
  • Included tests that validate the fix or feature.
  • Checked that related changes are merged.

cc @evankanderson

@Vyom-Yadav Vyom-Yadav requested a review from a team as a code owner June 16, 2024 16:09
Comment on lines +198 to +203
// Commit the transaction i.e update reminder_last_sent
// only if the messages were sent successfully
if err = tx.Commit(); err != nil {
logger.Error().Err(err).Msg("unable to commit transaction")
errorSlice = append(errorSlice, err)
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

UpdateReminderLastSentById queries are part of a transaction and are only committed if the messages are published successfully. Still, the transaction could fail, which would result in reminders being sent, but the reminder_last_sent field won't be updated (better than false positive i.e. reminder_last_sent was updated but reminders weren't sent)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is fine. Did you consider to split sending the messages into a separate function or return earlier if len(messages) == 0? This might be a personal preference but I find it easier to read code with fewer indentation levels.

Something like:

if len(messages) == 0 {
   return errorSlice
}

// the rest of the code

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like the reminder_last_sent field is mostly diagnostic, because we're actually ticking forward on the (in-memory) repository cursor, rather than using reminder_last_sent to actually dictate when repos are reconsidered. I think this is correct, but that suggests to me that strict update correctness is less necessary.

(I'm not arguing against keeping this data, but it seems like the actual mechanism that prevents re-evaluation is when the reminder is received and a rule evaluation is complete, which could be minutes after the reminder is sent on a particularly bad day.)

Comment on lines +162 to +166
// TODO: Collect Metrics
// Potential metrics:
// - Gauge: Number of reminders in the current batch
// - UpDownCounter: Average reminders sent per batch
// - Histogram: reminder_last_sent time distribution
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do we want to export the metrics? And some thoughts on adding these metrics?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I wonder what metrics would be useful as a minder operator.

About the histogram, I think a histogram that shows me how out of date (how much before a cutoff) a repo was when it was selected would be useful.

I think the metrics would be interesting to fine-tune the algorithm, did you see some other value in the metrics?

Since reminder is a separate service, then I guess implementation-wise it should just have its own metrics server that could be scraped separately.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My $0.02 would be:

  1. Create a MeterProvider ala https://github.com/stacklok/minder/blob/main/internal/controlplane/server.go#L188.
  2. Wire the metrics into prometheus. Since this has no external interface, you can simply set up an internal http.Server to serve the metrics -- we could put other debug information there later.
  3. If you call otel.SetMeterProvider in the server setup, you don't need to worry about plumbing directly from the code recording the metric to the server; OTel will handle this in the backend (with some globals, but that's fine in this case). That means that the metrics code can do the simple thing from the OTel docs:
    var timeShift metrics.Float64Histogram
    func init() {
      meter := otel.GetMeterProvider().Meter("reminder")  // Or could be ""
      var err error
      timeShift, err = meter.Float64Counter("send_delay", api.WithDescription("Delay in revisit after eligible"), api.WithUnit("s"))
      if err != nil {
        panic("Couldn't register meter: %v", err)
      }
    }
    
    // At call site:
      timeShift.Record(ctx, send_delay)  // No need for attributes

@coveralls
Copy link

Coverage Status

coverage: 53.167% (-0.2%) from 53.4%
when pulling d480686 on Vyom-Yadav:send-reminder-events
into cd7a0f2 on stacklok:main.

@coveralls
Copy link

Coverage Status

coverage: 53.172% (-0.2%) from 53.4%
when pulling d480686 on Vyom-Yadav:send-reminder-events
into cd7a0f2 on stacklok:main.

@coveralls
Copy link

Coverage Status

coverage: 53.177% (-0.2%) from 53.4%
when pulling d480686 on Vyom-Yadav:send-reminder-events
into cd7a0f2 on stacklok:main.

@evankanderson evankanderson self-requested a review June 20, 2024 13:02
@evankanderson evankanderson linked an issue Jun 20, 2024 that may be closed by this pull request
2 tasks
Copy link
Contributor

@jhrozek jhrozek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code looks quite good to me. I left replies to your comments, neither of them is blocking. It wasn't clear to me if you wanted to tackle the metrics as part of this PR, but I normally prefer smaller PRs.

Copy link
Member

@evankanderson evankanderson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like the current architecture re-uses the existing events configuration for Minder, since it's assuming that the repo.reminder.event events are in a Watermill instance which Minder is configured to see. I think that's fine, but I didn't recall for sure where we'd landed on that earlier. This setup is 👍 for me from a simplicity PoV.

Comment on lines +29 to +34
// Project is the project that the event is relevant to
Project uuid.UUID `json:"project"`
// RepositoryID is id of the repository to be reconciled
RepositoryID int64 `json:"repository" validate:"gte=0"`
// ProviderID is the provider of the repository
ProviderID uuid.UUID `json:"provider"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My preference would be to order these Project > Provider > Repo, which is the ownership order.

// Project is the project that the event is relevant to
Project uuid.UUID `json:"project"`
// RepositoryID is id of the repository to be reconciled
RepositoryID int64 `json:"repository" validate:"gte=0"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO, this should use the id field from the repositories table, not the repo_id from GitHub. That will help insulate this from GitHub-specific fields that are different in (for example) GitLab or BitBucket in the future.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additionally, id is the primary key. It looks like our current indexes are set up off repo_id; sorry for not noticing this previously...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking further, it looks like we use the GitHub-matching repo_id in NewRepoReconcilerMessage already, so that argues towards using it here as well.

Comment on lines +61 to +64
validate := validator.New()
if err := validate.Struct(&evt); err != nil {
return nil, err
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the benefit of calling validate and checking that the numeric repo_id is non-negative? It seems like the caller of this method will still need to check whether or not repo_id actually exists in the database, so this seems like it adds some extra code without really accomplishing the necessary validation; the necessary validation requires a database connection, and would presumably return the actual row if needed.

Comment on lines +162 to +166
// TODO: Collect Metrics
// Potential metrics:
// - Gauge: Number of reminders in the current batch
// - UpDownCounter: Average reminders sent per batch
// - Histogram: reminder_last_sent time distribution
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My $0.02 would be:

  1. Create a MeterProvider ala https://github.com/stacklok/minder/blob/main/internal/controlplane/server.go#L188.
  2. Wire the metrics into prometheus. Since this has no external interface, you can simply set up an internal http.Server to serve the metrics -- we could put other debug information there later.
  3. If you call otel.SetMeterProvider in the server setup, you don't need to worry about plumbing directly from the code recording the metric to the server; OTel will handle this in the backend (with some globals, but that's fine in this case). That means that the metrics code can do the simple thing from the OTel docs:
    var timeShift metrics.Float64Histogram
    func init() {
      meter := otel.GetMeterProvider().Meter("reminder")  // Or could be ""
      var err error
      timeShift, err = meter.Float64Counter("send_delay", api.WithDescription("Delay in revisit after eligible"), api.WithUnit("s"))
      if err != nil {
        panic("Couldn't register meter: %v", err)
      }
    }
    
    // At call site:
      timeShift.Record(ctx, send_delay)  // No need for attributes

return []error{err}
}

defer tx.Rollback()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need all of the following database reads and writes to go into a single transaction (but not the getRepositoryBatch call above in the transaction, or the r.updateRepositoryCursor in the outer getRepositoryBatch call)?

Time("previously", repo.ReminderLastSent.Time).
Msg("updating reminder_last_sent")

err = qtx.UpdateReminderLastSentById(ctx, repo.ID)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Intentional that you update when the reminder is sent before you send the message?

... except that with transactions, this actually sort-of happens after the message was sent, too -- you may be taking out a lock on the repo row for the duration of the iteration. Transactions can make before/after behavior hard to reason about in some cases, so I'd be cautious about introducing them when they aren't needed / aren't producing specific guarantees.

Comment on lines 183 to +184
logger.Error().Err(err).Str("repo", repo.ID.String()).Msg("unable to update reminder_last_sent")
return []error{err}
errorSlice = append(errorSlice, err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general:

Either handle an error (by logging, recovering, etc) OR propagate an error, but don't do both.

Logging the error and also propagating it often means that you'll end up with errors being double- or even triple-logged, as different parts of the stack log the error but then pass it up to be logged in the next part of the stack.

Comment on lines +193 to +195
if err != nil {
errorSlice = append(errorSlice, fmt.Errorf("error publishing messages: %w", err))
} else {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than an if/else here, generally prefer to use early-return, particularly since this seems like a terminal case.

More generally, I'm not convinced that there's a large benefit to batching the Publish messages, rather than sending each in the previous part of the loop.

Comment on lines +198 to +203
// Commit the transaction i.e update reminder_last_sent
// only if the messages were sent successfully
if err = tx.Commit(); err != nil {
logger.Error().Err(err).Msg("unable to commit transaction")
errorSlice = append(errorSlice, err)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like the reminder_last_sent field is mostly diagnostic, because we're actually ticking forward on the (in-memory) repository cursor, rather than using reminder_last_sent to actually dictate when repos are reconsidered. I think this is correct, but that suggests to me that strict update correctness is less necessary.

(I'm not arguing against keeping this data, but it seems like the actual mechanism that prevents re-evaluation is when the reminder is received and a rule evaluation is complete, which could be minutes after the reminder is sent on a particularly bad day.)

Comment on lines +45 to +63
func (rp *ReminderProcessor) reminderMessageHandler(msg *message.Message) error {
evt, err := remindermessages.RepoReminderEventFromMessage(msg)
if err != nil {
return fmt.Errorf("error unmarshalling reminder event: %w", err)
}

log.Info().Msgf("Received reminder event: %v", evt)

repoReconcileMsg, err := reconcilermessages.NewRepoReconcilerMessage(evt.ProviderID, evt.RepositoryID, evt.Project)
if err != nil {
return fmt.Errorf("error creating repo reconcile event: %w", err)
}

// This is a non-fatal error, so we'll just log it and continue with the next ones
if err := rp.evt.Publish(events.TopicQueueReconcileRepoInit, repoReconcileMsg); err != nil {
log.Printf("error publishing reconciler event: %v", err)
}
return nil
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It feels like this is simply forwarding messages from one queue to another; why not simply sending the TopicQueueReconcileRepoInit message directly from reminder.go and skip this extra intermediary? (There might be a good reason, but we should encapsulate that with a comment if there is such a reason.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add Background Reconciliation for Entities
4 participants