Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scheduler status visibility #478

Open
sihil opened this issue Jan 18, 2018 · 4 comments
Open

Scheduler status visibility #478

sihil opened this issue Jan 18, 2018 · 4 comments

Comments

@sihil
Copy link
Contributor

sihil commented Jan 18, 2018

A scheduled job (see #476) can fail to kick off for a few reasons and it should be really obvious through the interface when this has happened. Similarly, if a scheduled job fails someone probably wants to hear about it.

We might:

  • Add a status dashboard that shows all failed scheduled jobs
  • Add a topic or other notification mechanism for letting people know about failed jobs.
@alexduf
Copy link
Contributor

alexduf commented Jan 22, 2018

Could adding an email address field when scheduling the deploy be a simple way to get feedback in case of an issue?

@sihil
Copy link
Contributor Author

sihil commented Feb 9, 2018

@nicl Now that we've had some use of this shall we pick it up again on Monday to figure out what we need?

@nicl
Copy link
Contributor

nicl commented Feb 12, 2018

Talking with @sihil @adamnfish has also suggested sending emails on any failed deploy which would also solve this issue (in a basic sense).

@sihil
Copy link
Contributor Author

sihil commented Feb 12, 2018

In order to implement this we need some way of getting an e-mail address to notify. I suggest that we use Prism Owners for this (https://github.com/guardian/prism/blob/master/app/data/Owners.scala).

To use prism owners we'd need to gather the set of SSAs being deployed, look them up in Prism and then actually fire off the e-mails. This means having a place in the code where we can detect failure where we have access to the set of SSAs (or data that allows us to derive this). I suspect that this place is the DeployGroupRunner that has the DeployContext (which contains the parameters and the task graph) and also sees the failure events. Having said that the task graph has lost easy access to the app and stack data so this will likely be non-trivial.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants