Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Weekly reports on performance #2

Open
MickaelBergem opened this issue Sep 10, 2018 · 15 comments
Open

Weekly reports on performance #2

MickaelBergem opened this issue Sep 10, 2018 · 15 comments
Assignees
Labels
feature New feature request need-more-details Extra specification is needed

Comments

@MickaelBergem
Copy link
Contributor

MickaelBergem commented Sep 10, 2018

Need: to get a periodic overview of the general performance and/or major events on a given set of monitors.

This may include:

  • mean response time
  • number and nature of downtime
  • uptime % or downtime minutes
  • major events (response time dropped by 20% compared with last week)

Display channels:

Any comment from the community is welcome! 📣

@MickaelBergem MickaelBergem added feature New feature request need-more-details Extra specification is needed labels Sep 10, 2018
@MickaelBergem MickaelBergem self-assigned this Sep 10, 2018
@MickaelBergem
Copy link
Contributor Author

MickaelBergem commented Jul 9, 2020

Here is the draft visual of this weekly/monthly report. @jpcaruana @pascalandy would the information in this report be useful to you? (you had upvoted this feature)
image

Now is the best time to add / remove / improve the information that goes into this report :)

@pascalandy
Copy link

pascalandy commented Jul 9, 2020

I would add titles like:

Weekly stats

bla bla ...

monthly stats

bla bla ...

@pascalandy
Copy link

Not sure the green and red little triangles add value.

@jpcaruana
Copy link

I like it (I like green/red triangles : if everything is green, I don't have to read)

It would be great if you could also had a list of worst routes (from the APM part), the top 5 impact

@jpcaruana
Copy link

as a reference, I like the weekly email by sentry:

image

it gives me a good sense of "everything is good/bad" at a glance, and I can dvelve into details (not shown in my screenshot)

@MickaelBergem
Copy link
Contributor Author

Thank you both for the quick answer!

I would add titles like:

Weekly stats

bla bla ...

monthly stats

bla bla ...

@pascalandy Just to make sure I understand what you mean, you would like to receive the weekly+monthly report every week? So that you can somehow have some longer-term stats on the performance?

@jpcaruana thanks! The Sentry report is indeed quite useful. Adding APM data in the performance report will come next, I can definitely see the value.

@pascalandy
Copy link

Correct, it's a two for one :-P

you would like to receive the weekly+monthly report every week

@MickaelBergem
Copy link
Contributor Author

Update: I've been thinking a bit about how to best design this email (in terms of content more than in terms of UI), and here are my thoughts.

User experience

As a user of HowFast, I have limited time, so I will just trash the email if I don't care. To make it easier for the user to know if they should care or not, the email subject has cues such as:

  • Weekly performance report for GéoSchool: no incident this week 💪
  • Weekly performance report for HowFast: 7 monitors currently down 😮

If all monitors are up, but there were incidents in the past week, I'm still unsure what is the most useful. We could go with the total number of minutes spent down (45 minutes of cumulated downtime this week), or the maximum (longest incident lasted 6h), or something else. The cumulated downtime becomes much less interesting as soon as you have 3+ monitors: maybe all three went down at the same time and you end up with a 2h15 downtime while it only went down for 45 minutes.

Based on this information, the user can decide to archive/trash the email (especially if there were no incidents), or to read it.

If the user only has to spend 10 seconds scanning through the email, what are the most important metrics? See "head metrics" below. I'm not sure about this part for now.

image

The current design highlights the monitors that are currently down (the most important information) along with a short explanation of what is happening. If the user needs more information, the rows are clickable and open the monitor in HowFast.

Head metrics

This is the "Sentry-like" report. I'm in favor of adding those big numbers at the top of the email to provide a synthetic view of what happened, but I need to figure out the details. What makes sense in the context of HowFast?

For now I only think about the number of monitors currently down, the number of incidents last week, and maybe the slowest average response time. What are the metrics you would be interested to see?

Weekly AND monthly metrics in a single mail

Given the number of monitors for some of the teams using HowFast, having two tables will make the email super long and harder to read, so I'm not convinced this will add value. I will study the possibility of adding an extra column in the report "uptime over the last 30 days", while making it clear the other one is the "uptime over the last 7 days". Those numbers might as well be easier to show directly inside HowFast instead of in an email.

Next up

Currently the implementation is almost ready, and will be rolled in in the next few days. If you are interested, you can opt-in and start receiving the reports in your mailbox, so that you can see what it will look like with your numbers - I would love to hear your feedback!

@MickaelBergem
Copy link
Contributor Author

MickaelBergem commented Jul 30, 2020

The first batch of weekly reports were sent this Monday, with some very good results:

  • several teams removed old monitors that had been down for months
  • a few teams tweaked the monitoring configuration
  • a few teams added new monitors

Overall, several teams were able to get more value out of HowFast thanks to this report.

The next batch will get extra data included, related to certificates expiring soon (in less than two weeks). This will help make sure that even if no notification is configured for the affected monitors, the team can still learn about it.

Feel free to share your feedback :)

@jpcaruana
Copy link

jpcaruana commented Jul 30, 2020

Hi,

I like these emails and I'm looking forward to seeing it becoming better and better.

Would it be possible to be able to choose the order of monitors inn the email ? I have a lot of monitors, and production monitors are my main focus (the rest is more informational for me) for this kind of weekly digests.

Thanks!

@MickaelBergem
Copy link
Contributor Author

Thank you for your feedback @jpcaruana! Currently, the monitors are ordered by:

  1. status (monitors that are down first)
  2. increasing uptime (so that you can see the problematic monitors first)
  3. decreasing response time (if all your monitors have 100%, you probably want to focus on the slower ones first)

I'm trying to think about a way to make it work for you in this context. I assume we could somehow add a flag for "production" monitors, and display those first, do you see another way to make it work in your case? I will try to think about it.

@jpcaruana
Copy link

I assume we could somehow add a flag for "production" monitors, and display those first, do you see another way to make it work in your case?

this seems like a perfect use solution for my use case. You could also use this flag for the web UI too i guess

@MickaelBergem
Copy link
Contributor Author

@jpcaruana you mentioned having the most impactful endpoints listed in the email, would that work if it's based off all the APMs in your team, including the non-prod ones, or would the result be significantly useless? I started working on this and might very well be able to send you the results for your team so that you can double check, but maybe you already know.

@MickaelBergem
Copy link
Contributor Author

Here is a draft of the APM summary:

v0.1
image

v0.2
application-performance-monitoring-howfast

I think the impact measured in ms per minute makes sense (=milliseconds a worker is spending working on this endpoint during an average minute) and adds value.

@jpcaruana
Copy link

current weekly report works great: I think you can close here @MickaelBergem :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature New feature request need-more-details Extra specification is needed
Projects
None yet
Development

No branches or pull requests

3 participants