Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Monitoring & alerting #78

Open
cben opened this issue Feb 20, 2015 · 11 comments
Open

Monitoring & alerting #78

cben opened this issue Feb 20, 2015 · 11 comments

Comments

@cben
Copy link
Owner

cben commented Feb 20, 2015

Currently I won't be notified app on RHcloud/Heroku is overloaded/returning errors
(unless it's bad enough for pingdom/uptimerobot check to fail).
What's worse, I don't have any good way to observe current/recent status!

@cben
Copy link
Owner Author

cben commented Feb 20, 2015

Also consider cliend-site performance and JS error logging.
It can't catch server not responding though.

E.g. http://www.lognormal.com/boomerang/doc/ & https://github.com/nature/boomcatch, https://github.com/getsentry/sentry
(I'm tempted to write the data to Firebase instead of running separate server, but won't realistically ever get around to coding that. Also, at some point I'll want an open backend alternative to Firebase.)

@cben
Copy link
Owner Author

cben commented Feb 20, 2015

https://blog.openshift.com/openshift-logs-metrics-management-logstash-graphite/ lists open logging/monitoring tools [for OpenShift Enterprise, not sure what applies to OpenShift Online]

@cben
Copy link
Owner Author

cben commented May 27, 2015

Collecting client-side errors would be nice, e.g. I would have learnt of #85 much earlier.
More tools at https://github.com/cjbarber/ToolsOfTheTrade#errorexception-handling
Reminder to self: if I implementing that, I should scrub console log of fragments of document content (notably CM-MJ spews all formulas).

@cben
Copy link
Owner Author

cben commented Jun 17, 2015

The main lesson from #100 (40 min downtime on 2015-06-12, little leads into why) are:

  • improve my understanding of haproxy (and in particular learn to read or change its log time(?) format).
  • on-server resource metrics (e.g. was it running out of cpu or ram to the point it became unresponsive?)

Unrelated: looking at pingdom and uptimerobot, I see occasional latency spikes to 1.5–2sec.
Why? Are these something negligible like single random lost packet, or are they times of slowness when loading the whole page would take tens of seconds?
Also, pingdom believes the baseline latency is >300ms while uptimerobot has <100ms. Probably because they're pinging from different geographic locations.
Would be nice to have server-side latency metrics. Better yet (in case server is too loaded to even accept connections quickly), haproxy-side metrics.

@cben
Copy link
Owner Author

cben commented Aug 3, 2015

  • log responses: at least non-200 responses, and perhaps latency (ideally first and last byte?) for all requests.

@cben
Copy link
Owner Author

cben commented Aug 3, 2015

As lesson from #104 (huge TTL on mathdown.{net,com} but OK on www.mathdown.{net,com}),
I've added all domain variations (and the underlying mathdown-cben.rhcloud.com) to Uptimerobot (less used ones with 30min freq).

Another lesson is that I learned to use Pingdom and it's indeed more informative on failures (shows IP tried, runs traceroute, full HTTP exchange). Probably should upgrade to paid plan and add all domain variations there.

@cben
Copy link
Owner Author

cben commented Aug 3, 2015

FOSS alternatives to check out:
http://cabotapp.com/
https://github.com/fzaninotto/uptime

@cben
Copy link
Owner Author

cben commented Aug 3, 2015

found haproxy config: haproxy/conf/haproxy.cfg under home dir on first gear (apparently generated from https://github.com/openshift/origin-server/blob/master/cartridges/openshift-origin-cartridge-haproxy/versions/1.4/configuration/haproxy.cfg.erb)
Still don't understand the haproxy.log format — it's not the "httplog" configured there.

@cben
Copy link
Owner Author

cben commented Sep 20, 2015

Lesson from #117: it's hard to understand how realistic DNS-caching users experience DNS flips.
In this instance Pingdom saw the flip immediately, Uptimerobot apparently used outdated DNS for a hour or two (inferred, there is no info).

@cben
Copy link
Owner Author

cben commented Dec 16, 2015

Pingdom is reducing features on free plan: https://www.pingdom.com/planfree
Notably, I'll lose: Public static page, 1min->5min freq, Root cause analysis (extra probing when down).

I'm getting free Starter trial till January 28, can upgrade until Dec 29 for $7/mo for first year.
That's somewhat tempting, but I'm more interested in reducing expenditure now.

@cben
Copy link
Owner Author

cben commented Dec 17, 2022

I've been on https://updown.io/ for a while, pretty happy with it.

  • update deployment/README.md
  • write down where I've set up TLS cert monitoring

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant