Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cylc server monitor #72

Open
oliver-sanders opened this issue Dec 18, 2019 · 7 comments
Open

cylc server monitor #72

oliver-sanders opened this issue Dec 18, 2019 · 7 comments

Comments

@oliver-sanders
Copy link
Member

oliver-sanders commented Dec 18, 2019

Writing this up here as it's uncertain where this code would live.

At the Met Office we have a pool of 12 16 servers which suites can run on. To help us keep track of the health of these servers and the usage of Cylc on them we wrote a tool which provides a web dashboard with:

  • Plots of the number of running suites, server load, memory usage and CPU for the suite servers.
  • Plot of the number of suites running under each Cylc version.
  • Graph of the number of running suites and active users.
  • Listing of all suites running on the suite servers with info such as run time and memory usage.
  • Listing of all processes running on the suite servers which aren't Cylc processes.

This is important functionality for larger sites, there are lots of ways in which is can be improved (e.g. daily job counts).

This code is written in Python2.6 and has to run in a bare environment so is kinda ugly and not especially portable. It needs a re-write!

We should be able to re-implement this functionality within the Cylc UI/UI-Server infrastructure to provide an admin dashboard. That way this functionality would ship with Cylc and be available to all.

This would involve the creation of a dashboard for Cylc admins (we could make it accessible to all users), it would require an always-running UI Server running under a specified account, which, depending on site specifics may require certain privileges to be effective. It will need to maintain a database, sqlite3 is more than sufficient.

Infrastructure aside the actual code component is pretty simple:

  • The current implementation is a few thousand lines of code, with modern Python and packaging a direct re-implementation could be much lighter.
  • Server info can come from psutils.
  • Suite listings can come from the UI Server suite discovery service.
  • The rest is data management and plotting.

Infrastructure wise:

  • The server side component of this code could live in:
    • The UI Server
    • Another repo which provides plugin extensions to the UI Server.
  • The UI component of the code could live in:
    • The UI
      • Would inflate dependencies but thanks to webpack this wouldn't impact load times.
    • Another Vue project.
@kinow
Copy link
Member

kinow commented Dec 18, 2019

I think for the UI we can actually leverage from existing tools.

Graphite, Prometheus, Grafana, and so many other tools are able to digest this sort of information.

Our dashboard could then have dummy components that simply use these other libraries - or we could even simply use the tools in the UI.

These tools are also common in cloud deployments, so if the server side is able to produce a JSON document in the format for prometheus (for example) users woulf be able to choose their monitoring and even alerting solution.

Just my 0.02 cents, but great idea and should be fun to implement.

@oliver-sanders
Copy link
Member Author

Lots of fun plotting libraries we could use, interesting point on alerting, the old system does this in the Python backend.

@hjoliver
Copy link
Member

hjoliver commented Dec 18, 2019

Grafana etc. are really nice; we should certainly look at using something like that (in due course) since you say a rewrite is needed anyway.

@oliver-sanders
Copy link
Member Author

oliver-sanders commented Dec 18, 2019

We could potentially keep the old frontend but it wouldn't take long to re-write so lets do it properly!

Some screenshots of the old frontend for reference:

exvcylc01
exvcylc02
exvcylc03

Some issues hanging over from the old system transcribed from the old issue tracker (sticky notes on my desk):

  • Better colour scheme to make managing larger numbers of servers easier.
  • Ability to view all metrics on the same plot (at least for a single server).
  • Bar chart of the number of suites running under each Cylc version should be a line plot.
  • Bar chart of the number of suites running on each Cylc server should be a line plot.
  • Integrate CLI utility.

Some screenshots of a Python3 CLI utility which works with the JSON dump files produced by the old system:

$ suitetool3 --latest
# 732 rows in dataset

0  Add field     Add derived field to the data set.               
1  Filter        Filter by field value.                           
2  View          Print all data                                   
3  Summary       Print the first few rows of data.                
4  Count         Count unique values for a given field.           
5  Debug         Insert pdb breakpoint                            
6  Export Data   Export the current dataset as a CSV file.        
7  Email Users   Send an email to all users present in the dataset
8  Stack Action  (undo, export, import)                           
9  Exit                                                           

Choose an action (int): 0

0  suite_dir       The FS location of the suite directory.               
1  root_dir        The FS mount which the suite is installed on.         
2  shared_account  True if the account is *likely* to be a shared account
3  suite_grep      Grep *.rc files against a pattern.                    
4  diff            Diff suites present at another checkpoint.            
5  cylc_tags       Tuple of taggs for the cylc_version                   
Choose a field (int): 1
[=============================================================================]
[=============================================================================]

# 732 rows in dataset

0  Add field     Add derived field to the data set.               
1  Filter        Filter by field value.                           
2  View          Print all data                                   
3  Summary       Print the first few rows of data.                
4  Count         Count unique values for a given field.           
5  Debug         Insert pdb breakpoint                            
6  Export Data   Export the current dataset as a CSV file.        
7  Email Users   Send an email to all users present in the dataset
8  Stack Action  (undo, export, import)                           
9  Exit                                                           

Choose an action (int): 4
Available fields "server, suite_id, user_name, user_id, cylc_version, memory, cpu, run_days, last_activity, suite_dir, root_dir"
Choose field: root_dir

field: root_dir
unique items: 4

frequency
---------
/net/home           413
/net/data           289
/net/spice/scratch  29 
/net/spice/project  1  

items
-----
/net/data|/net/home|/net/spice/scratch|/net/spice/project

@oliver-sanders
Copy link
Member Author

The neatest way to implement this is likely as a jupyter-hub service.

This will allow us to run the extension with the hub account privileges if necessary and provide integration with cylc hub. The most obvious place for the code to live is the cylc-uiserver repository (we can omit it from the standard installation using an optional dependencies if desired).

The service would scape Cylc processes from ps listings (e.g. via psutil) and store the results in a housekept sqlite3 db or in raw data files. It would register endpoints exposing this data for a light-weight web-app.

@oliver-sanders oliver-sanders changed the title suite server monitor cylc server monitor Mar 27, 2023
@oliver-sanders
Copy link
Member Author

This is worth a look, someone worked out how to "proxy" graphana as a Jupyter Hub service - https://github.com/rcthomas/jupyterhub-prometheus-grafana

@hjoliver
Copy link
Member

(For the record, now using the original "exvcylc" monitor at NIWA, it's super helpful).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants