The Alps Job Visualization Tool is designed to record and visualize hardware metrics of nodes used in a SLURM job. It provides various graphs to help detect anomalies and identify potential issues in node performance.
- Collects performance data from nodes in a running SLURM job.
- Processes and organizes the data for easy analysis.
- Generates visualizations to highlight anomalies and performance trends.
Follow these steps to set up the tool:
-
Clone the repository:
git clone https://github.com/swiss-ai/alps-job-reporting-tool cd alps-job-reporting-tool -
Create a Python environment using the
requirements.txtfile:
To collect and process data for a given SLURM job, follow these steps:
-
Ensure the SLURM job is running.
-
Run the following command from the main directory:
./metrics_downloader.sh <job_id> [duration]
<job_id>: The SLURM job ID for which data will be collected.[duration](optional): The logging period in seconds for node data collection. The default is 300 seconds (5 minutes).
-
The collected data and an HTML report will be saved in the
outputs/<job_id>_<date>folder.
Both CSV and Parquet files are saved, allowing users to perform additional analysis and visualize the data using their preferred tools.
If there are any interesting metrics or visualization that you think could be useful, let us know such that we can add them.