Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(job_monitor): log job pod errors, disruptions to warning (#468) #468

Merged
merged 3 commits into from
Oct 21, 2024

Conversation

jlemesh
Copy link
Member

@jlemesh jlemesh commented Sep 4, 2024

Related to reanahub/reana#824

Fix multiline formatting. Log pod errors, evictions and other problematic events to warning log. Cover new functionality with tests.

Example logs (with FluentBit configured to collect job-controller logs):

OOM killed:

2024-09-04 11:08:11,795 | root | MainThread | INFO | Publishing step:0, cmd: python "code/helloworld.py" --inputfile "data/names.txt" --outputfile "results/greetings.txt" --sleeptime 1, total steps 1 to MQ
2024-09-04 11:08:13,734 | root | kubernetes_job_monitor | WARNING | Job reana-run-job-ad805df8-6d8c-4b00-9942-25ca994af5ad-f62zc was terminated, reason: StartError, message: failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: container init was OOM-killed (memory limit too low?): unknown
2024-09-04 11:08:17,805 | root | MainThread | INFO | Workflow 43250308-f42b-4506-bee8-556be0fbe754 finished. Files available at /var/reana/users/00000000-0000-0000-0000-000000000000/workflows/43250308-f42b-4506-bee8-556be0fbe754.

DeadlineExceeded:

2024-09-04 11:08:57,032 | root | MainThread | INFO | Publishing step:0, cmd: python "code/helloworld.py" --inputfile "data/names.txt" --outputfile "results/greetings.txt" --sleeptime 1, total steps 1 to MQ
2024-09-04 11:09:04,959 | root | kubernetes_job_monitor | WARNING | Job reana-run-job-9ee1aba5-73fd-4ac2-a655-a8cf6937712a-clzzk was terminated, reason: Error, message: None
2024-09-04 11:09:04,959 | root | kubernetes_job_monitor | WARNING | DeadlineExceeded: The job was killed due to exceeding timeout of 3 seconds.
2024-09-04 11:09:09,059 | root | MainThread | INFO | Workflow c80f79cc-1acc-47c7-bb03-6b6ce57cc60c finished. Files available at /var/reana/users/00000000-0000-0000-0000-000000000000/workflows/c80f79cc-1acc-47c7-bb03-6b6ce57cc60c.

Eviction:

2024-09-04 11:52:45,493 | root | MainThread | INFO | Publishing step:0, cmd: python "code/helloworld.py" --inputfile "data/names.txt" --outputfile "results/greetings.txt" --sleeptime 1, total steps 1 to MQ
2024-09-04 11:53:03,719 | root | kubernetes_job_monitor | WARNING | Job reana-run-job-e8494ddb-cc11-40b3-94b9-caa39ad74003-tbstk was terminated, reason: Error, message: None
2024-09-04 11:53:03,719 | root | kubernetes_job_monitor | WARNING | Job reana-run-job-e8494ddb-cc11-40b3-94b9-caa39ad74003 was disrupted: Eviction API: evicting, reason: EvictionByEvictionAPI
2024-09-04 11:53:09,563 | root | MainThread | INFO | Workflow 7ed08030-414e-4389-9ee3-a1c1ecbe400e finished. Files available at /var/reana/users/00000000-0000-0000-0000-000000000000/workflows/7ed08030-414e-4389-9ee3-a1c1ecbe400e.

Without multiline formatter it will not be possible to properly display multiline errors to REANA users using FluentBit.

Old multiline format:

2024-09-04 11:53:03,719 | root | kubernetes_job_monitor | WARNING | line 1
    line2
        line3

New format:

2024-09-04 11:53:03,719 | root | kubernetes_job_monitor | WARNING | line 1
2024-09-04 11:53:03,719 | root | kubernetes_job_monitor | WARNING |     line 2
2024-09-04 11:53:03,719 | root | kubernetes_job_monitor | WARNING |         line 3

With old format, FluentBit drops line 2 and line 3 logs, so the user only sees line 1. With new format, it saves all lines.

feat(utils): add multiline log formatter (#468)
feat(job_manager): log pod errors to warning (#468)
feat(job_monitor): log pod errors, disruptions to warning (#468)

Copy link

codecov bot commented Sep 4, 2024

Codecov Report

Attention: Patch coverage is 81.08108% with 7 lines in your changes missing coverage. Please review.

Project coverage is 47.95%. Comparing base (891aeab) to head (db9c258).
Report is 3 commits behind head on master.

Files with missing lines Patch % Lines
reana_job_controller/kubernetes_job_manager.py 60.00% 4 Missing ⚠️
reana_job_controller/job_monitor.py 70.00% 3 Missing ⚠️
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #468      +/-   ##
==========================================
+ Coverage   47.04%   47.95%   +0.91%     
==========================================
  Files          17       17              
  Lines        1271     1299      +28     
==========================================
+ Hits          598      623      +25     
- Misses        673      676       +3     
Files with missing lines Coverage Δ
reana_job_controller/factory.py 90.90% <100.00%> (+0.90%) ⬆️
reana_job_controller/utils.py 50.00% <100.00%> (+9.70%) ⬆️
reana_job_controller/job_monitor.py 47.67% <70.00%> (+0.49%) ⬆️
reana_job_controller/kubernetes_job_manager.py 63.41% <60.00%> (+0.49%) ⬆️

jlemesh added a commit to jlemesh/reana-job-controller that referenced this pull request Sep 4, 2024
jlemesh added a commit to jlemesh/reana-job-controller that referenced this pull request Sep 4, 2024
jlemesh added a commit to jlemesh/reana-job-controller that referenced this pull request Sep 4, 2024
@jlemesh jlemesh force-pushed the feature_log_pod_info_to_warnings branch from 5bd1f9f to 59ef4d0 Compare September 4, 2024 14:51
@jlemesh jlemesh changed the title feat(job_monitor): log pod errors, disruptions to warning (#468) feat(job_monitor): log job pod errors, disruptions to warning (#468) Sep 4, 2024
@jlemesh jlemesh force-pushed the feature_log_pod_info_to_warnings branch from 59ef4d0 to 561287e Compare September 6, 2024 07:21
jlemesh added a commit to jlemesh/reana-job-controller that referenced this pull request Sep 6, 2024
jlemesh added a commit to jlemesh/reana-job-controller that referenced this pull request Sep 6, 2024
jlemesh added a commit to jlemesh/reana-job-controller that referenced this pull request Sep 6, 2024
@jlemesh jlemesh force-pushed the feature_log_pod_info_to_warnings branch from 561287e to 1a4b97a Compare September 6, 2024 07:23
jlemesh added a commit to jlemesh/reana-job-controller that referenced this pull request Sep 6, 2024
@jlemesh

This comment was marked as outdated.

jlemesh added a commit to jlemesh/reana-job-controller that referenced this pull request Sep 6, 2024
jlemesh added a commit to jlemesh/reana-job-controller that referenced this pull request Sep 6, 2024
jlemesh added a commit to jlemesh/reana-job-controller that referenced this pull request Sep 6, 2024
jlemesh added a commit to jlemesh/reana-job-controller that referenced this pull request Sep 6, 2024
jlemesh added a commit to jlemesh/reana-job-controller that referenced this pull request Sep 6, 2024
@jlemesh jlemesh force-pushed the feature_log_pod_info_to_warnings branch from 61acfe7 to e0f375d Compare September 6, 2024 09:45
jlemesh added a commit to jlemesh/reana-job-controller that referenced this pull request Sep 16, 2024
jlemesh added a commit to jlemesh/reana-job-controller that referenced this pull request Sep 16, 2024
jlemesh added a commit to jlemesh/reana-job-controller that referenced this pull request Sep 16, 2024
jlemesh added a commit to jlemesh/reana-job-controller that referenced this pull request Sep 16, 2024
@jlemesh jlemesh force-pushed the feature_log_pod_info_to_warnings branch from e0f375d to 891049c Compare September 16, 2024 12:25
jlemesh added a commit to jlemesh/reana-job-controller that referenced this pull request Sep 19, 2024
jlemesh added a commit to jlemesh/reana-job-controller that referenced this pull request Sep 19, 2024
jlemesh added a commit to jlemesh/reana-job-controller that referenced this pull request Sep 19, 2024
jlemesh added a commit to jlemesh/reana-job-controller that referenced this pull request Sep 19, 2024
@jlemesh jlemesh force-pushed the feature_log_pod_info_to_warnings branch from 891049c to e685b6b Compare September 19, 2024 11:19
@tiborsimko tiborsimko force-pushed the feature_log_pod_info_to_warnings branch from e685b6b to db9c258 Compare October 21, 2024 17:08
Copy link
Member

@tiborsimko tiborsimko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Works nicely 👍

Squashed the doc commit to the first one and slightly rephrased the commit headlines for release news.

@tiborsimko tiborsimko merged commit db9c258 into reanahub:master Oct 21, 2024
14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants