Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/multi loader logs collection #598

Open
wants to merge 15 commits into
base: main
Choose a base branch
from

Conversation

nosnelmil
Copy link
Contributor

Summary

Extends multi-loader by collecting key logs from nodes in the cluster for the Knative platform. Users can optionally collect the following logs:

  • TOP (resource usage metrics)
  • Prometheus snapshots
  • Logs from the Activator Pod
  • Logs from the Autoscaler Pod

Implementation Notes ⚒️

  • Added an additional Metrics field in the multi-loader config, accepting an array with any of the following values: top, prometheus, activator, autoscaler.
  • Introduced optional fields: MasterNode, ActivatorNode, AutoscalerNode, and WorkerNodes to allow users to manually specify IPs instead of relying on multi-loader to determine them (mostly unnecessary in typical scenarios).
  • Uses kubectl to automatically determine node IPs and classify them based on their roles.
  • Resets TOP metrics for all nodes before starting any experiment.
  • Collects Activator Pod logs from:
    /var/log/pods/knative-serving_activator-*/activator/*
  • Collects Autoscaler Pod logs from:
    /var/log/pods/knative-serving_autoscaler-*/autoscaler/*
  • Copies Prometheus snapshots by first triggering a snapshot via the Prometheus API on the master node and then retrieving the generated snapshot.
  • Additionally, log collection logic runs during multi-loader dry run to:
    • Validate that identified IPs are reachable.
    • Ensure SSH access and necessary permissions.
    • Execute log retrieval commands to detect potential errors.
    • Delete any collected logs after validation, as no experiments have been executed yet.

External Dependencies 🍀

  • N/A

Breaking API Changes ⚠️

  • N/A

@nosnelmil nosnelmil force-pushed the feature/multi-loader-adv-logs branch 2 times, most recently from 3dfaaaf to d5d74ac Compare February 4, 2025 03:16
@nosnelmil nosnelmil marked this pull request as draft February 4, 2025 03:17
@nosnelmil nosnelmil force-pushed the feature/multi-loader-adv-logs branch 9 times, most recently from 44717ff to c4798ba Compare February 13, 2025 06:59
@nosnelmil nosnelmil marked this pull request as ready for review February 13, 2025 07:08
@nosnelmil nosnelmil force-pushed the feature/multi-loader-adv-logs branch 2 times, most recently from 535398d to 0e11bd8 Compare February 19, 2025 16:18
Copy link
Contributor

@leokondrashov leokondrashov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks nice. I left some comments.

@nosnelmil nosnelmil force-pushed the feature/multi-loader-adv-logs branch 2 times, most recently from b32c581 to 02a9edf Compare March 3, 2025 13:08
Signed-off-by: Lenson <[email protected]>

add node discovery validators

Signed-off-by: Lenson <[email protected]>

add collect TOP metric functions

Signed-off-by: Lenson <[email protected]>

add multi-loader metric_manager

Signed-off-by: Lenson <[email protected]>

add autoscaler log collection

Signed-off-by: Lenson <[email protected]>

add activator log collection

Signed-off-by: Lenson <[email protected]>

add prometh log collection

Signed-off-by: Lenson <[email protected]>

refactor metric manager contants

Signed-off-by: Lenson <[email protected]>

minor fix for node discovery

Signed-off-by: Lenson <[email protected]>

fix node discovery

Signed-off-by: Lenson <[email protected]>

minor fix

Signed-off-by: Lenson <[email protected]>

minor fix

Signed-off-by: Lenson <[email protected]>

add logs for prometh

Signed-off-by: Lenson <[email protected]>

add pause between prometh collection

Signed-off-by: Lenson <[email protected]>

update wait time

Signed-off-by: Lenson <[email protected]>

update condition for node discovery

Signed-off-by: Lenson <[email protected]>

update logging

Signed-off-by: Lenson <[email protected]>
Signed-off-by: Lenson <[email protected]>

update kind ssh update script

Signed-off-by: Lenson <[email protected]>

fix setup kind ssh

Signed-off-by: Lenson <[email protected]>
Signed-off-by: Lenson <[email protected]>

update setup metrics script

Signed-off-by: Lenson <[email protected]>
Signed-off-by: Lenson <[email protected]>

fix log collection test

commit a05990d
Author: Lenson <[email protected]>
Date:   Mon Feb 3 15:39:39 2025 +0800

    update test trigger

    Signed-off-by: Lenson <[email protected]>

commit 3edb3b4
Author: Lenson <[email protected]>
Date:   Mon Feb 3 15:33:06 2025 +0800

    update test

    Signed-off-by: Lenson <[email protected]>

commit 56a0f7d
Author: Lenson <[email protected]>
Date:   Mon Feb 3 15:18:40 2025 +0800

    fix

    Signed-off-by: Lenson <[email protected]>

commit 67c520d
Author: Lenson <[email protected]>
Date:   Mon Feb 3 15:06:20 2025 +0800

    fix

    Signed-off-by: Lenson <[email protected]>

commit 48ff845
Author: Lenson <[email protected]>
Date:   Mon Feb 3 14:46:29 2025 +0800

    test'

    Signed-off-by: Lenson <[email protected]>

commit 295c761
Author: Lenson <[email protected]>
Date:   Mon Feb 3 14:45:35 2025 +0800

    add adv log collection tests

    Signed-off-by: Lenson <[email protected]>

commit 8469bdb
Author: Lenson <[email protected]>
Date:   Mon Feb 3 14:45:05 2025 +0800

    update logging

    Signed-off-by: Lenson <[email protected]>

commit 10e295a
Author: Lenson <[email protected]>
Date:   Mon Feb 3 14:44:42 2025 +0800

    update kind ssh update script

    Signed-off-by: Lenson <[email protected]>

commit c56a9d8
Author: Lenson <[email protected]>
Date:   Mon Feb 3 13:19:27 2025 +0800

    add KinD ssh setup script

    Signed-off-by: Lenson <[email protected]>

commit bf9a804
Author: Lenson <[email protected]>
Date:   Mon Feb 3 10:31:55 2025 +0800

    update condition for node discovery

    Signed-off-by: Lenson <[email protected]>

commit b3f078b
Author: Lenson <[email protected]>
Date:   Fri Jan 31 18:35:03 2025 +0800

    add multi loader log collection

    Signed-off-by: Lenson <[email protected]>

    add node discovery validators

    Signed-off-by: Lenson <[email protected]>

    add collect TOP metric functions

    Signed-off-by: Lenson <[email protected]>

    add multi-loader metric_manager

    Signed-off-by: Lenson <[email protected]>

    add autoscaler log collection

    Signed-off-by: Lenson <[email protected]>

    add activator log collection

    Signed-off-by: Lenson <[email protected]>

    add prometh log collection

    Signed-off-by: Lenson <[email protected]>

    refactor metric manager contants

    Signed-off-by: Lenson <[email protected]>

    minor fix for node discovery

    Signed-off-by: Lenson <[email protected]>

    fix node discovery

    Signed-off-by: Lenson <[email protected]>

    minor fix

    Signed-off-by: Lenson <[email protected]>

    minor fix

    Signed-off-by: Lenson <[email protected]>

    add logs for prometh

    Signed-off-by: Lenson <[email protected]>

    add pause between prometh collection

    Signed-off-by: Lenson <[email protected]>

    update wait time

    Signed-off-by: Lenson <[email protected]>

commit 9bac3c4
Author: Lenson <[email protected]>
Date:   Tue Jan 21 13:00:50 2025 +0800

    update multi loader docs

    Signed-off-by: Lenson <[email protected]>

    update multi-loader docs

    Signed-off-by: Lenson <[email protected]>

commit bfd17be
Author: Lenson <[email protected]>
Date:   Mon Jan 20 16:30:13 2025 +0800

    minor multi loader fix

    Signed-off-by: Lenson <[email protected]>

    fix incorrect retry logging

    Signed-off-by: Lenson <[email protected]>

    remove iat and generated cli args

    Signed-off-by: Lenson <[email protected]>

    remove make clean from clean up

    Signed-off-by: Lenson <[email protected]>

commit 91042aa
Author: Lenson <[email protected]>
Date:   Thu Jan 16 15:53:19 2025 +0800

    update tests

    Signed-off-by: Lenson <[email protected]>

    update multi loader e2e tests

    Signed-off-by: Lenson <[email protected]>

    revert setup.cfg

    Signed-off-by: Lenson <[email protected]>

    chmod script

    Signed-off-by: Lenson <[email protected]>

    update unit tests

    Signed-off-by: Lenson <[email protected]>

    fix e2e test

    Signed-off-by: Lenson <[email protected]>

    update tests

    Signed-off-by: Lenson <[email protected]>

commit 69c3c3a
Author: Lenson <[email protected]>
Date:   Tue Dec 31 11:49:55 2024 +0800

    add failfast flag

    Signed-off-by: Lenson <[email protected]>

    update failfast flag description

    Signed-off-by: Lenson <[email protected]>

    update comments

    Signed-off-by: Lenson <[email protected]>

    update wordlist with multiloader specific words

    Signed-off-by: Lenson <[email protected]>

    simplify run experiment logic

    Signed-off-by: Lenson <[email protected]>

    refactor partial experiment naming

    Signed-off-by: Lenson <[email protected]>

    fix wrong indexing

    Signed-off-by: Lenson <[email protected]>

    add progress in logging

    Signed-off-by: Lenson <[email protected]>

commit fc3ad98
Author: Lenson <[email protected]>
Date:   Sun Nov 17 14:07:35 2024 +0800

    refactor multi loader

    Signed-off-by: Lenson <[email protected]>

    add multi-loader tests

    Signed-off-by: Lenson <[email protected]>

    update test

    Signed-off-by: Lenson <[email protected]>

    refactor multi-loader tests

    Signed-off-by: Lenson <[email protected]>

    add loader experiment

    Signed-off-by: Lenson <[email protected]>

    update logs

    Signed-off-by: Lenson <[email protected]>

    update log verbosity

    Signed-off-by: Lenson <[email protected]>

    update logs

    Signed-off-by: Lenson <[email protected]>

    update logs

    Signed-off-by: Lenson <[email protected]>

    rename multiloader driver to runner

    Signed-off-by: Lenson <[email protected]>

    refactor common files to multiloader folder

    Signed-off-by: Lenson <[email protected]>

    refactor multiloader functions

    Signed-off-by: Lenson <[email protected]>

    rename createNewStudy function name

    Signed-off-by: Lenson <[email protected]>

    fix formatting

    Signed-off-by: Lenson <[email protected]>

    remove extra features

    Signed-off-by: Lenson <[email protected]>

    remove extra features

    Signed-off-by: Lenson <[email protected]>

    add validation for platform

    Signed-off-by: Lenson <[email protected]>

commit ca5e2ad
Author: Lenson <[email protected]>
Date:   Sat Nov 16 18:49:35 2024 +0800

    add multi loader documentation

    Signed-off-by: Lenson <[email protected]>

    update docs

    Signed-off-by: Lenson <[email protected]>

    fix docs

    Signed-off-by: Lenson <[email protected]>

    update documentation

    Signed-off-by: Lenson <[email protected]>

commit 3c7e6b5
Author: Lenson <[email protected]>
Date:   Sat Nov 16 12:36:43 2024 +0800

    add multi-loader

    Signed-off-by: Lenson <[email protected]>

    add multi-loader config reader

    Signed-off-by: Lenson <[email protected]>

    add multi loader base

    Signed-off-by: Lenson <[email protected]>

    add multi loader base

    Signed-off-by: Lenson <[email protected]>

    add node group struct

    Signed-off-by: Lenson <[email protected]>

    add multi loader runner

    Signed-off-by: Lenson <[email protected]>

    refactor multi loader config

    Signed-off-by: Lenson <[email protected]>

    add multi loader config validators

    Signed-off-by: Lenson <[email protected]>

    add knative specific config enricher

    Signed-off-by: Lenson <[email protected]>

    add additional knative platform type

    Signed-off-by: Lenson <[email protected]>

    add base runner entry point

    Signed-off-by: Lenson <[email protected]>

    refactor multi loader config

    Signed-off-by: Lenson <[email protected]>

    update multi loader config struct

    Signed-off-by: Lenson <[email protected]>

    update unpack study doc

    Signed-off-by: Lenson <[email protected]>

    add unpack study

    Signed-off-by: Lenson <[email protected]>

    add prepare experiment

    Signed-off-by: Lenson <[email protected]>

    update experiment config temp path

    Signed-off-by: Lenson <[email protected]>

    add run loader function

    Signed-off-by: Lenson <[email protected]>

    update log parser

    Signed-off-by: Lenson <[email protected]>

    update log parser

    Signed-off-by: Lenson <[email protected]>

    update log parser

    Signed-off-by: Lenson <[email protected]>

    add clean up function

    Signed-off-by: Lenson <[email protected]>

    add logs to indicate run status

    Signed-off-by: Lenson <[email protected]>

    expose entry points for multi loader runner

    Signed-off-by: Lenson <[email protected]>

    add multi loader runner execution

    Signed-off-by: Lenson <[email protected]>

    update default multi loader config path

    Signed-off-by: Lenson <[email protected]>

    add cpu limit validator

    Signed-off-by: Lenson <[email protected]>

    remove extra knative feature

    Signed-off-by: Lenson <[email protected]>

    remove knative extra features

    Signed-off-by: Lenson <[email protected]>

    add multi loader tests

    Signed-off-by: Lenson <[email protected]>

    add basic config

    Signed-off-by: Lenson <[email protected]>

    update basic config

    Signed-off-by: Lenson <[email protected]>

    update basic config

    Signed-off-by: Lenson <[email protected]>

    add basic configs

    Signed-off-by: Lenson <[email protected]>

    update base config

    Signed-off-by: Lenson <[email protected]>

Signed-off-by: Lenson <[email protected]>

update e2e test

Signed-off-by: Lenson <[email protected]>
@nosnelmil nosnelmil force-pushed the feature/multi-loader-adv-logs branch from 02a9edf to ba47317 Compare March 3, 2025 13:12
@nosnelmil nosnelmil force-pushed the feature/multi-loader-adv-logs branch from ba47317 to bfb51ab Compare March 4, 2025 03:59
@nosnelmil
Copy link
Contributor Author

@leokondrashov as discussed, added the log consolidation logic in 0dc0950

@nosnelmil nosnelmil requested a review from leokondrashov March 4, 2025 04:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants