Skip to content

Conversation

@hweawer
Copy link
Collaborator

@hweawer hweawer commented Jul 11, 2025

🔧 Key Changes

  1. Enhanced Agent Server Logging
  • Improved readiness check error reporting: Error messages now include component identifiers (scheduler:, build-index:, tracker:) for better debugging
  • Better error message formatting: Standardized error output format across all health check components
  • Enhanced request logging middleware: Added detailed request context logging for better observability
  1. Code Quality & Testing Improvements
  • Test robustness: Updated test mocks to use gomock.Any() instead of context.Background() for more flexible testing
  • Field naming consistency: Fixed struct field naming (e.g., readinessCacheTTL → ReadinessCacheTTL)
  • Improved error assertions: Updated test expectations to match new error message formats
  1. Robust Docker Image Building Process
  • Enhanced Dockerfile patterns: Implemented more reliable apt-get commands across all service Dockerfiles
  • Better package management: Added retry mechanisms and cleanup for package installations
  1. Configuration & Error Handling
  • Removed puller dependencies: Cleaned up unused puller-related code
  • Fixed configuration issues: Resolved various configuration parsing and validation errors
  • Better timeout handling: Improved timeout context management for container operations

@hweawer hweawer self-assigned this Jul 11, 2025
@hweawer hweawer force-pushed the improve-kraken-agent-logging branch from 159aea2 to 7247d25 Compare July 16, 2025 12:36
@hweawer hweawer force-pushed the improve-kraken-agent-logging branch from 7247d25 to f4bd0d9 Compare July 17, 2025 08:14
// applyDefaults sets default values for configuration.
func (c *Config) applyDefaults() {
if c.DownloadTimeout == 0 {
c.DownloadTimeout = 5 * time.Minute
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please increase this default download timeout to 15 minutes atleast.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Increased

Copy link
Collaborator

@gkeesh7 gkeesh7 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please address the comments, Let's keep the Changeset size limited to 150 Lines per PR as much as possible.

@@ -1,6 +1,25 @@
FROM debian:10
FROM debian:11
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let us use debian:12 instead ? It is the current stable release https://www.debian.org/releases/

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done


// Validate tag format
if strings.TrimSpace(tag) == "" {
return handler.ErrorStatus(http.StatusBadRequest)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need to do this check ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This check is just a slight optimisation for not triggering build-index for invalid tags

logger = logger.With("namespace", namespace, "digest", d.String())
logger.Debugw("downloading blob")

// Try to get file from cache first
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain what are we changing in the logic here ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

f, err := s.cads.Cache().GetFileReader(d.Hex())
	if err != nil {
		if os.IsNotExist(err) || s.cads.InDownloadError(err) {
			if err := s.sched.Download(namespace, d); err != nil {
				if err == scheduler.ErrTorrentNotFound {
					return handler.ErrorStatus(http.StatusNotFound)
				}
				return handler.Errorf("download torrent: %s", err)
			}
			f, err = s.cads.Cache().GetFileReader(d.Hex())
			if err != nil {
				return handler.Errorf("store: %s", err)
			}
		} else {
			return handler.Errorf("store: %s", err)
		}
	}
	if _, err := io.Copy(w, f); err != nil {
		return fmt.Errorf("copy file: %s", err)
	}
	return nil

The only thing which was changed in the logic is that in case of any error returned from cache blob will be downloaded, previously it was only on os.IsNotExist(err) || s.cads.InDownloadError(err). New code removes 4 ifs nesting

f, err := s.cads.Cache().GetFileReader(d.Hex())

// Get the downloaded file
reader, err = s.getCachedBlob(d)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we getting the Blob from Cache twice ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Get from cache:
  • If cache hit -> return
  • If cache miss:
    2. Download
    • Get from cache.(Blob in the cache after the download),

case <-checkCtx.Done():
}
}()
wg.Wait()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why aren't we waiting for all 3 checks to finish for the readiness check ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are waiting in select in the loop. It was possible to hang indefinitely with just waitgroup:

  • starting 3 goroutines
  • timing out on nginx
  • server goroutine stays in the wg.Wait

Now all the calls have timeout protection.

var wg sync.WaitGroup
results := make(chan checkResult, 3)

wg.Add(3)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe we should not modify the behaviour of readiness check if the purpose of the PR is to improve Logging in Kraken Agent.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, can we split into 2 PRs?

@hweawer hweawer requested a review from gkeesh7 July 29, 2025 11:05
Copy link
Collaborator

@Anton-Kalpakchiev Anton-Kalpakchiev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall feedback - is it possible to split into smaller PRs to improve reviewability?

ReadinessTimeout time.Duration `yaml:"readiness_timeout"`

// Enable detailed request logging
EnableRequestLogging bool `yaml:"enable_request_logging"`
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How are we planning to use this flag internally? Are we gonna enable it globally or only on some agents?

Are there any performance limitations as to how many logs or what logs cardinality kraken-agent can have when deployed on Job Controller?

var wg sync.WaitGroup
results := make(chan checkResult, 3)

wg.Add(3)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, can we split into 2 PRs?

( \
unset http_proxy https_proxy no_proxy && \
apt-get update -o Acquire::Retries=3 -o Acquire::http::No-Cache=true \
apt-get update --allow-releaseinfo-change -o Acquire::Retries=10 -o Acquire::http::No-Cache=true -o Acquire::http::timeout=60 -o Acquire::ForceIPv4=true \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does the retry logic work here? Is this relevant to logging? Could we please split functional changes in another PR?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants