-
Notifications
You must be signed in to change notification settings - Fork 464
Improve kraken agent logging #411
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
159aea2 to
7247d25
Compare
7247d25 to
f4bd0d9
Compare
agent/agentserver/server.go
Outdated
| // applyDefaults sets default values for configuration. | ||
| func (c *Config) applyDefaults() { | ||
| if c.DownloadTimeout == 0 { | ||
| c.DownloadTimeout = 5 * time.Minute |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please increase this default download timeout to 15 minutes atleast.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Increased
gkeesh7
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please address the comments, Let's keep the Changeset size limited to 150 Lines per PR as much as possible.
docker/agent/Dockerfile
Outdated
| @@ -1,6 +1,25 @@ | |||
| FROM debian:10 | |||
| FROM debian:11 | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let us use debian:12 instead ? It is the current stable release https://www.debian.org/releases/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
|
|
||
| // Validate tag format | ||
| if strings.TrimSpace(tag) == "" { | ||
| return handler.ErrorStatus(http.StatusBadRequest) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we need to do this check ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This check is just a slight optimisation for not triggering build-index for invalid tags
| logger = logger.With("namespace", namespace, "digest", d.String()) | ||
| logger.Debugw("downloading blob") | ||
|
|
||
| // Try to get file from cache first |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you explain what are we changing in the logic here ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
f, err := s.cads.Cache().GetFileReader(d.Hex())
if err != nil {
if os.IsNotExist(err) || s.cads.InDownloadError(err) {
if err := s.sched.Download(namespace, d); err != nil {
if err == scheduler.ErrTorrentNotFound {
return handler.ErrorStatus(http.StatusNotFound)
}
return handler.Errorf("download torrent: %s", err)
}
f, err = s.cads.Cache().GetFileReader(d.Hex())
if err != nil {
return handler.Errorf("store: %s", err)
}
} else {
return handler.Errorf("store: %s", err)
}
}
if _, err := io.Copy(w, f); err != nil {
return fmt.Errorf("copy file: %s", err)
}
return nil
The only thing which was changed in the logic is that in case of any error returned from cache blob will be downloaded, previously it was only on os.IsNotExist(err) || s.cads.InDownloadError(err). New code removes 4 ifs nesting
| f, err := s.cads.Cache().GetFileReader(d.Hex()) | ||
|
|
||
| // Get the downloaded file | ||
| reader, err = s.getCachedBlob(d) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we getting the Blob from Cache twice ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Get from cache:
- If cache hit -> return
- If cache miss:
2. Download- Get from cache.(Blob in the cache after the download),
| case <-checkCtx.Done(): | ||
| } | ||
| }() | ||
| wg.Wait() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why aren't we waiting for all 3 checks to finish for the readiness check ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are waiting in select in the loop. It was possible to hang indefinitely with just waitgroup:
- starting 3 goroutines
- timing out on nginx
- server goroutine stays in the wg.Wait
Now all the calls have timeout protection.
| var wg sync.WaitGroup | ||
| results := make(chan checkResult, 3) | ||
|
|
||
| wg.Add(3) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe we should not modify the behaviour of readiness check if the purpose of the PR is to improve Logging in Kraken Agent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, can we split into 2 PRs?
Anton-Kalpakchiev
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall feedback - is it possible to split into smaller PRs to improve reviewability?
| ReadinessTimeout time.Duration `yaml:"readiness_timeout"` | ||
|
|
||
| // Enable detailed request logging | ||
| EnableRequestLogging bool `yaml:"enable_request_logging"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How are we planning to use this flag internally? Are we gonna enable it globally or only on some agents?
Are there any performance limitations as to how many logs or what logs cardinality kraken-agent can have when deployed on Job Controller?
| var wg sync.WaitGroup | ||
| results := make(chan checkResult, 3) | ||
|
|
||
| wg.Add(3) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, can we split into 2 PRs?
| ( \ | ||
| unset http_proxy https_proxy no_proxy && \ | ||
| apt-get update -o Acquire::Retries=3 -o Acquire::http::No-Cache=true \ | ||
| apt-get update --allow-releaseinfo-change -o Acquire::Retries=10 -o Acquire::http::No-Cache=true -o Acquire::http::timeout=60 -o Acquire::ForceIPv4=true \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does the retry logic work here? Is this relevant to logging? Could we please split functional changes in another PR?
🔧 Key Changes