-
Notifications
You must be signed in to change notification settings - Fork 464
fix: resolve race condition in file state verification and improve error handling #501
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
…ror handling This commit fixes critical issues with blob downloads in the agent: 1. Race Condition Fix: - Use Any() instead of Cache() after sched.Download() completes - Tolerates files in download state during brief move window - Eliminates spurious 500 errors when serving newly downloaded blobs - Safe due to Unix file descriptor semantics (fd stays valid after rename) 2. Improved Error Handling: - Add typed errors (ErrBlobNotFound, ErrTagNotFound) with detailed reasons - Map scheduler errors to appropriate response codes (404 vs 500) - Use %w for error wrapping to preserve error chains - Extract mapSchedulerError() helper to centralize error mapping 3. Code Quality: - Fix double slash in error messages (use filepath.Join) - Reduce nesting with early return patterns - Consistent error handling across methods
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR fixes a critical race condition in blob downloads and improves error handling throughout the transfer layer:
- Replaces
Cache()withAny()after download completion to handle files still in transit between download and cache directories - Converts error sentinels to typed errors (
ErrBlobNotFound,ErrTagNotFound) with detailed context - Improves error wrapping using
%wformat verb to preserve error chains
Reviewed changes
Copilot reviewed 10 out of 10 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| lib/store/base/file_entry.go | Removes redundant f.Close() return (defer already handles closing) |
| lib/store/base/errors.go | Fixes double slash in error messages using filepath.Join |
| lib/dockerregistry/transfer/errors.go | Introduces typed errors with helper functions for error checking |
| lib/dockerregistry/transfer/testing.go | Updates test transferer to use new typed error format |
| lib/dockerregistry/transfer/rw_transferer.go | Refactors error handling with early returns and error wrapping |
| lib/dockerregistry/transfer/ro_transferer.go | Fixes race condition using Any() and adds mapSchedulerError helper |
| lib/dockerregistry/transfer/rw_transferer_test.go | Updates tests to use new error checking functions |
| lib/dockerregistry/transfer/ro_transferer_test.go | Updates tests to use new error checking functions |
| lib/dockerregistry/storage_driver.go | Updates error checking to use new typed error helpers |
| agent/agentserver/server.go | Fixes race condition using Any() after scheduler download |
Comments suppressed due to low confidence (1)
agent/agentserver/server.go:179
- The file reader
fobtained at line 171 is not closed after use, causing a resource leak. Adddefer f.Close()after the error check at line 174 to ensure the file handle is properly closed.
f, err = s.cads.Any().GetFileReader(d.Hex())
if err != nil {
return handler.Errorf("store: %s", err)
}
if _, err := io.Copy(w, f); err != nil {
return fmt.Errorf("copy file: %s", err)
}
return nil
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 10 out of 10 changed files in this pull request and generated 4 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 10 out of 10 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
agent/agentserver/server.go
Outdated
| } else { | ||
| return handler.Errorf("store: %s", err) | ||
|
|
||
| // Happy path: file already exists in cache |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we remove these comments that repeat the code logic? IMO they get stale with time and the cost of maintaining them is higher than the value they provide by clarifying the code. Were they written by AI and forgotten after? I think this happens a lot with AI code :D
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also love the nesting reduction here too! Code is much more readable like this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed
| // Get file reader after download completes | ||
| // Use Any() to check both download and cache directories, as the file | ||
| // might still be in the process of being moved from download to cache. | ||
| f, err = s.cads.Any().GetFileReader(d.Hex()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- What happens when a file is in the download dir, but is partially downloaded? Does it get returned?
- If we can serve blobs directly from the download dir, what is the purpose of having a download and a cache dir separately? Aren't we violating any atomicity invariants by serving data from the download dir?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s.sched.Download()blocks until download is complete, concurrent requests are deduplicated, so it will not be returned- I guess the purpose is:
- Download dir: Incomplete files being assembled piece-by-piece
- Cache dir: Complete, verified files ready for serving
But Any() is just handling the microsecond window where the move operation is in flight
| if err != nil { | ||
| return nil, fmt.Errorf("stat cache: %s", err) | ||
|
|
||
| // Happy path: file already exists in cache |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same note about these comments as above
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed
Co-authored-by: akalpakchiev <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 10 out of 10 changed files in this pull request and generated no new comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Problem
Agents were returning
500 UNKNOWNerrors when serving blobs immediately after P2P download, causing image pulls to fail intermittently.Root Cause
A race condition between file download completion and file state verification:
sched.Download()returns success when file is fully downloaded to/download/directory/download/→/cache/Cache().GetFileStat()expects file in cache stateverifyStateHelperfails → 500 errorError message: transferer stat: stat cache: failed to perform "verifyStateHelper" on
/var/cache/udocker/kraken-agent/download/cf5f39ef...:
Solution
1. Fix Race Condition
Change
Cache()toAny()after download:// Before
fi, err = t.cads.Cache().GetFileStat(d.Hex()) // Only accepts cache state
// After
fi, err = t.cads.Any().GetFileStat(d.Hex()) // Accepts download OR cache state
This eliminates false negatives during the move window while maintaining safety:
sched.Download()blocks until file is complete (no partial data)2. Improve Error Handling
Added typed errors with detailed reasons:
Map scheduler errors appropriately:
ErrTorrentNotFound→ 404 BLOB_UNKNOWNUse
%wfor error wrapping to preserve error chains in logs.3. Code Quality
filepath.Join()Impact
Files Changed
lib/dockerregistry/transfer/ro_transferer.go- Use Any(), add error mappinglib/dockerregistry/transfer/rw_transferer.go- Error wrappinglib/dockerregistry/transfer/errors.go- Custom error typeslib/dockerregistry/storage_driver.go- Improved error logginglib/store/base/errors.go- Fix path formattingagent/agentserver/server.go- Use Any(), reduce nesting