Skip to content

Conversation

@hweawer
Copy link
Collaborator

@hweawer hweawer commented Nov 21, 2025

Problem

Agents were returning 500 UNKNOWN errors when serving blobs immediately after P2P download, causing image pulls to fail intermittently.

Root Cause

A race condition between file download completion and file state verification:

  1. sched.Download() returns success when file is fully downloaded to /download/ directory
  2. Scheduler asynchronously moves file from /download//cache/
  3. Client requests blob immediately
  4. Cache().GetFileStat() expects file in cache state
  5. File still has download state → verifyStateHelper fails → 500 error

Error message: transferer stat: stat cache: failed to perform "verifyStateHelper" on
/var/cache/udocker/kraken-agent/download/cf5f39ef...:

Solution

1. Fix Race Condition

Change Cache() to Any() after download:
// Before
fi, err = t.cads.Cache().GetFileStat(d.Hex()) // Only accepts cache state

// After
fi, err = t.cads.Any().GetFileStat(d.Hex()) // Accepts download OR cache state
This eliminates false negatives during the move window while maintaining safety:

  • ✅ Unix file descriptors remain valid after rename (no partial reads)
  • sched.Download() blocks until file is complete (no partial data)
  • ✅ Move operation is atomic (no corruption)

2. Improve Error Handling

Added typed errors with detailed reasons:

type ErrBlobNotFound struct {
    Digest string
    Reason string  // e.g., "torrent not found in tracker"
}

Map scheduler errors appropriately:

  • ErrTorrentNotFound → 404 BLOB_UNKNOWN
  • Other errors → 500 with wrapped error details

Use %w for error wrapping to preserve error chains in logs.

3. Code Quality

  • Fix double slash in error paths using filepath.Join()
  • Reduce nesting with early return patterns

Impact

  • ✅ Eliminates spurious 500 errors
  • ✅ Better error messages for debugging
  • ✅ No risk of serving partial/corrupted data
  • ✅ Consistent error handling across codebase

Files Changed

  • lib/dockerregistry/transfer/ro_transferer.go - Use Any(), add error mapping
  • lib/dockerregistry/transfer/rw_transferer.go - Error wrapping
  • lib/dockerregistry/transfer/errors.go - Custom error types
  • lib/dockerregistry/storage_driver.go - Improved error logging
  • lib/store/base/errors.go - Fix path formatting
  • agent/agentserver/server.go - Use Any(), reduce nesting

…ror handling

This commit fixes critical issues with blob downloads in the agent:

1. Race Condition Fix:
   - Use Any() instead of Cache() after sched.Download() completes
   - Tolerates files in download state during brief move window
   - Eliminates spurious 500 errors when serving newly downloaded blobs
   - Safe due to Unix file descriptor semantics (fd stays valid after rename)

2. Improved Error Handling:
   - Add typed errors (ErrBlobNotFound, ErrTagNotFound) with detailed reasons
   - Map scheduler errors to appropriate response codes (404 vs 500)
   - Use %w for error wrapping to preserve error chains
   - Extract mapSchedulerError() helper to centralize error mapping

3. Code Quality:
   - Fix double slash in error messages (use filepath.Join)
   - Reduce nesting with early return patterns
   - Consistent error handling across methods
Copilot AI review requested due to automatic review settings November 21, 2025 14:51
Copilot finished reviewing on behalf of hweawer November 21, 2025 14:54
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes a critical race condition in blob downloads and improves error handling throughout the transfer layer:

  • Replaces Cache() with Any() after download completion to handle files still in transit between download and cache directories
  • Converts error sentinels to typed errors (ErrBlobNotFound, ErrTagNotFound) with detailed context
  • Improves error wrapping using %w format verb to preserve error chains

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
lib/store/base/file_entry.go Removes redundant f.Close() return (defer already handles closing)
lib/store/base/errors.go Fixes double slash in error messages using filepath.Join
lib/dockerregistry/transfer/errors.go Introduces typed errors with helper functions for error checking
lib/dockerregistry/transfer/testing.go Updates test transferer to use new typed error format
lib/dockerregistry/transfer/rw_transferer.go Refactors error handling with early returns and error wrapping
lib/dockerregistry/transfer/ro_transferer.go Fixes race condition using Any() and adds mapSchedulerError helper
lib/dockerregistry/transfer/rw_transferer_test.go Updates tests to use new error checking functions
lib/dockerregistry/transfer/ro_transferer_test.go Updates tests to use new error checking functions
lib/dockerregistry/storage_driver.go Updates error checking to use new typed error helpers
agent/agentserver/server.go Fixes race condition using Any() after scheduler download
Comments suppressed due to low confidence (1)

agent/agentserver/server.go:179

  • The file reader f obtained at line 171 is not closed after use, causing a resource leak. Add defer f.Close() after the error check at line 174 to ensure the file handle is properly closed.
	f, err = s.cads.Any().GetFileReader(d.Hex())
	if err != nil {
		return handler.Errorf("store: %s", err)
	}

	if _, err := io.Copy(w, f); err != nil {
		return fmt.Errorf("copy file: %s", err)
	}
	return nil

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@hweawer hweawer self-assigned this Nov 21, 2025
Copilot AI review requested due to automatic review settings November 21, 2025 15:34
Copilot finished reviewing on behalf of hweawer November 21, 2025 15:37
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated 4 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot AI review requested due to automatic review settings November 21, 2025 15:58
Copilot finished reviewing on behalf of hweawer November 21, 2025 16:00
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

} else {
return handler.Errorf("store: %s", err)

// Happy path: file already exists in cache
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we remove these comments that repeat the code logic? IMO they get stale with time and the cost of maintaining them is higher than the value they provide by clarifying the code. Were they written by AI and forgotten after? I think this happens a lot with AI code :D

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also love the nesting reduction here too! Code is much more readable like this

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed

// Get file reader after download completes
// Use Any() to check both download and cache directories, as the file
// might still be in the process of being moved from download to cache.
f, err = s.cads.Any().GetFileReader(d.Hex())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. What happens when a file is in the download dir, but is partially downloaded? Does it get returned?
  2. If we can serve blobs directly from the download dir, what is the purpose of having a download and a cache dir separately? Aren't we violating any atomicity invariants by serving data from the download dir?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. s.sched.Download() blocks until download is complete, concurrent requests are deduplicated, so it will not be returned
  2. I guess the purpose is:
  • Download dir: Incomplete files being assembled piece-by-piece
  • Cache dir: Complete, verified files ready for serving

But Any() is just handling the microsecond window where the move operation is in flight

if err != nil {
return nil, fmt.Errorf("stat cache: %s", err)

// Happy path: file already exists in cache
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same note about these comments as above

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed

Copilot AI review requested due to automatic review settings November 24, 2025 15:49
Copilot finished reviewing on behalf of hweawer November 24, 2025 15:52
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated no new comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants