Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Made chunk reading explicit when using read or pread #2772

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

rpecka
Copy link

@rpecka rpecka commented Jul 9, 2024

Resolved an issue where reading a file in chunks using an unbounded range would read from the current file pointer even for regular files.

Motivation:

When a FileChunks object is initialized, it checks if the range is set to 0..<Int.max. If it is, then instead of reading using a series of offsets and the pread sys call, it uses the read syscall repeatedly. This is fine if the read is the first ever of this type for a file, but if we read this way twice, then the second call will be affected by the side file pointer effect from the first call.

For example:

// Read the file the first time. This will repeatedly call `read` until EOF.
var firstRead = ByteBuffer()
for try await chunk in handle.readChunks(in: 0..<Int.max, chunkLength: .bytes(128)) {
  XCTAssertLessThanOrEqual(chunk.readableBytes, 128)
  firstRead.writeImmutableBuffer(chunk)
}
// Read the file again using `read` until EOF. This will read zero bytes since the previous call moved the file pointer to the end of the file without resetting it.
var secondRead = ByteBuffer()
for try await chunk in handle.readChunks(in: 0..<Int.max, chunkLength: .bytes(128)) {
  XCTAssertLessThanOrEqual(chunk.readableBytes, 128)
  secondRead.writeImmutableBuffer(chunk)
}

The main issue is that the read syscall affects the file pointer while the pread syscall does not.

Modifications:

  • Add a readChunksFromFilePointer to ReadableFileHandleProtocol to explicitly read from the current file pointer instead of relying on the magic 0..<Int.max range.
  • Use the new function when reading from an unseekable file in .readToEnd.
  • Rename the ChunkRange cases to make what they are doing clearer.

Result:

Reading 0..<Int.max over and over again will have the same result each time.

@rpecka rpecka force-pushed the file-pointer-offset-bug branch from 30b9cec to 7baadcc Compare July 9, 2024 06:56
Comment on lines +25 to +26
case filePointerToEnd
case range(Range<Int64>)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think there's any reason to change these names

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't entireFile misleading because it read from the current file pointer, not from the beginning of the file?

Comment on lines +206 to +211
/// Returns an asynchronous sequence of chunks read from the file starting from the current file pointer.
///
/// - Parameters:
/// - size: The maximum length of the chunk to read as a ``ByteCount``.
/// - Returns: A sequence of chunks read from the file.
func readChunksFromFilePointer(chunkLength size: ByteCount) -> FileChunks
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't quite what I had in mind. Rather than adding new API I think we should use the type of the file to determine how to do the read inside FileChunks. Once we know the type of the file we can determine whether the range passed in is acceptable and then call the appropriate read function.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would make the checks that happen inside of .readToEnd redundant. Should we keep those or remove them?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This also means that calling readToEnd will stat the file twice.

I’m trying out your recommended solution and it also causes problems because if we check the file type in the FileChunks initializer, then that means the function has to be async throws. But the readChunks function from ReadableFileHandleProtocol is neither async or throws so that would be an API change.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants