Skip to content

Conversation

@sbassam
Copy link
Contributor

@sbassam sbassam commented Nov 7, 2025

Have you read the Contributing Guidelines?

Issue #

Describe your changes

Clearly and concisely describe what's in this pull request. Include screenshots, if necessary.


Note

Increases max file size to 50.1GB, switches multipart target part size to 250MB with sliding-window concurrency, adds progress bars to validation, and sets a download request timeout.

  • Uploads:
    • Multipart: Set TARGET_PART_SIZE_MB to 250; keep MAX_MULTIPART_PARTS at 250.
    • Implement sliding-window concurrency in MultipartUploadManager._upload_parts_concurrent to respect max_concurrent_parts while continuously feeding new parts.
    • Increase max supported file size to MAX_FILE_SIZE_GB = 50.1.
  • Download:
    • Add request_timeout=3600 to raw GET in DownloadManager.download.
  • File validation:
    • Wrap JSONL iteration with tqdm for progress; import tqdm.
    • Improve UTF-8 check to read in chunks.
  • Tests:
    • Update expectations for part sizing/count (e.g., 500MB → 2×250MB, 50GB → ~205 parts) and size-limit error message.
    • Adjust as_completed mocking to align with sliding-window behavior.

Written by Cursor Bugbot for commit 7ac3d44. This will update automatically on new commits. Configure here.

@sbassam sbassam marked this pull request as ready for review November 7, 2025 03:21
@sbassam sbassam requested a review from vorobyov01 November 8, 2025 03:38
Copy link

@nikita-smetanin nikita-smetanin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Soroush, PR looks nice, I left a few suggestions :)

chunk_size = 8192
with file.open(encoding="utf-8") as f:
f.read()
for chunk in iter(lambda: f.read(chunk_size), ""):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think just do for line in f: pass would be simpler and you won't have to specified chunk size manually (it'll use line buffering but you can override it if you think we should use larger buffers). I'd also comment we just do dry-run to let file reader decode file in utf8


# Submit next part if available
if part_index < len(parts):
part_info = parts[part_index]

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be great to rewrite it to deduplicate this code piece with the one above. I think you can either make a for loop to submit tasks and wait on result if we have enough already, or use executor.map with buffersize to limit concurrent tasks.

TARGET_PART_SIZE_MB = 100 # Target part size for optimal performance
MAX_MULTIPART_PARTS = 250 # Maximum parts per upload (S3 limit)
TARGET_PART_SIZE_MB = 250 # Target part size
MAX_MULTIPART_PARTS = 250 # Maximum parts per upload

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see you dropped "S3 limit" mention, is it still the case?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants