Skip to content

Commit b289316

Browse files
chore(datasets): improve logging and retry logic for MPUs (#387)
This adds some basic retry logic for multipart upload parts so we don't choke immediately if a single part fails for any reason. It also adds some decidedly jank logging to update the end user on the progress of large uploads. Unfortunately this can create some really noisy console logs, but the way we have Halo implemented makes it just annoying enough to do a cleaner job of things as to not be worthwhile. The important thing is that users should now get relevant, timely information as to how large uploads are progressing, or an informative error message if the upload fails to complete for any reason.
1 parent 2769721 commit b289316

File tree

1 file changed

+32
-4
lines changed

1 file changed

+32
-4
lines changed

gradient/commands/datasets.py

Lines changed: 32 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -640,12 +640,40 @@ def _put(self, path, url, content_type, dataset_version_id=None, key=None):
640640
)[0]['url']
641641

642642
chunk = f.read(part_minsize)
643-
part_res = session.put(
644-
presigned_url,
645-
data=chunk,
646-
timeout=5)
643+
for attempt in range(0, 5):
644+
part_res = session.put(
645+
presigned_url,
646+
data=chunk,
647+
timeout=5)
648+
if part_res.status_code == 200:
649+
break
650+
651+
if part_res.status_code != 200:
652+
# Why do we silence exceptions that get
653+
# explicitly raised? Mystery for the ages, but
654+
# there you have it I guess...
655+
print(f'\nUnable to complete upload of {path}')
656+
raise ApplicationError(
657+
f'Unable to complete upload of {path}')
647658
etag = part_res.headers['ETag'].replace('"', '')
648659
parts.append({'ETag': etag, 'PartNumber': part})
660+
# This is a pretty jank way to get about multipart
661+
# upload status updates, but we structure the Halo
662+
# spinner to report on the number of completed
663+
# tasks dispatched to the workers in the pool.
664+
# Since it's more of a PITA to properly distribute
665+
# this MPU among all workers than I really want to
666+
# deal with, that means we can't easily plug into
667+
# Halo for these updates. But we can print to
668+
# console! Which again, jank and noisy, but arguably
669+
# better than a task sitting forever, never either
670+
# completing or emitting an error message.
671+
if len(parts) % 7 == 0: # About every 100MB
672+
print(
673+
f'\nUploaded {len(parts) * part_minsize / 10e5}MB '
674+
f'of {int(size / 10e5)}MB for '
675+
f'{path}'
676+
)
649677

650678
r = api_client.post(
651679
url=mpu_url,

0 commit comments

Comments
 (0)