Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CRTTransferManager.shutdown() fails silently #307

Open
cboin1996 opened this issue Jun 7, 2024 · 0 comments
Open

CRTTransferManager.shutdown() fails silently #307

cboin1996 opened this issue Jun 7, 2024 · 0 comments

Comments

@cboin1996
Copy link

When calling .shutdown() on a CRTTransferManager class, failures are not raised. After some digging into this repo, I saw
the below code (referenced from here)

    def _shutdown(self, cancel=False):
        if cancel:
            self._cancel_transfers()
        try:
            self._finish_transfers()

        except KeyboardInterrupt:
            self._cancel_transfers()
        except Exception:
            pass
        finally:
            self._wait_transfers_done()

the pass statement should probably re-raise the exception, so that errors are actually thrown. What do you think?

For the record, my code was this:

def parallel_upload(bucket: str, data: List[dict], workers=100):
    """performs parallel upload of records

    Args:
        bucket (str): the bucket
        records (List[dict]): records to process
            ex. [
                {
                    "key": "path/file.type",
                    "data": "data"
                }
            ]
        workers (int, optional): number of parrallel workers. Defaults to 100.
    """
    botocore_config = botocore.config.Config(
      max_pool_connections=workers,
      tcp_keepalive=True
    )
    s3client = boto3.Session().client('s3', config=botocore_config)
    transfer_config = s3transfer.TransferConfig(max_concurrency=workers)
    s3t = s3transfer.create_transfer_manager(s3client, transfer_config)
    upload_bytes = []
    futures = []
    for item in data:
      # IMPORTANT! we need to retain the reference to BytesIO outside of this loop otherwise
      # the call to shutdown will fail all transfers as there is no longer a reference to the bytes
      # being transferred - hence the use of a list to maintain them
      upload_bytes.append(BytesIO(item['data'].encode()))
      s3t.upload(
          upload_bytes[-1], bucket, item['key']
      )
    # wait for transfers to complete
    s3t.shutdown()
    print(f"\t - {len(data)} objects uploaded to {bucket} successfully.")

As a work-around, I have manually handled the collection of results from s3t.upload, like below:

def parallel_upload(bucket: str, data: List[dict], workers=100):
    """performs parallel upload of records

    Args:
        bucket (str): the bucket
        records (List[dict]): records to process
            ex. [
                {
                    "key": "path/file.type",
                    "data": "data"
                }
            ]
        workers (int, optional): number of parrallel workers. Defaults to 100.
    """
    botocore_config = botocore.config.Config(
      max_pool_connections=workers,
      tcp_keepalive=True
    )
    s3client = boto3.Session().client('s3', config=botocore_config)
    transfer_config = s3transfer.TransferConfig(max_concurrency=workers)
    s3t = s3transfer.create_transfer_manager(s3client, transfer_config)
    upload_bytes = []
    futures = []
    for item in data:
      # IMPORTANT! we need to retain the reference to BytesIO outside of this loop otherwise
      # the call to shutdown will fail all transfers as there is no longer a reference to the bytes
      # being transferred - hence the use of a list to maintain them
      upload_bytes.append(BytesIO(item['data'].encode()))
      # keep track of all submissions for validation of successful uploads
      futures.append(
        s3t.upload(
            upload_bytes[-1], bucket, item['key']
        )
      )
    # ensure all results do not throw an exception.
    for f in futures:
        f.result() # if upload fails, exception is thrown from this call
    print(f"\t - {len(data)} objects uploaded to {bucket} successfully.")

I simply iterate the results after the queue-ing of uploads, and call .result(), which works as expected allowing me to concurrently upload while still detecting errors.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant