Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Google Cloud Storage hFILE not using multipart / resumable uploads or downloads for reliability #1814

Open
pettyalex opened this issue Jul 31, 2024 · 1 comment
Assignees
Milestone

Comments

@pettyalex
Copy link
Contributor

Summary: GCS Read/Write could use resumable uploads and downloads to recover better from network failures or other transient issues

I've noticed that the Google Cloud Storage support for htslib makes single requests for both download and upload, without doing resumable, multiple chunk, or multipart uploads. I believe that using resumable uploads and using Range headers on downloads could significantly increase reliability when working with GCS, potentially fixing most of the bugs related to read/write problems from GCS that I've seen reported here. I've personally had a pretty bad time trying to read/write large files in GCS, it works intermittently but we experience failures every few hours that make dealing with large files infeasible.

My group is trying to work in Google Cloud via terra.bio, and hoping to be able to stream input and output from Google Cloud Storage so that we avoid having to copy around >1TB vcf.gz and bcfs: samtools/bcftools#2235

Google's recommendations for streaming uploads and downloads are here:
https://cloud.google.com/storage/docs/streaming-uploads
https://cloud.google.com/storage/docs/streaming-downloads

I see two main ways to approach this. It would be possible to have hfile_gcs wrap hfile_libcurl just like it currently does, make a request to start a resumable upload before it starts sending data, and then handle creating a new hFILE for each large chunk. hfile_gcs could also handle retrying, although if we want robust retry logic we'd need to keep each chunk in memory until we know it's been successfully sent.

The other approach would be to rework, extend, or wrap hfile_s3_write, because Google Cloud Storage also supports XML multipart uploads matching the S3 API: https://cloud.google.com/storage/docs/multipart-uploads

It's also possible to work around this in some situations by using the GCP cli to do the read/write, but this won't work in all situations. For example, one can: gcloud storage cat gs://my-bucket/my-file.bcf | bcftools view | gcloud storage cp - gs://my-bucket/the-output.bcf

It would also be really nice to use range requests for reading, as it'd be possible to request just one bgzf block at a time if you're doing random I/O.

@whitwham
Copy link
Contributor

whitwham commented Aug 8, 2024

These are all good things that we should add. Our entire cloud storage code needs looking at and seeing what we can do better with. At the moment we are spread a bit thin due to other projects, but hopefully we can get around to this in the not too distant future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants