Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for using a custom storage for temporary uploads #51

Open
sttwister opened this issue Jan 20, 2021 · 13 comments
Open

Support for using a custom storage for temporary uploads #51

sttwister opened this issue Jan 20, 2021 · 13 comments
Assignees
Labels
enhancement New feature or request
Milestone

Comments

@sttwister
Copy link

Saving temporary uploads in the server's filesystem assumes there is a single Django instance serving all requests.

This poses problems in a production environment with multiple webserver instances with ephemeral storage (think containers in a Kubernetes environment).

Using a remote storage (i.e S3) supported by django-stroages for saving temporary uploads would solve this issue.

@jcohen02
Copy link
Collaborator

Thanks for raising this issue @sttwister. I had been considering the addition of support for remote storage for temporary uploads when implementing the original remote storage functionality for stored uploads. I seem to recall there was something that made this less straightforward than I at first thought it might be! Nonetheless, I can see that this would be a useful feature and it would be good to revisit this and try and get it implemented.

In the sort of production environment you have in mind, would you expect all the webserver instances to be talking to single, common database instance? i.e. if files are stored out to S3 or similar, but the application database is still used to track these, would that work? I'm happy to have a go at adding this capability to django-drf-filepond on the basis that there is still a shared database available. I can see that this would certainly enhance the scope for using this in a larger-scale production environment.

@sttwister
Copy link
Author

Yes, the webservers would share a database instance, it's only the local storage that is not shared (i.e. stateless).

For my particular use case, I've opted for an in-house implementation instead of using django-drf-filepond in order to minimize complexity.

When using stateless webservers with a remote storage, the concept of temporary and stored uploads sometimes doesn't really make sense because you would end up saving files on S3 twice.

My approach which attempts to simplify this does the following:

  • Temporary uploads are saved directly on S3. Metadata is saved in the database.
  • Instead of moving temporary uploads to stored uploads, I just update metadata in the database on "form submit" and leave the files in the original location in S3.
  • Temporary files that have not been promoted to "stored files" will expire after some time and be cleaned up by some automated process (cron, Celery etc.)

However, I still find the flexibility of the temporary and stored upload storage very appealing. Being able to use any storage for temporary uploads opens a lot of possibilities (like saving them in the database temporarily using projects like django-db-file-storage and then moving to S3).

@jcohen02
Copy link
Collaborator

I take your point on the issue of considering temporary and stored uploads as separate entities and then being in a position where you may need to save them twice. This is clearly an important consideration, especially if you're talking about large files.

When django-drf-filepond was originally created, there was no remote storage capability so the concept of temporary and stored uploads worked since storing temporary uploads just meant moving them to an alternative location on the same local storage.

If we have remote storage capability for temporary uploads it would be good to find a way to avoid multiple uploads. A simple option might just be have a setting for the remote storage location for temporary and stored uploads and if they're the same, storing a temporary upload simply involves a change to the metadata, as you have highlighted in your approach.

If using S3 as the storage backend, I'm assuming the S3 mv moves files within AWS and doesn't result in them being downloaded/uploaded so even if different buckets are being used, I could include something to use that approach when S3 is being used as the backend.

@sttwister
Copy link
Author

Yes, sharing the same storage and path between temporary and stored files sounds like a good plan.

AFAIK, S3 doesn't natively support mv operations, they are implemented in clients using multiple operations. Also has the disadvantage that it's not storage agnostic.

@pasevin
Copy link
Contributor

pasevin commented Feb 2, 2021

Just my two cents :) Not sure if this is helpful at all.
My setup includes multiple instances on AWS as well with load balancing, and for the temporary uploads, I am using EFS.
Then for permanent uploads, they go to S3 of course

@huynguyengl99
Copy link

Hi @jcohen02 , when do you plan to release this feature? (It looks like the v0.4.0 is for the upload big file only.)

@jcohen02
Copy link
Collaborator

jcohen02 commented Sep 6, 2021

Hi @huynguyengl99, you are correct that v0.4.0 does not include this feature and is focused on adding the large file upload support.

I'm hoping to get this feature added soon but have been caught up with other work - I seem to recall that there were going to be some challenges with this that mean it's going to take longer than originally anticipated to implement.

If you're keen to use this, I can look to try and get a prototype implementation in place within the next 2-3 weeks that you could test.

@huynguyengl99
Copy link

Thank you @jcohen02 I have some workaround for my current implementation right now in my project for that case, so I can stay calm and wait for that feature.

@jcohen02
Copy link
Collaborator

jcohen02 commented Sep 6, 2021

No problem, I'll try to get this new feature implemented as soon as possible.

@jcohen02 jcohen02 modified the milestones: v0.4.0, v0.5.0 Sep 29, 2021
@jcohen02 jcohen02 self-assigned this Sep 29, 2021
@jcohen02 jcohen02 added the enhancement New feature or request label Sep 29, 2021
@jcohen02 jcohen02 modified the milestones: v0.5.0, v0.6.0 Dec 5, 2022
@pplonski
Copy link

Hi @jcohen02,

Thank you for this amazing package. I found this issue when looking on how to setup upload of temporary files directly to S3. If you have some tips or examples on how to do this, it would be fantastic.

I'm using django-drf-filepond in my framework for converting Jupyter Notebooks to Data Science Web Apps. When users are working with framework locally then storing files in local directory is OK, for larger deployments direct S3 upload is required because workers and server are on different machines.

Thank you!

@jcohen02
Copy link
Collaborator

jcohen02 commented Feb 24, 2023

Hi @pplonski, good to hear that you're using the package and thanks for the feedback. Mercury looks like a great framework.

Just to check that I understand correctly what you're trying to do: when users run Mercury locally, if they upload a file, this gets sent by filepond from the browser to the Django backend (which is running on the same machine) where it's handled by django-drf-filepond and stored as a temporary file on the local host - when the upload is stored, django-drf-filepond then sends the file from the local temporary storage to S3 (if you have the S3 backend configured). As you say, this is fine when running locally but when the client and server are on separate machines and the upload is large, you want to avoid the upload first being sent from the client to the Django server as a temporary upload, and then subsequently being sent by the Django server to S3 when the temporary upload is stored?

So what you're looking for is some way to send the initial temporary upload directly from the client to S3?

The main challenge that I see here is that on the client side, since django-drf-filepond doesn't have any awareness or control of anything, you'd need the filepond client to handle the upload to S3 and then also send a message to django-drf-filepond with the upload metadata so that the server-side can keep track of the upload. You'd then need some sort of security configuration so that both the client and server have access to some common location on S3 and the client should only be able to upload the file but not see/alter any of the files stored there?

In principle, I don't think the client-side upload to S3 is a major problem, a possible solution seems to be discussed in this filepond issue. However, I can see potential security challenges in terms of the managing access. The client would presumably need to have some S3 credentials to upload the file to S3? However, the server would also need to have control over the temporary uploads in S3. I guess the person setting up the Mercury server would need to embed some credentials in the Mercury deployment and these would somehow also need to be used by the client - I'll need to thinking this through in more detail but it doesn't sound practical.

Hope I understood correctly what you're hoping to acheive. Do let me know if you have any thoughts on potential options for how this might be designed/implemented. Thanks.

UPDATE: It looks like something like this should, in principle, be possible using one-time pre-signed AWS URLs, at the same time, I'm guessing that this will also require updates to the client-side filepond code to make it possible 🤔. Further investigation required.

@pplonski
Copy link

Hi @jcohen02,

Thank you for your response! In Mercury user can create a Python notebook that accepts file upload. User can upload a new file and Python notebook will compute analysis for it. All files upload are temporary. I don't need to store files for later use. I just use temporary files, do computations in the notebook and display analysis. After some time, all temporary uploads are just deleted. I don't store files.

Users can run Mercury locally or deployed in the cloud. When running locally Mercury saving temporary files in the hard drive is ok. In the cloud I was doing only small deployments - workers and server on the same machine, so saving in the hard drive was ok. I would like to prepare code for large deployments with multiple workers on different machines, thus I need some external storage.

I need to look at one-time pre-signed AWS URLs. I can detect when I have external storage, and filepond will send files directly to S3 and store that information in Django's database. That's might be a good solution. Thanks for a tip!

BTW: I really enjoy reading your code in the django-drf-filepond - well written! Thank you!

@jcohen02
Copy link
Collaborator

Thanks for the additional information @pplonski. It sounds like, perhaps if you're only using temporary uploads, you may be able to combine the approach of overriding the filepond process function to send uploads directly to an S3 bucket with some custom implementation in Mercury that generates a one-time upload link when a user tries to upload a file in a notebook. I guess you'd need to make a call to the server when the user attempts an upload, return the URL and then trigger filepond to upload to that URL.

You'd also, presumably need to think about security and have some way to ensure that when you go to process data in an uploaded file, that it is actually the file the user uploaded.

I'll take a look at whether it's possible to add any support for something like this into django-drf-filepond but I'm afraid it will take me a little time to investigate due to other committments at the moment. Hope these thoughts are some help anyway.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

5 participants