Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

File Storage #1

Open
vezenovm opened this issue Apr 23, 2020 · 1 comment
Open

File Storage #1

vezenovm opened this issue Apr 23, 2020 · 1 comment
Assignees
Labels
question Further information is requested

Comments

@vezenovm
Copy link
Contributor

vezenovm commented Apr 23, 2020

Storing files in the PostgreSQL database I imagine would require me to do some sort of buffering to account for the possibility of large files. Is there a recommended strategy for dealing with file storage?

What are everyone's thoughts on simply storing the file through a Google Drive cloud service for the moment and rather than hashing the entire file contents in our database we just store a hashed file id?

Also for reference here is my current database schema for the MVP:

create table users (
	user_id				varchar(32)		not null,
	name				varchar(32)		not null,		
	public_address			varchar(42)		unique not null,
	creation_timestamp		timestamp 		not null,
	primary key (user_id)
);

create table letters (
	letter_id			varchar(32)		primary key,
	letter_contents			bytea			not null,
	letter_writer_id		varchar(32)		not null references users(user_id),
	letter_requestor		varchar(32)		not null references users(user_id),
	primary key (letter_id)
);

create table sent_letters (
	user_id					varchar(32) 	        not null references users(user_id),
	letter_id 				varchar(32)		not null references letters(letter_id),
	primary key (user_id, letter_id)
);
@vezenovm vezenovm added the question Further information is requested label Apr 23, 2020
@zcstarr
Copy link

zcstarr commented Apr 24, 2020

Actually this a good point. So what people tend to do is not store files in databases. The standard pattern here is like you're thinking to hash the content and store that content hash in the db.

The data itself ("often referred to as a blob") goes into a blob storage facility. In your case maybe that is google drive. Especially if you're thinking that people can write the letters in any format they want.

For the MVP it might make sense for you to just return a url that links to the underlying data, and display the metadata.

On Blob Storage:

If you have the resources available, it might be better to https://devcenter.heroku.com/articles/s3-upload-node . It'll give you more flexibility and probably be a better storage system for your needs.

The other thing to note is that you might consider using a CDN like cloudfront with S3. It will help with the retrieval of the data, but is tricker to set up. So I'd start with the foundational layer which is the s3 bucket, on public for the proof of concept (against best practices - setting the bucket to public).

As a first pass I would not complicate things and simply use s3 and heroku. You'll get better performance because heroku is deployed on aws to start, and will do a better job of colocating your data. In theory it'll be cheaper as well, because you won't pay outgoing network bandwidth.

the rest of this might be tl;dr :D and just background information on uploading data schemes. I believe for the 1.x these are not necessary so that's why there's the tl;dr caveat :) ..

So with file uploading schemes there are a few things to consider.

  1. is the processing of the file data

  2. is the outgoing bandwidth consumption of that data

  3. the accessibility of that data

  4. for part 1 in some apis accessing disk might block i/o or be really slow or consume too much memory and so when apis get large enough they pull filestorage out to a separate system, that might queue the data for further processing and is dedicate to slurping files in ( but that's scale issue and chances are you'll not hit that point soon)

  5. Typically when you store large blob data in a db, you're then responsible for serving that data to anywhere in the world. This means that you don't get to take advantage of CDN's and caching. It means you'll spin alot of time in your api server reading from disk and then spend the bandwidth serving content. Initially this isn't terrible, but it becomes expensive to do this from a bandwidth and io perspective.

  6. One downside to hashed content on s3 is the difficulty of restricting access to the content once a url for the content has been distributed. Share links etc... often work as a security by obscurity protocol. So if a link leaks you have access to it forever. In theory this is the reality that we live in everyday. Post something to twitter there's no guarantee someone won't screen grab and retweet.
    Content access management could happen in a couple of different ways, sometimes APIs like s3 allow for restricted access, you could time manage the access yourself. You could generate public facing urls for each share, where you generate a url and a corresponding CDN url to map to the underlying asset. Each have their own complexity tradeoffs... . The best approach here when building these systems is to introduce complexity only when you need it.

For the mvp and for the initial builds I would keep it simple, and just return an s3 url or a cloud front url that maps to s3.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

4 participants