-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Data Problems and Questions #10
Comments
Thanks for laying out all the questions. To answer some of your questions:
[...] denotes my insertion |
Just an answer to another question that I hadn't proposed yet, how much will it cost to store the data in Google Cloud? To store it on a drive for access by a compute cluster, we are subject to this pricing scheme: We can store it in a Google Cloud storage bucket which is cheaper for just storing the data, and it appears we can access it from compute, but then we would need to pay a cost per operation (query) as well. |
Regarding @tonofshell 's comment on storage, I chatted with Prof. Wachs today and he said that we should store the data in Google cloud buckets (ref. here and here. Students in the past years store their data there and access it using gsutil tool. Also to note, we can automate the launching of compute engines via gcloud command. |
I guess this means we need to decide how much of the data we want to use before we start paying to put all 200+Gb into a Google Cloud Bucket |
Exactly. Prof Wachs gave a piece of advice that we should start from <1% of the entire data, store it on the cloud buckets, and do some analysis. Once we are sure that our code won't break, we can either slowly increase the size of the data (from <1% to 5% to 10% to 20% and so on) or we can try putting the entire 200 GB there. He did mention though that the storage won't burn our money quickly. It's the computation that will cost us a lot. He also reminded us to select a compute engine with 500 GB when we want to upload full data set to cloud. |
I have found a few issues that will add to the complexity of our analysis that I think we should start thinking about. Feel free to add any questions or problems you find with the data to this issue.
The text was updated successfully, but these errors were encountered: