Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data Problems and Questions #10

Open
tonofshell opened this issue Apr 29, 2019 · 5 comments
Open

Data Problems and Questions #10

tonofshell opened this issue Apr 29, 2019 · 5 comments
Labels
question Further information is requested

Comments

@tonofshell
Copy link
Collaborator

I have found a few issues that will add to the complexity of our analysis that I think we should start thinking about. Feel free to add any questions or problems you find with the data to this issue.

  • The data is in XML format NOT raw text files
    • Can we parse XML line by line?
    • Depending on the XML parser, this is likely stored in Python as a list of dictionaries, one dictionary for each row in the XML file.
      • Each row is one user or comment
  • Any comment or About Me attribute (basically anything with a sentence or more of text) within each XML row is in HTML format
    • Do we simply drop all HTML tags and how would we do this?
    • Do we process the data and save it to disk in another format to use for our analysis?
@sanittawan
Copy link
Collaborator

sanittawan commented Apr 29, 2019

Thanks for laying out all the questions. To answer some of your questions:

  • The data is in XML format NOT raw text files

    • Can we parse XML line by line? >> I think we can use lxml or beautifulsoup to do it. Check this link and this StackOverflow thread out.
    • Depending on the XML parser, this is likely stored in Python as a list of dictionaries, one dictionary for each row in the XML file. >> Answer below (last point)
  • Any comment or About Me attribute (basically anything with a sentence or more of text) within each XML row is in HTML format

    • Do we simply drop all HTML tags and how would we do this? >> We can use beautifulsoup to extract the text, theoretically.
    • Do we process the data and save it to disk in another format to use for our analysis? >> Here's what professor Wachs said:

You could [parse XML] with either MPI or MapReduce. Although I agree that this is not a typical MapReduce job, for the reason you said [I said there's really nothing to be reduced], you can do unusual things with MapReduce if you want. If you just yielded a key-value pair for each entry, and the reducer didn't do anything to change them, then it would have the effect of writing out all the results.

A bigger issue is that MapReduce breaks things up line by line. XML is usually on separate lines, even for things that belong together. This could mean that MapReduce sends different parts of the same thing to different nodes. That's an immediate problem. Are there newlines within things that belong together in your data file?

Putting things in a database is often a good idea, but not always suitable for big data environments. Sometimes, working with a raw CSV is much easier than trying to query a database. For instance, you can take things from a raw CSV and put them right into MapReduce, but not so for a database. I would plan on not using a database, but if there is some specific reason why it turns out to be useful for part of your project, you can reconsider.

[...] denotes my insertion

@tonofshell
Copy link
Collaborator Author

Just an answer to another question that I hadn't proposed yet, how much will it cost to store the data in Google Cloud? To store it on a drive for access by a compute cluster, we are subject to this pricing scheme:
2019-04-29 (3)

We can store it in a Google Cloud storage bucket which is cheaper for just storing the data, and it appears we can access it from compute, but then we would need to pay a cost per operation (query) as well.

@sanittawan
Copy link
Collaborator

Regarding @tonofshell 's comment on storage, I chatted with Prof. Wachs today and he said that we should store the data in Google cloud buckets (ref. here and here. Students in the past years store their data there and access it using gsutil tool.

Also to note, we can automate the launching of compute engines via gcloud command.

@tonofshell
Copy link
Collaborator Author

Regarding @tonofshell 's comment on storage, I chatted with Prof. Wachs today and he said that we should store the data in Google cloud buckets (ref. here and here. Students in the past years store their data there and access it using gsutil tool.

Also to note, we can automate the launching of compute engines via gcloud command.

I guess this means we need to decide how much of the data we want to use before we start paying to put all 200+Gb into a Google Cloud Bucket

@sanittawan
Copy link
Collaborator

sanittawan commented May 1, 2019

Regarding @tonofshell 's comment on storage, I chatted with Prof. Wachs today and he said that we should store the data in Google cloud buckets (ref. here and here. Students in the past years store their data there and access it using gsutil tool.
Also to note, we can automate the launching of compute engines via gcloud command.

I guess this means we need to decide how much of the data we want to use before we start paying to put all 200+Gb into a Google Cloud Bucket

Exactly. Prof Wachs gave a piece of advice that we should start from <1% of the entire data, store it on the cloud buckets, and do some analysis. Once we are sure that our code won't break, we can either slowly increase the size of the data (from <1% to 5% to 10% to 20% and so on) or we can try putting the entire 200 GB there. He did mention though that the storage won't burn our money quickly. It's the computation that will cost us a lot. He also reminded us to select a compute engine with 500 GB when we want to upload full data set to cloud.

@sanittawan sanittawan added the question Further information is requested label May 3, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants