Data Problems and Questions #10

tonofshell · 2019-04-29T17:56:50Z

I have found a few issues that will add to the complexity of our analysis that I think we should start thinking about. Feel free to add any questions or problems you find with the data to this issue.

The data is in XML format NOT raw text files
- Can we parse XML line by line?
- Depending on the XML parser, this is likely stored in Python as a list of dictionaries, one dictionary for each row in the XML file.
  - Each row is one user or comment
Any comment or About Me attribute (basically anything with a sentence or more of text) within each XML row is in HTML format
- Do we simply drop all HTML tags and how would we do this?
- Do we process the data and save it to disk in another format to use for our analysis?

sanittawan · 2019-04-29T20:10:15Z

Thanks for laying out all the questions. To answer some of your questions:

The data is in XML format NOT raw text files
- Can we parse XML line by line? >> I think we can use lxml or beautifulsoup to do it. Check this link and this StackOverflow thread out.
- Depending on the XML parser, this is likely stored in Python as a list of dictionaries, one dictionary for each row in the XML file. >> Answer below (last point)
Any comment or About Me attribute (basically anything with a sentence or more of text) within each XML row is in HTML format
- Do we simply drop all HTML tags and how would we do this? >> We can use beautifulsoup to extract the text, theoretically.
- Do we process the data and save it to disk in another format to use for our analysis? >> Here's what professor Wachs said:

You could [parse XML] with either MPI or MapReduce. Although I agree that this is not a typical MapReduce job, for the reason you said [I said there's really nothing to be reduced], you can do unusual things with MapReduce if you want. If you just yielded a key-value pair for each entry, and the reducer didn't do anything to change them, then it would have the effect of writing out all the results.

A bigger issue is that MapReduce breaks things up line by line. XML is usually on separate lines, even for things that belong together. This could mean that MapReduce sends different parts of the same thing to different nodes. That's an immediate problem. Are there newlines within things that belong together in your data file?

Putting things in a database is often a good idea, but not always suitable for big data environments. Sometimes, working with a raw CSV is much easier than trying to query a database. For instance, you can take things from a raw CSV and put them right into MapReduce, but not so for a database. I would plan on not using a database, but if there is some specific reason why it turns out to be useful for part of your project, you can reconsider.

[...] denotes my insertion

tonofshell · 2019-04-29T22:25:31Z

Just an answer to another question that I hadn't proposed yet, how much will it cost to store the data in Google Cloud? To store it on a drive for access by a compute cluster, we are subject to this pricing scheme:

We can store it in a Google Cloud storage bucket which is cheaper for just storing the data, and it appears we can access it from compute, but then we would need to pay a cost per operation (query) as well.

sanittawan · 2019-05-01T19:42:27Z

Regarding @tonofshell 's comment on storage, I chatted with Prof. Wachs today and he said that we should store the data in Google cloud buckets (ref. here and here. Students in the past years store their data there and access it using gsutil tool.

Also to note, we can automate the launching of compute engines via gcloud command.

tonofshell · 2019-05-01T21:59:02Z

Regarding @tonofshell 's comment on storage, I chatted with Prof. Wachs today and he said that we should store the data in Google cloud buckets (ref. here and here. Students in the past years store their data there and access it using gsutil tool.

Also to note, we can automate the launching of compute engines via gcloud command.

I guess this means we need to decide how much of the data we want to use before we start paying to put all 200+Gb into a Google Cloud Bucket

sanittawan · 2019-05-01T22:45:08Z

Regarding @tonofshell 's comment on storage, I chatted with Prof. Wachs today and he said that we should store the data in Google cloud buckets (ref. here and here. Students in the past years store their data there and access it using gsutil tool.
Also to note, we can automate the launching of compute engines via gcloud command.

I guess this means we need to decide how much of the data we want to use before we start paying to put all 200+Gb into a Google Cloud Bucket

Exactly. Prof Wachs gave a piece of advice that we should start from <1% of the entire data, store it on the cloud buckets, and do some analysis. Once we are sure that our code won't break, we can either slowly increase the size of the data (from <1% to 5% to 10% to 20% and so on) or we can try putting the entire 200 GB there. He did mention though that the storage won't burn our money quickly. It's the computation that will cost us a lot. He also reminded us to select a compute engine with 500 GB when we want to upload full data set to cloud.

sanittawan added the question Further information is requested label May 3, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data Problems and Questions #10

Data Problems and Questions #10

tonofshell commented Apr 29, 2019

sanittawan commented Apr 29, 2019 •

edited

Loading

tonofshell commented Apr 29, 2019

sanittawan commented May 1, 2019

tonofshell commented May 1, 2019

sanittawan commented May 1, 2019 •

edited

Loading

Data Problems and Questions #10

Data Problems and Questions #10

Comments

tonofshell commented Apr 29, 2019

sanittawan commented Apr 29, 2019 • edited Loading

tonofshell commented Apr 29, 2019

sanittawan commented May 1, 2019

tonofshell commented May 1, 2019

sanittawan commented May 1, 2019 • edited Loading

sanittawan commented Apr 29, 2019 •

edited

Loading

sanittawan commented May 1, 2019 •

edited

Loading