Week 7 plan - Division of Labor #19

sanittawan · 2019-05-10T19:13:56Z

Task	Name	Date
Data uploading
convert to CSV (Adam's code, CSV module, get rid of tags)	Nikki, Adam	Fri May 17
Upload to buckets	Adam	Fri May 17
Figure out sharing/access	Adam	Fri May 17
Data prep
Decide necessary vars	Dhruval	Fri May 17
Find data keys	Dhruval	Fri May 17
Decide on data structure	Dhruval	Fri May 17
Join data using MapReduce	Dhruval, Nikki	Fri May 17
Sentiment logistics
Decide on a dictionary	Li	Fri May 17
Decide on sentiments	Li	Fri May 17
Define sentiments (n-grams)	Li	Fri May 17
Specify inputs for models	Li	Fri May 17
Data analysis
NLTK	Li	Fri May 17
what other packages to use	??
split up analysis on clusters	??
Pres/Viz
??	??
??	??

tonofshell · 2019-05-16T03:14:09Z

Progress update 1:

Created a storage bucket
I believe I can allow access to it with the emails associated with your Google Cloud accounts
Some of the smaller XML files have already finished converting, larger ones might have to be done on the cloud
- I believe Nikki is adapting my code to work on MapReduce for this reason
Tomorrow morning before the workshop I will upload any finished CSVs and all the XML files, this should (hopefully) take no more than a few hours

sanittawan · 2019-05-16T04:40:41Z

@tonofshell Not sure if you've seen my comment in another issue (I posted it since last Saturday). So, I'm going to post it here again.

Can you please tell me which line in your code causes the script to scan the whole file?
I have a question about these lines (in startElement()).

if self.row == 1:
  self.out.write(str(attributes.keys())[1:-1] + "\n")
if len(attributes) > 0:
  self.out.write(str(attributes.values())[1:-1] + "\n")

It seems to me that attributes is a dictionary. Order in Python dictionary is not guaranteed. How do you know that the keys and attributes of each row of data will have the exact same order? (If it happens to yield the right thing, it's luck.) For example, if attributes of the first row 0 is {id: 0, name: "a", link: "url0"}, how can you be sure that row 1's attributes dict would not be something like {name: "b", id: 1, link: "url1"}? So, the resulting CSV is:

id, name, link
0, "a", "url0"
"b", 1, "url1"

If you agree with me that this could potentially be a problem, we should control the order by keeping a list of attributes. (That was the reason why I hard-coded the column names, but I did plan to make changes to make it more generic).

MapReduce might not be a good idea for this task. I am going to write an MPI script that does the conversion tomorrow and will make changes to the code according to the potential problem that I pointed out above.

liu431 · 2019-05-17T02:01:02Z

I've uploaded a notebook on applying VADER package to extract the sentiments from the data. It is easy to use and parallelable.

sanittawan added High Priority must be done in sequence or must meet the deadline soon Difficult labels May 10, 2019

sanittawan assigned tonofshell, liu431, sanittawan and dhruvalb May 10, 2019

sanittawan changed the title ~~Division of Labor~~ Week 7 plan - Division of Labor May 10, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Week 7 plan - Division of Labor #19

Week 7 plan - Division of Labor #19

sanittawan commented May 10, 2019

tonofshell commented May 16, 2019

sanittawan commented May 16, 2019 •

edited

Loading

liu431 commented May 17, 2019

Week 7 plan - Division of Labor #19

Week 7 plan - Division of Labor #19

Comments

sanittawan commented May 10, 2019

tonofshell commented May 16, 2019

sanittawan commented May 16, 2019 • edited Loading

liu431 commented May 17, 2019

sanittawan commented May 16, 2019 •

edited

Loading