Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Week 7 plan - Division of Labor #19

Open
sanittawan opened this issue May 10, 2019 · 3 comments
Open

Week 7 plan - Division of Labor #19

sanittawan opened this issue May 10, 2019 · 3 comments
Assignees
Labels
Difficult High Priority must be done in sequence or must meet the deadline soon

Comments

@sanittawan
Copy link
Collaborator

Task Name Date
Data uploading
convert to CSV (Adam's code, CSV module, get rid of tags) Nikki, Adam Fri May 17
Upload to buckets Adam Fri May 17
Figure out sharing/access Adam Fri May 17
Data prep
Decide necessary vars Dhruval Fri May 17
Find data keys Dhruval Fri May 17
Decide on data structure Dhruval Fri May 17
Join data using MapReduce Dhruval, Nikki Fri May 17
Sentiment logistics
Decide on a dictionary Li Fri May 17
Decide on sentiments Li Fri May 17
Define sentiments (n-grams) Li Fri May 17
Specify inputs for models Li Fri May 17
Data analysis
NLTK Li Fri May 17
what other packages to use ??
split up analysis on clusters ??
Pres/Viz
?? ??
?? ??
@sanittawan sanittawan added High Priority must be done in sequence or must meet the deadline soon Difficult labels May 10, 2019
@sanittawan sanittawan changed the title Division of Labor Week 7 plan - Division of Labor May 10, 2019
@tonofshell
Copy link
Collaborator

Progress update 1:

  • Created a storage bucket
  • I believe I can allow access to it with the emails associated with your Google Cloud accounts
  • Some of the smaller XML files have already finished converting, larger ones might have to be done on the cloud
    • I believe Nikki is adapting my code to work on MapReduce for this reason
  • Tomorrow morning before the workshop I will upload any finished CSVs and all the XML files, this should (hopefully) take no more than a few hours

@sanittawan
Copy link
Collaborator Author

sanittawan commented May 16, 2019

@tonofshell Not sure if you've seen my comment in another issue (I posted it since last Saturday). So, I'm going to post it here again.

  • Can you please tell me which line in your code causes the script to scan the whole file?

  • I have a question about these lines (in startElement()).

if self.row == 1:
  self.out.write(str(attributes.keys())[1:-1] + "\n")
if len(attributes) > 0:
  self.out.write(str(attributes.values())[1:-1] + "\n")

It seems to me that attributes is a dictionary. Order in Python dictionary is not guaranteed. How do you know that the keys and attributes of each row of data will have the exact same order? (If it happens to yield the right thing, it's luck.) For example, if attributes of the first row 0 is {id: 0, name: "a", link: "url0"}, how can you be sure that row 1's attributes dict would not be something like {name: "b", id: 1, link: "url1"}? So, the resulting CSV is:

id, name, link
0, "a", "url0"
"b", 1, "url1"

If you agree with me that this could potentially be a problem, we should control the order by keeping a list of attributes. (That was the reason why I hard-coded the column names, but I did plan to make changes to make it more generic).

MapReduce might not be a good idea for this task. I am going to write an MPI script that does the conversion tomorrow and will make changes to the code according to the potential problem that I pointed out above.

@liu431
Copy link
Owner

liu431 commented May 17, 2019

I've uploaded a notebook on applying VADER package to extract the sentiments from the data. It is easy to use and parallelable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Difficult High Priority must be done in sequence or must meet the deadline soon
Projects
None yet
Development

No branches or pull requests

4 participants