For Adam to run on Dataproc #37

sanittawan · 2019-05-31T21:08:50Z

These can be done in Dataproc.

Please follow these steps:

run Questions with highest number of answers per year decrs_max_ans_q.py with Posts.csv
run Users who answered/asked questions the most decrs_users_activities.py with Posts.csv
Add a column ",badges" to Badges.csv, save it to a new file in the bucket
Add a column ",users" to Users.csv, save it to a new file in the bucket
Cat Badges.csv and Users.csv, save it to a new file "badges_users.csv"
run The locations where users with gold answer badges are from decrs_users_gold_ans.py with "bagdes_users.csv"
run 2-grams tags that are usually tagged together decrs_n_grams_tags.py with Posts.csv

I will add more to the list as we have more. Thanks!

tonofshell · 2019-06-01T02:28:50Z

Got the top posts analysis completed but the output isn't looking right, posted the raw file on my repo

tonofshell · 2019-06-01T04:08:40Z

User activities output was too large to push to GitHub so here's it on OneDrive

tonofshell · 2019-06-01T04:10:37Z

Do you want me to come up with the code to add columns to the Badges and Users files?

sanittawan · 2019-06-02T06:49:38Z

Got the top posts analysis completed but the output isn't looking right, posted the raw file on my repo

I am not sure why it's outputting that way though some of them are what I was expecting. It's supposed to yield (year, [num answers, title]). I am guessing that it's something weird about the processed data that is giving the code weird years.

Do you want me to come up with the code to add columns to the Badges and Users files?

Yes, if you could figure out how to use gsutils to add those two columns to the file on the buckets, that'd be ideal. On Lunux @dhruvalb and I used awk or sed to do it. I can look into it probably tmr night. I have a presentation on Tuesday, so I'm not going to be able to get to it earlier :( sorry!

sanittawan · 2019-06-02T06:52:30Z

Oh! I think I know why. It must have been because of new line! "\n." I wasn't aware of it until weeks after. That's why it's giving us weird years! What we can do it to probably filter the result based on the output we got. We know that the data range from 2008-2019, so I can write a RegEx to grab those years.

sanittawan assigned tonofshell May 31, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

For Adam to run on Dataproc #37

For Adam to run on Dataproc #37

sanittawan commented May 31, 2019 •

edited

Loading

tonofshell commented Jun 1, 2019

tonofshell commented Jun 1, 2019

tonofshell commented Jun 1, 2019

sanittawan commented Jun 2, 2019

sanittawan commented Jun 2, 2019

For Adam to run on Dataproc #37

For Adam to run on Dataproc #37

Comments

sanittawan commented May 31, 2019 • edited Loading

tonofshell commented Jun 1, 2019

tonofshell commented Jun 1, 2019

tonofshell commented Jun 1, 2019

sanittawan commented Jun 2, 2019

sanittawan commented Jun 2, 2019

sanittawan commented May 31, 2019 •

edited

Loading