Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

For Adam to run on Dataproc #37

Open
sanittawan opened this issue May 31, 2019 · 5 comments
Open

For Adam to run on Dataproc #37

sanittawan opened this issue May 31, 2019 · 5 comments
Assignees

Comments

@sanittawan
Copy link
Collaborator

sanittawan commented May 31, 2019

These can be done in Dataproc.

Please follow these steps:

  1. run Questions with highest number of answers per year decrs_max_ans_q.py with Posts.csv

  2. run Users who answered/asked questions the most decrs_users_activities.py with Posts.csv

  3. Add a column ",badges" to Badges.csv, save it to a new file in the bucket

  4. Add a column ",users" to Users.csv, save it to a new file in the bucket

  5. Cat Badges.csv and Users.csv, save it to a new file "badges_users.csv"

  6. run The locations where users with gold answer badges are from decrs_users_gold_ans.py with "bagdes_users.csv"

  7. run 2-grams tags that are usually tagged together decrs_n_grams_tags.py with Posts.csv

I will add more to the list as we have more. Thanks!

@tonofshell
Copy link
Collaborator

Got the top posts analysis completed but the output isn't looking right, posted the raw file on my repo

@tonofshell
Copy link
Collaborator

User activities output was too large to push to GitHub so here's it on OneDrive

@tonofshell
Copy link
Collaborator

Do you want me to come up with the code to add columns to the Badges and Users files?

@sanittawan
Copy link
Collaborator Author

Got the top posts analysis completed but the output isn't looking right, posted the raw file on my repo

I am not sure why it's outputting that way though some of them are what I was expecting. It's supposed to yield (year, [num answers, title]). I am guessing that it's something weird about the processed data that is giving the code weird years.

Do you want me to come up with the code to add columns to the Badges and Users files?

Yes, if you could figure out how to use gsutils to add those two columns to the file on the buckets, that'd be ideal. On Lunux @dhruvalb and I used awk or sed to do it. I can look into it probably tmr night. I have a presentation on Tuesday, so I'm not going to be able to get to it earlier :( sorry!

@sanittawan
Copy link
Collaborator Author

Oh! I think I know why. It must have been because of new line! "\n." I wasn't aware of it until weeks after. That's why it's giving us weird years! What we can do it to probably filter the result based on the output we got. We know that the data range from 2008-2019, so I can write a RegEx to grab those years.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants