-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
For Adam to run on Dataproc #37
Comments
Got the top posts analysis completed but the output isn't looking right, posted the raw file on my repo |
User activities output was too large to push to GitHub so here's it on OneDrive |
Do you want me to come up with the code to add columns to the Badges and Users files? |
I am not sure why it's outputting that way though some of them are what I was expecting. It's supposed to yield (year, [num answers, title]). I am guessing that it's something weird about the processed data that is giving the code weird years.
Yes, if you could figure out how to use gsutils to add those two columns to the file on the buckets, that'd be ideal. On Lunux @dhruvalb and I used |
Oh! I think I know why. It must have been because of new line! "\n." I wasn't aware of it until weeks after. That's why it's giving us weird years! What we can do it to probably filter the result based on the output we got. We know that the data range from 2008-2019, so I can write a RegEx to grab those years. |
These can be done in Dataproc.
Please follow these steps:
run Questions with highest number of answers per year decrs_max_ans_q.py with Posts.csv
run Users who answered/asked questions the most decrs_users_activities.py with Posts.csv
Add a column ",badges" to Badges.csv, save it to a new file in the bucket
Add a column ",users" to Users.csv, save it to a new file in the bucket
Cat Badges.csv and Users.csv, save it to a new file "badges_users.csv"
run The locations where users with gold answer badges are from decrs_users_gold_ans.py with "bagdes_users.csv"
run 2-grams tags that are usually tagged together decrs_n_grams_tags.py with Posts.csv
I will add more to the list as we have more. Thanks!
The text was updated successfully, but these errors were encountered: