-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Updated location prediction to openstreetmap #7
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice analyses, @nrosed! I'm approving these changes. I left some comments with questions. Regarding the R code, I just took a quick look at it, because I don't use R much (although I left some minor comments in some parts). As a general comment regarding code, in the future, I would add more documentation about functions and their arguments.
|
||
Similar to before we will match columns baed on their names, in `raw_loc_df` it has `est_` columns and in `bm_loc_df` is has matching `true_` columns | ||
|
||
Now lets first look at the benchmark data ![](benchmark_analysis_files/figure-markdown_github/unnamed-chunk-9-1.png)![](benchmark_analysis_files/figure-markdown_github/unnamed-chunk-9-2.png)![](benchmark_analysis_files/figure-markdown_github/unnamed-chunk-9-3.png) | ||
|
||
Ok, so we see a strong signal that US/Americas/Europe are mentioned at a much higher rate than other regions. We would like to also see this pattern in our predicted locations, but first we need to show that our estimations are accurate. Shown below are now analyses comparing our hand-annotated benchmark data against the fully-automated processed data. We would like to show that the true number of articles with a region mention, is highly correlated to the estimated number of articles from our full pipeline. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a basic question here. When you say We would like to show that the true **number** of articles with a region mention, is highly correlated to the estimated **number** of articles from our full pipeline
, are you only comparing the final numbers? For instance, with this analysis, what happens if you have 10 articles from Europe (set A) and 10 articles from America (set "B"), and you predict all B articles being from Europe and all A articles being from America? In this case, the numbers are correct, but the prediction is wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this could happen if we only consider the correlations (so the last 6 plots). However, with the confusion matrix and the somewhat strong Kappa value, this shows that this type of switching behavior is not occurring.
The reason why I focus on the number of articles per country is because this is the metric that I will use later for the analysis of the full dataset.
Does this make sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes it does!
In this PR, location prediction has been updated. It is now done by querying openstreetmap or a cached version of openstreetmap. Previously, it used a database of all universities/businesses that have published in Nature.
There are 3 main analysis updates for location data:
The following scripts were modified to enable the above analyses: