Updated location prediction to openstreetmap #7

nrosed · 2021-02-08T21:17:56Z

In this PR, location prediction has been updated. It is now done by querying openstreetmap or a cached version of openstreetmap. Previously, it used a database of all universities/businesses that have published in Nature.

There are 3 main analysis updates for location data:

analysis_scripts/analyze_benchmark_data/benchmark_analysis.md (R version is the same name with .Rmd)
- This analysis compares a hand-annotated benchmark reference (the gold standard) and compares it to the output from the location prediction results from the full pipeline.
- The main purpose of this analysis is to quantify how well the pipeline predicts the location of different "named entities".
- In the benchmark analysis, there is both gender and location predictions, only the location analysis has been updated
analysis_scripts/analyze_quote_data/location-analysis_all_years.md (R version is the same name with .Rmd)
- This analysis looks at the location pipeline output from 2010, 2015-2020.
- This analysis first compares how different the benchmark dataset is from the full dataset for the years 2010, 2015, 2020
- After, the analysis looks at the proportion of articles that mention a specific country.
analysis_scripts/qc_scripts/pipeline_qc.Rmd (R version is the same name with .Rmd)
- This analysis compares statistics across different years for different steps of the pipeline. Ideally, you would like the different counts of features to be stable across pipeline steps. For example, this is seen in the number of locations identified between different pipeline steps. In the gender analysis, this is not as consistent because between pipeline step 3 and 4 the genders are predicted using genderize.io -- however you do not expect drastic changes.
- The main change in this analysis is the addition of the last set of plots (QC pipeline step 4: Check coreNLP processed output) Here the plot style changed for the location/gender data.

The following scripts were modified to enable the above analyses:

analysis_scripts/analysis_utils.R
process_scraped_data/process_corenlp_locations_corenlp_output.R
utils/plotting_utils.R
utils/scraper_processing_utils.R (most changes here)

…dataset

… 0.61 to 0.72

miltondp

Nice analyses, @nrosed! I'm approving these changes. I left some comments with questions. Regarding the R code, I just took a quick look at it, because I don't use R much (although I left some minor comments in some parts). As a general comment regarding code, in the future, I would add more documentation about functions and their arguments.

analysis_scripts/analyze_benchmark_data/benchmark_analysis.md

miltondp · 2021-02-09T14:37:25Z

analysis_scripts/analyze_benchmark_data/benchmark_analysis.md


 Similar to before we will match columns baed on their names, in `raw_loc_df` it has `est_` columns and in `bm_loc_df` is has matching `true_` columns

 Now lets first look at the benchmark data ![](benchmark_analysis_files/figure-markdown_github/unnamed-chunk-9-1.png)![](benchmark_analysis_files/figure-markdown_github/unnamed-chunk-9-2.png)![](benchmark_analysis_files/figure-markdown_github/unnamed-chunk-9-3.png)
+
+Ok, so we see a strong signal that US/Americas/Europe are mentioned at a much higher rate than other regions. We would like to also see this pattern in our predicted locations, but first we need to show that our estimations are accurate. Shown below are now analyses comparing our hand-annotated benchmark data against the fully-automated processed data. We would like to show that the true number of articles with a region mention, is highly correlated to the estimated number of articles from our full pipeline.


I have a basic question here. When you say We would like to show that the true **number** of articles with a region mention, is highly correlated to the estimated **number** of articles from our full pipeline, are you only comparing the final numbers? For instance, with this analysis, what happens if you have 10 articles from Europe (set A) and 10 articles from America (set "B"), and you predict all B articles being from Europe and all A articles being from America? In this case, the numbers are correct, but the prediction is wrong.

Yes, this could happen if we only consider the correlations (so the last 6 plots). However, with the confusion matrix and the somewhat strong Kappa value, this shows that this type of switching behavior is not occurring.

The reason why I focus on the number of articles per country is because this is the metric that I will use later for the analysis of the full dataset.

Does this make sense?

Yes it does!

analysis_scripts/qc_scripts/pipeline_qc.md

analysis_scripts/analysis_utils.R

process_scraped_data/process_corenlp_locations_corenlp_output.R

utils/plotting_utils.R

process_scraped_data/run_process_target_year.sh

nrosed added 8 commits February 3, 2021 20:00

initial OSM query code and first run

2628478

initial comparison on location prediction accuracy for the benchmark …

4c986b3

…dataset

benchmark location analysis bug fixes

fd871af

painful location hand annotations, but it was worth it, kappa up from…

be4aa0b

… 0.61 to 0.72

re run location analysis

9886bc5

removed unneccesary methods for location prediction

12ef904

changed headers in markdown

6ec2e87

added location plot over 2010, 2015-2020

2bba9ee

nrosed requested a review from miltondp February 8, 2021 21:17

miltondp approved these changes Feb 9, 2021

View reviewed changes

nrosed added 3 commits February 9, 2021 11:30

using file.path instead of paste, and added function details

3fb8701

now handling reading warning of cdh code file

e1ff68b

added R2 and RMSE to benchmark location plots

e2c0219

nrosed mentioned this pull request Feb 9, 2021

GB, IT, CA country frequencies very different from benchmark #8

Open

nrosed added 2 commits February 9, 2021 14:25

added more comments on the location analysis

39ebb1e

sharpened language in pipeline qc, quote matching

3fec287

nrosed merged commit 526069b into greenelab:main Mar 8, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Updated location prediction to openstreetmap #7

Updated location prediction to openstreetmap #7

nrosed commented Feb 8, 2021

miltondp left a comment

miltondp Feb 9, 2021

nrosed Feb 9, 2021

miltondp Feb 19, 2021

Updated location prediction to openstreetmap #7

Updated location prediction to openstreetmap #7

Conversation

nrosed commented Feb 8, 2021

miltondp left a comment

Choose a reason for hiding this comment

miltondp Feb 9, 2021

Choose a reason for hiding this comment

nrosed Feb 9, 2021

Choose a reason for hiding this comment

miltondp Feb 19, 2021

Choose a reason for hiding this comment