Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updated location prediction to openstreetmap #7

Merged
merged 13 commits into from
Mar 8, 2021
Merged

Conversation

nrosed
Copy link
Collaborator

@nrosed nrosed commented Feb 8, 2021

In this PR, location prediction has been updated. It is now done by querying openstreetmap or a cached version of openstreetmap. Previously, it used a database of all universities/businesses that have published in Nature.

There are 3 main analysis updates for location data:

  • analysis_scripts/analyze_benchmark_data/benchmark_analysis.md (R version is the same name with .Rmd)
    • This analysis compares a hand-annotated benchmark reference (the gold standard) and compares it to the output from the location prediction results from the full pipeline.
    • The main purpose of this analysis is to quantify how well the pipeline predicts the location of different "named entities".
    • In the benchmark analysis, there is both gender and location predictions, only the location analysis has been updated
  • analysis_scripts/analyze_quote_data/location-analysis_all_years.md (R version is the same name with .Rmd)
    • This analysis looks at the location pipeline output from 2010, 2015-2020.
    • This analysis first compares how different the benchmark dataset is from the full dataset for the years 2010, 2015, 2020
    • After, the analysis looks at the proportion of articles that mention a specific country.
  • analysis_scripts/qc_scripts/pipeline_qc.Rmd (R version is the same name with .Rmd)
    • This analysis compares statistics across different years for different steps of the pipeline. Ideally, you would like the different counts of features to be stable across pipeline steps. For example, this is seen in the number of locations identified between different pipeline steps. In the gender analysis, this is not as consistent because between pipeline step 3 and 4 the genders are predicted using genderize.io -- however you do not expect drastic changes.
    • The main change in this analysis is the addition of the last set of plots (QC pipeline step 4: Check coreNLP processed output) Here the plot style changed for the location/gender data.

The following scripts were modified to enable the above analyses:

  • analysis_scripts/analysis_utils.R
  • process_scraped_data/process_corenlp_locations_corenlp_output.R
  • utils/plotting_utils.R
  • utils/scraper_processing_utils.R (most changes here)

@nrosed nrosed requested a review from miltondp February 8, 2021 21:17
Copy link
Member

@miltondp miltondp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice analyses, @nrosed! I'm approving these changes. I left some comments with questions. Regarding the R code, I just took a quick look at it, because I don't use R much (although I left some minor comments in some parts). As a general comment regarding code, in the future, I would add more documentation about functions and their arguments.


Similar to before we will match columns baed on their names, in `raw_loc_df` it has `est_` columns and in `bm_loc_df` is has matching `true_` columns

Now lets first look at the benchmark data ![](benchmark_analysis_files/figure-markdown_github/unnamed-chunk-9-1.png)![](benchmark_analysis_files/figure-markdown_github/unnamed-chunk-9-2.png)![](benchmark_analysis_files/figure-markdown_github/unnamed-chunk-9-3.png)

Ok, so we see a strong signal that US/Americas/Europe are mentioned at a much higher rate than other regions. We would like to also see this pattern in our predicted locations, but first we need to show that our estimations are accurate. Shown below are now analyses comparing our hand-annotated benchmark data against the fully-automated processed data. We would like to show that the true number of articles with a region mention, is highly correlated to the estimated number of articles from our full pipeline.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a basic question here. When you say We would like to show that the true **number** of articles with a region mention, is highly correlated to the estimated **number** of articles from our full pipeline, are you only comparing the final numbers? For instance, with this analysis, what happens if you have 10 articles from Europe (set A) and 10 articles from America (set "B"), and you predict all B articles being from Europe and all A articles being from America? In this case, the numbers are correct, but the prediction is wrong.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this could happen if we only consider the correlations (so the last 6 plots). However, with the confusion matrix and the somewhat strong Kappa value, this shows that this type of switching behavior is not occurring.

The reason why I focus on the number of articles per country is because this is the metric that I will use later for the analysis of the full dataset.

Does this make sense?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it does!

analysis_scripts/qc_scripts/pipeline_qc.md Show resolved Hide resolved
analysis_scripts/analysis_utils.R Outdated Show resolved Hide resolved
utils/plotting_utils.R Show resolved Hide resolved
process_scraped_data/run_process_target_year.sh Outdated Show resolved Hide resolved
@nrosed nrosed merged commit 526069b into greenelab:main Mar 8, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants