-
-
Notifications
You must be signed in to change notification settings - Fork 70
Fix: ArXiv API Migration - OAI-PMH Implementation #243
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
- Updated base URL from v3 to v4 API endpoint - Added publisher information extraction (name and country) - Added article sampling functionality for license analysis - Enhanced CSV output with new publisher and article count files - Improved error handling and logging for v4 API structure - Updated provenance tracking to include API version - Maintained backward compatibility with existing data structure Benefits of v4 migration: - Access to richer metadata including publisher details - Better structured response format with pagination info - Enhanced license information extraction capabilities - Improved data quality for commons quantification analysis
…nformation - Generated doaj_6_count_by_publisher.csv with publisher name and country data - Added doaj_5_article_count.csv for article sampling statistics - Updated provenance.yaml to track API v4 usage and enhanced data collection - Publisher data includes institutions from IR, PL, CL, GB, RU, BR, ID countries - Article sampling demonstrates new capability to analyze article-level data - All existing data files (count, subject, language, year) maintained compatibility Test run processed 10 journals and 1 article sample successfully.
- Extract detailed license flags (BY, NC, ND, SA) from DOAJ v4 API response - Add doaj_7_license_details.csv to capture license component breakdown - Enhanced extract_license_type() to return both license type and detailed components - Updated data processing pipeline to handle granular license information - Added license URL tracking for verification and compliance analysis New capabilities: - Identify specific Creative Commons license components used by journals - Track license URLs for direct reference to legal terms - Enable analysis of license component combinations and trends - Support more precise commons quantification based on usage restrictions Test data shows successful extraction of BY, NC, SA flags and license URLs.
- Document complete migration process from v3 to v4 API - Detail all enhanced data collection capabilities - Provide technical implementation overview - Include validation results and test data analysis - Document new CSV file schemas and data structures - Outline future enhancement opportunities - Reference all related commits for audit trail Key documentation sections: - API endpoint changes and migration rationale - Enhanced license component analysis capabilities - Publisher and geographic data collection - Article processing implementation - Data quality improvements and validation - Performance optimizations and error handling - Impact on commons quantification research
… integration - Remove boolean license component extraction (BY, NC, ND, SA flags) - Remove doaj_7_license_details.csv file generation - Simplify extract_license_type() to return only license type string - Remove license_details_counts processing from data pipeline - Maintain focus on meaningful license type classification Rationale: License type string (e.g., 'CC BY-NC') already contains all necessary information. Boolean flags add complexity without providing additional analytical value for commons quantification purposes.
- Remove doaj_fetch.py script (moved to feature/doaj branch) - Remove all DOAJ data files (moved to feature/doaj branch) - Remove DOAJ_V4_MIGRATION.md documentation (moved to feature/doaj branch) This branch now focuses exclusively on ArXiv-related improvements. All DOAJ v4 migration work has been moved to dedicated feature/doaj branch.
|
@TimidRobot I hope this fixes help address the issues with #236, ild be happy to get your reviews on the script and make changes where necessary |
…iptive identifiers
…iptive identifiers
5920576 to
5739ad3
Compare
| "http://creativecommons.org/licenses/by-nd/4.0/": "CC BY-ND 4.0", | ||
| "http://creativecommons.org/licenses/by-nd/3.0/": "CC BY-ND 3.0", | ||
| "http://creativecommons.org/publicdomain/zero/1.0/": "CC0 1.0", | ||
| "http://creativecommons.org/share-your-work/public-domain/cc0/": "CC0", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add:
"http://creativecommons.org/licenses/publicdomain": "CC CERTIFICATION 1.0 US",| ), | ||
| ] | ||
| # License mapping for structured data from OAI-PMH | ||
| LICENSE_MAPPING = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please sort/order these entries
scripts/1-fetch/arxiv_fetch.py
Outdated
| f"Provenance file write failed: {e}", 1 | ||
| ) | ||
|
|
||
| LOGGER.info(f"Total CC licensed papers fetched: {total_fetched}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Iftotal_fetched includes "Non-CC" papers, this message is misleading
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@TimidRobot. Thanks for this. I investigated and fixed the license filtering logic in arxiv_fetch.py that incorrectly included non-Creative Commons licenses due to substring matching. The condition "CC" in metadata["license"] would match false positives like "Non-CC" or "Apache with CC mention". Changed to metadata["license"].startswith("CC") to ensure only licenses beginning with "CC" are processed, preventing
contamination of CC license statistics with misclassified non-CC licenses.
as per Issue #236 by @TimidRobot
Problem
The current
arxiv_fetch.pyscript relies on ArXiv's Atom API which provides unreliable license information through two problematic fields:<rights>field: Does not exist in the API response schema according to ArXiv API documentation<summary>field: Uses text pattern matching which incorrectly identifies papers that discuss CC licenses rather than papers that are actually CC-licensedImpact
Example Case
Query
[ALL:]("CC BY")returns paper [2008.00774v3] "Elsevier OA CC-By Corpus" which discusses CC BY works but is actually licensed under "arXiv.org - Non-exclusive license to distribute", not CC BY.Proposed Solution
Migrate
arxiv_fetch.pyto use ArXiv's OAI-PMH API (https://oaipmh.arxiv.org/oai) which provides:<license>elements in arXiv namespaceQuery Strategy Implemented
API Endpoint Migration
License Extraction Method
Request Parameters
Implementation Details
New Features
Data Structure Changes
Performance Improvements
API Requirements
https://oaipmh.arxiv.org/oaiChecklist
Update index.md).mainormaster).visible errors.
Developer Certificate of Origin
For the purposes of this DCO, "license" is equivalent to "license or public domain dedication," and "open source license" is equivalent to "open content license or public domain dedication."
Developer Certificate of Origin