Fix: ArXiv API Migration - OAI-PMH Implementation #243

Opsmithe · 2025-11-08T18:04:19Z

Problem

The current arxiv_fetch.py script relies on ArXiv's Atom API which provides unreliable license information through two problematic fields:

<rights> field: Does not exist in the API response schema according to ArXiv API documentation
<summary> field: Uses text pattern matching which incorrectly identifies papers that discuss CC licenses rather than papers that are actually CC-licensed

Impact

False positives: Papers discussing CC-licensed works are incorrectly classified as CC-licensed
Data integrity: Unreliable license detection compromises the accuracy of commons quantification
Reproducibility: Inconsistent results across different API responses

Example Case

Query [ALL:]("CC BY") returns paper [2008.00774v3] "Elsevier OA CC-By Corpus" which discusses CC BY works but is actually licensed under "arXiv.org - Non-exclusive license to distribute", not CC BY.

Proposed Solution

Migrate arxiv_fetch.py to use ArXiv's OAI-PMH API (https://oaipmh.arxiv.org/oai) which provides:

Structured metadata: XML-based responses with dedicated license fields
Reliable extraction: Direct access to <license> elements in arXiv namespace
Better coverage: Access to historical papers with proper license metadata
Standards compliance: OAI-PMH is a standardized protocol for metadata harvesting

Query Strategy Implemented

API Endpoint Migration

+ BASE_URL = "https://oaipmh.arxiv.org/oai"

License Extraction Method

+ # Structured XML parsing from OAI-PMH
+ def extract_license_from_xml(record_xml):
+     root = ET.fromstring(record_xml)
+     license_elem = root.find(".//{http://arxiv.org/OAI/arXiv/}license")
+     return LICENSE_MAPPING.get(license_elem.text, "Unknown")

Request Parameters

+ # OAI-PMH harvesting request  
+ params = {
+     'verb': 'ListRecords',
+     'metadataPrefix': 'arXiv',
+     'from': from_date,
+     'resumptionToken': token  # For pagination
+ }

Implementation Details

New Features

Date-based harvesting: Focus on recent years where CC licensing is more prevalent
Resumption token support: Handle large datasets with proper pagination
Enhanced error handling: Robust XML parsing with namespace awareness
License URL mapping: Standardized license identification from URLs
Provenance tracking: Detailed metadata for audit trails

Data Structure Changes

XML namespace handling: Proper parsing of arXiv-specific metadata elements
License normalization: Map license URLs to standardized identifiers
Category extraction: Use structured category fields instead of text parsing
Author counting: Extract from structured author elements

Performance Improvements

Reduced API calls: OAI-PMH provides batch processing capabilities
Rate limiting compliance: Built-in delays following OAI-PMH best practices
Retry logic: Enhanced error recovery for network issues

API Requirements

ArXiv OAI-PMH endpoint: https://oaipmh.arxiv.org/oai
No authentication required
Rate limiting: 3-second delays between requests (OAI-PMH recommendation)

Checklist

I have read and understood the Developer Certificate of Origin (DCO), below, which covers the contents of this pull request (PR).
My pull request doesn't include code or content generated with AI.
My pull request has a descriptive title (not a vague title like Update index.md).
My pull request targets the default branch of the repository (main or master).
My commit messages follow best practices.
My code follows the established code style of the repository.
I added or updated tests for the changes I made (if applicable).
I added or updated documentation (if applicable).
I tried running the project locally and verified that there are no
visible errors.

Developer Certificate of Origin

For the purposes of this DCO, "license" is equivalent to "license or public domain dedication," and "open source license" is equivalent to "open content license or public domain dedication."

Developer Certificate of Origin

Developer Certificate of Origin
Version 1.1

Copyright (C) 2004, 2006 The Linux Foundation and its contributors.
1 Letterman Drive
Suite D4700
San Francisco, CA, 94129

Everyone is permitted to copy and distribute verbatim copies of this
license document, but changing it is not allowed.


Developer's Certificate of Origin 1.1

By making a contribution to this project, I certify that:

(a) The contribution was created in whole or in part by me and I
    have the right to submit it under the open source license
    indicated in the file; or

(b) The contribution is based upon previous work that, to the best
    of my knowledge, is covered under an appropriate open source
    license and I have the right under that license to submit that
    work with modifications, whether created in whole or in part
    by me, under the same open source license (unless I am
    permitted to submit under a different license), as indicated
    in the file; or

(c) The contribution was provided directly to me by some other
    person who certified (a), (b) or (c) and I have not modified
    it.

(d) I understand and agree that this project and the contribution
    are public and that a record of the contribution (including all
    personal information I submit with it, including my sign-off) is
    maintained indefinitely and may be redistributed consistent with
    this project or the open source license(s) involved.

- Updated base URL from v3 to v4 API endpoint - Added publisher information extraction (name and country) - Added article sampling functionality for license analysis - Enhanced CSV output with new publisher and article count files - Improved error handling and logging for v4 API structure - Updated provenance tracking to include API version - Maintained backward compatibility with existing data structure Benefits of v4 migration: - Access to richer metadata including publisher details - Better structured response format with pagination info - Enhanced license information extraction capabilities - Improved data quality for commons quantification analysis

…nformation - Generated doaj_6_count_by_publisher.csv with publisher name and country data - Added doaj_5_article_count.csv for article sampling statistics - Updated provenance.yaml to track API v4 usage and enhanced data collection - Publisher data includes institutions from IR, PL, CL, GB, RU, BR, ID countries - Article sampling demonstrates new capability to analyze article-level data - All existing data files (count, subject, language, year) maintained compatibility Test run processed 10 journals and 1 article sample successfully.

- Extract detailed license flags (BY, NC, ND, SA) from DOAJ v4 API response - Add doaj_7_license_details.csv to capture license component breakdown - Enhanced extract_license_type() to return both license type and detailed components - Updated data processing pipeline to handle granular license information - Added license URL tracking for verification and compliance analysis New capabilities: - Identify specific Creative Commons license components used by journals - Track license URLs for direct reference to legal terms - Enable analysis of license component combinations and trends - Support more precise commons quantification based on usage restrictions Test data shows successful extraction of BY, NC, SA flags and license URLs.

- Document complete migration process from v3 to v4 API - Detail all enhanced data collection capabilities - Provide technical implementation overview - Include validation results and test data analysis - Document new CSV file schemas and data structures - Outline future enhancement opportunities - Reference all related commits for audit trail Key documentation sections: - API endpoint changes and migration rationale - Enhanced license component analysis capabilities - Publisher and geographic data collection - Article processing implementation - Data quality improvements and validation - Performance optimizations and error handling - Impact on commons quantification research

… integration - Remove boolean license component extraction (BY, NC, ND, SA flags) - Remove doaj_7_license_details.csv file generation - Simplify extract_license_type() to return only license type string - Remove license_details_counts processing from data pipeline - Maintain focus on meaningful license type classification Rationale: License type string (e.g., 'CC BY-NC') already contains all necessary information. Boolean flags add complexity without providing additional analytical value for commons quantification purposes.

- Remove doaj_fetch.py script (moved to feature/doaj branch) - Remove all DOAJ data files (moved to feature/doaj branch) - Remove DOAJ_V4_MIGRATION.md documentation (moved to feature/doaj branch) This branch now focuses exclusively on ArXiv-related improvements. All DOAJ v4 migration work has been moved to dedicated feature/doaj branch.

Opsmithe · 2025-11-10T14:56:54Z

@TimidRobot I hope this fixes help address the issues with #236, ild be happy to get your reviews on the script and make changes where necessary

…ific error codes

… arxiv_fetch.py

…iptive identifiers

…bility

TimidRobot · 2025-11-14T09:19:51Z

scripts/1-fetch/arxiv_fetch.py

+    "http://creativecommons.org/licenses/by-nd/4.0/": "CC BY-ND 4.0",
+    "http://creativecommons.org/licenses/by-nd/3.0/": "CC BY-ND 3.0",
+    "http://creativecommons.org/publicdomain/zero/1.0/": "CC0 1.0",
+    "http://creativecommons.org/share-your-work/public-domain/cc0/": "CC0",


Please add:

"http://creativecommons.org/licenses/publicdomain": "CC CERTIFICATION 1.0 US",

TimidRobot · 2025-11-14T09:20:28Z

scripts/1-fetch/arxiv_fetch.py

-    ),
-]
+# License mapping for structured data from OAI-PMH
+LICENSE_MAPPING = {


Please sort/order these entries

scripts/1-fetch/arxiv_fetch.py

TimidRobot · 2025-11-14T09:45:56Z

scripts/1-fetch/arxiv_fetch.py

+            f"Provenance file write failed: {e}", 1
+        )

    LOGGER.info(f"Total CC licensed papers fetched: {total_fetched}")


Iftotal_fetched includes "Non-CC" papers, this message is misleading

@TimidRobot. Thanks for this. I investigated and fixed the license filtering logic in arxiv_fetch.py that incorrectly included non-Creative Commons licenses due to substring matching. The condition "CC" in metadata["license"] would match false positives like "Non-CC" or "Apache with CC mention". Changed to metadata["license"].startswith("CC") to ensure only licenses beginning with "CC" are processed, preventing
contamination of CC license statistics with misclassified non-CC licenses.

Opsmithe added 2 commits November 8, 2025 08:48

Add simple name output script

74a099c

Update arxiv_fetch.py - minimal recent changes

586accc

Opsmithe requested review from a team as code owners November 8, 2025 18:04

Opsmithe requested review from TimidRobot and possumbilities and removed request for a team November 8, 2025 18:04

Delete output_name.py

19b32bb

cc-open-source-bot added this to TimidRobot Nov 8, 2025

cc-open-source-bot moved this to In review in TimidRobot Nov 8, 2025

Opsmithe added 13 commits November 10, 2025 00:07

Update arxiv_fetch.py with last 12 commits from feature/arxiv

d3f0fe3

style: fix formatting in arxiv_fetch.py to meet project standards

d0facfc

chore: increase ArXiv fetch limit to 2000 CC-licensed papers

f371e13

chore: adjust ArXiv fetch parameters to 1000 limit and 5 years back

346b90f

docs: update arXiv API documentation to include OAI-PMH interface

3e77cac

style: apply consistent hard wrapping to arXiv section in sources.md

6ef2736

style: fix line length and trailing whitespace issues

ad0ce28

Opsmithe added 6 commits November 11, 2025 12:50

Improve arXiv license extraction with stricter CC validation and spec…

2945f63

…ific error codes

Replace 'Unknown' with more descriptive terms and rename variables in…

f1fdccb

… arxiv_fetch.py

Update exception handler to use consistent generic terms

6431e72

Improve code readability by replacing vague variable names with descr…

faaa174

…iptive identifiers

Improve code readability by replacing vague variable names with descr…

2a3bcbb

…iptive identifiers

Replace vague variable names in query_arxiv function for better reada…

5db5b81

…bility

Opsmithe added 7 commits November 11, 2025 13:26

Complete variable name refactoring for improved code readability

0921353

Remove data files from PR - keep only arxiv_fetch.py changes

af1e59a

Restore output_name.py - keep only arxiv_fetch.py and sources.md changes

6c95f00

Enhance arXiv documentation with OAI-PMH interface details

4149a18

Apply consistent hard wrapping to arXiv section in sources.md

8ed2cdc

Fix static analysis issues: trailing whitespace and code formatting

2d3fd5a

Fix arXiv API documentation links in sources.md

5739ad3

Opsmithe force-pushed the arxiv-minimal-fix branch from 5920576 to 5739ad3 Compare November 12, 2025 08:09

Opsmithe added 3 commits November 12, 2025 09:33

Update arXiv section to focus on OAI-PMH API and add data format details

5cee2ac

Delete output_name.py

bc70c78

Simplify Non-CC license display to remove URL from output

8fd22b0

TimidRobot self-assigned this Nov 14, 2025

TimidRobot reviewed Nov 14, 2025

View reviewed changes

scripts/1-fetch/arxiv_fetch.py Outdated Show resolved Hide resolved

TimidRobot reviewed Nov 14, 2025

View reviewed changes

Opsmithe added 4 commits November 15, 2025 08:49

Add CC CERTIFICATION 1.0 US license mapping

3a6b24f

Order license mapping alphabetically

b6e19a9

Simplify author count bucketing to continuous logic block

cd24b78

Fix license filtering to use startswith instead of substring matching

04a74a0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Fix: ArXiv API Migration - OAI-PMH Implementation #243

Fix: ArXiv API Migration - OAI-PMH Implementation #243

Uh oh!

Opsmithe commented Nov 8, 2025 •

edited by TimidRobot

Loading

Uh oh!

Opsmithe commented Nov 10, 2025

Uh oh!

TimidRobot Nov 14, 2025

Uh oh!

TimidRobot Nov 14, 2025

Uh oh!

Uh oh!

TimidRobot Nov 14, 2025

Uh oh!

Opsmithe Nov 15, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Fix: ArXiv API Migration - OAI-PMH Implementation #243

Are you sure you want to change the base?

Fix: ArXiv API Migration - OAI-PMH Implementation #243

Uh oh!

Conversation

Opsmithe commented Nov 8, 2025 • edited by TimidRobot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Impact

Example Case

Proposed Solution

Query Strategy Implemented

API Endpoint Migration

License Extraction Method

Request Parameters

Implementation Details

New Features

Data Structure Changes

Performance Improvements

API Requirements

Checklist

Developer Certificate of Origin

Uh oh!

Opsmithe commented Nov 10, 2025

Uh oh!

TimidRobot Nov 14, 2025

Choose a reason for hiding this comment

Uh oh!

TimidRobot Nov 14, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

TimidRobot Nov 14, 2025

Choose a reason for hiding this comment

Uh oh!

Opsmithe Nov 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Opsmithe commented Nov 8, 2025 •

edited by TimidRobot

Loading

Opsmithe Nov 15, 2025 •

edited

Loading