Skip to content

Conversation

@Opsmithe
Copy link

@Opsmithe Opsmithe commented Nov 8, 2025

as per Issue #236 by @TimidRobot

Problem

The current arxiv_fetch.py script relies on ArXiv's Atom API which provides unreliable license information through two problematic fields:

  1. <rights> field: Does not exist in the API response schema according to ArXiv API documentation
  2. <summary> field: Uses text pattern matching which incorrectly identifies papers that discuss CC licenses rather than papers that are actually CC-licensed

Impact

  • False positives: Papers discussing CC-licensed works are incorrectly classified as CC-licensed
  • Data integrity: Unreliable license detection compromises the accuracy of commons quantification
  • Reproducibility: Inconsistent results across different API responses

Example Case

Query [ALL:]("CC BY") returns paper [2008.00774v3] "Elsevier OA CC-By Corpus" which discusses CC BY works but is actually licensed under "arXiv.org - Non-exclusive license to distribute", not CC BY.

Proposed Solution

Migrate arxiv_fetch.py to use ArXiv's OAI-PMH API (https://oaipmh.arxiv.org/oai) which provides:

  • Structured metadata: XML-based responses with dedicated license fields
  • Reliable extraction: Direct access to <license> elements in arXiv namespace
  • Better coverage: Access to historical papers with proper license metadata
  • Standards compliance: OAI-PMH is a standardized protocol for metadata harvesting

Query Strategy Implemented

API Endpoint Migration

+ BASE_URL = "https://oaipmh.arxiv.org/oai"

License Extraction Method

+ # Structured XML parsing from OAI-PMH
+ def extract_license_from_xml(record_xml):
+     root = ET.fromstring(record_xml)
+     license_elem = root.find(".//{http://arxiv.org/OAI/arXiv/}license")
+     return LICENSE_MAPPING.get(license_elem.text, "Unknown")

Request Parameters

+ # OAI-PMH harvesting request  
+ params = {
+     'verb': 'ListRecords',
+     'metadataPrefix': 'arXiv',
+     'from': from_date,
+     'resumptionToken': token  # For pagination
+ }

Implementation Details

New Features

  • Date-based harvesting: Focus on recent years where CC licensing is more prevalent
  • Resumption token support: Handle large datasets with proper pagination
  • Enhanced error handling: Robust XML parsing with namespace awareness
  • License URL mapping: Standardized license identification from URLs
  • Provenance tracking: Detailed metadata for audit trails

Data Structure Changes

  • XML namespace handling: Proper parsing of arXiv-specific metadata elements
  • License normalization: Map license URLs to standardized identifiers
  • Category extraction: Use structured category fields instead of text parsing
  • Author counting: Extract from structured author elements

Performance Improvements

  • Reduced API calls: OAI-PMH provides batch processing capabilities
  • Rate limiting compliance: Built-in delays following OAI-PMH best practices
  • Retry logic: Enhanced error recovery for network issues

API Requirements

  • ArXiv OAI-PMH endpoint: https://oaipmh.arxiv.org/oai
  • No authentication required
  • Rate limiting: 3-second delays between requests (OAI-PMH recommendation)

Checklist

  • I have read and understood the Developer Certificate of Origin (DCO), below, which covers the contents of this pull request (PR).
  • My pull request doesn't include code or content generated with AI.
  • My pull request has a descriptive title (not a vague title like Update index.md).
  • My pull request targets the default branch of the repository (main or master).
  • My commit messages follow best practices.
  • My code follows the established code style of the repository.
  • I added or updated tests for the changes I made (if applicable).
  • I added or updated documentation (if applicable).
  • I tried running the project locally and verified that there are no
    visible errors.

Developer Certificate of Origin

For the purposes of this DCO, "license" is equivalent to "license or public domain dedication," and "open source license" is equivalent to "open content license or public domain dedication."

Developer Certificate of Origin
Developer Certificate of Origin
Version 1.1

Copyright (C) 2004, 2006 The Linux Foundation and its contributors.
1 Letterman Drive
Suite D4700
San Francisco, CA, 94129

Everyone is permitted to copy and distribute verbatim copies of this
license document, but changing it is not allowed.


Developer's Certificate of Origin 1.1

By making a contribution to this project, I certify that:

(a) The contribution was created in whole or in part by me and I
    have the right to submit it under the open source license
    indicated in the file; or

(b) The contribution is based upon previous work that, to the best
    of my knowledge, is covered under an appropriate open source
    license and I have the right under that license to submit that
    work with modifications, whether created in whole or in part
    by me, under the same open source license (unless I am
    permitted to submit under a different license), as indicated
    in the file; or

(c) The contribution was provided directly to me by some other
    person who certified (a), (b) or (c) and I have not modified
    it.

(d) I understand and agree that this project and the contribution
    are public and that a record of the contribution (including all
    personal information I submit with it, including my sign-off) is
    maintained indefinitely and may be redistributed consistent with
    this project or the open source license(s) involved.

@Opsmithe Opsmithe requested review from a team as code owners November 8, 2025 18:04
@Opsmithe Opsmithe requested review from TimidRobot and possumbilities and removed request for a team November 8, 2025 18:04
- Updated base URL from v3 to v4 API endpoint
- Added publisher information extraction (name and country)
- Added article sampling functionality for license analysis
- Enhanced CSV output with new publisher and article count files
- Improved error handling and logging for v4 API structure
- Updated provenance tracking to include API version
- Maintained backward compatibility with existing data structure

Benefits of v4 migration:
- Access to richer metadata including publisher details
- Better structured response format with pagination info
- Enhanced license information extraction capabilities
- Improved data quality for commons quantification analysis
…nformation

- Generated doaj_6_count_by_publisher.csv with publisher name and country data
- Added doaj_5_article_count.csv for article sampling statistics
- Updated provenance.yaml to track API v4 usage and enhanced data collection
- Publisher data includes institutions from IR, PL, CL, GB, RU, BR, ID countries
- Article sampling demonstrates new capability to analyze article-level data
- All existing data files (count, subject, language, year) maintained compatibility

Test run processed 10 journals and 1 article sample successfully.
- Extract detailed license flags (BY, NC, ND, SA) from DOAJ v4 API response
- Add doaj_7_license_details.csv to capture license component breakdown
- Enhanced extract_license_type() to return both license type and detailed components
- Updated data processing pipeline to handle granular license information
- Added license URL tracking for verification and compliance analysis

New capabilities:
- Identify specific Creative Commons license components used by journals
- Track license URLs for direct reference to legal terms
- Enable analysis of license component combinations and trends
- Support more precise commons quantification based on usage restrictions

Test data shows successful extraction of BY, NC, SA flags and license URLs.
- Document complete migration process from v3 to v4 API
- Detail all enhanced data collection capabilities
- Provide technical implementation overview
- Include validation results and test data analysis
- Document new CSV file schemas and data structures
- Outline future enhancement opportunities
- Reference all related commits for audit trail

Key documentation sections:
- API endpoint changes and migration rationale
- Enhanced license component analysis capabilities
- Publisher and geographic data collection
- Article processing implementation
- Data quality improvements and validation
- Performance optimizations and error handling
- Impact on commons quantification research
… integration

- Remove boolean license component extraction (BY, NC, ND, SA flags)
- Remove doaj_7_license_details.csv file generation
- Simplify extract_license_type() to return only license type string
- Remove license_details_counts processing from data pipeline
- Maintain focus on meaningful license type classification

Rationale: License type string (e.g., 'CC BY-NC') already contains all necessary
information. Boolean flags add complexity without providing additional analytical
value for commons quantification purposes.
- Remove doaj_fetch.py script (moved to feature/doaj branch)
- Remove all DOAJ data files (moved to feature/doaj branch)
- Remove DOAJ_V4_MIGRATION.md documentation (moved to feature/doaj branch)

This branch now focuses exclusively on ArXiv-related improvements.
All DOAJ v4 migration work has been moved to dedicated feature/doaj branch.
@Opsmithe
Copy link
Author

@TimidRobot I hope this fixes help address the issues with #236, ild be happy to get your reviews on the script and make changes where necessary

@TimidRobot TimidRobot self-assigned this Nov 14, 2025
"http://creativecommons.org/licenses/by-nd/4.0/": "CC BY-ND 4.0",
"http://creativecommons.org/licenses/by-nd/3.0/": "CC BY-ND 3.0",
"http://creativecommons.org/publicdomain/zero/1.0/": "CC0 1.0",
"http://creativecommons.org/share-your-work/public-domain/cc0/": "CC0",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add:

    "http://creativecommons.org/licenses/publicdomain": "CC CERTIFICATION 1.0 US",

),
]
# License mapping for structured data from OAI-PMH
LICENSE_MAPPING = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please sort/order these entries

f"Provenance file write failed: {e}", 1
)

LOGGER.info(f"Total CC licensed papers fetched: {total_fetched}")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Iftotal_fetched includes "Non-CC" papers, this message is misleading

Copy link
Author

@Opsmithe Opsmithe Nov 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@TimidRobot. Thanks for this. I investigated and fixed the license filtering logic in arxiv_fetch.py that incorrectly included non-Creative Commons licenses due to substring matching. The condition "CC" in metadata["license"] would match false positives like "Non-CC" or "Apache with CC mention". Changed to metadata["license"].startswith("CC") to ensure only licenses beginning with "CC" are processed, preventing
contamination of CC license statistics with misclassified non-CC licenses.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: In review

Development

Successfully merging this pull request may close these issues.

2 participants