Skip to content

Conversation

@muthukumaranR
Copy link
Collaborator

@muthukumaranR muthukumaranR commented Jul 2, 2025

Description

Title

Fix Source Validator Crossref API, Thread Safety and Performance Issues

Summary

  • Improve CrossRef API calling
  • Replace asyncio.to_thread() with ThreadPoolExecutor for CrossRef API calls
  • Add 30-second timeout to prevent hanging requests
  • Improve error handling and logging

Validation Improvements:

  • Add ISSN checksum validation (was format-only)
  • Enhance DOI format validation with stricter checks
  • Use weighted scoring instead of simple max() for confidence scores

Performance:

  • Add class-level caching for whitelist data
  • Consolidate overlapping DOI regex patterns (9 → 6 patterns)
  • Add cache management methods

…commons, unidecode, httpx, requests-cache, orjson, and ftfy
- Updated import from `CompositeWebScraper` to `CompositeScraper` for consistency.
- Introduced `create_default_source_validator` function to facilitate the creation of a source validator with default parameters, enhancing the validation process.
- Refactored `source_validator.py` to include comprehensive DOI and ISSN validation with detailed error handling.
- Implemented a journal index for efficient fuzzy matching of source titles against a whitelist.
- Updated the `arun` method to handle validation results and exceptions more gracefully.
- Modified example scripts and tests to reflect changes from journal to source validation terminology.
- Improved logging for better debugging and tracking of validation processes.
- Introduced `source_validation_pipeline_test.py` to demonstrate the complete source validation process.
- Implemented functionality to search for research papers, extract DOIs, validate sources against a whitelist, and generate validation reports.
- Included sample queries for Earth Sciences and Astronomy to showcase the pipeline's capabilities.
- Added comprehensive inline comments and docstrings for clarity and maintainability.
- Introduced `simple_source_validation_example.py` to demonstrate the core functionality of the source validation process without complex dependencies.
- Implemented a test case with real DOIs to validate against a whitelist and included a paper without a DOI to showcase validation failure.
- Added comprehensive inline comments and docstrings for clarity and maintainability.
- Utilized asynchronous programming to handle validation and output results effectively.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants