-
Notifications
You must be signed in to change notification settings - Fork 3
Implement ISSN-based source validation system #117
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Conversation
Switch from fuzzy title matching to precise ISSN validation using CrossRef API data for improved accuracy and maintainability. - Extract ISSNs from CrossRef 'ISSN' and 'issn-type' fields - Add ISSN normalization (NNNN-NNNN format) with validation - Support multiple whitelist formats: flat list, categorized, nested - Add arXiv bypass option for preprint repositories - Enhanced debug logging with journal names, DOIs, and URLs - Rename ValidationResult.journal_info to source_info - Add matched_issn field to ValidationResult for debugging Breaking Changes: - Default whitelist path: pubs_whitelist.json → issn_whitelist.json - Whitelist format changed from title-based to ISSN-based
Update source_validation_test.py to use correct ValidationResult attribute names after schema changes. - Change journal_info references to source_info - Maintain backward compatibility with existing test workflows - Fix attribute access for validation result processing
Add comprehensive test coverage for the new ISSN-based validation system with 26 tests covering all functionality. - Test ISSN normalization and validation logic - Test whitelist loading for multiple JSON formats - Test DOI extraction from search results - Test CrossRef integration and ISSN matching - Test validation workflow end-to-end - Test edge cases, error handling, and configuration - Test backward compatibility with existing structures - Achieve 100% test pass rate for all validation scenarios
Provide sample ISSN whitelist files for testing and development with journal metadata mapping for enhanced validation. - Add issn_whitelist.json with sample trusted journal ISSNs - Add issn_whitelist_map.json with ISSN to journal metadata mapping - Support different whitelist formats for flexible configuration - Include major scientific journals for validation testing - Provide clear examples for users migrating from title-based validation
Provide automated tool for building and maintaining ISSN whitelists from various journal data sources. - Build ISSN whitelist from journal name lists - Fetch ISSN data from CrossRef API - Support batch processing for large journal collections - Generate both flat ISSN lists and metadata mappings - Include error handling and validation for ISSN formats - Facilitate migration from title-based to ISSN-based validation
- Removed unused TYPE_CHECKING block to streamline imports. - Simplified ISSN normalization and whitelist loading logic for clarity. - Enhanced error handling and logging messages for better debugging. - Consolidated duplicate code in ISSN processing to reduce redundancy. - Updated inline comments and removed outdated docstrings for better documentation.
|
@muthukumaranR can you put the usage example code snippet as well. Thanks. |
|
@muthukumaranR Is this an active PR? |
| if params.whitelist_file_path: | ||
| self.config.whitelist_file_path = params.whitelist_file_path | ||
| self._whitelist = self._load_whitelist() | ||
|
|
||
| validated_results = [] | ||
| self._load_whitelist() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This has nasty side effect. We don't want to change the config at runtime based on param self.config.whitelist_file_path = params.whitelist_file_path
Intead maybe just have whitelist preloaded and not called from _arun(...).
Modifying self.config during execution breaks thread safety in general...since it also setting/building up these attributes once.
self._allowed_issn_set = allowed_issn
self._issn_to_category = issn_to_category
Since we're setting these params in load_whitelist(...), so dynamically changing the config might be bad idea. Maybe what we can do is self._load_whitelist(<path_str>) is a better way.
| # Load whitelist on initialization (ISSN-based) | ||
| self._allowed_issn_set: Set[str] = set() | ||
| self._issn_to_category: Dict[str, str] = {} | ||
| self._load_whitelist() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Delegate this to maybe _post_init()?
def _post_init(self):
super()._post_init()
self._load_whitelist()
| issn: List[str] = Field( | ||
| default_factory=list, description="List of normalized ISSNs" | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see issn changed from list[str] to optional[list[str]]. Not sure what are the consequences. Are we guaranteeing some issn?
| import asyncio | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove the import inside methods
| matches = re.findall(pattern, url_str, re.IGNORECASE) | ||
| matches = re.findall(pattern, str(url).strip(), re.IGNORECASE) | ||
| if matches: | ||
| doi = matches[0] if isinstance(matches[0], str) else matches[0][0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Possibly cause indexing error for match?
| def _normalize_issn(issn: str) -> Optional[str]: | ||
| if not issn: | ||
| return None | ||
| candidate = re.sub(r"[^0-9xX]", "", issn).upper() | ||
| return f"{candidate[:4]}-{candidate[4:]}" if len(candidate) == 8 else None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we add digit check as well?
"To confirm the check digit, calculate the sum of all eight digits of the ISSN multiplied by their position in the number, counting from the right"
This might be handy in case the upstream is actually LLM generated ISSN?
| async def validate_with_semaphore( | ||
| result: Any, | ||
| ) -> ValidationResult: | ||
| async def validate_with_semaphore(result: Any) -> ValidationResult: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason for using Any?
Summary
Replaces fuzzy title matching with precise ISSN-based source validation for improved accuracy and maintainability. This change eliminates false positives/negatives from title variations while providing a more robust validation system.
Key Changes
Technical Implementation
ISSNandissn-typefieldsNNNN-NNNNformat with validationmatched_issnfield for debugging and transparencyBreaking Changes
pubs_whitelist.json→issn_whitelist.jsonValidationResult.journal_info→ValidationResult.source_infoTesting
Migration Path
issn_whitelist.jsonwith ISSN list/categoriesjournal_infotosource_infoallow_arxiv: truefor preprint supportCommits