Implement ISSN-based source validation system #117

muthukumaranR · 2025-08-17T23:43:41Z

Summary

Replaces fuzzy title matching with precise ISSN-based source validation for improved accuracy and maintainability. This change eliminates false positives/negatives from title variations while providing a more robust validation system.

Key Changes

ISSN-based validation: Uses CrossRef ISSN data instead of title matching
Flexible whitelist formats: Supports flat lists, categorized maps, and nested structures
ISSN normalization: Standardizes format and validates structure
ArXiv support: Configurable bypass for preprint repositories
Enhanced debugging: Detailed logging with journal names, DOIs, and URLs
Test improvements: Fixed attribute references and moved to proper directory

Technical Implementation

Extracts ISSNs from CrossRef ISSN and issn-type fields
Normalizes to NNNN-NNNN format with validation
Supports multiple whitelist JSON formats for flexibility
Maintains backward compatibility with existing search structures
Adds matched_issn field for debugging and transparency

Breaking Changes

Default whitelist path: pubs_whitelist.json → issn_whitelist.json
Schema change: ValidationResult.journal_info → ValidationResult.source_info
Whitelist format: title-based → ISSN-based

Testing

Created test suite with 26 tests (100% pass rate)
Covers ISSN normalization, whitelist loading, DOI extraction
Tests CrossRef integration, validation logic, end-to-end workflow
Includes edge cases, error handling, and configuration validation
Updated example script with proper attribute references
Verified backward compatibility with existing search results

Migration Path

Create new issn_whitelist.json with ISSN list/categories
Update any code referencing journal_info to source_info
Optional: Enable allow_arxiv: true for preprint support

Commits

Replace title-based validation with ISSN matching
Fix example script attribute references
Add test suite for ISSN source validation
Add ISSN whitelist documentation and mapping
Add ISSN whitelist builder script

Switch from fuzzy title matching to precise ISSN validation using CrossRef API data for improved accuracy and maintainability. - Extract ISSNs from CrossRef 'ISSN' and 'issn-type' fields - Add ISSN normalization (NNNN-NNNN format) with validation - Support multiple whitelist formats: flat list, categorized, nested - Add arXiv bypass option for preprint repositories - Enhanced debug logging with journal names, DOIs, and URLs - Rename ValidationResult.journal_info to source_info - Add matched_issn field to ValidationResult for debugging Breaking Changes: - Default whitelist path: pubs_whitelist.json → issn_whitelist.json - Whitelist format changed from title-based to ISSN-based

Update source_validation_test.py to use correct ValidationResult attribute names after schema changes. - Change journal_info references to source_info - Maintain backward compatibility with existing test workflows - Fix attribute access for validation result processing

Add comprehensive test coverage for the new ISSN-based validation system with 26 tests covering all functionality. - Test ISSN normalization and validation logic - Test whitelist loading for multiple JSON formats - Test DOI extraction from search results - Test CrossRef integration and ISSN matching - Test validation workflow end-to-end - Test edge cases, error handling, and configuration - Test backward compatibility with existing structures - Achieve 100% test pass rate for all validation scenarios

Provide sample ISSN whitelist files for testing and development with journal metadata mapping for enhanced validation. - Add issn_whitelist.json with sample trusted journal ISSNs - Add issn_whitelist_map.json with ISSN to journal metadata mapping - Support different whitelist formats for flexible configuration - Include major scientific journals for validation testing - Provide clear examples for users migrating from title-based validation

Provide automated tool for building and maintaining ISSN whitelists from various journal data sources. - Build ISSN whitelist from journal name lists - Fetch ISSN data from CrossRef API - Support batch processing for large journal collections - Generate both flat ISSN lists and metadata mappings - Include error handling and validation for ISSN formats - Facilitate migration from title-based to ISSN-based validation

- Removed unused TYPE_CHECKING block to streamline imports. - Simplified ISSN normalization and whitelist loading logic for clarity. - Enhanced error handling and logging messages for better debugging. - Consolidated duplicate code in ISSN processing to reduce redundancy. - Updated inline comments and removed outdated docstrings for better documentation.

… for testing

NISH1001 · 2025-08-18T19:26:53Z

@muthukumaranR can you put the usage example code snippet as well. Thanks.

NISH1001 · 2025-08-27T18:59:58Z

@muthukumaranR Is this an active PR?

NISH1001 · 2025-08-27T19:52:44Z

akd/tools/source_validator.py

        if params.whitelist_file_path:
            self.config.whitelist_file_path = params.whitelist_file_path
-            self._whitelist = self._load_whitelist()
-
-        validated_results = []
+            self._load_whitelist()


This has nasty side effect. We don't want to change the config at runtime based on param self.config.whitelist_file_path = params.whitelist_file_path

Intead maybe just have whitelist preloaded and not called from _arun(...).

Modifying self.config during execution breaks thread safety in general...since it also setting/building up these attributes once.

self._allowed_issn_set = allowed_issn self._issn_to_category = issn_to_category

Since we're setting these params in load_whitelist(...), so dynamically changing the config might be bad idea. Maybe what we can do is self._load_whitelist(<path_str>) is a better way.

NISH1001 · 2025-08-27T19:59:35Z

akd/tools/source_validator.py

+        # Load whitelist on initialization (ISSN-based)
+        self._allowed_issn_set: Set[str] = set()
+        self._issn_to_category: Dict[str, str] = {}
+        self._load_whitelist()


Delegate this to maybe _post_init()?

def _post_init(self): super()._post_init() self._load_whitelist()

NISH1001 · 2025-08-27T20:00:50Z

akd/tools/source_validator.py

+    issn: List[str] = Field(
+        default_factory=list, description="List of normalized ISSNs"
+    )


I see issn changed from list[str] to optional[list[str]]. Not sure what are the consequences. Are we guaranteeing some issn?

NISH1001 · 2025-08-27T20:01:16Z

akd/tools/source_validator.py

            import asyncio



Remove the import inside methods

NISH1001 · 2025-08-27T20:02:49Z

akd/tools/source_validator.py

-            matches = re.findall(pattern, url_str, re.IGNORECASE)
+            matches = re.findall(pattern, str(url).strip(), re.IGNORECASE)
            if matches:
                doi = matches[0] if isinstance(matches[0], str) else matches[0][0]


Possibly cause indexing error for match?

NISH1001 · 2025-08-27T20:05:07Z

akd/tools/source_validator.py

+    def _normalize_issn(issn: str) -> Optional[str]:
+        if not issn:
+            return None
+        candidate = re.sub(r"[^0-9xX]", "", issn).upper()
+        return f"{candidate[:4]}-{candidate[4:]}" if len(candidate) == 8 else None


Should we add digit check as well?

"To confirm the check digit, calculate the sum of all eight digits of the ISSN multiplied by their position in the number, counting from the right"

This might be handy in case the upstream is actually LLM generated ISSN?

NISH1001 · 2025-08-27T20:06:05Z

akd/tools/source_validator.py

-            async def validate_with_semaphore(
-                result: Any,
-            ) -> ValidationResult:
+            async def validate_with_semaphore(result: Any) -> ValidationResult:


Is there a reason for using Any?

muthukumaranR added 5 commits August 18, 2025 05:06

muthukumaranR marked this pull request as ready for review August 17, 2025 23:44

muthukumaranR added 2 commits August 18, 2025 05:33

Remove test script for journal validator tool

b90ec90

muthukumaranR closed this Aug 18, 2025

muthukumaranR reopened this Aug 18, 2025

github-actions bot added a commit that referenced this pull request Aug 18, 2025

Auto-merge PR #117 (refactor/issn-source-validation) into integration…

819c49b

… for testing

NISH1001 requested changes Aug 27, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Implement ISSN-based source validation system #117

Implement ISSN-based source validation system #117

Uh oh!

muthukumaranR commented Aug 17, 2025

Uh oh!

NISH1001 commented Aug 18, 2025

Uh oh!

NISH1001 commented Aug 27, 2025

Uh oh!

NISH1001 Aug 27, 2025

Uh oh!

NISH1001 Aug 27, 2025

Uh oh!

NISH1001 Aug 27, 2025

Uh oh!

NISH1001 Aug 27, 2025

Uh oh!

NISH1001 Aug 27, 2025

Uh oh!

NISH1001 Aug 27, 2025

Uh oh!

NISH1001 Aug 27, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Implement ISSN-based source validation system #117

Are you sure you want to change the base?

Implement ISSN-based source validation system #117

Uh oh!

Conversation

muthukumaranR commented Aug 17, 2025

Summary

Key Changes

Technical Implementation

Breaking Changes

Testing

Migration Path

Commits

Uh oh!

NISH1001 commented Aug 18, 2025

Uh oh!

NISH1001 commented Aug 27, 2025

Uh oh!

NISH1001 Aug 27, 2025

Choose a reason for hiding this comment

Uh oh!

NISH1001 Aug 27, 2025

Choose a reason for hiding this comment

Uh oh!

NISH1001 Aug 27, 2025

Choose a reason for hiding this comment

Uh oh!

NISH1001 Aug 27, 2025

Choose a reason for hiding this comment

Uh oh!

NISH1001 Aug 27, 2025

Choose a reason for hiding this comment

Uh oh!

NISH1001 Aug 27, 2025

Choose a reason for hiding this comment

Uh oh!

NISH1001 Aug 27, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants