Skip to content

Conversation

@muthukumaranR
Copy link
Collaborator

Summary

Replaces fuzzy title matching with precise ISSN-based source validation for improved accuracy and maintainability. This change eliminates false positives/negatives from title variations while providing a more robust validation system.

Key Changes

  • ISSN-based validation: Uses CrossRef ISSN data instead of title matching
  • Flexible whitelist formats: Supports flat lists, categorized maps, and nested structures
  • ISSN normalization: Standardizes format and validates structure
  • ArXiv support: Configurable bypass for preprint repositories
  • Enhanced debugging: Detailed logging with journal names, DOIs, and URLs
  • Test improvements: Fixed attribute references and moved to proper directory

Technical Implementation

  • Extracts ISSNs from CrossRef ISSN and issn-type fields
  • Normalizes to NNNN-NNNN format with validation
  • Supports multiple whitelist JSON formats for flexibility
  • Maintains backward compatibility with existing search structures
  • Adds matched_issn field for debugging and transparency

Breaking Changes

  • Default whitelist path: pubs_whitelist.jsonissn_whitelist.json
  • Schema change: ValidationResult.journal_infoValidationResult.source_info
  • Whitelist format: title-based → ISSN-based

Testing

  • Created test suite with 26 tests (100% pass rate)
  • Covers ISSN normalization, whitelist loading, DOI extraction
  • Tests CrossRef integration, validation logic, end-to-end workflow
  • Includes edge cases, error handling, and configuration validation
  • Updated example script with proper attribute references
  • Verified backward compatibility with existing search results

Migration Path

  1. Create new issn_whitelist.json with ISSN list/categories
  2. Update any code referencing journal_info to source_info
  3. Optional: Enable allow_arxiv: true for preprint support

Commits

  • Replace title-based validation with ISSN matching
  • Fix example script attribute references
  • Add test suite for ISSN source validation
  • Add ISSN whitelist documentation and mapping
  • Add ISSN whitelist builder script

Switch from fuzzy title matching to precise ISSN validation using
CrossRef API data for improved accuracy and maintainability.

- Extract ISSNs from CrossRef 'ISSN' and 'issn-type' fields
- Add ISSN normalization (NNNN-NNNN format) with validation
- Support multiple whitelist formats: flat list, categorized, nested
- Add arXiv bypass option for preprint repositories
- Enhanced debug logging with journal names, DOIs, and URLs
- Rename ValidationResult.journal_info to source_info
- Add matched_issn field to ValidationResult for debugging

Breaking Changes:
- Default whitelist path: pubs_whitelist.json → issn_whitelist.json
- Whitelist format changed from title-based to ISSN-based
Update source_validation_test.py to use correct ValidationResult
attribute names after schema changes.

- Change journal_info references to source_info
- Maintain backward compatibility with existing test workflows
- Fix attribute access for validation result processing
Add comprehensive test coverage for the new ISSN-based validation
system with 26 tests covering all functionality.

- Test ISSN normalization and validation logic
- Test whitelist loading for multiple JSON formats
- Test DOI extraction from search results
- Test CrossRef integration and ISSN matching
- Test validation workflow end-to-end
- Test edge cases, error handling, and configuration
- Test backward compatibility with existing structures
- Achieve 100% test pass rate for all validation scenarios
Provide sample ISSN whitelist files for testing and development
with journal metadata mapping for enhanced validation.

- Add issn_whitelist.json with sample trusted journal ISSNs
- Add issn_whitelist_map.json with ISSN to journal metadata mapping
- Support different whitelist formats for flexible configuration
- Include major scientific journals for validation testing
- Provide clear examples for users migrating from title-based validation
Provide automated tool for building and maintaining ISSN whitelists
from various journal data sources.

- Build ISSN whitelist from journal name lists
- Fetch ISSN data from CrossRef API
- Support batch processing for large journal collections
- Generate both flat ISSN lists and metadata mappings
- Include error handling and validation for ISSN formats
- Facilitate migration from title-based to ISSN-based validation
@muthukumaranR muthukumaranR marked this pull request as ready for review August 17, 2025 23:44
- Removed unused TYPE_CHECKING block to streamline imports.
- Simplified ISSN normalization and whitelist loading logic for clarity.
- Enhanced error handling and logging messages for better debugging.
- Consolidated duplicate code in ISSN processing to reduce redundancy.
- Updated inline comments and removed outdated docstrings for better documentation.
@muthukumaranR muthukumaranR reopened this Aug 18, 2025
github-actions bot added a commit that referenced this pull request Aug 18, 2025
@NISH1001
Copy link
Collaborator

@muthukumaranR can you put the usage example code snippet as well. Thanks.

@NISH1001
Copy link
Collaborator

@muthukumaranR Is this an active PR?

Comment on lines 436 to +438
if params.whitelist_file_path:
self.config.whitelist_file_path = params.whitelist_file_path
self._whitelist = self._load_whitelist()

validated_results = []
self._load_whitelist()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has nasty side effect. We don't want to change the config at runtime based on param self.config.whitelist_file_path = params.whitelist_file_path

Intead maybe just have whitelist preloaded and not called from _arun(...).

Modifying self.config during execution breaks thread safety in general...since it also setting/building up these attributes once.

            self._allowed_issn_set = allowed_issn
            self._issn_to_category = issn_to_category

Since we're setting these params in load_whitelist(...), so dynamically changing the config might be bad idea. Maybe what we can do is self._load_whitelist(<path_str>) is a better way.

# Load whitelist on initialization (ISSN-based)
self._allowed_issn_set: Set[str] = set()
self._issn_to_category: Dict[str, str] = {}
self._load_whitelist()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Delegate this to maybe _post_init()?

def _post_init(self):
  super()._post_init()
  self._load_whitelist()

Comment on lines +32 to +34
issn: List[str] = Field(
default_factory=list, description="List of normalized ISSNs"
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see issn changed from list[str] to optional[list[str]]. Not sure what are the consequences. Are we guaranteeing some issn?

Comment on lines 447 to 448
import asyncio

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove the import inside methods

matches = re.findall(pattern, url_str, re.IGNORECASE)
matches = re.findall(pattern, str(url).strip(), re.IGNORECASE)
if matches:
doi = matches[0] if isinstance(matches[0], str) else matches[0][0]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possibly cause indexing error for match?

Comment on lines +237 to +241
def _normalize_issn(issn: str) -> Optional[str]:
if not issn:
return None
candidate = re.sub(r"[^0-9xX]", "", issn).upper()
return f"{candidate[:4]}-{candidate[4:]}" if len(candidate) == 8 else None
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add digit check as well?

"To confirm the check digit, calculate the sum of all eight digits of the ISSN multiplied by their position in the number, counting from the right"

This might be handy in case the upstream is actually LLM generated ISSN?

async def validate_with_semaphore(
result: Any,
) -> ValidationResult:
async def validate_with_semaphore(result: Any) -> ValidationResult:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason for using Any?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants