Skip to content

Conversation

FarnazSalehi94
Copy link

@FarnazSalehi94 FarnazSalehi94 commented Jun 30, 2025

🚀 SASSY Integration for CRISPRapido

Summary

Replace WFA2 with SASSY for approximate string matching, significantly simplifying the build process and improving performance for short DNA sequences.

Changes

  • ✅ Integrated SASSY library for approximate string matching
  • ✅ Removed WFA2 dependency and complex build requirements
  • ✅ Updated CIGAR string parsing to handle SASSY format
  • ✅ Simplified installation (just cargo build - no external deps!)
  • ✅ All tests passing (12/12)
  • ✅ Updated README with new installation instructions
  • ✅ Added CI/CD pipeline for automated testing

Performance

  • 🚀 Faster builds (no C library compilation)
  • 🚀 Better suited for CRISPR guide RNA sequences (~20bp)
  • 🚀 Easier installation and distribution

Testing

  • All existing tests updated and passing
  • CI/CD pipeline added to verify builds and tests
  • Manual testing confirms functionality

Breaking Changes

  • Removed WFA2LIB_PATH environment variable requirement
  • CIGAR output format changed from WFA2 style to standard format
  • Much simpler installation process

let max_errors = (max_mismatches + max_bulges) as usize;

// Create SASSY searcher with DNA profile
let mut searcher: Searcher<Dna> = Searcher::new(false, None);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you want to support N characters, consider IUPAC profile here.

For reverse-complement searchers, also change false to true.

let max_errors = (max_mismatches + max_bulges) as usize;

// Create SASSY searcher with DNA profile
let mut searcher: Searcher<Dna> = Searcher::new(false, None);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Depending on how you use this, you may want to reuse the searcher object between consecutive invocations, so allocations are reused.

let mut searcher: Searcher<Dna> = Searcher::new(false, None);

// Convert window to a Vec so it implements SearchAble
let window_vec = window.to_vec();

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, we should fix this @rickbeeloo

src/main.rs Outdated

// Convert SASSY CIGAR to standard format
let cigar_debug = format!("{:?}", best_match.cigar);
let cigar_str = parse_sassy_cigar_debug(&cigar_debug);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should be able to use Cigar::to_string() here, right?

FarnazSalehi94 and others added 6 commits July 15, 2025 12:00
- Fixed incorrect position reporting in scan_window_sassy function
- Added manual verification for match positions in windows
- Improved SASSY CIGAR parsing with proper fallback handling
- Enhanced debug output for troubleshooting alignment issues
- Added support for correct query start/end coordinates in PAF output
- Fixed CFD score calculation with proper target sequence extraction
- Added test files for multi-sequence testing
- Fixed CFD key construction to use correct RNA-DNA pairing format
- Corrected position calculation for match reporting
- Updated CIGAR parsing to handle count+operation format properly
- All tests now passing including CFD score validation
- Cleaned up debug output for production
- Basic CFD functionality works correctly for simple cases
- Comprehensive test suite has 12/20 failing cases that need investigation
- Main tool functionality (position calculation, CIGAR parsing) is working
- Will debug CFD matrix lookup issues in separate PR
Major improvements:
- Replace alignment engine with SASSY for accurate sequence matching
- Implement position-dependent CFD (Cutting Frequency Determination) scoring
- Fix target coordinate calculation for proper off-target detection
- Add support for mismatches and indels in alignment
- Clean up debug output for production-ready tool
- Improve PAF output format with CFD scores

Technical changes:
- Integrate SASSY library for approximate string matching
- Add CFD calculation with position-specific mismatch penalties
- Fix coordinate mapping from window positions to absolute positions
- Implement proper target sequence extraction for CFD scoring
- Add comprehensive test cases for validation

Breaking changes:
- Output format now includes CFD scores (cf:f tag)
- Improved coordinate accuracy may change previous results
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants