Skip to content

Conversation

kevinschaper
Copy link
Contributor

Summary

  • Add memory configuration options (--memory, --heap-size) for Docker/JVM tuning
  • Implement chunked parallel loading with configurable chunk size and worker count
  • Add CBOR format support via /update/cbor endpoint for faster binary data loading
  • Auto-configure Solr performance settings (RAM buffer, disable autocommits)
  • Single commit at end instead of per-file commits to reduce overhead
  • New configure-performance command for manual Solr optimization

Test plan

  • Test chunked loading with large CSV files (1GB+)
  • Compare CSV vs CBOR performance on same dataset
  • Verify memory options work with Docker containers
  • Test auto-configuration properly sets Solr parameters
  • Confirm parallel workers improve loading speed
  • Test error handling and cleanup of temporary files

Performance Benefits

These optimizations target large file loading (25GB+) and should provide:

  • 5-10x faster loading through parallel chunk processing
  • Reduced memory pressure via configurable RAM buffers
  • Faster binary format support with CBOR
  • Elimination of per-file commit overhead

🤖 Generated with Claude Code

kevinschaper and others added 19 commits August 12, 2025 18:21
- Add memory configuration options (--memory, --heap-size) for Docker/JVM
- Implement chunked parallel loading with configurable chunk size and workers
- Add CBOR format support via /update/cbor endpoint for faster binary loading
- Auto-configure Solr performance settings (RAM buffer, disable autocommits)
- Single commit at end of all uploads instead of per-file commits
- Add configure-performance command for manual Solr tuning

These optimizations should significantly improve loading performance for large files (25GB+) through:
- Parallel processing of file chunks
- Optimized memory allocation
- Reduced commit overhead
- Binary format support

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
- Add cbor2 for binary format support
- Add pandas for CSV chunk processing
- Add requests for HTTP API configuration

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
Update pandas from ^1.3.0 to ^2.0.0 for better performance and compatibility

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
- Replace pandas with DuckDB for better handling of malformed data
- Add ignore_errors=true to skip bad rows instead of failing
- Auto-detect TSV vs CSV based on file extension
- More efficient row counting and chunk processing for large files
- Set duckdb dependency to '*' for maximum compatibility

This should resolve issues with inconsistent field counts in TSV files.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
Add proper Content-Type header to Solr commit requests to avoid 'Missing ContentType' error.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
- Move upload operations inside ThreadPoolExecutor for genuine parallelism
- Add HTTP connection pooling with 20 concurrent connections
- Use as_completed() to process uploads as they finish
- Combine chunk creation and upload into single parallel tasks
- Add automatic temp file cleanup per chunk

This should dramatically improve performance as workers now upload
simultaneously instead of sequentially. 8 workers will actually
process 8 uploads concurrently to Solr.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
Show actual error response text when Solr configuration requests fail,
instead of just HTTP status codes. This will help debug issues like
the 400 error when setting RAM buffer size.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
- Remove CBOR format (not supported in Solr 8.x)
- Replace with CSV to JSON conversion for performance comparison
- Fix RAM buffer configuration to use Docker environment variables
- Remove failing HTTP API call for RAM buffer setting
- Add --ram-buffer-mb option to start-server command
- Remove cbor2 dependency

Now works with Solr 8.x and properly configures RAM buffer via SOLR_OPTS.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
Correct 'solrschcema' to 'solrschema' in solrschemagen.py import.
This was causing ModuleNotFoundError when running lsolr commands.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
Show total time taken and commit time separately to help
compare performance between different formats and settings.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
Break down timing to show:
- Preprocessing time (chunk creation)
- Upload time (parallel HTTP uploads)
- Total processing time

This helps identify whether bottlenecks are in data processing
or network/Solr uploads.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
Show documents per second for:
- Upload phase (pure HTTP throughput)
- Processing phase (including preprocessing)
- Overall end-to-end (including commit)

This makes it easy to compare performance across different
settings, formats, and optimizations.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
Implement new 'bulkload-db' command with:
- Read-only DuckDB connections for safety
- Parallel query execution with OFFSET/LIMIT chunking
- Direct streaming (no temp files): DuckDB → JSON → HTTP
- SQL filtering support (WHERE, columns, ORDER BY)
- Auto-detected optimal worker count
- Comprehensive timing and throughput metrics

Usage: lsolr bulkload-db data.duckdb table_name [options]

Expected performance: 50k-100k+ docs/sec vs 30k from CSV approach.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
- Reduce default chunk size from 500k to 100k rows for better memory usage
- Make worker auto-detection less aggressive: CPU × 1.5 instead of × 2
- Cap workers at 12 instead of 16 to avoid over-parallelization
- Maintains manual override capability for fine-tuning

This provides better out-of-the-box performance while allowing customization.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
Adds null_padding and max_line_size parameters to DuckDB CSV reader
to handle rows with inconsistent column counts. This prevents bulk
loading failures when encountering malformed CSV/TSV data.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
The bulkload process was creating CSV files but Solr was configured to expect
TSV (tab-separated) format with separator=%09. This caused all field names to
be concatenated into one field and all values into a single array.

Changes:
- Remove problematic null_padding and max_line_size DuckDB parameters
- Use DELIMITER '\t' in DuckDB COPY command to create TSV files
- Ensure format consistency between export and Solr import

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
@kevinschaper kevinschaper merged commit 34abf88 into main Aug 30, 2025
0 of 2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant