-
Notifications
You must be signed in to change notification settings - Fork 1
Add high-performance bulk loading optimizations for large files #16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
- Add memory configuration options (--memory, --heap-size) for Docker/JVM - Implement chunked parallel loading with configurable chunk size and workers - Add CBOR format support via /update/cbor endpoint for faster binary loading - Auto-configure Solr performance settings (RAM buffer, disable autocommits) - Single commit at end of all uploads instead of per-file commits - Add configure-performance command for manual Solr tuning These optimizations should significantly improve loading performance for large files (25GB+) through: - Parallel processing of file chunks - Optimized memory allocation - Reduced commit overhead - Binary format support 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>
- Add cbor2 for binary format support - Add pandas for CSV chunk processing - Add requests for HTTP API configuration 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>
Update pandas from ^1.3.0 to ^2.0.0 for better performance and compatibility 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>
- Replace pandas with DuckDB for better handling of malformed data - Add ignore_errors=true to skip bad rows instead of failing - Auto-detect TSV vs CSV based on file extension - More efficient row counting and chunk processing for large files - Set duckdb dependency to '*' for maximum compatibility This should resolve issues with inconsistent field counts in TSV files. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>
Add proper Content-Type header to Solr commit requests to avoid 'Missing ContentType' error. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>
- Move upload operations inside ThreadPoolExecutor for genuine parallelism - Add HTTP connection pooling with 20 concurrent connections - Use as_completed() to process uploads as they finish - Combine chunk creation and upload into single parallel tasks - Add automatic temp file cleanup per chunk This should dramatically improve performance as workers now upload simultaneously instead of sequentially. 8 workers will actually process 8 uploads concurrently to Solr. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>
Show actual error response text when Solr configuration requests fail, instead of just HTTP status codes. This will help debug issues like the 400 error when setting RAM buffer size. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>
- Remove CBOR format (not supported in Solr 8.x) - Replace with CSV to JSON conversion for performance comparison - Fix RAM buffer configuration to use Docker environment variables - Remove failing HTTP API call for RAM buffer setting - Add --ram-buffer-mb option to start-server command - Remove cbor2 dependency Now works with Solr 8.x and properly configures RAM buffer via SOLR_OPTS. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>
Correct 'solrschcema' to 'solrschema' in solrschemagen.py import. This was causing ModuleNotFoundError when running lsolr commands. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>
Show total time taken and commit time separately to help compare performance between different formats and settings. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>
Break down timing to show: - Preprocessing time (chunk creation) - Upload time (parallel HTTP uploads) - Total processing time This helps identify whether bottlenecks are in data processing or network/Solr uploads. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>
Show documents per second for: - Upload phase (pure HTTP throughput) - Processing phase (including preprocessing) - Overall end-to-end (including commit) This makes it easy to compare performance across different settings, formats, and optimizations. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>
Implement new 'bulkload-db' command with: - Read-only DuckDB connections for safety - Parallel query execution with OFFSET/LIMIT chunking - Direct streaming (no temp files): DuckDB → JSON → HTTP - SQL filtering support (WHERE, columns, ORDER BY) - Auto-detected optimal worker count - Comprehensive timing and throughput metrics Usage: lsolr bulkload-db data.duckdb table_name [options] Expected performance: 50k-100k+ docs/sec vs 30k from CSV approach. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>
- Reduce default chunk size from 500k to 100k rows for better memory usage - Make worker auto-detection less aggressive: CPU × 1.5 instead of × 2 - Cap workers at 12 instead of 16 to avoid over-parallelization - Maintains manual override capability for fine-tuning This provides better out-of-the-box performance while allowing customization. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>
Adds null_padding and max_line_size parameters to DuckDB CSV reader to handle rows with inconsistent column counts. This prevents bulk loading failures when encountering malformed CSV/TSV data. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>
The bulkload process was creating CSV files but Solr was configured to expect TSV (tab-separated) format with separator=%09. This caused all field names to be concatenated into one field and all values into a single array. Changes: - Remove problematic null_padding and max_line_size DuckDB parameters - Use DELIMITER '\t' in DuckDB COPY command to create TSV files - Ensure format consistency between export and Solr import 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary
Test plan
Performance Benefits
These optimizations target large file loading (25GB+) and should provide:
🤖 Generated with Claude Code