Add high-performance bulk loading optimizations for large files #16

kevinschaper · 2025-08-13T01:22:28Z

Summary

Add memory configuration options (--memory, --heap-size) for Docker/JVM tuning
Implement chunked parallel loading with configurable chunk size and worker count
Add CBOR format support via /update/cbor endpoint for faster binary data loading
Auto-configure Solr performance settings (RAM buffer, disable autocommits)
Single commit at end instead of per-file commits to reduce overhead
New configure-performance command for manual Solr optimization

Test plan

Test chunked loading with large CSV files (1GB+)
Compare CSV vs CBOR performance on same dataset
Verify memory options work with Docker containers
Test auto-configuration properly sets Solr parameters
Confirm parallel workers improve loading speed
Test error handling and cleanup of temporary files

Performance Benefits

These optimizations target large file loading (25GB+) and should provide:

5-10x faster loading through parallel chunk processing
Reduced memory pressure via configurable RAM buffers
Faster binary format support with CBOR
Elimination of per-file commit overhead

🤖 Generated with Claude Code

- Add memory configuration options (--memory, --heap-size) for Docker/JVM - Implement chunked parallel loading with configurable chunk size and workers - Add CBOR format support via /update/cbor endpoint for faster binary loading - Auto-configure Solr performance settings (RAM buffer, disable autocommits) - Single commit at end of all uploads instead of per-file commits - Add configure-performance command for manual Solr tuning These optimizations should significantly improve loading performance for large files (25GB+) through: - Parallel processing of file chunks - Optimized memory allocation - Reduced commit overhead - Binary format support 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>

- Add cbor2 for binary format support - Add pandas for CSV chunk processing - Add requests for HTTP API configuration 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>

Update pandas from ^1.3.0 to ^2.0.0 for better performance and compatibility 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>

- Replace pandas with DuckDB for better handling of malformed data - Add ignore_errors=true to skip bad rows instead of failing - Auto-detect TSV vs CSV based on file extension - More efficient row counting and chunk processing for large files - Set duckdb dependency to '*' for maximum compatibility This should resolve issues with inconsistent field counts in TSV files. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>

Add proper Content-Type header to Solr commit requests to avoid 'Missing ContentType' error. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>

- Move upload operations inside ThreadPoolExecutor for genuine parallelism - Add HTTP connection pooling with 20 concurrent connections - Use as_completed() to process uploads as they finish - Combine chunk creation and upload into single parallel tasks - Add automatic temp file cleanup per chunk This should dramatically improve performance as workers now upload simultaneously instead of sequentially. 8 workers will actually process 8 uploads concurrently to Solr. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>

Show actual error response text when Solr configuration requests fail, instead of just HTTP status codes. This will help debug issues like the 400 error when setting RAM buffer size. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>

- Remove CBOR format (not supported in Solr 8.x) - Replace with CSV to JSON conversion for performance comparison - Fix RAM buffer configuration to use Docker environment variables - Remove failing HTTP API call for RAM buffer setting - Add --ram-buffer-mb option to start-server command - Remove cbor2 dependency Now works with Solr 8.x and properly configures RAM buffer via SOLR_OPTS. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>

Correct 'solrschcema' to 'solrschema' in solrschemagen.py import. This was causing ModuleNotFoundError when running lsolr commands. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>

Show total time taken and commit time separately to help compare performance between different formats and settings. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>

Break down timing to show: - Preprocessing time (chunk creation) - Upload time (parallel HTTP uploads) - Total processing time This helps identify whether bottlenecks are in data processing or network/Solr uploads. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>

Show documents per second for: - Upload phase (pure HTTP throughput) - Processing phase (including preprocessing) - Overall end-to-end (including commit) This makes it easy to compare performance across different settings, formats, and optimizations. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>

Implement new 'bulkload-db' command with: - Read-only DuckDB connections for safety - Parallel query execution with OFFSET/LIMIT chunking - Direct streaming (no temp files): DuckDB → JSON → HTTP - SQL filtering support (WHERE, columns, ORDER BY) - Auto-detected optimal worker count - Comprehensive timing and throughput metrics Usage: lsolr bulkload-db data.duckdb table_name [options] Expected performance: 50k-100k+ docs/sec vs 30k from CSV approach. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>

- Reduce default chunk size from 500k to 100k rows for better memory usage - Make worker auto-detection less aggressive: CPU × 1.5 instead of × 2 - Cap workers at 12 instead of 16 to avoid over-parallelization - Maintains manual override capability for fine-tuning This provides better out-of-the-box performance while allowing customization. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>

Adds null_padding and max_line_size parameters to DuckDB CSV reader to handle rows with inconsistent column counts. This prevents bulk loading failures when encountering malformed CSV/TSV data. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>

The bulkload process was creating CSV files but Solr was configured to expect TSV (tab-separated) format with separator=%09. This caused all field names to be concatenated into one field and all values into a single array. Changes: - Remove problematic null_padding and max_line_size DuckDB parameters - Use DELIMITER '\t' in DuckDB COPY command to create TSV files - Ensure format consistency between export and Solr import 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>

kevinschaper and others added 19 commits August 12, 2025 18:21

Add required dependencies for performance optimizations

3f2be33

- Add cbor2 for binary format support - Add pandas for CSV chunk processing - Add requests for HTTP API configuration 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>

Bump pandas dependency to modern version

003860f

Update pandas from ^1.3.0 to ^2.0.0 for better performance and compatibility 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>

Fix commit function missing Content-Type header

892f60b

Add proper Content-Type header to Solr commit requests to avoid 'Missing ContentType' error. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>

Fix typo in import statement

ad97051

Correct 'solrschcema' to 'solrschema' in solrschemagen.py import. This was causing ModuleNotFoundError when running lsolr commands. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>

Add timing information to bulkload command

1d6781d

Show total time taken and commit time separately to help compare performance between different formats and settings. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>

add processor option for db bulkload

f218771

batching improvements

0f2f324

increased request header size

01985b2

kevinschaper merged commit 34abf88 into main Aug 30, 2025
0 of 2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add high-performance bulk loading optimizations for large files #16

Add high-performance bulk loading optimizations for large files #16

Uh oh!

kevinschaper commented Aug 13, 2025

Uh oh!

Uh oh!

Uh oh!

Add high-performance bulk loading optimizations for large files #16

Add high-performance bulk loading optimizations for large files #16

Uh oh!

Conversation

kevinschaper commented Aug 13, 2025

Summary

Test plan

Performance Benefits

Uh oh!

Uh oh!

Uh oh!