-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-54305][SQL][PYTHON] Add admission control support to Python DataSource streaming API #53001
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
[SPARK-54305][SQL][PYTHON] Add admission control support to Python DataSource streaming API #53001
Conversation
|
👋 Hello reviewers! JIRA Status: My Apache JIRA account is pending approval (account request submitted). I will update this PR with the actual JIRA number once my account is approved (typically 1-2 days). Review Status: This PR is ready for technical review. All implementation is complete and tested:
What this PR does: Adds admission control support to Python DataSource streaming API, enabling Python sources to control microbatch sizes via Looking forward to your feedback! 🚀 |
6476629 to
14571af
Compare
|
✅ CI Check Issue Fixed Fixed the failing
CI checks should now run properly. The workflows will start automatically. |
14571af to
10e8f59
Compare
fe55e7a to
4b86c76
Compare
a3cd1a2 to
345b024
Compare
051fe97 to
7e5af68
Compare
…taSource streaming API This change adds admission control capabilities to the Python DataSource streaming API, bringing it to feature parity with the Scala SupportsAdmissionControl interface. Changes include: - Modified PythonMicroBatchStream to implement SupportsAdmissionControl - Updated PythonStreamingSourceRunner to serialize ReadLimit to Python - Enhanced python_streaming_source_runner.py to deserialize and pass parameters - Extended DataSourceStreamReader.latestOffset() to accept start_offset and read_limit - Added reportLatestOffset() method for monitoring - Full backward compatibility maintained - Added comprehensive unit tests - Added example demonstrating admission control This enables Python streaming sources to: - Control microbatch sizes via maxRecordsPerBatch option - Implement rate limiting and backpressure - Match capabilities of built-in Scala sources (Kafka, Delta) JIRA: https://issues.apache.org/jira/browse/SPARK-54305
7e5af68 to
624830f
Compare
- Add signature detection in report_latest_offset_func to handle both old and new latestOffset signatures - This fixes test failures in existing data sources that use latestOffset() without parameters - Maintains backward compatibility while supporting new admission control API
- Fix compilation error: 'not found: value ReadLimit' - Import org.apache.spark.sql.connector.read.streaming.ReadLimit in PythonStreamingDataSourceSuite
…ion-control' into fix-pyspark-ci-failures-BEXD6
…asource_admission_control.py - Remove unused imports (StringType, StructType, StructField, IntegerType, ReusedSQLTestCase) - Fix line length violations by splitting long import statement - Remove trailing blank line
…ke8/mypy before commit
…, packaging, Scala checks)
…s run all applicable checks
What changes were proposed in this pull request?
This PR adds admission control support to the Python DataSource streaming API, bringing it to feature parity with Scala's
SupportsAdmissionControlinterface.JIRA: https://issues.apache.org/jira/browse/SPARK-54305
Problem
Currently, Python streaming data sources cannot control microbatch sizes because the
DataSourceStreamReader.latestOffset()method has no parameters to receive the configured limits. This forces Python sources to either:In contrast, Scala sources can implement
SupportsAdmissionControlto properly control batch sizes.Solution
This PR extends the Python DataSource API to support admission control by:
DataSourceStreamReader.latestOffset()to accept optionalstart_offsetandread_limitparametersPythonMicroBatchStreamto implementSupportsAdmissionControlReadLimitserialization inPythonStreamingSourceRunnerpython_streaming_source_runner.pyto deserialize and pass parametersreportLatestOffset()method for observabilityKey Features
ReadLimittypes (maxRows, maxFiles, maxBytes, minRows, composite)Why are the changes needed?
This change is critical for production streaming workloads using Python DataSources:
Does this PR introduce any user-facing changes?
Yes - API Enhancement (Backward Compatible)
New API Signature
Usage Example
Backward Compatibility
How was this patch tested?
Unit Tests
test_streaming_datasource_admission_control.pyReadLimitdictionary formatsIntegration Tests
structured_blockchain_admission_control.pyTest Environment
./build/mvn clean package -DskipTests -Phivepython/run-tests --testnames 'pyspark.sql.tests.streaming.test_streaming_datasource_admission_control'Files Changed
Total: 8 files changed, 842 insertions(+), 30 deletions(-)
Scala Changes
sql/core/.../python/PythonMicroBatchStream.scala- ImplementSupportsAdmissionControlsql/core/.../python/streaming/PythonStreamingSourceRunner.scala- SerializeReadLimitPython Changes
python/pyspark/sql/datasource.py- Enhanced API signaturepython/pyspark/sql/streaming/python_streaming_source_runner.py- Deserialize parameterspython/pyspark/sql/datasource_internal.py- Internal updatesTests & Examples
python/pyspark/sql/tests/streaming/test_streaming_datasource_admission_control.py- Unit testsexamples/.../structured_blockchain_admission_control.py- DemonstrationDocumentation
python/docs/source/tutorial/sql/python_data_source.rst- Tutorial updatesLicense Declaration
I confirm that this contribution is my original work and I license the work to the Apache Spark project under the Apache License 2.0.