-
Notifications
You must be signed in to change notification settings - Fork 28.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-48755] State V2 base implementation and ValueState support #47133
base: master
Are you sure you want to change the base?
Conversation
val field = structType.fields(0) | ||
val encoder = getEncoder(field.dataType) | ||
val state = statefulProcessorHandle.getValueState[String](stateName, Encoders.STRING) | ||
// val state = statefulProcessorHandle.getValueState(stateName, encoder) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if this is the correct way to build value state with a schema, but it cannot support class as a state type right now, might need to find a different way to do so. cc @sahnib
val groupingKeyType = groupingKeySchema.fields(0).dataType | ||
val castedData = castToType(key, groupingKeyType) | ||
logWarning(s"setting implicit key to $castedData with type ${castedData.getClass}") | ||
val row = Row(castedData) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
setting grouping key to row here since only this way works in my local, otherwise it would throw a mismatch error like below if I just set a String key
24/06/26 13:13:13 WARN ExpressionEncoder$Serializer: inputRow: [test_key]
Exception in thread "stateConnectionListenerThread" org.apache.spark.SparkRuntimeException: [EXPRESSION_ENCODING_FAILED] Failed to encode a value of the expressions: if (assertnotnull(input[0, org.apache.spark.sql.Row, true]).isNullAt) null else validateexternaltype(getexternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true]), 0, value), LongType, ObjectType(class java.lang.Long)).longValue AS value#129L to a row. SQLSTATE: 42846
at org.apache.spark.sql.errors.QueryExecutionErrors$.expressionEncodingError(QueryExecutionErrors.scala:1309)
at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$Serializer.apply(ExpressionEncoder.scala:227)
at org.apache.spark.sql.execution.streaming.StateTypesEncoder.serializeGroupingKey(StateTypesEncoderUtils.scala:101)
at org.apache.spark.sql.execution.streaming.StateTypesEncoder.encodeGroupingKey(StateTypesEncoderUtils.scala:80)
at org.apache.spark.sql.execution.streaming.ValueStateImpl.get(ValueStateImpl.scala:64)
at org.apache.spark.sql.execution.python.TransformWithStateInPandasStateServer.handleRequest(TransformWithStateInPandasStateServer.scala:133)
at org.apache.spark.sql.execution.python.TransformWithStateInPandasStateServer.run(TransformWithStateInPandasStateServer.scala:78)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
at java.base/java.lang.Thread.run(Thread.java:840)
Caused by: java.lang.ClassCastException: class java.lang.String cannot be cast to class org.apache.spark.sql.Row (java.lang.String is in module java.base of loader 'bootstrap'; org.apache.spark.sql.Row is in unnamed module of loader 'app')
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.If_0$(Unknown Source)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source)
at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$Serializer.apply(ExpressionEncoder.scala:222)
... 8 more
import org.apache.spark.tags.SlowSQLTest | ||
|
||
@SlowSQLTest | ||
class TransformWithStateInPandasSuite extends StreamTest { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will add more unit tests
inputRows: Iterator["PandasDataFrameLike"]) -> Iterator["PandasDataFrameLike"]: | ||
handle = StatefulProcessorHandle(state_api_client) | ||
|
||
print(f"checking handle state: {state_api_client.handle_state}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm keeping all the prints and logWarnings for testing purpose, will remove them in final revision
Mind filing a JIRA? |
Yeah, will do, thanks! |
What changes were proposed in this pull request?
Why are the changes needed?
Support Python State V2 API
Does this PR introduce any user-facing change?
Yes
How was this patch tested?
Did local integration test with below command
Verified from the logs that value state methods work as expected for key
11
Will add unit test
Was this patch authored or co-authored using generative AI tooling?
No