Skip to content

Conversation

@gaogaotiantian
Copy link
Contributor

What changes were proposed in this pull request?

Add sanity check for number of configurations being passed.

Why are the changes needed?

This is helpful to recognize malformed message - avoid potential deadlock when the message does not conform to protocol.

Does this PR introduce any user-facing change?

No

How was this patch tested?

This error should not happen and it should not break CI either.

Was this patch authored or co-authored using generative AI tooling?

No.

@gaogaotiantian gaogaotiantian marked this pull request as ready for review December 5, 2025 23:50
def load(self, infile):
num_conf = read_int(infile)
for i in range(num_conf):
if num_conf < 0 or num_conf > 10000:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

10,000 seems to be too small though, @gaogaotiantian .

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, don't we already only send the allowed confs only? e.g., ArrowPythonRunner.getPythonRunnerConfMap

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, okay so this is adding a sanity check

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of introducing a hard-coded magic number, please provide an environment variable to control this, @gaogaotiantian . Also, it would be great if we can have more higher default values to make it sure.

if num_conf < 0 or num_conf > 10000:

@dongjoon-hyun dongjoon-hyun changed the title [SPARK-54619] Add a sanity check for configuration numbers [SPARK-54619][PYTHON] Add a sanity check for configuration numbers Dec 6, 2025
@gaogaotiantian
Copy link
Contributor Author

Hi @dongjoon-hyun , runner_conf is hand-picked by daemon to pass to worker in a need-to-know basis. For now, the maximum number is less than 10. 10000 is a very safe upper limit. Also, we have about a few thousand spark configs in total so even if we pass everything, it's still going to be less than 10000.

I don't think this should be controllable by an env var because:

  1. It's a sanity check, not a run-time validation. We just want to make sure that number at least makes some sense.
  2. One of the reasons to introduce runner_conf is to avoid passing too many arguments through env var. Hopefully we can put more environment setups in runner_conf. Having another env var to control it is kind of against the purpose.
  3. We will have more sanity checks in the protocol - which will be super safe but helpful to rule out crazy situations. Having an env var for each of them will explode our env var namespace.

If you really hate the magic number, we can only check the positivity of the number. However, that leaves plenty of unreasonable space for sanity check.

Thanks!

@dongjoon-hyun
Copy link
Member

Got it. Thank you for the details.

@dongjoon-hyun dongjoon-hyun dismissed their stale review December 6, 2025 19:10

Got the rational.


def load(self, infile):
num_conf = read_int(infile)
for i in range(num_conf):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add a bit of background here in a comment. I was also wondering why we need this as we're already controlling the confs to send.

@gaogaotiantian
Copy link
Contributor Author

I updated the comments, let's see if it's clearer.

The rational behind this is - when we change something on the protocol (passing an extra integer at a random place, which is what we often do now), it's common to just "stuck" at somewhere. The test hangs and we don't know what happened - we don't even know where the message might go wrong.

Having sanity check in different places in our protocols can stop the communication early so we know the message is already wrong at this place. It's helpful for debugging.

More than that, there could be communication errors during production (rare, but possible). There could be dark corners that we forgot to test. Raising an error explicitly is always better than hanging there.

That's why I think we should introduce more sanity check and real runtime validation checks on our data passed in.

Of course, eventually we might just want a more dedicated RPC, but for now this is helpful.

@dongjoon-hyun
Copy link
Member

Merged to master for Apache Spark 4.2.0. Thank you, @gaogaotiantian and all.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants