Skip to content

Conversation

@gaogaotiantian
Copy link
Contributor

What changes were proposed in this pull request?

Add py4j path to search for tests.

Why are the changes needed?

When a user tries to run tests with a random python executable, import pyspark.sql will fail because py4j does not exist, which confuses the script. bin/pyspark adds the path to py4j so the tests work without issues. We should add the path when we try to search the test.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Locally - system python does not work without the patch, but works after.

Was this patch authored or co-authored using generative AI tooling?

No

@github-actions github-actions bot added the PYTHON label Dec 9, 2025
def split_and_validate_testnames(testnames):
testnames_to_test = []

py4j_module_path = os.path.join(SPARK_HOME, "python/lib/py4j-0.10.9.9-src.zip")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, I believe we're already handling this via bin/pyspark script (see run_individual_python_test).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it fail when it imports? if the case, importlib.util.find_spec(name) is not None should work

Copy link
Contributor Author

@gaogaotiantian gaogaotiantian Dec 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we did this in bin/pyspark so we need it here. importlib.util.find_spec(name) fails because it can't find py4j. It can find pyspark that's why the errors.

Basically when the script tries to locate the test module. It can find pyspark, but not pyspark.sql (because py4j is not there). Then it believes the test should be split as pyspark sql.xxx. We need py4j in this specific script (that runs before bin/pyspark) to check the test module properly.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, the problem here is that it might require other dependencies around, e.g., numpy is a required dependency for ML, and it will fail. Can we handle this case as well?

@dongjoon-hyun dongjoon-hyun changed the title [SPARK-54450][FOLLOWUP] Add py4j to sys path to support import check [SPARK-54450][PYTHON][FOLLOWUP] Add py4j to sys path to support import check Dec 10, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants