Adding optimization rewrite pass to utilize server with information about masked columns #443

knassre-bodo · 2025-10-09T04:49:56Z

Augmenting relational optimization to rewrite expressions containing an UNMASK operator when a server is mounted to the PyDough session (and the environment variable is activated):

When this is the case, two additional shuttles are added to the additional_shuttles lest, before the masking literal comparisons shuttle.
The first shuttle, MaskServerCandidateShuttle is a no-op shuttle that just traverses the entire tree to find expressions that can potentially be rewritten and adds them to a pool.
The second shuttle, MaskServerRewriteShuttle looks for expressions in the candidate shuttle's pool, and once it finds one it sends every candidate in the pool into a batch request to the mask server, processing the output results to create the new relational node. The candidate pool is then emptied so future invocations will not re-do the same batch calculation.

…ptbank_filter_count_01

tests/test_mock_mask_server.py

pydough/conversion/mask_server_candidate_visitor.py

john-sanchez31

Comments regarding IN and ISIN operators and a type hint

pydough/conversion/mask_server_candidate_visitor.py

pydough/mask_server/mask_server_candidate_visitor.py

…N CI]

john-sanchez31

Just fix the type hints missing and TODO docstrings, but overall LGTM! Nice job with the new dry run algorithm impressive!

john-sanchez31 · 2025-11-24T22:19:35Z

pydough/mask_server/mask_server.py

    """

-    def __init__(self, base_url: str, token: str | None = None):
+    def __init__(self, base_url: str, server_address: str, token: str | None = None):


What would be the difference between base_url and server_addresss?

I think base_url is to contact the predicate server and server_address is to write the Fully Qualified Column Name as f"{server_addresss}/{table_path}". Is this correct?

Where will server_address be configured? This is very specific to the database instance we are connecting to. For example, metadata can be re-used for the same database on different servers, even with different engines. However, the server_address is directly associated (1:1) with the database instance.

I think base_url is to contact the predicate server and server_address is to write the Fully Qualified Column Name as f"{server_addresss}/{table_path}". Is this correct?

Yes, that is correct.

Where will server_address be configured?

When you configure/mount the MaskServerInfo class, you pass in the server_address (same place the token gets passed).

john-sanchez31 · 2025-11-24T22:38:58Z

pydough/mask_server/mask_server.py

+                response: dict = item.get("response", None)
+                if response is None:
+                    # In this case, use a dummy value as a default to indicate
+                    # the dry run was successful


Did you mean to indicate the dry run was unsuccessful?

No, I mean successful. I do need to adjust this slightly.

pydough/mask_server/mask_server_candidate_visitor.py

pydough/mask_server/mask_server_rewrite_shuttle.py

pydough/mask_server/min_cover_set.py

pydough/configs/session.py

juankx-bodo · 2025-11-26T02:02:04Z

pydough/conversion/relational_converter.py

                        pydop.MaskedExpressionFunctionOperator(
-                            hybrid_expr.column.column_property, True
+                            hybrid_expr.column.column_property,
+                            node.collection.collection.table_path,


Is this the reason why we need to use the full table path in metadata?

EXACTLY (plus its a good idea in general)

juankx-bodo · 2025-11-26T02:10:48Z

pydough/conversion/relational_converter.py

-    # PYDOUGH_ENABLE_MASK_REWRITES is set to 1.
+    # PYDOUGH_ENABLE_MASK_REWRITES is set to 1. If a masking rewrite server has
+    # been attached to the session, include the shuttles for that as well.
    if os.getenv("PYDOUGH_ENABLE_MASK_REWRITES") == "1":


Is there a reson why PYDOUGH_ENABLE_MASK_REWRITES is not in PyDoughConfigs?

Because we wanted an environment variable as a "switch"

juankx-bodo · 2025-11-26T02:31:59Z

pydough/mask_server/mask_server.py

    """

-    def __init__(self, base_url: str, token: str | None = None):
+    def __init__(self, base_url: str, server_address: str, token: str | None = None):


I think base_url is to contact the predicate server and server_address is to write the Fully Qualified Column Name as f"{server_addresss}/{table_path}". Is this correct?

Where will server_address be configured? This is very specific to the database instance we are connecting to. For example, metadata can be re-used for the same database on different servers, even with different engines. However, the server_address is directly associated (1:1) with the database instance.

juankx-bodo · 2025-11-26T02:47:59Z

pydough/mask_server/mask_server.py

+        for idx, item in enumerate(batch):
+            pyd_logger.info(
+                f"({idx + 1}) {item.table_path}.{item.column_name}: {item.expression}"
+            )


Should this log entry be debug level instead of info?

🤷 Rn I'm keeping everything the same logging level for simplicity. We can revise down if we think it is appropriate.

juankx-bodo · 2025-11-26T02:55:58Z

pydough/mask_server/mask_server.py

+        request: ServerRequest = self.generate_request(
+            batch, path, method, dry_run, hard_limit
+        )
        response_json = self.connection.send_server_request(request)


In case of a predicate_server failure, users will not be able to query the database at all. Not even with the MASK functions. This could be a critical point of failure for the system.

Failure vs error are very different. If there is a legitimate error with connecting to the server, my understanding was that we wanted to abort. If the server responds just fine but indicates it failed to derive an answer, then that's fine and we proceed normally.

pydough/mask_server/mask_server.py

juankx-bodo · 2025-11-26T03:04:23Z

pydough/mask_server/mask_server.py

-                "column_reference": f"{item.table_path}.{item.column_name}",
+                "column_ref": {
+                    "kind": "fqn",
+                    "value": f"{self.server_address}.{item.table_path}.{item.column_name}",


separator should be "/". item.table_path is a composed name with elements separated by ".". Any element could be enclosed with double-quotes or backtick and have "." as part of the element name. Additionally, any character in the name equals to the enclosure char will be escaped using the same character twice.

We could wait to see the real thing implementation before this kind of changes.

Gonna do this. The problem for table path is how to handle varying edge cases of what table_path looks like:

db.schema.col -> db/schema/col

"a.b"."c.d"."e.f" -> a.b/c.d/e.f

juankx-bodo · 2025-11-26T14:55:17Z

pydough/mask_server/mask_server.py

+
        assert batch != [], "Batch cannot be empty."

        path: str = "v1/predicates/batch-evaluate"


path could be a class variable, so we don't need to pass it as parameter to other class methods like generate_request()

Good point, moved into a class var of MaskServerInfo that gets passed into the ServerRequest by generate_request

juankx-bodo · 2025-11-26T15:04:32Z

pydough/mask_server/mask_server.py

+        self,
+        batch: list[MaskServerInput],
+        path: str,
+        method: RequestMethod,


Including path and method in parameters looks like an attempt of doing generate_request() more general. However, due to all other specific parameters and actions I think this method is very specific for batch-evaluate. Maybe path and method could be class properties since them will not change for this method. If more request methods are required in future those paths and methods could also be part of the class.

I think method doesn't need to be a class property, can just get baked into the method's construciton of an ServerRequest instance.

juankx-bodo · 2025-11-26T15:19:37Z

pydough/mask_server/mask_server.py

        """
-        Generate a list of server outputs from the server response.
+        Generate a list of server outputs from the server response of a
+        non-dry-run request.


What happens when the request is a dry-run? We are calling generate_result(response_json) in
L174 for all batch-evaluate requests.

I didn't liked the design idea to have the dry-run and the actual call in the same API path because they are different things called on different times. We can't change that but could it make sense to separate them on our side? At least how do we process the response?

This comment should be rolled-back. The function is the same for both, the difference is that dry-runs have an empty payload for the records.

juankx-bodo · 2025-11-26T15:31:32Z

pydough/mask_server/mask_server_candidate_visitor.py

+    - `DATEDIFF`
+    """
+
+    PREDICATE_OPERATORS: set[str] = {


What is the criteria for a predicate operator to be included here?

These are the operators that are actually predicates, e.g. they return a boolean.

E.g. SUBSTRING can be inside the expression, but should not be the expression itself.

E.g. we wouldn't send abs(expr + 2) to the predicate server, but we would send abs(expr + 2) < 13, we wouldn't send LOWER(expr[:5]) but we would send CONTAINS(LOWER(expr[:5]), 'a')

juankx-bodo · 2025-11-26T15:42:44Z

pydough/mask_server/mask_server_candidate_visitor.py

+                # from the earlier check.
+                for inp in input_exprs:
+                    assert inp is not None
+                    result.extend(inp)


Remember that a literal string may require to use QUOTE if it matches an operator name.

Ahhh good point. I'll do that for literal string handling.

john-sanchez31 · 2025-12-03T22:23:43Z

pydough/mask_server/mask_server.py

                },
                ...
            ],
            "expression_format": {"name": "linear", "version": "0.2.0"}


Suggested change

"expression_format": {"name": "linear", "version": "0.2.0"}

"expression_format": {"name": "linear", "version": "0.2.0"},

john-sanchez31 · 2025-12-03T22:46:04Z

pydough/mask_server/mask_server_rewrite_shuttle.py

+    Mask Server and replacing the candidate expressions with the appropriate
+    responses from the server.


Suggested change

Mask Server and replacing the candidate expressions with the appropriate

responses from the server.

Mask Server. First send all candidates using the dry run flag, then selects the best candidates to be replaced with the appropriate response from the Mask Server.

john-sanchez31 · 2025-12-04T16:04:42Z

pydough/mask_server/mask_server_candidate_visitor.py

+        self.processed_candidates: set[RelationalExpression] = set()
+        """
+        The set of all relational expressions that have already been added to
+        the candidate pool at lest once. This is used to avoid adding the same


pydough/mask_server/mask_server_candidate_visitor.py

…_filter' test

hadia206

Well done!
I have minor comments but overall great work!
Thanks for the efforts

hadia206 · 2025-12-22T19:48:09Z

pydough/errors/error_utils.py

        }

-    def _split_identifier(self, name: str) -> list[str]:
+    @staticmethod


Why change to static?

So it can be used from other files w/o creating an instance of this class

hadia206 · 2025-12-22T20:04:19Z

pydough/mask_server/mask_server_rewrite_shuttle.py

+        # of the same expression. The responses will be stored in self.responses
+        # for later lookup.
+        if expr in self.candidate_visitor.candidate_pool:
+            self.process_batch()


Can we add a comment to explain why batching happens here and not at the end of traversal?

hadia206 · 2025-12-22T20:06:22Z

pydough/mask_server/mask_server_rewrite_shuttle.py

+            assert mask_op.masking_metadata.server_masked
+            assert mask_op.masking_metadata.server_dataset_id is not None


nit: Should these be asserts or explicit checks and exceptions?

I'm treating these as assertions since if these conditions are not true, then it shouldn't have been placed in the candidate pool in the first place.

hadia206 · 2025-12-22T20:16:24Z

pydough/mask_server/mask_server_candidate_visitor.py

+        expression. This is used to build the `heritage_tree` mapping.
+        """
+
+    def reset(self):


hadia206 · 2025-12-22T20:18:42Z

pydough/mask_server/mask_server_candidate_visitor.py

+        input_op: pydop.MaskedExpressionFunctionOperator
+        input_expr: RelationalExpression
+        combined_exprs: list[str | int | float | None | bool] | None
+


nit: following what happens with the stack is hard. Adding an example in the code that shows the flow, would be good.

hadia206 · 2025-12-22T20:19:10Z

pydough/mask_server/mask_server_candidate_visitor.py

+        self.processed_candidates: set[RelationalExpression] = set()
+        """
+        The set of all relational expressions that have already been added to
+        the candidate pool at lest once. This is used to avoid adding the same


hadia206 · 2025-12-22T20:20:43Z

pydough/mask_server/mask_server_candidate_visitor.py

+        # If there are zero unmasking operators in the inputs, or more than
+        # one, this expression is not a candidate.


I may have missed it but can you explain why more that one is not a candidate?

YEAR(x) == 2024 is fine

MONTH(y) == 6 is fine

(YEAR(x) == 2024) & (MONTH(y) == 6) is not fine because the predicate is on x and y, so we can't rewrite it as x IN (...) or y IN (...)

However, we could do (YEAR(x) == 2024) & (MONTH(x) == 6), because that is a predicate on just x

hadia206 · 2025-12-22T20:21:31Z

pydough/mask_server/mask_server_candidate_visitor.py

+                # from the earlier check.
+                for inp in input_exprs:
+                    assert inp is not None
+                    result.extend(inp)


hadia206 · 2025-12-22T20:47:41Z

pydough/sqlglot/override_pushdown_predicates.py

+
+def contains_real_aggregate(expression) -> bool:
+    """
+    Check if the expression contains a real aggregate function (e.g. SUM, AVG),


why do we need this check?

Because we discovered an unfortuante bug where MIN / MAX are treated by SQLGlot as aggregations even when doing MIN(a, b, c) (which is how some dialects do LEAST / GREATEST), which is highly problematic because it would make SQLGlot do buggy stuff during filter pushdown.

hadia206 · 2025-12-22T20:48:06Z

pydough/sqlglot/override_pushdown_predicates.py

    return expression
+
+
+def pushdown(condition, sources, scope_ref_count, dialect, join_index=None):


If this is new code (not copy/paste from SqlGlot), add docstring

Nope, the only new thing is the parts with contains_real_aggregate (which is a brand new function).

…UN ALL]

knassre-bodo added 10 commits October 9, 2025 00:49

Initial implementaitons of candidate vs rewrite shuttle

4d6488c

Initial implementation of predicate server integration working on cry…

5369379

…ptbank_filter_count_01

WIP adding to lookup table

36cab6e

Rewriting the rest of the filter count queries

ed6650c

Moving server address into mask server info setup

cc2bbed

[RUN ALL]

a6d4b29

Adding more tests

beadb15

Merge branch 'main' into kian/mask_server_rewrite

1b4bcac

Switching up relational shuttle handling for simplification

5ea82f1

Minor adjustments to file placement

f0f512c

knassre-bodo commented Oct 15, 2025

View reviewed changes

tests/test_mock_mask_server.py Show resolved Hide resolved

knassre-bodo added 5 commits October 15, 2025 13:32

Moved some logic from rewrite shuttle to candidate visitor

54ecef1

Added more tests

557aaeb

Added rewrite shuttle docstrings/comments

6b109d9

Adding remaining documentation

1377916

Removing dead rule

891c472

knassre-bodo marked this pull request as ready for review October 16, 2025 19:08

knassre-bodo added 2 commits October 16, 2025 15:08

Merge branch 'main' into kian/mask_server_rewrite

7d7580b

[RUN ALL]

62db4bf

knassre-bodo requested review from a team, hadia206, john-sanchez31 and juankx-bodo and removed request for a team October 16, 2025 19:09

[RUN ALL]

c9f6a59

knassre-bodo commented Oct 16, 2025

View reviewed changes

pydough/conversion/mask_server_candidate_visitor.py Outdated Show resolved Hide resolved

john-sanchez31 reviewed Oct 17, 2025

View reviewed changes

pydough/conversion/mask_server_candidate_visitor.py Outdated Show resolved Hide resolved

pydough/conversion/mask_server_candidate_visitor.py Outdated Show resolved Hide resolved

pydough/mask_server/mask_server_candidate_visitor.py Show resolved Hide resolved

knassre-bodo added 3 commits October 26, 2025 09:25

Adding logging to keep track of the batch requests sent

7c37110

Ensuring non-predicate sub-expressions are not sent to the server [RU…

127244f

…N CI]

Ensuring non-predicate sub-expressions are not sent to the server [RU…

1f2dc6d

…N CI]

knassre-bodo added 8 commits November 19, 2025 11:45

Adding four-phase algorithm, need to implement step #3

0371ec5

Updating rewrite handling, need to add DP algorithm

3996ced

Finishing implementation of min cover set

29e0e3f

Added edge case tests for selection algorithm

f9c05b2

Minor test adjustment

4f274fd

Minor test adjustment

18379ef

Merge branch 'main' into kian/mask_server_rewrite

f512f8b

Resolving conflicts [RUN ALL]

90f0671

knassre-bodo requested review from john-sanchez31 and juankx-bodo November 24, 2025 18:43

john-sanchez31 reviewed Nov 25, 2025

View reviewed changes

juankx-bodo reviewed Nov 26, 2025

View reviewed changes

knassre-bodo added 3 commits November 26, 2025 11:05

Merge branch 'main' into kian/mask_server_rewrite

f6a571b

Added the FQN slash handling

b728348

Revisions, QUOTE operator handling, docstrings/documentation [RUN ALL]

8e03b04

knassre-bodo requested review from john-sanchez31 and juankx-bodo December 2, 2025 19:14

Fixing mask server tests [RUN ALL]

a3c79cf

john-sanchez31 reviewed Dec 4, 2025

View reviewed changes

knassre-bodo added 4 commits December 10, 2025 11:47

API-based revisions overhaul WIP

32d7ee2

Mask server working, need to iron out kinks with 'retail_transactions…

0ed7303

…_filter' test

More documentation

7e98a09

[RUN CI]

28c7478

hadia206 approved these changes Dec 22, 2025

View reviewed changes

knassre-bodo added 4 commits December 22, 2025 19:53

More tests after TS fixed, still need to iterate and remove prints [R…

af7089e

…UN ALL]

Edge case debugging WIP

f619356

Adding PYDOUGH_MASK_SERVER_PATH to CI

8a5f82d

Resolving conflicts

aaf9af1


		assert batch != [], "Batch cannot be empty."

		path: str = "v1/predicates/batch-evaluate"

	"expression_format": {"name": "linear", "version": "0.2.0"}
	"expression_format": {"name": "linear", "version": "0.2.0"},

		Mask Server and replacing the candidate expressions with the appropriate
		responses from the server.

	Mask Server and replacing the candidate expressions with the appropriate
	responses from the server.
	Mask Server. First send all candidates using the dry run flag, then selects the best candidates to be replaced with the appropriate response from the Mask Server.

		assert mask_op.masking_metadata.server_masked
		assert mask_op.masking_metadata.server_dataset_id is not None

		# If there are zero unmasking operators in the inputs, or more than
		# one, this expression is not a candidate.

		return expression


		def pushdown(condition, sources, scope_ref_count, dialect, join_index=None):

Adding optimization rewrite pass to utilize server with information about masked columns #443

Are you sure you want to change the base?

Adding optimization rewrite pass to utilize server with information about masked columns #443

Uh oh!

Conversation

knassre-bodo commented Oct 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

john-sanchez31 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

john-sanchez31 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

knassre-bodo Dec 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

knassre-bodo Dec 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

knassre-bodo Dec 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

knassre-bodo Dec 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

knassre-bodo commented Oct 9, 2025 •

edited

Loading

knassre-bodo Dec 1, 2025 •

edited

Loading

knassre-bodo Dec 1, 2025 •

edited

Loading

knassre-bodo Dec 2, 2025 •

edited

Loading

knassre-bodo Dec 2, 2025 •

edited

Loading

knassre-bodo Nov 26, 2025 •

edited

Loading