Snowflake Integer Range (User Collections) #393

hadia206 · 2025-07-15T00:25:36Z

No description provided.

pydough/qdag/collections/user_collections.py

pydough/qdag/collections/range_collection.py

pydough/qdag/collections/user_collections.py

pydough/qdag/collections/user_collection_qdag.py

…update_reference

for more information, see https://pre-commit.ci

pydough/conversion/hybrid_tree.py

pydough/sqlglot/sqlglot_relational_visitor.py

pydough/qdag/collections/user_collection_qdag.py

pydough/relational/relational_nodes/generated_table.py

pydough/conversion/hybrid_operations.py

pydough/qdag/collections/user_collection_qdag.py

pydough/user_collections/range_collection.py

tests/test_pipeline_tpch_custom.py

tests/test_pydough_functions/user_collections.py

hadia206 · 2025-10-31T04:31:52Z

pydough/sqlglot/override_merge_subqueries.py

+    if has_seq4_or_table(inner_scope.expression):
+        if (
+            outer_scope.expression.args.get("joins") is not None
+            or outer_scope.expression.find(exp.Window)
+            or outer_scope.expression.find(exp.Limit)
+            or outer_scope.expression.find(exp.AggFunc)
+        ):
+            return False


Without that, it removed Table-Generator CTE and directly replaced 1 + SEQ4() * 5 twice in the code.
I'm open to other ways to prevent that.

Here's before and after without that fix.

WITH sizes AS ( SELECT 1 + SEQ4() * 5 AS part_size FROM TABLE(GENERATOR(ROWCOUNT => 20)) AS _q_0 ), _s0 AS ( SELECT sizes.part_size AS part_size FROM sizes AS sizes ), _t2 AS ( SELECT part.p_name AS p_name, part.p_size AS p_size FROM tpch.part AS part WHERE CONTAINS(part.p_name, 'turquoise') ), _t1 AS ( SELECT _t2.p_size AS p_size FROM _t2 AS _t2 WHERE TRUE ), _s1 AS ( SELECT _t1.p_size AS p_size, COUNT(*) AS n_rows FROM _t1 AS _t1 GROUP BY _t1.p_size ), _t0 AS ( SELECT _s1.n_rows AS n_rows, _s0.part_size AS part_size FROM _s0 AS _s0 LEFT JOIN _s1 AS _s1 ON _s0.part_size = _s1.p_size ) SELECT _t0.part_size AS part_size, COALESCE(_t0.n_rows, 0) AS n_parts FROM _t0 AS _t0

to

WITH _s1 AS ( SELECT part.p_size AS p_size, COUNT(*) AS n_rows FROM tpch.part AS part WHERE TRUE AND CONTAINS(part.p_name, 'turquoise') GROUP BY part.p_size ) SELECT 1 + SEQ4() * 5 AS part_size, COALESCE(_s1.n_rows, 0) AS n_parts FROM TABLE(GENERATOR(ROWCOUNT => 20)) AS _q_0 LEFT JOIN _s1 AS _s1 ON ( 1 + SEQ4() * 5 ) = _s1.p_size

Lets see if we can move this logic to the end of the giant and block at the bottom of this function, just so we call has_seq4_or_table as infrequently as possible.

hadia206 · 2025-10-31T04:33:02Z

pydough/sqlglot/sqlglot_helpers.py

+    if isinstance(expr, SQLGlotColumn):
+        if isinstance(expr.this, Identifier):
+            return expr.this.this
+        return expr.this


This is related to Sqlite code. It might be removed in the followup PR that will be for other dialects support.

tests/test_sql_refsols/simple_range_1_ansi.sql

john-sanchez31 · 2025-11-05T21:44:46Z

documentation/usage.md

+- `end`: The ending value of the range (exclusive).
+- `step`: The increment between consecutive values (default: 1).
+
+Supported Signatures:


Lets also include name and column_name in this supported signatures

john-sanchez31 · 2025-11-06T16:27:50Z

pydough/sqlglot/transform_bindings/sf_transform_bindings.py

            # For other units, use base implementation
            return super().convert_datediff(args, types)
+
+    def _convert_user_generated_range(


Type hints missing in query, inner_select and subquery

john-sanchez31 · 2025-11-06T16:32:21Z

pydough/unqualified/unqualified_node.py

+    """Represents a user-generated collection of values."""
+
+    def __init__(self, user_collection: PyDoughUserGeneratedCollection):
+        self._parcel: tuple[PyDoughUserGeneratedCollection] = (user_collection,)


NIT:

Suggested change

self._parcel: tuple[PyDoughUserGeneratedCollection] = (user_collection,)

self._parcel: tuple[PyDoughUserGeneratedCollection] = (user_collection)

Without the , it won't be a tuple

>>> x = (1) >>> type(x) <class 'int'> >>> x = (1,) >>> type(x) <class 'tuple'>

john-sanchez31 · 2025-11-06T16:38:25Z

pydough/user_collections/README.md

+```python
+import pydough
+
+my_range = pydough.range_collection(


Can you add the output for this code? Can be useful for the user to know what would be the result

john-sanchez31 · 2025-11-06T16:47:45Z

pydough/user_collections/README.md

+
+## Available APIs
+
+### [range_collection.py](range_collection.py)


What would be the difference between this 2 sections? Why are they 2 separated if the content is almost the same?

Right now they're similar. This is specific to that one, other will include new operations as we add them.

john-sanchez31 · 2025-11-06T17:04:48Z

tests/test_pydough_functions/user_collections.py

+def simple_range_4():
+    #  Generate a table with 1 column named `N` counting backwards
+    # from 10 to 1 (inclusive)
+    return pydough.range_collection("T2", "N", 10, 0, -1).ORDER_BY(N.ASC())


I didn't know this feature, Can we add it in the documentation? (or maybe I didn't see it)

This is the feature the PR is adding, generating range numbers.
Do you mean the backward? this is part of the range. Not sure what you mean by add it in the documentation.

knassre-bodo

A few more comments to address, but once those are handled we can merge it :)

Great job @hadia206

knassre-bodo · 2025-11-06T18:13:53Z

documentation/usage.md

+<!-- TOC --><a name="user-collection-apis"></a>
+## User Collection APIs
+
+> [!WARNING]  
+> NOTE: User collections are currently supported **only in the Snowflake context**.
+
+This section describes APIs for dynamically creating PyDough collections and using them alongside other data sources.


These shouldn't be in usage.md, they should be in dsl.md. Add a section called User Generated Collections (between Collection Operators and Induced Properties). Also, we can delete the Induced Properties seciton.

knassre-bodo · 2025-11-06T18:22:54Z

pydough/qdag/collections/user_collection_qdag.py

+    @property
+    def all_terms(self) -> set[str]:
+        """
+        The set of expression/subcollection names accessible by the context.
+        """
+        return self.calc_terms


Still need to address this. See how CollectionAccess has a property _all_property_names containing all of the calc terms AND all of the ancestral terms.

knassre-bodo · 2025-11-06T18:23:52Z

pydough/qdag/collections/user_collection_qdag.py

+    def to_string(self) -> str:
+        return f"range_collection(table={self.name}, column={self.collection.columns}, range=({self.collection.data.start}, {self.collection.data.stop}, {self.collection.data.step}))"


This needs to be more generic so it works for all user generated collections, not just range collections. I think it should just be f"UserCollection[{self.collection.to_string()}]"

knassre-bodo · 2025-11-06T18:24:06Z

pydough/qdag/collections/user_collection_qdag.py

+    ):
+        assert ancestor is not None
+        super().__init__(ancestor)
+        self._collection = collection


Add a type annotation here

knassre-bodo · 2025-11-06T18:25:37Z

pydough/sqlglot/override_merge_subqueries.py


+    # PYDOUGH CHANGE: avoid merging CTEs when the inner scope uses
+    # SEQ4()/TABLE() and if any of these exist in the outer query:
+    # - joins - window functions - aggregations - limit/offset


I think the formatting of this got messed up

knassre-bodo · 2025-11-06T18:29:10Z

pydough/sqlglot/transform_bindings/base_transform_bindings.py

+
+        match collection:
+            case RangeGeneratedCollection():
+                return self._convert_user_generated_range(collection)


NIT: rename to convert_user_generated_range

knassre-bodo · 2025-11-06T18:31:09Z

pydough/unqualified/unqualified_node.py

+            result = "generated_collection("
+            result += f"name={unqualified._parcel[0].name!r}, "
+            result += f"columns=[{', '.join(unqualified._parcel[0].columns)}],"
+            result += f"data={unqualified._parcel[0].data}"
+            return result + ")"


Lets rely on unqualified._parcel[0].to_string() to help us out here.

knassre-bodo · 2025-11-06T18:36:39Z

pydough/sqlglot/transform_bindings/sf_transform_bindings.py

+                            this=sqlglot_expressions.Literal.number(collection.start),
+                            expression=sqlglot_expressions.Mul(
+                                this=sqlglot_expressions.Anonymous(this="SEQ4"),
+                                expression=sqlglot_expressions.Literal.number(
+                                    collection.step
+                                ),
+                            ),
+                        ),


Minor optimizations here for cleanliness:

Skip the Add if collection.start == 0

Skip the Mul if collection.step == 1

knassre-bodo · 2025-11-06T18:39:52Z

pydough/sqlglot/override_merge_subqueries.py

+    if has_seq4_or_table(inner_scope.expression):
+        if (
+            outer_scope.expression.args.get("joins") is not None
+            or outer_scope.expression.find(exp.Window)
+            or outer_scope.expression.find(exp.Limit)
+            or outer_scope.expression.find(exp.AggFunc)
+        ):
+            return False


Lets see if we can move this logic to the end of the giant and block at the bottom of this function, just so we call has_seq4_or_table as infrequently as possible.

knassre-bodo · 2025-11-06T18:40:39Z

pydough/relational/relational_nodes/generated_table.py

+    @property
+    def name(self) -> str:
+        """Returns the name of the generated table."""
+        return self.collection.name


Why do we need the name here? By now, relational nodes don't care about the names.

This is the table name. Makes it easier to access it later

…user_collections_range

knassre-bodo reviewed Jul 15, 2025

View reviewed changes

hadia206 added 6 commits July 15, 2025 16:49

attempt 1

499c000

add collection back

c192a6e

Merge branch 'main' of https://github.com/bodo-ai/PyDough into Hadia/…

77f63d1

…update_reference

range collections base, initial try

721eecc

add range_collection to pydough top

4a53e01

add test

d0ce87f

hadia206 force-pushed the Hadia/user_collections_range branch from 207f36b to d0ce87f Compare July 16, 2025 17:55

hadia206 changed the base branch from main to Hadia/update_reference July 16, 2025 17:56

pre-commit-ci bot and others added 6 commits July 16, 2025 17:57

[pre-commit.ci] auto fixes from pre-commit.com hooks

3c5ed66

for more information, see https://pre-commit.ci

address comments

8893d79

merge conflict

a7469d3

update test

9e66de2

make range inside UnqualifiedGeneratedCollection

025b556

[run CI] hybrid and execute

29e1e74

knassre-bodo reviewed Jul 18, 2025

View reviewed changes

pydough/conversion/hybrid_tree.py Outdated Show resolved Hide resolved

knassre-bodo reviewed Jul 18, 2025

View reviewed changes

pydough/conversion/hybrid_tree.py Outdated Show resolved Hide resolved

hadia206 and others added 5 commits July 18, 2025 15:35

fix uniqueness, singular, always exists, and add another test

dc3c686

[run CI] more tests and fix empty table

a626dc1

add other tests (skipped as they're not passing)

8491cc6

Fixing test bugs

6315f3d

Fixing hybrid/qualification/conversion bugs

d4230e4