-
Notifications
You must be signed in to change notification settings - Fork 3
Snowflake Integer Range (User Collections) #393
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
207f36b to
d0ce87f
Compare
| if has_seq4_or_table(inner_scope.expression): | ||
| if ( | ||
| outer_scope.expression.args.get("joins") is not None | ||
| or outer_scope.expression.find(exp.Window) | ||
| or outer_scope.expression.find(exp.Limit) | ||
| or outer_scope.expression.find(exp.AggFunc) | ||
| ): | ||
| return False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Without that, it removed Table-Generator CTE and directly replaced 1 + SEQ4() * 5 twice in the code.
I'm open to other ways to prevent that.
Here's before and after without that fix.
WITH sizes AS (
SELECT
1 + SEQ4() * 5 AS part_size
FROM TABLE(GENERATOR(ROWCOUNT => 20)) AS _q_0
), _s0 AS (
SELECT
sizes.part_size AS part_size
FROM sizes AS sizes
), _t2 AS (
SELECT
part.p_name AS p_name,
part.p_size AS p_size
FROM tpch.part AS part
WHERE
CONTAINS(part.p_name, 'turquoise')
), _t1 AS (
SELECT
_t2.p_size AS p_size
FROM _t2 AS _t2
WHERE
TRUE
), _s1 AS (
SELECT
_t1.p_size AS p_size,
COUNT(*) AS n_rows
FROM _t1 AS _t1
GROUP BY
_t1.p_size
), _t0 AS (
SELECT
_s1.n_rows AS n_rows,
_s0.part_size AS part_size
FROM _s0 AS _s0
LEFT JOIN _s1 AS _s1
ON _s0.part_size = _s1.p_size
)
SELECT
_t0.part_size AS part_size,
COALESCE(_t0.n_rows, 0) AS n_parts
FROM _t0 AS _t0
to
WITH _s1 AS (
SELECT
part.p_size AS p_size,
COUNT(*) AS n_rows
FROM tpch.part AS part
WHERE
TRUE AND CONTAINS(part.p_name, 'turquoise')
GROUP BY
part.p_size
)
SELECT
1 + SEQ4() * 5 AS part_size,
COALESCE(_s1.n_rows, 0) AS n_parts
FROM TABLE(GENERATOR(ROWCOUNT => 20)) AS _q_0
LEFT JOIN _s1 AS _s1
ON (
1 + SEQ4() * 5
) = _s1.p_size
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lets see if we can move this logic to the end of the giant and block at the bottom of this function, just so we call has_seq4_or_table as infrequently as possible.
| if isinstance(expr, SQLGlotColumn): | ||
| if isinstance(expr.this, Identifier): | ||
| return expr.this.this | ||
| return expr.this |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is related to Sqlite code. It might be removed in the followup PR that will be for other dialects support.
documentation/usage.md
Outdated
| - `end`: The ending value of the range (exclusive). | ||
| - `step`: The increment between consecutive values (default: 1). | ||
|
|
||
| Supported Signatures: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lets also include name and column_name in this supported signatures
| # For other units, use base implementation | ||
| return super().convert_datediff(args, types) | ||
|
|
||
| def _convert_user_generated_range( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Type hints missing in query, inner_select and subquery
| """Represents a user-generated collection of values.""" | ||
|
|
||
| def __init__(self, user_collection: PyDoughUserGeneratedCollection): | ||
| self._parcel: tuple[PyDoughUserGeneratedCollection] = (user_collection,) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NIT:
| self._parcel: tuple[PyDoughUserGeneratedCollection] = (user_collection,) | |
| self._parcel: tuple[PyDoughUserGeneratedCollection] = (user_collection) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Without the , it won't be a tuple
>>> x = (1)
>>> type(x)
<class 'int'>
>>> x = (1,)
>>> type(x)
<class 'tuple'>
| ```python | ||
| import pydough | ||
|
|
||
| my_range = pydough.range_collection( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add the output for this code? Can be useful for the user to know what would be the result
|
|
||
| ## Available APIs | ||
|
|
||
| ### [range_collection.py](range_collection.py) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What would be the difference between this 2 sections? Why are they 2 separated if the content is almost the same?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right now they're similar. This is specific to that one, other will include new operations as we add them.
| def simple_range_4(): | ||
| # Generate a table with 1 column named `N` counting backwards | ||
| # from 10 to 1 (inclusive) | ||
| return pydough.range_collection("T2", "N", 10, 0, -1).ORDER_BY(N.ASC()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't know this feature, Can we add it in the documentation? (or maybe I didn't see it)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the feature the PR is adding, generating range numbers.
Do you mean the backward? this is part of the range. Not sure what you mean by add it in the documentation.
knassre-bodo
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few more comments to address, but once those are handled we can merge it :)
Great job @hadia206
documentation/usage.md
Outdated
| <!-- TOC --><a name="user-collection-apis"></a> | ||
| ## User Collection APIs | ||
|
|
||
| > [!WARNING] | ||
| > NOTE: User collections are currently supported **only in the Snowflake context**. | ||
| This section describes APIs for dynamically creating PyDough collections and using them alongside other data sources. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These shouldn't be in usage.md, they should be in dsl.md. Add a section called User Generated Collections (between Collection Operators and Induced Properties). Also, we can delete the Induced Properties seciton.
| @property | ||
| def all_terms(self) -> set[str]: | ||
| """ | ||
| The set of expression/subcollection names accessible by the context. | ||
| """ | ||
| return self.calc_terms |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still need to address this. See how CollectionAccess has a property _all_property_names containing all of the calc terms AND all of the ancestral terms.
| def to_string(self) -> str: | ||
| return f"range_collection(table={self.name}, column={self.collection.columns}, range=({self.collection.data.start}, {self.collection.data.stop}, {self.collection.data.step}))" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs to be more generic so it works for all user generated collections, not just range collections. I think it should just be f"UserCollection[{self.collection.to_string()}]"
| ): | ||
| assert ancestor is not None | ||
| super().__init__(ancestor) | ||
| self._collection = collection |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a type annotation here
|
|
||
| # PYDOUGH CHANGE: avoid merging CTEs when the inner scope uses | ||
| # SEQ4()/TABLE() and if any of these exist in the outer query: | ||
| # - joins - window functions - aggregations - limit/offset |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the formatting of this got messed up
|
|
||
| match collection: | ||
| case RangeGeneratedCollection(): | ||
| return self._convert_user_generated_range(collection) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NIT: rename to convert_user_generated_range
| result = "generated_collection(" | ||
| result += f"name={unqualified._parcel[0].name!r}, " | ||
| result += f"columns=[{', '.join(unqualified._parcel[0].columns)}]," | ||
| result += f"data={unqualified._parcel[0].data}" | ||
| return result + ")" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lets rely on unqualified._parcel[0].to_string() to help us out here.
| this=sqlglot_expressions.Literal.number(collection.start), | ||
| expression=sqlglot_expressions.Mul( | ||
| this=sqlglot_expressions.Anonymous(this="SEQ4"), | ||
| expression=sqlglot_expressions.Literal.number( | ||
| collection.step | ||
| ), | ||
| ), | ||
| ), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor optimizations here for cleanliness:
- Skip the
Addifcollection.start == 0 - Skip the
Mulifcollection.step == 1
| if has_seq4_or_table(inner_scope.expression): | ||
| if ( | ||
| outer_scope.expression.args.get("joins") is not None | ||
| or outer_scope.expression.find(exp.Window) | ||
| or outer_scope.expression.find(exp.Limit) | ||
| or outer_scope.expression.find(exp.AggFunc) | ||
| ): | ||
| return False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lets see if we can move this logic to the end of the giant and block at the bottom of this function, just so we call has_seq4_or_table as infrequently as possible.
| @property | ||
| def name(self) -> str: | ||
| """Returns the name of the generated table.""" | ||
| return self.collection.name |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need the name here? By now, relational nodes don't care about the names.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the table name. Makes it easier to access it later
No description provided.