Skip to content

Conversation

@hadia206
Copy link
Contributor

No description provided.

@hadia206 hadia206 force-pushed the Hadia/user_collections_range branch from 207f36b to d0ce87f Compare July 16, 2025 17:55
@hadia206 hadia206 changed the base branch from main to Hadia/update_reference July 16, 2025 17:56
Base automatically changed from Hadia/update_reference to main August 12, 2025 17:06
Comment on lines 231 to 238
if has_seq4_or_table(inner_scope.expression):
if (
outer_scope.expression.args.get("joins") is not None
or outer_scope.expression.find(exp.Window)
or outer_scope.expression.find(exp.Limit)
or outer_scope.expression.find(exp.AggFunc)
):
return False
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without that, it removed Table-Generator CTE and directly replaced 1 + SEQ4() * 5 twice in the code.
I'm open to other ways to prevent that.

Here's before and after without that fix.

WITH sizes AS (
  SELECT
    1 + SEQ4() * 5 AS part_size
  FROM TABLE(GENERATOR(ROWCOUNT => 20)) AS _q_0
), _s0 AS (
  SELECT
    sizes.part_size AS part_size
  FROM sizes AS sizes
), _t2 AS (
  SELECT
    part.p_name AS p_name,
    part.p_size AS p_size
  FROM tpch.part AS part
  WHERE
    CONTAINS(part.p_name, 'turquoise')
), _t1 AS (
  SELECT
    _t2.p_size AS p_size
  FROM _t2 AS _t2
  WHERE
    TRUE
), _s1 AS (
  SELECT
    _t1.p_size AS p_size,
    COUNT(*) AS n_rows
  FROM _t1 AS _t1
  GROUP BY
    _t1.p_size
), _t0 AS (
  SELECT
    _s1.n_rows AS n_rows,
    _s0.part_size AS part_size
  FROM _s0 AS _s0
  LEFT JOIN _s1 AS _s1
    ON _s0.part_size = _s1.p_size
)
SELECT
  _t0.part_size AS part_size,
  COALESCE(_t0.n_rows, 0) AS n_parts
FROM _t0 AS _t0

to

WITH _s1 AS (
  SELECT
    part.p_size AS p_size,
    COUNT(*) AS n_rows
  FROM tpch.part AS part
  WHERE
    TRUE AND CONTAINS(part.p_name, 'turquoise')
  GROUP BY
    part.p_size
)
SELECT
  1 + SEQ4() * 5 AS part_size,
  COALESCE(_s1.n_rows, 0) AS n_parts
FROM TABLE(GENERATOR(ROWCOUNT => 20)) AS _q_0
LEFT JOIN _s1 AS _s1
  ON (
    1 + SEQ4() * 5
  ) = _s1.p_size

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets see if we can move this logic to the end of the giant and block at the bottom of this function, just so we call has_seq4_or_table as infrequently as possible.

Comment on lines +30 to +33
if isinstance(expr, SQLGlotColumn):
if isinstance(expr.this, Identifier):
return expr.this.this
return expr.this
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is related to Sqlite code. It might be removed in the followup PR that will be for other dialects support.

@hadia206 hadia206 marked this pull request as ready for review November 3, 2025 23:32
@hadia206 hadia206 changed the title [DRAFT] User collections: integer range Snowflake Integer Range (User Collections) Nov 5, 2025
- `end`: The ending value of the range (exclusive).
- `step`: The increment between consecutive values (default: 1).

Supported Signatures:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets also include name and column_name in this supported signatures

# For other units, use base implementation
return super().convert_datediff(args, types)

def _convert_user_generated_range(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Type hints missing in query, inner_select and subquery

"""Represents a user-generated collection of values."""

def __init__(self, user_collection: PyDoughUserGeneratedCollection):
self._parcel: tuple[PyDoughUserGeneratedCollection] = (user_collection,)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT:

Suggested change
self._parcel: tuple[PyDoughUserGeneratedCollection] = (user_collection,)
self._parcel: tuple[PyDoughUserGeneratedCollection] = (user_collection)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without the , it won't be a tuple

>>> x = (1)
>>> type(x)
<class 'int'>
>>> x = (1,)
>>> type(x)
<class 'tuple'>

```python
import pydough

my_range = pydough.range_collection(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add the output for this code? Can be useful for the user to know what would be the result


## Available APIs

### [range_collection.py](range_collection.py)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would be the difference between this 2 sections? Why are they 2 separated if the content is almost the same?

Copy link
Contributor Author

@hadia206 hadia206 Nov 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now they're similar. This is specific to that one, other will include new operations as we add them.

def simple_range_4():
# Generate a table with 1 column named `N` counting backwards
# from 10 to 1 (inclusive)
return pydough.range_collection("T2", "N", 10, 0, -1).ORDER_BY(N.ASC())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't know this feature, Can we add it in the documentation? (or maybe I didn't see it)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the feature the PR is adding, generating range numbers.
Do you mean the backward? this is part of the range. Not sure what you mean by add it in the documentation.

Copy link
Contributor

@knassre-bodo knassre-bodo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few more comments to address, but once those are handled we can merge it :)

Great job @hadia206

Comment on lines 728 to 734
<!-- TOC --><a name="user-collection-apis"></a>
## User Collection APIs

> [!WARNING]
> NOTE: User collections are currently supported **only in the Snowflake context**.
This section describes APIs for dynamically creating PyDough collections and using them alongside other data sources.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These shouldn't be in usage.md, they should be in dsl.md. Add a section called User Generated Collections (between Collection Operators and Induced Properties). Also, we can delete the Induced Properties seciton.

Comment on lines 85 to 90
@property
def all_terms(self) -> set[str]:
"""
The set of expression/subcollection names accessible by the context.
"""
return self.calc_terms
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still need to address this. See how CollectionAccess has a property _all_property_names containing all of the calc terms AND all of the ancestral terms.

Comment on lines 127 to 128
def to_string(self) -> str:
return f"range_collection(table={self.name}, column={self.collection.columns}, range=({self.collection.data.start}, {self.collection.data.stop}, {self.collection.data.step}))"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to be more generic so it works for all user generated collections, not just range collections. I think it should just be f"UserCollection[{self.collection.to_string()}]"

):
assert ancestor is not None
super().__init__(ancestor)
self._collection = collection
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a type annotation here


# PYDOUGH CHANGE: avoid merging CTEs when the inner scope uses
# SEQ4()/TABLE() and if any of these exist in the outer query:
# - joins - window functions - aggregations - limit/offset
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the formatting of this got messed up


match collection:
case RangeGeneratedCollection():
return self._convert_user_generated_range(collection)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: rename to convert_user_generated_range

Comment on lines 894 to 898
result = "generated_collection("
result += f"name={unqualified._parcel[0].name!r}, "
result += f"columns=[{', '.join(unqualified._parcel[0].columns)}],"
result += f"data={unqualified._parcel[0].data}"
return result + ")"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets rely on unqualified._parcel[0].to_string() to help us out here.

Comment on lines 217 to 224
this=sqlglot_expressions.Literal.number(collection.start),
expression=sqlglot_expressions.Mul(
this=sqlglot_expressions.Anonymous(this="SEQ4"),
expression=sqlglot_expressions.Literal.number(
collection.step
),
),
),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor optimizations here for cleanliness:

  • Skip the Add if collection.start == 0
  • Skip the Mul if collection.step == 1

Comment on lines 231 to 238
if has_seq4_or_table(inner_scope.expression):
if (
outer_scope.expression.args.get("joins") is not None
or outer_scope.expression.find(exp.Window)
or outer_scope.expression.find(exp.Limit)
or outer_scope.expression.find(exp.AggFunc)
):
return False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets see if we can move this logic to the end of the giant and block at the bottom of this function, just so we call has_seq4_or_table as infrequently as possible.

Comment on lines +42 to +45
@property
def name(self) -> str:
"""Returns the name of the generated table."""
return self.collection.name
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need the name here? By now, relational nodes don't care about the names.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the table name. Makes it easier to access it later

@hadia206 hadia206 merged commit 557b98e into main Nov 7, 2025
19 checks passed
@hadia206 hadia206 deleted the Hadia/user_collections_range branch November 7, 2025 17:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants