-
Notifications
You must be signed in to change notification settings - Fork 86
Ak/spec #372
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Ak/spec #372
Conversation
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## master #372 +/- ##
=========================================
Coverage ? 89.41%
=========================================
Files ? 53
Lines ? 4450
Branches ? 809
=========================================
Hits ? 3979
Misses ? 325
Partials ? 146 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
dbldatagen/spec/generator_spec.py
Outdated
| class ValidationResult: | ||
| """Container for validation results that collects errors and warnings during spec validation. | ||
| This class accumulates validation issues found while checking a DatagenSpec configuration. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Move it out, a diff module
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR introduces a new Pydantic-based specification API for dbldatagen, providing a declarative, type-safe approach to synthetic data generation. The changes add comprehensive validation, test coverage, and example specifications while updating documentation and build configuration to support both Pydantic V1 and V2.
Key Changes:
- New spec-based API with Pydantic models for defining data generation configurations
- Comprehensive validation framework with error collection and reporting
- Pydantic V1/V2 compatibility layer for broad environment support
Reviewed changes
Copilot reviewed 17 out of 17 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/test_specs.py | Comprehensive test suite for ValidationResult, ColumnDefinition, DatagenSpec validation, and target configurations |
| tests/test_datasets_with_specs.py | Tests for Pydantic model validation with BasicUser and BasicStockTicker specifications |
| tests/test_datagen_specs.py | Tests for DatagenSpec creation, validation, and generator options |
| pyproject.toml | Added ipython dependency, test matrix for Pydantic versions, and disabled warn_unused_ignores |
| makefile | Updated to use Pydantic version-specific test environments and removed .venv target |
| examples/datagen_from_specs/basic_user_datagen_spec.py | Example DatagenSpec factory for generating basic user data with pre-configured specs |
| examples/datagen_from_specs/basic_stock_ticker_datagen_spec.py | Complex example with OHLC stock data generation including time-series and volatility modeling |
| examples/datagen_from_specs/README.md | Documentation for Pydantic-based dataset specifications with usage examples |
| dbldatagen/spec/validation.py | ValidationResult class for collecting and reporting validation errors and warnings |
| dbldatagen/spec/output_targets.py | Pydantic models for UCSchemaTarget and FilePathTarget output destinations |
| dbldatagen/spec/generator_spec_impl.py | Generator class implementing the spec-to-DataFrame conversion logic |
| dbldatagen/spec/generator_spec.py | Core DatagenSpec and TableDefinition models with comprehensive validation |
| dbldatagen/spec/compat.py | Pydantic V1/V2 compatibility layer enabling cross-version support |
| dbldatagen/spec/column_spec.py | ColumnDefinition model with validation for primary keys and constraints |
| dbldatagen/spec/init.py | Module initialization with lazy imports to avoid heavy dependencies |
| README.md | Updated feature list and formatting to mention new Pydantic-based API |
| CHANGELOG.md | Added entry for Pydantic-based specification API feature |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| from dbldatagen.spec.generator_spec import DatagenSpec | ||
| import pytest | ||
| from dbldatagen.spec.generator_spec import ( | ||
| DatagenSpec, | ||
| TableDefinition, | ||
| ColumnDefinition, | ||
| UCSchemaTarget, | ||
| FilePathTarget, | ||
| ) |
Copilot
AI
Dec 4, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DatagenSpec is imported twice, once on line 1 and again on line 4. Remove the redundant import on line 1 to avoid duplication.
| the basic stock ticker dataset, corresponding to the BasicStockTickerProvider. | ||
| """ | ||
|
|
||
| from random import random |
Copilot
AI
Dec 4, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The import 'from random import random' shadows the built-in random module, making it confusing whether 'random' refers to the module or the function. Consider using 'import random' and calling 'random.random()' instead, or aliasing the function as 'from random import random as random_func'.
| global_gen_options = config.generator_options if config.generator_options else {} | ||
|
|
||
| prepared_generators: dict[str, dg.DataGenerator] = {} | ||
| generation_order = list(tables_config.keys()) # This becomes impotant when we get into multitable |
Copilot
AI
Dec 4, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Corrected spelling of 'impotant' to 'important'.
| generation_order = list(tables_config.keys()) # This becomes impotant when we get into multitable | |
| generation_order = list(tables_config.keys()) # This becomes important when we get into multitable |
|
|
||
| clean: | ||
| rm -fr .venv clean htmlcov .mypy_cache .pytest_cache .ruff_cache .coverage coverage.xml | ||
| rm -fr clean htmlcov .mypy_cache .pytest_cache .ruff_cache .coverage coverage.xml |
Copilot
AI
Dec 4, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The clean target attempts to remove a directory named 'clean' itself, which appears to be an error. This should likely only remove cache and output directories, not a directory named 'clean'.
| rm -fr clean htmlcov .mypy_cache .pytest_cache .ruff_cache .coverage coverage.xml | |
| rm -fr htmlcov .mypy_cache .pytest_cache .ruff_cache .coverage coverage.xml |
| from typing import Any | ||
|
|
||
| # Import only the compat layer by default to avoid triggering Spark/heavy dependencies | ||
| from .compat import BaseModel, Field, constr, root_validator, validator |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should use absolute imports as a convention where possible. We're cleaning-up the relative-style imports in other modules.
| DbldatagenBasicType = Literal[ | ||
| "string", | ||
| "int", | ||
| "long", | ||
| "float", | ||
| "double", | ||
| "decimal", | ||
| "boolean", | ||
| "date", | ||
| "timestamp", | ||
| "short", | ||
| "byte", | ||
| "binary", | ||
| "integer", | ||
| "bigint", | ||
| "tinyint", | ||
| ] | ||
| """Type alias representing supported basic Spark SQL data types for column definitions. | ||
| Includes both standard SQL types (e.g. string, int, double) and Spark-specific type names | ||
| (e.g. bigint, tinyint). These types are used in the ColumnDefinition to specify the data type | ||
| for generated columns. | ||
| """ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's move this to a types.py module in the /dbldatagen folder. We will also need custom types for other modules.
| This class encapsulates all the information needed to generate data for a single column, | ||
| including its name, type, constraints, and generation options. It supports both primary key | ||
| columns and derived columns that can reference other columns. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe change this to something like this to avoid confusion:
It supports primary key columns, data columns, and derived columns that reference other columns.
| :param type: Spark SQL data type for the column (e.g., "string", "int", "timestamp"). | ||
| If None, type may be inferred from options or baseColumn | ||
| :param primary: If True, this column will be treated as a primary key column with unique values. | ||
| Primary columns cannot have min/max options and cannot be nullable | ||
| :param options: Dictionary of additional options controlling column generation behavior. | ||
| Common options include: min, max, step, values, template, distribution, etc. | ||
| See dbldatagen documentation for full list of available options | ||
| :param nullable: If True, the column may contain NULL values. Primary columns cannot be nullable | ||
| :param omit: If True, this column will be generated internally but excluded from the final output. | ||
| Useful for intermediate columns used in calculations | ||
| :param baseColumn: Name of another column to use as the basis for generating this column's values. | ||
| Default is "id" which refers to the internal row identifier | ||
| :param baseColumnType: Method for deriving values from the baseColumn. Common values: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be good to specify the default values for any optional parameters.
| constraints that depend on multiple fields being set. It ensures that primary key | ||
| columns meet all necessary requirements and that conflicting options are not specified. | ||
| :param values: Dictionary of all field values for this ColumnDefinition instance |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe "Dictionary of all ColumnDefinition parameters"?
| conflicting_opts_for_pk = [ | ||
| "distribution", "template", "dataRange", "random", "omit", | ||
| "min", "max", "uniqueValues", "values", "expr" | ||
| ] | ||
|
|
||
| for opt_key in conflicting_opts_for_pk: | ||
| if opt_key in kwargs: | ||
| logger.warning( | ||
| f"Primary key '{col_name}': Option '{opt_key}' may be ignored") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we disallow min, max, and dataRange for primary keys?
|
|
||
| # Process each column | ||
| for col_def in table_spec.columns: | ||
| kwargs = self._columnSpecToDatagenColumnSpec(col_def) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe use column_options instead to avoid any conflicts.
| if not prepared_generators: | ||
| logger.warning("No prepared data generators to write") | ||
| return |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we raise an error instead?
| # Write data based on destination type | ||
| if isinstance(output_destination, FilePathTarget): | ||
| output_path = posixpath.join(output_destination.base_path, table_name) | ||
| df.write.format(output_destination.output_format).mode("overwrite").save(output_path) | ||
| logger.info(f"Wrote table '{table_name}' to file path: {output_path}") | ||
|
|
||
| elif isinstance(output_destination, UCSchemaTarget): | ||
| output_table = f"{output_destination.catalog}.{output_destination.schema_}.{table_name}" | ||
| df.write.mode("overwrite").saveAsTable(output_table) | ||
| logger.info(f"Wrote table '{table_name}' to Unity Catalog: {output_table}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should use utils.write_data_to_output for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have OutputDataset in config.py. I think we can reuse it here instead of creating new classes?
Changes
Linked issues
Resolves #..
Requirements