Ak/spec #372

anupkalburgi · 2025-11-04T14:28:33Z

Changes

Linked issues

Resolves #..

Requirements

manually tested
updated documentation
updated demos
updated tests

codecov · 2025-11-06T16:14:20Z

Codecov Report

❌ Patch coverage is 53.44262% with 142 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (master@3aca7f3). Learn more about missing BASE report.
⚠️ Report is 6 commits behind head on master.

Files with missing lines	Patch %	Lines
dbldatagen/spec/generator_spec_impl.py	0.00%	111 Missing ⚠️
dbldatagen/spec/generator_spec.py	84.21%	13 Missing and 2 partials ⚠️
dbldatagen/spec/__init__.py	28.57%	10 Missing ⚠️
dbldatagen/spec/output_targets.py	85.18%	3 Missing and 1 partial ⚠️
dbldatagen/spec/compat.py	60.00%	2 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff            @@
##             master     #372   +/-   ##
=========================================
  Coverage          ?   89.41%           
=========================================
  Files             ?       53           
  Lines             ?     4450           
  Branches          ?      809           
=========================================
  Hits              ?     3979           
  Misses            ?      325           
  Partials          ?      146

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

dbldatagen/spec/generator_spec.py

anupkalburgi · 2025-11-10T20:13:20Z

dbldatagen/spec/generator_spec.py

+class ValidationResult:
+    """Container for validation results that collects errors and warnings during spec validation.
+
+    This class accumulates validation issues found while checking a DatagenSpec configuration.


Move it out, a diff module

dbldatagen/spec/generator_spec_impl.py

Copilot

Pull request overview

This PR introduces a new Pydantic-based specification API for dbldatagen, providing a declarative, type-safe approach to synthetic data generation. The changes add comprehensive validation, test coverage, and example specifications while updating documentation and build configuration to support both Pydantic V1 and V2.

Key Changes:

New spec-based API with Pydantic models for defining data generation configurations
Comprehensive validation framework with error collection and reporting
Pydantic V1/V2 compatibility layer for broad environment support

Reviewed changes

Copilot reviewed 17 out of 17 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
tests/test_specs.py	Comprehensive test suite for ValidationResult, ColumnDefinition, DatagenSpec validation, and target configurations
tests/test_datasets_with_specs.py	Tests for Pydantic model validation with BasicUser and BasicStockTicker specifications
tests/test_datagen_specs.py	Tests for DatagenSpec creation, validation, and generator options
pyproject.toml	Added ipython dependency, test matrix for Pydantic versions, and disabled warn_unused_ignores
makefile	Updated to use Pydantic version-specific test environments and removed .venv target
examples/datagen_from_specs/basic_user_datagen_spec.py	Example DatagenSpec factory for generating basic user data with pre-configured specs
examples/datagen_from_specs/basic_stock_ticker_datagen_spec.py	Complex example with OHLC stock data generation including time-series and volatility modeling
examples/datagen_from_specs/README.md	Documentation for Pydantic-based dataset specifications with usage examples
dbldatagen/spec/validation.py	ValidationResult class for collecting and reporting validation errors and warnings
dbldatagen/spec/output_targets.py	Pydantic models for UCSchemaTarget and FilePathTarget output destinations
dbldatagen/spec/generator_spec_impl.py	Generator class implementing the spec-to-DataFrame conversion logic
dbldatagen/spec/generator_spec.py	Core DatagenSpec and TableDefinition models with comprehensive validation
dbldatagen/spec/compat.py	Pydantic V1/V2 compatibility layer enabling cross-version support
dbldatagen/spec/column_spec.py	ColumnDefinition model with validation for primary keys and constraints
dbldatagen/spec/init.py	Module initialization with lazy imports to avoid heavy dependencies
README.md	Updated feature list and formatting to mention new Pydantic-based API
CHANGELOG.md	Added entry for Pydantic-based specification API feature

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2025-12-04T18:05:37Z

tests/test_specs.py

+from dbldatagen.spec.generator_spec import DatagenSpec
+import pytest
+from dbldatagen.spec.generator_spec import (
+    DatagenSpec,
+    TableDefinition,
+    ColumnDefinition,
+    UCSchemaTarget,
+    FilePathTarget,
+)


DatagenSpec is imported twice, once on line 1 and again on line 4. Remove the redundant import on line 1 to avoid duplication.

Copilot · 2025-12-04T18:05:37Z

examples/datagen_from_specs/basic_stock_ticker_datagen_spec.py

+the basic stock ticker dataset, corresponding to the BasicStockTickerProvider.
+"""
+
+from random import random


The import 'from random import random' shadows the built-in random module, making it confusing whether 'random' refers to the module or the function. Consider using 'import random' and calling 'random.random()' instead, or aliasing the function as 'from random import random as random_func'.

Copilot · 2025-12-04T18:05:37Z

dbldatagen/spec/generator_spec_impl.py

+        global_gen_options = config.generator_options if config.generator_options else {}
+
+        prepared_generators: dict[str, dg.DataGenerator] = {}
+        generation_order = list(tables_config.keys()) # This becomes impotant when we get into multitable


Corrected spelling of 'impotant' to 'important'.

Suggested change

generation_order = list(tables_config.keys()) # This becomes impotant when we get into multitable

generation_order = list(tables_config.keys()) # This becomes important when we get into multitable

Copilot · 2025-12-04T18:05:38Z

makefile


 clean:
-	rm -fr .venv clean htmlcov .mypy_cache .pytest_cache .ruff_cache .coverage coverage.xml
+	rm -fr clean htmlcov .mypy_cache .pytest_cache .ruff_cache .coverage coverage.xml


The clean target attempts to remove a directory named 'clean' itself, which appears to be an error. This should likely only remove cache and output directories, not a directory named 'clean'.

Suggested change

rm -fr clean htmlcov .mypy_cache .pytest_cache .ruff_cache .coverage coverage.xml

rm -fr htmlcov .mypy_cache .pytest_cache .ruff_cache .coverage coverage.xml

ghanse · 2025-12-04T18:07:13Z

dbldatagen/spec/__init__.py

+from typing import Any
+
+# Import only the compat layer by default to avoid triggering Spark/heavy dependencies
+from .compat import BaseModel, Field, constr, root_validator, validator


I think we should use absolute imports as a convention where possible. We're cleaning-up the relative-style imports in other modules.

ghanse · 2025-12-04T18:08:48Z

dbldatagen/spec/column_spec.py

+DbldatagenBasicType = Literal[
+    "string",
+    "int",
+    "long",
+    "float",
+    "double",
+    "decimal",
+    "boolean",
+    "date",
+    "timestamp",
+    "short",
+    "byte",
+    "binary",
+    "integer",
+    "bigint",
+    "tinyint",
+]
+"""Type alias representing supported basic Spark SQL data types for column definitions.
+
+Includes both standard SQL types (e.g. string, int, double) and Spark-specific type names
+(e.g. bigint, tinyint). These types are used in the ColumnDefinition to specify the data type
+for generated columns.
+"""


Let's move this to a types.py module in the /dbldatagen folder. We will also need custom types for other modules.

ghanse · 2025-12-04T18:10:48Z

dbldatagen/spec/column_spec.py

+    This class encapsulates all the information needed to generate data for a single column,
+    including its name, type, constraints, and generation options. It supports both primary key
+    columns and derived columns that can reference other columns.


Maybe change this to something like this to avoid confusion:

It supports primary key columns, data columns, and derived columns that reference other columns.

ghanse · 2025-12-04T18:12:05Z

dbldatagen/spec/column_spec.py

+    :param type: Spark SQL data type for the column (e.g., "string", "int", "timestamp").
+                 If None, type may be inferred from options or baseColumn
+    :param primary: If True, this column will be treated as a primary key column with unique values.
+                    Primary columns cannot have min/max options and cannot be nullable
+    :param options: Dictionary of additional options controlling column generation behavior.
+                    Common options include: min, max, step, values, template, distribution, etc.
+                    See dbldatagen documentation for full list of available options
+    :param nullable: If True, the column may contain NULL values. Primary columns cannot be nullable
+    :param omit: If True, this column will be generated internally but excluded from the final output.
+                 Useful for intermediate columns used in calculations
+    :param baseColumn: Name of another column to use as the basis for generating this column's values.
+                       Default is "id" which refers to the internal row identifier
+    :param baseColumnType: Method for deriving values from the baseColumn. Common values:


Would be good to specify the default values for any optional parameters.

ghanse · 2025-12-04T18:13:56Z

dbldatagen/spec/column_spec.py

+        constraints that depend on multiple fields being set. It ensures that primary key
+        columns meet all necessary requirements and that conflicting options are not specified.
+
+        :param values: Dictionary of all field values for this ColumnDefinition instance


Maybe "Dictionary of all ColumnDefinition parameters"?

ghanse · 2025-12-04T19:40:25Z

dbldatagen/spec/generator_spec_impl.py

+            conflicting_opts_for_pk = [
+                "distribution", "template", "dataRange", "random", "omit",
+                "min", "max", "uniqueValues", "values", "expr"
+            ]
+
+            for opt_key in conflicting_opts_for_pk:
+                if opt_key in kwargs:
+                    logger.warning(
+                        f"Primary key '{col_name}': Option '{opt_key}' may be ignored")


Why do we disallow min, max, and dataRange for primary keys?

ghanse · 2025-12-04T19:48:31Z

dbldatagen/spec/generator_spec_impl.py

+
+                # Process each column
+                for col_def in table_spec.columns:
+                    kwargs = self._columnSpecToDatagenColumnSpec(col_def)


Maybe use column_options instead to avoid any conflicts.

ghanse · 2025-12-04T19:54:48Z

dbldatagen/spec/generator_spec_impl.py

+        if not prepared_generators:
+            logger.warning("No prepared data generators to write")
+            return


Should we raise an error instead?

ghanse · 2025-12-04T19:58:35Z

dbldatagen/spec/generator_spec_impl.py

+                # Write data based on destination type
+                if isinstance(output_destination, FilePathTarget):
+                    output_path = posixpath.join(output_destination.base_path, table_name)
+                    df.write.format(output_destination.output_format).mode("overwrite").save(output_path)
+                    logger.info(f"Wrote table '{table_name}' to file path: {output_path}")
+
+                elif isinstance(output_destination, UCSchemaTarget):
+                    output_table = f"{output_destination.catalog}.{output_destination.schema_}.{table_name}"
+                    df.write.mode("overwrite").saveAsTable(output_table)
+                    logger.info(f"Wrote table '{table_name}' to Unity Catalog: {output_table}")


We should use utils.write_data_to_output for this.

ghanse · 2025-12-04T20:00:33Z

dbldatagen/spec/output_targets.py

We have OutputDataset in config.py. I think we can reuse it here instead of creating new classes?

anupkalburgi added 4 commits September 23, 2025 12:16

working test wireup

20db964

Initial code, spec and test, pushing for review

4e9cba5

fixing tests

d37de68

changes to make file

f5214ca

anupkalburgi force-pushed the ak/spec branch from 47e80f1 to f5214ca Compare November 6, 2025 15:55

updating docs

0a8fa2a

anupkalburgi commented Nov 10, 2025

View reviewed changes

dbldatagen/spec/generator_spec.py Outdated Show resolved Hide resolved

anupkalburgi commented Nov 10, 2025

View reviewed changes

dbldatagen/spec/generator_spec_impl.py Outdated Show resolved Hide resolved

anupkalburgi commented Nov 10, 2025

View reviewed changes

dbldatagen/spec/generator_spec_impl.py Outdated Show resolved Hide resolved

anupkalburgi added 3 commits November 17, 2025 13:56

fixing tests, removing the solved todos, targets to a diff module

61f676d

converting to camelCase

e139c8b

validation into a diff module

52a4283

anupkalburgi marked this pull request as ready for review December 2, 2025 14:31

anupkalburgi requested review from a team as code owners December 2, 2025 14:31

anupkalburgi requested review from nfx and renardeinside and removed request for a team December 2, 2025 14:31

removing compat/scratch notes

a0ce13b

anupkalburgi force-pushed the ak/spec branch from 1b90b6a to a0ce13b Compare December 2, 2025 14:56

marking the spec module experimental

f6c9a69

ghanse requested a review from Copilot December 4, 2025 18:04

Copilot AI reviewed Dec 4, 2025

View reviewed changes

ghanse requested changes Dec 4, 2025

View reviewed changes

anupkalburgi requested a review from akshayamin December 5, 2025 14:00

	generation_order = list(tables_config.keys()) # This becomes impotant when we get into multitable
	generation_order = list(tables_config.keys()) # This becomes important when we get into multitable

	rm -fr clean htmlcov .mypy_cache .pytest_cache .ruff_cache .coverage coverage.xml
	rm -fr htmlcov .mypy_cache .pytest_cache .ruff_cache .coverage coverage.xml

Ak/spec #372

Are you sure you want to change the base?

Ak/spec #372

Uh oh!

Conversation

anupkalburgi commented Nov 4, 2025

Changes

Linked issues

Requirements

Uh oh!

codecov bot commented Nov 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Dec 4, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 4, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 4, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 4, 2025

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

codecov bot commented Nov 6, 2025 •

edited

Loading