Metadata cleanup #272

tnixon · 2022-10-25T00:45:39Z

This is the first big chunk of the v0.2 refactor work. This update includes a major reworking of the metadata structure for TSDF. I wanted to present it internally to the team as a PR to get eyes on this.

THIS IS A WORK IN PROGRESS
I have done quite a bit to get test code passing, but not all are working yet. There are still some loose ends to clean up and there will be more changes coming. The focus here should be on the new meta-data structure for TSDF, and its new constructors, and how these will simplify and streamline other code. I invite your reviews, questions and comments - let's get a good discussion going!

Also note - this is not going into master, just into the v0.2-integration branch. This is just a temporary place to integrate lots of changes and get things ready for a final merge to master when all changes are completed.

First round of TSDF code changes to use the new classes

getting test code passing

will need bigger refactoring...

This reverts commit c2ef72b.

lgtm-com · 2022-10-25T01:03:08Z

This pull request introduces 6 alerts when merging ab1ff0c into 38ec63f - view on LGTM.com

new alerts:

4 for Wrong name for an argument in a class instantiation
2 for `__eq__` not overridden when adding attributes

lgtm-com · 2022-10-25T01:31:05Z

This pull request introduces 6 alerts when merging 1459ff5 into 38ec63f - view on LGTM.com

new alerts:

4 for Wrong name for an argument in a class instantiation
2 for `__eq__` not overridden when adding attributes

python/tempo/tsdf.py

R7L208 · 2022-10-25T13:25:38Z

python/tempo/tsdf.py

+        if validate_schema:
+            self.ts_schema.validate(df.schema)


What's the scenario where we would not want to validate the schema?

I see there are some protected methods where we don't validate schema, but seems like exposing this arg could cause issues if set to False when users initialize a TSDF.

I also don't think it hurts to validate the schema each time we manipulate the underlying DF in any way, even protected args.

Most TSDF transformer methods make some changes to the underlying DF and then return it wrapped in a new TSDF object. I think of this validation as primarily for end-users who might need guidance on how they're building a TSDF. Internal transformations should already be safe, so shouldn't require validation.

However, I'm open to doing validation on every constructor. I dont' think it'll be a hugely heavy function.

R7L208 · 2022-10-25T13:39:30Z

python/tempo/tsdf.py

-        # If we see a string, we will proactively created a double
-        # version of the string timestamp for sorting purposes and
-        # rename to ts_col
+    @classmethod


Should we use __withTransformedDF instead of a class method for this?

__withTransformedDF returns a TSDF, this returns a DataFrame. It's a helper to build compound-timestamp columns from multiple columns for special situations (timestamp + sub-sequence, string col -> parsed ts col, etc.)

I guess I'm struggling to understand why we'd instantiate an instance of TSDF to do a df -> df operation. Seems like this could be a static method other than it's protected. And if it's protected, do we want to expose it as a class method?

R7L208 · 2022-10-25T13:47:48Z

python/tempo/tsdf.py

-    def __addColumnsFromOtherDF(self, other_cols: Sequence[str]):
+        return TSDF(df, ts_col=ts_col, series_ids=self.series_ids)
+
+    def __addColumnsFromOtherDF(self, other_cols):
        """
        Add columns from some other DF as lit(None), as pre-step before union.


unionByName with allowMissingCols=True may allow us to get rid of this method.

https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.unionByName.html#pyspark-sql-dataframe-unionbyname

Good thought! This could simplify things a lot here!

R7L208 · 2022-10-25T13:52:37Z

python/tempo/tsdf.py

+        if set(self.structural_cols).issubset(set(cols)):
+            return self.__withTransformedDF(self.df.select(*cols))
        else:
-            raise Exception(
-                "In TSDF's select statement original ts_col, partitionCols and seq_col_stub(optional) must be present"
+            raise TSDFStructureChangeError(
+                "select that does not include all structural columns"


💯 this is super clean. Can you tag #248 in the PR descritption?

Yes! definitely.

Also... couldn't we just short-cut select("*") by just returning self?
After all select("*") doesn't change anything...

R7L208 · 2022-10-25T13:54:15Z

python/tempo/tsdf.py

            dbutils.fs.ls("/")
            return full_smry
-        # TODO: Can we raise something other than generic Exception?
-        #  perhaps refactor to check for IS_DATABRICKS
        except Exception:


Suggested change

except Exception:

except NameError:

Just a thought to narrow this assuming we are just trying to catch dbutils not being imported/installed

R7L208 · 2022-10-25T14:07:09Z

python/tempo/tsschema.py

+    Timeseries Index when we have a primary timeseries column and a secondary sequencing
+    column that indicates the


Excellent cliffhanger 😂

CLAassistant · 2023-11-27T20:15:28Z

All committers have signed the CLA.

tnixon added 13 commits August 11, 2022 16:35

created new TSIndex and TSSchema classes to represent TSDF metadata.

6a17569

First round of TSDF code changes to use the new classes

saving progess to this point

3c3e5f8

Merge branch 'master' into metadata_cleanup

9ceac4d

getting tsdf_tests.BasicTests to pass

5519daf

big search & replace: partition_cols -> series_ids

6084277

getting test code passing

all as_of tests passing but 1

0dd3b43

will need bigger refactoring...

Merge branch 'v0.2-integration' into metadata_cleanup

8199b60

checkpoint save of current progress...

c2ef72b

Revert "checkpoint save of current progress..."

ef1f4ee

This reverts commit c2ef72b.

Merge branch 'v0.2-integration' into metadata_cleanup

0c9e32a

Merge branch 'v0.2-integration' into metadata_cleanup

f5de397

merging changes from integration branch

99c5e9c

Merge branch 'v0.2-integration' into metadata_cleanup

ab1ff0c

tnixon requested review from rportilla-databricks, R7L208, souvik-databricks, bendoan-db, guanjieshen and Sonali-guleria October 25, 2022 00:45

black code formatting

1459ff5

R7L208 reviewed Oct 25, 2022

View reviewed changes

tnixon linked an issue Oct 31, 2022 that may be closed by this pull request

tsdf.select("*") throws Exception stating columns must be present when they are by nature of the projection #248

Open

tnixon added 4 commits January 11, 2023 13:16

Merge branch 'v0.2-integration' into metadata_cleanup

c924f16

Standardizing pyspark.sql.functions as Fn

f2d2669

Merge branch 'v0.2-integration' into metadata_cleanup

ea47327

committing WIP - migrating to new laptop

b8e8f8e

tnixon marked this pull request as draft April 10, 2023 22:32

merging non-code changes from master (via v0.2-integration)

e9578be

tnixon closed this Jul 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Metadata cleanup #272

Metadata cleanup #272

tnixon commented Oct 25, 2022 •

edited

Loading

lgtm-com bot commented Oct 25, 2022

lgtm-com bot commented Oct 25, 2022

R7L208 Oct 25, 2022

R7L208 Oct 25, 2022

tnixon Oct 31, 2022

R7L208 Oct 25, 2022

tnixon Oct 31, 2022

R7L208 Nov 9, 2022

R7L208 Oct 25, 2022

tnixon Oct 31, 2022

R7L208 Oct 25, 2022

tnixon Oct 31, 2022

tnixon Oct 31, 2022

R7L208 Oct 25, 2022

R7L208 Oct 25, 2022

R7L208 Oct 25, 2022

tnixon Oct 31, 2022

CLAassistant commented Nov 27, 2023 •

edited

Loading

		Timeseries Index when we have a primary timeseries column and a secondary sequencing
		column that indicates the

Metadata cleanup #272

Metadata cleanup #272

Conversation

tnixon commented Oct 25, 2022 • edited Loading

lgtm-com bot commented Oct 25, 2022

lgtm-com bot commented Oct 25, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

CLAassistant commented Nov 27, 2023 • edited Loading

tnixon commented Oct 25, 2022 •

edited

Loading

CLAassistant commented Nov 27, 2023 •

edited

Loading