-
Notifications
You must be signed in to change notification settings - Fork 53
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Metadata cleanup #272
Metadata cleanup #272
Conversation
First round of TSDF code changes to use the new classes
getting test code passing
will need bigger refactoring...
This reverts commit c2ef72b.
This pull request introduces 6 alerts when merging ab1ff0c into 38ec63f - view on LGTM.com new alerts:
|
This pull request introduces 6 alerts when merging 1459ff5 into 38ec63f - view on LGTM.com new alerts:
|
python/tempo/tsdf.py
Outdated
if validate_schema: | ||
self.ts_schema.validate(df.schema) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the scenario where we would not want to validate the schema?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see there are some protected methods where we don't validate schema, but seems like exposing this arg could cause issues if set to False
when users initialize a TSDF.
I also don't think it hurts to validate the schema each time we manipulate the underlying DF in any way, even protected args.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Most TSDF transformer methods make some changes to the underlying DF and then return it wrapped in a new TSDF object. I think of this validation as primarily for end-users who might need guidance on how they're building a TSDF. Internal transformations should already be safe, so shouldn't require validation.
However, I'm open to doing validation on every constructor. I dont' think it'll be a hugely heavy function.
# If we see a string, we will proactively created a double | ||
# version of the string timestamp for sorting purposes and | ||
# rename to ts_col | ||
@classmethod |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we use __withTransformedDF
instead of a class method for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
__withTransformedDF
returns a TSDF, this returns a DataFrame. It's a helper to build compound-timestamp columns from multiple columns for special situations (timestamp + sub-sequence, string col -> parsed ts col, etc.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess I'm struggling to understand why we'd instantiate an instance of TSDF to do a df -> df operation. Seems like this could be a static method other than it's protected. And if it's protected, do we want to expose it as a class method?
def __addColumnsFromOtherDF(self, other_cols: Sequence[str]): | ||
return TSDF(df, ts_col=ts_col, series_ids=self.series_ids) | ||
|
||
def __addColumnsFromOtherDF(self, other_cols): | ||
""" | ||
Add columns from some other DF as lit(None), as pre-step before union. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unionByName
with allowMissingCols=True
may allow us to get rid of this method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good thought! This could simplify things a lot here!
python/tempo/tsdf.py
Outdated
if set(self.structural_cols).issubset(set(cols)): | ||
return self.__withTransformedDF(self.df.select(*cols)) | ||
else: | ||
raise Exception( | ||
"In TSDF's select statement original ts_col, partitionCols and seq_col_stub(optional) must be present" | ||
raise TSDFStructureChangeError( | ||
"select that does not include all structural columns" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💯 this is super clean. Can you tag #248 in the PR descritption?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes! definitely.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also... couldn't we just short-cut select("*")
by just returning self
?
After all select("*")
doesn't change anything...
dbutils.fs.ls("/") | ||
return full_smry | ||
# TODO: Can we raise something other than generic Exception? | ||
# perhaps refactor to check for IS_DATABRICKS | ||
except Exception: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
except Exception: | |
except NameError: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a thought to narrow this assuming we are just trying to catch dbutils not being imported/installed
python/tempo/tsschema.py
Outdated
Timeseries Index when we have a primary timeseries column and a secondary sequencing | ||
column that indicates the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Excellent cliffhanger 😂
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤣
This is the first big chunk of the v0.2 refactor work. This update includes a major reworking of the metadata structure for TSDF. I wanted to present it internally to the team as a PR to get eyes on this.
THIS IS A WORK IN PROGRESS
I have done quite a bit to get test code passing, but not all are working yet. There are still some loose ends to clean up and there will be more changes coming. The focus here should be on the new meta-data structure for TSDF, and its new constructors, and how these will simplify and streamline other code. I invite your reviews, questions and comments - let's get a good discussion going!
Also note - this is not going into master, just into the
v0.2-integration
branch. This is just a temporary place to integrate lots of changes and get things ready for a final merge to master when all changes are completed.