Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for Views #87

Open
KadeG opened this issue Feb 9, 2024 · 6 comments
Open

Support for Views #87

KadeG opened this issue Feb 9, 2024 · 6 comments

Comments

@KadeG
Copy link

KadeG commented Feb 9, 2024

Adding view support to @filter, @distinct, and @select would be great. It's far faster and more memory efficient. Right now my code for a project is split between massive Tidier @chains for reading in and wrangling data (which is shockingly fast), and DataFramesMeta view_df = @subset(df, :col [condition]; view = true) expressions to get SubDataFrames for logging and validation. It would be nice to have a cleaner namespace and use the same syntax throughout.

@kdpsingh
Copy link
Member

kdpsingh commented Feb 9, 2024

Thanks! Quick question. Is the the direct support for views that is important, or would you be satisfied with in-place versions of each of the macros?

I've been planning to add @mutate!, @select!, etc.

@kdpsingh
Copy link
Member

kdpsingh commented Feb 9, 2024

Ah, I think I see what you mean, which serves a different purpose.

Let me look and see how to go about doing this.

As an aside, I've also been planning to add logging support, which is automated printing of the changes that each step in the chain achieved. That's also a bit different than what you requested.

We will look into views and how best to support them.

@KadeG
Copy link
Author

KadeG commented Feb 12, 2024

You're awesome :) In-place operations are great, glad to hear it's being worked on. I'll look into this view thing as well, but I imagine y'all will have much better ideas about how to proceed.

The DataFramesMeta @chain syntax for this is:

@chain df begin
   @rsubset(:col1 == 1; view = true)
   unique(:col2; view = true)
end

Or

@chain df begin
   @rsubset begin
      :col1 == 1
      @kwarg view = true
   end
   unique(:col2; view = true)
end

It seems like the current, very tidy, Tidier syntax for passing args would have this looking something like:

@chain df begin
   @filter col1 == 1 view = true
   @distinct col2 view = true
end

Maybe passing the arg a single time within the @chain block would be preferable? Something like:

@chain df begin
   @filter col1 == 1
   @distinct col2
   @kwarg view = true
end

Maybe extending the @view macro to accept Tidier expressions? Something like:

@chain df begin
   @view @filter col1 == 1
   @view @distinct col2
end

It would be very nice if it were possible to do something like this, but I have no idea how it would even work:

@chain df begin
   @filter col1 == 1
   @distinct col2
   @view
end

As an aside, I've also been planning to add logging support, which is automated printing of the changes that each step in the chain achieved. That's also a bit different than what you requested.

This is interesting! To be clear, I'm storing intermediate views in order to have a count of unique records at each step of joining and transforming data from two databases, and saving them to CSVs if those counts aren't what is expected to later audit the databases. For example, if the count of records with mismatched values in some status field which should be identical between two databases > 0 the view is saved and written.

@kdpsingh
Copy link
Member

Thanks for raising this issue. I've been reading through the documentation for DataFrames.jl and trying to understand what I would need to implement to make this work as intended.

I think this might get you what you're looking for. Can you see if this works as you'd expect?

@chain df begin
   @filter col1 == 1
   @distinct col2
   view(:,:)
end

I'm happy to make a macro for this if this turns out to be what you want.

Also, note that you can use the @aside macro that comes with Chain.jl (which is automatically re-exported by TidierData.jl) to store the intermediate results to a variable.

For example:

@chain df begin
   @filter col1 == 1
   @distinct col2
   @aside temp = view(_, :,:) # Creates a variable named `temp` that you can access afterwards
   # ... rest of piped functions continued here...
end

Thoughts?

@kdpsingh
Copy link
Member

Thinking more on this, I'm not sure this will quite give you what you're looking for.

The example I shared will return a view that you can work with further for logging purposes. However, if you feed that view back into TidierData, then it will get instantiated as a copy (a new data frame).

Looking at DataFramesMeta documentation, it looks to me like only the subset macros support views. Because the select macros can create new columns, I don't think they can operate purely on views (unless you only point to existing columns).

The big question: is it sufficient to return a view, or would you want TidierData to also be able to operate on that view without making a copy?

@KadeG
Copy link
Author

KadeG commented Feb 20, 2024

Your second reply is exactly right. Sorry I should have clarified I'd already experimented with what you mentioned in the earlier reply. This returns a view:

@chain df begin
   @filter col1 == 1
   @distinct col2
   view(:,:)
end

but internally (I think this is the right line) the if in the definition for @filter for example does this:

local df_copy = copy($(esc(df)))

which creates an internal copy which has to get GCd, which is often slower and uses more memory than operating on views.

You are correct that only @subset and @rsubset support views in DataFramesMeta'; @select does not. I've been using something like this to output filtered views with column selections and unique values:

log = @chain df begin
   @rsubset(:col1 == 1; view = true)
   unique([:col2, :col3]; view = true)
   view(:, [:col4, :col5, :col6])
end

To answer the big question: Yes it would be great if @filter, @distinct, and @select could accept and output views without creating full intermediate copies. At least in this case, I'm creating dozens of views for error checking/validation and logging so storing all of them as copies in memory isn't feasible. Something like this (or another syntax option, like those I listed above):

df2 = @chain df1 begin
   @view @filter col2 == 0
   @view @distinct(col1, col3)
   @view @select col4 col5 col6
end

I didn't know about @aside! This is great for cases where the logs build on each other with sequential filters. Something like (ignoring view syntax):

@chain df1 begin
   @filter col1 == 0
   @aside df2 = @distinct(col2, col3)
   @filter col4 == 0
   @aside df3 = @distinct col5
   #etc . . .
end

which would be especially nice if extended to support begin and end blocks as in the other issue I opened #88 . Something like:

@chain df1 begin
   @filter col2 == 0
   @aside begin
      @select col1 col3
      df2 = @distinct col1
   end
   @filter col4 == 0
   @aside begin
      @select col1 col5
      df3 = @distinct col1
   end
   #etc . . .
end

When combined with views something like this should be faster than how it would currently work:

df2 = @chain df1 begin
   @filter col2 == 0
   @select col1 col3
   @distinct col1
end

df3 = @chain df1 begin
   @filter col2 == 0 && col4 == 0
   @select col1 col5
   @distinct col1
end

#etc . . .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants