Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement dplyr::lead() and dplyr::lag() functions #8

Open
christophscheuch opened this issue Mar 29, 2023 · 7 comments
Open

Implement dplyr::lead() and dplyr::lag() functions #8

christophscheuch opened this issue Mar 29, 2023 · 7 comments
Assignees
Labels
enhancement New feature or request

Comments

@christophscheuch
Copy link

christophscheuch commented Mar 29, 2023

Your fantastic package motivated me to revisit Julia after more than 5 years of abstinence - please keep going :)

Would be great if Tidier.jl could support the computation of lagged or leading values in data frames. For instance, for the typical financial application, I'd want to compute returns over different horizons like this:

using Tidier, MarketData

prices = yahoo(:AAPL, YahooOpt(period1 = DateTime(2020, 1, 1), period2 = DateTime(2021, 12,31))) |> DataFrame

prices = @chain prices begin
    @clean_names()
end

returns = @chain prices begin
    @arrange(timestamp)
    @mutate(ret1m = adj_close / lag(adj_close) - 1, ret12m = adj_close / lag(adj_close, 12) - 1)
end

I guess there is some workaround using ShiftedArrays.jl but I can't seem to make it work with my limited Julia experience 😞

@kdpsingh
Copy link
Member

kdpsingh commented Mar 29, 2023

Thank you @christophscheuch for the kind words and for the suggestion!

Can you try calling them with a preceding tilde, like this: @mutate(ret1m = adj_close / ~lag(adj_close) - 1, ...

If that works let us know (and I can explain why). If not, I'll take a look. Agree that we should have this functionality natively supported with Tidier.jl, whether or not we can borrow the implementation from ShiftedArrays.jl.

I am also posting the dplyr implementation here as a placeholder in case we want to just base our implementation off of that: https://github.com/tidyverse/dplyr/blob/main/R/lead-lag.R

Key things for our implementation:

  • needs to handle missing values
  • need to add these functions to the do-not-vectorize list (which is why the ~ may be required at the moment).

@christophscheuch
Copy link
Author

Thanks for the quick response! So I tried the following to simply construct a lagged variable:

returns = @chain prices begin
    @mutate(adj_close_lag = ~lag(adj_close))
end

I get the following error messages:

ERROR: MethodError: no method matching lag(::Vector{Float64})
Closest candidates are:
  lag(::TimeArray{T, N, D, A} where {D<:TimeType, A<:AbstractArray{T, N}}) where {T, N} at C:\Users\...\.julia\packages\TimeSeries\vYT6q\src\apply.jl:9
  lag(::TimeArray{T, N, D, A} where {D<:TimeType, A<:AbstractArray{T, N}}, ::Int64; padding, period) where {T, N} at C:\Users\...\.julia\packages\TimeSeries\vYT6q\src\apply.jl:9  

Unfortunately, I do not know anything about these array types in Julia, so I'm lost :/

@kdpsingh
Copy link
Member

Have no fear. We will look into this and sort it out!

@kdpsingh kdpsingh added the enhancement New feature or request label Mar 29, 2023
@kdpsingh
Copy link
Member

Ok, I figured out the issue.

While the MarketData package exposes a lead() and lag() function for time series data, this is actually not the same lead() and lag() function inside of ShiftedArrays.jl.

And if you include using ShiftedArrays in your code, even this doesn't work because lead() and lag() aren't exported to the namespace (akin to package internal functions in R).

This pattern is a bit more common in Julia because it's intended to protect your namespace. You would still be able to access the non-exported functions by writing ShiftedArrays.lag() after writing using ShiftedArrays. This it the same reason why you have to type Pkg.add() to add a package (if not using the package manager console) even after typing using Pkg.

So tl;dr: here's a solution that will work for now. Add the following code to the top of your script:

using ShiftedArrays: lead, lag

Then, if you use ~lead() and ~lag() to reference them, this will work for now.

In an update in the near future, we'll directly add support for these functions within Tidier.jl and will remove the need for the tilde. The reason the tilde is needed for now is that Tidier.jl is "auto-vectorizing" lead() and lag(), which we don't want. You can read more about this here: https://kdpsingh.github.io/Tidier.jl/dev/examples/generated/UserGuide/autovec/

Adding the tilde to the front prevents this auto-vectorization. In the next update, we will add these functions to our "do-not-vectorize" list, which will remove the need for a tilde. Additionally, I plan to expose a function that will allow users to manually add functions to that list.

@christophscheuch
Copy link
Author

christophscheuch commented Mar 30, 2023

Indeed this works and I learned something - thanks a lot for the explanation! So for completeness, this code snippet now calcualtes returns :)

using ShiftedArrays: lag, lead
using Tidier, MarketData

prices = yahoo(:AAPL, YahooOpt(period1 = DateTime(2020, 1, 1), period2 = DateTime(2021, 12,31))) |> DataFrame

prices = @chain prices begin
    @clean_names()
end

returns = @chain prices begin
    @mutate(ret = adj_close / ~lag(adj_close) - 1, )
end

@kdpsingh
Copy link
Member

kdpsingh commented Mar 30, 2023

Note to self: I need to update the auto-vectorization docs to give an example when functions should not be auto-vectorized inside of @mutate(). This would be a good example of that, as would ntile().

While we will generally try to automate this, users need to understand this when using custom functions that they have written inside of Tidier verbs.

@kdpsingh kdpsingh self-assigned this Mar 31, 2023
@kdpsingh
Copy link
Member

kdpsingh commented Apr 2, 2023

This is now fixed in TidierOrg/Tidier.jl#82. lag() and lead() are re-exported from ShiftedArrays.jl. This implementation supports an optional argument to indicate how many values to shift by and includes an optional keyword argument to specify the default value to fill in instead of missing.

Both functions are now included in the do-not-vectorize list, so there's no more need to use a tilde.

I'll close this issue out after updating the docs for @mutate to include an example of lag/lead and updating the auto-vectorization page to explain when tildes should be used inside of user-created functions within @mutate().

@kdpsingh kdpsingh transferred this issue from TidierOrg/Tidier.jl Jul 31, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants