-
Notifications
You must be signed in to change notification settings - Fork 2.5k
feat: Allow quantile to compute multiple quantiles at once #25516
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
Thanks for adding the change. This might need some extra use-cases. For instance, a test case with an array input for a few valid quantiles followed by an invalid one. etc |
| .map(|v: Option<f64>| v.map(|f| f as i64)) | ||
| .collect::<Int64Chunked>() | ||
| .into_series(); | ||
| // Cast the int64 series to the time type |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should not use a cast, but rather do into_time on the Int64Chunked.
| .map(|v: Option<f64>| v.map(|f| f as i64)) | ||
| .collect::<Int64Chunked>() | ||
| .into_series(); | ||
| // Cast the int64 series to the duration type |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should not use a cast, but rather do into_duration on the Int64Chunked.
| .map(|v: Option<f64>| v.map(|f| f as i64)) | ||
| .collect::<Int64Chunked>() | ||
| .into_series(); | ||
| // Cast the int64 series to the datetime type |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should not use a cast, but rather do into_datetime on the Int64Chunked.
| .map(|v: Option<f64>| v.map(|f| (f * (US_IN_DAY as f64)) as i64)) | ||
| .collect::<Int64Chunked>() | ||
| .into_series(); | ||
| // Cast the int64 series to the datetime type |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should not use a cast, but rather do into_date on the Int64Chunked.
| for _q in quantiles { | ||
| out.push(None); | ||
| } | ||
| Ok(out) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
vec![None; quantiles.len()]
| } | ||
|
|
||
| fn quantiles_reduce(&self, quantiles: &[f64], method: QuantileMethod) -> PolarsResult<Scalar> { | ||
| let v = self.quantiles(quantiles, method)?; // Vec<Option<f64>> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please don't annotate types in comments like this.
|
|
||
| fn quantiles_reduce(&self, quantiles: &[f64], method: QuantileMethod) -> PolarsResult<Scalar> { | ||
| let v = self.quantiles(quantiles, method)?; // Vec<Option<f64>> | ||
| // build a Float64 series from the optional results, preserving nulls |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're not building a Float64 series here.
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #25516 +/- ##
==========================================
+ Coverage 79.61% 79.81% +0.20%
==========================================
Files 1729 1729
Lines 239727 239952 +225
Branches 3038 3038
==========================================
+ Hits 190857 191521 +664
+ Misses 48087 47648 -439
Partials 783 783 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
| /// Get the quantile of the Series as a new Series of length 1. | ||
| /// Default implementation delegates to `quantiles_reduce` with a single element | ||
| /// and unwraps the resulting `List` scalar to a plain scalar where possible. | ||
| fn quantile_reduce(&self, quantile: f64, method: QuantileMethod) -> PolarsResult<Scalar> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something is fundamentally wrong with the datatypes. You changed everything to go through quantiles_reduce and made the dtype quantiles_reduce dynamic based on the length of the input.
quantiles_reduce should always return a List, even if the input has length 1. Please undo the changes that makes everything go through quantiles_reduce.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. Just want to make sure I fully understand what the issue is here and what you would like done. Putting aside the specific internal function implementations for a second, the desired behavior (confirming) is that we have one user exposed function (quantile) that has two different overloaded types behaviors:
- f64 input quantile -> f64 output
- list of f64 inputs -> list of f64 outputs
Obviously from a math/process standpoint, (1) is a just a version of (2), but from a types standpoint they are different.
internally, we need to go through a lot of steps/functions along the way to go all the way in and all the way out. My original implementation had those two different types cases handled in side by side functions (essentially) all the way down to the bottom and then back out again, including implementing a (new)“quantiles_reduce” function sitting side by side next to “quantile_reduce.” (each of which needing an implementation for all the different numerical types). This seemed bad to me, so I collapsed the two reduce functions into a single one that could handle both cases.
Are you saying you don’t want that and we should go back to having two reduce functions, one for f64 -> f64 and one for list -> list? Or are you saying everything can go through the new quantiles_reduce, but make that function always do list -> list and handle the scalar/list type conversion in the caller?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you saying you don’t want that and we should go back to having two reduce functions, one for f64 -> f64 and one for list -> list?
Yes. They are different operations.
Or are you saying everything can go through the new quantiles_reduce, but make that function always do list -> list and handle the scalar/list type conversion in the caller?
This also would've been fine, but my preference goes to the above. What isn't fine is that the output datatype depends on the length of the input.
Fixes #25451
Fixes #21755
Fixes #9014
A first cut here. Interested in any feedback.
Note: for multiple quantiles, we will always just sort up front (for now). quick select is still used on the case where we are only looking for one quantile, we can get a contiguous slice, and the data is not already sorted. Could consider implementing a multi-quick sort if we think it is worth it.
I don't love having .quantile and quantiles in so many places in the internal code, but wasn't sure how hard it would be to avoid that.
Note: I have not refactored .describe() yet to actually use this, can do. I did make qcut utilize this pathway.