-
Notifications
You must be signed in to change notification settings - Fork 194
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix some issues with sampling #879
base: master
Are you sure you want to change the base?
Conversation
Currently sampling functions need to perform the same set of checks on the inputs and those checks are copied and pasted for each method. We can instead define a simple input validation function that can be used by all sampling functions so that any additional corner cases that need to be caught can be fixed in one place and propagated elsewhere. Relatedly, this adds checks for agreement between the length of the source array to be sampled and the array of weights (issue 871) as well as that the destination array is not larger than the source when sampling without replacement (issue 877).
Not all `AbstractWeights` subtypes have that field, e.g. `UnitWeights`, but all have indexing defined, so that can be used instead of trying to index into the underlying array.
src/sampling.jl
Outdated
n = length(a) | ||
length(wv) == n || throw(DimensionMismatch("a and wv must be of same length (got $n and $(length(wv))).")) | ||
k = length(x) | ||
n, k = _validate_sample_inputs(a, wv, x, false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this function really part of the official API and needs checks of the arguments? IIRC I had never intended it to be called by any user directly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And if users use the (IMO) intended sample
API then the arguments are already checked I assume.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wouldn't have thought it was intended to be user-facing at all except, as pointed out in #876, it's included in the manual. 😕
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But it's not exported, is it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not, no. That said, there are implementations of three different Efraimidis-Spirakis algorithms (A, A-Res, and AExpJ), only one of which (AExpJ) is actually used internally by a function like sample
. That suggests to me that there was the intention of use of these outside of the context sample
but I could very well be mistaken as I don't know the history.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Docs were added in #254.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was actually summer 2016 and you reviewed it
LOL amazing. My brain runs the GC often so 7 years ago is long gone.
I definitely buy the argument that the separate, non-exported functions that each implement specific algorithms should not be considered user-facing and thus shouldn't need to perform the same kind of safety checks as those intended to be called directly. What gets me nervous is that there's nothing saying they aren't user-facing, hence issues like #876 and #877. Perhaps we could add admonitions to the docstrings, e.g.
!!! note
This function is not intended to be called directly and is not considered
part of the package's API.
?
A bit tangential to this discussion but in the future we could do something for sampling algorithms as is done for sorting algorithms in Base: each algorithm gets a type that subtypes some abstract sampling algorithm type then the user may select a particular algorithm via a keyword argument to sample
, e.g. sample(x, wv; alg=EfraimidisAExpJ())
, and internally that dispatches to use e.g. efraimidis_aexpj_wsample_norep!
(after doing any appropriate argument checking 😄).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, passing types via an alg
keyword argument would be the best API.
Better perform checks anyway, except if this means we run checks twice when called from sample
. Is that the case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
except if this means we run checks twice when called from
sample
. Is that the case?
Currently yes. I can add a flag to the internal checking function that makes it a no-op if called from sample
but perhaps that's more complex than necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How expensive are the checks? Is there a noticeable performance difference between calling sample
and the internal function?
The alg keyword argument seems a reasonable suggestion for future refactorings.
For the time being I would prefer adding a warning or note to the docstrings of these internal functions. I think it was a mistake to add them to the docs at all (also based on the initial + follow-up PRs), so I would be fine even with just removing them from the docs. They're not exported and IMO have never been part of the official API (or at least they were not supposed to be).
sample(rng::AbstractRNG, a::AbstractArray, wv::AbstractWeights) = a[sample(rng, wv)] | ||
function sample(rng::AbstractRNG, a::AbstractArray, wv::AbstractWeights) | ||
_validate_sample_inputs(a, wv) | ||
return a[sample(rng, wv)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's weird that this line isn't tested.
src/sampling.jl
Outdated
n = length(a) | ||
length(wv) == n || throw(DimensionMismatch("a and wv must be of same length (got $n and $(length(wv))).")) | ||
k = length(x) | ||
n, k = _validate_sample_inputs(a, wv, x, false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, passing types via an alg
keyword argument would be the best API.
Better perform checks anyway, except if this means we run checks twice when called from sample
. Is that the case?
Co-authored-by: Milan Bouchet-Valat <[email protected]>
What's holding up this PR? |
The following, which I've not had time for:
|
This PR is separated into three distinct changes:
sample
fails to throw error for inconsistent lengths #871 and SegFault withefraimidis_a_wsample_norep!
#877UnitWeights
See commit messages for additional info as applicable.
Fixes #871
Fixes #877