Skip to content

Commit

Permalink
improve docs (#58)
Browse files Browse the repository at this point in the history
* improve docs

* Update api.md
  • Loading branch information
Tortar committed Apr 16, 2024
1 parent e0af817 commit aba9e6c
Show file tree
Hide file tree
Showing 8 changed files with 143 additions and 31 deletions.
16 changes: 8 additions & 8 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,28 +29,28 @@ julia> iter = Iterators.filter(x -> x != 10, 1:10^7);
julia> wv(el) = 1.0

julia> @btime itsample($rng, $iter, 10^4, algRSWRSKIP);
14.579 ms (5 allocations: 156.39 KiB)
14.578 ms (5 allocations: 156.39 KiB)

julia> @btime sample($rng, collect($iter), 10^4; replace=true);
136.973 ms (20 allocations: 146.91 MiB)
136.139 ms (20 allocations: 146.91 MiB)

julia> @btime itsample($rng, $iter, 10^4, algL);
10.630 ms (3 allocations: 78.22 KiB)
10.591 ms (3 allocations: 78.22 KiB)

julia> @btime sample($rng, collect($iter), 10^4; replace=false);
138.207 ms (27 allocations: 147.05 MiB)
134.352 ms (27 allocations: 147.05 MiB)

julia> @btime itsample($rng, $iter, $wv, 10^4, algWRSWRSKIP);
32.756 ms (5 allocations: 156.41 KiB)
32.892 ms (12 allocations: 568.83 KiB)

julia> @btime sample($rng, collect($iter), Weights($wv.($iter)), 10^4; replace=true);
548.043 ms (45 allocations: 702.33 MiB)
545.058 ms (45 allocations: 702.33 MiB)

julia> @btime itsample($rng, $iter, $wv, 10^4, algAExpJ);
40.849 ms (11 allocations: 234.78 KiB)
41.092 ms (11 allocations: 234.78 KiB)

julia> @btime sample($rng, collect($iter), Weights($wv.($iter)), 10^4; replace=false);
316.312 ms (43 allocations: 370.19 MiB)
312.880 ms (43 allocations: 370.19 MiB)
```

More information can be found in the [documentation](https://juliadynamics.github.io/StreamSampling.jl/stable/).
5 changes: 3 additions & 2 deletions docs/make.jl
Original file line number Diff line number Diff line change
Expand Up @@ -5,8 +5,9 @@ println("Documentation Build")
makedocs(
modules = [StreamSampling],
sitename = "StreamSampling.jl",
pages = [
"API" => "index.md",
pages = [
"Introduction" => "index.md",
"API" => "api.md",
],
warnonly = [:doctest, :missing_docs, :cross_references],
)
Expand Down
22 changes: 22 additions & 0 deletions docs/src/api.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
# API

## General functionalities

```@docs
ReservoirSample
update!
value
ordered_value
itsample
```

# Algorithms

```@docs
StreamSampling.algL
StreamSampling.algR
StreamSampling.algRSWRSKIP
StreamSampling.algAExpJ
StreamSampling.algARes
StreamSampling.algWRSWRSKIP
```
52 changes: 40 additions & 12 deletions docs/src/index.md
Original file line number Diff line number Diff line change
@@ -1,18 +1,46 @@
# API

## General functionalities
# Introduction

This package allows to sample from any stream in a single pass through the data, even if the number of items is unknown.

If the iterable is lazy, the memory required grows in relation to the size of the sample, instead of the all population, which can be useful for sampling from big data streams.

# Example Usage

The [`itsample`](@ref) instead allows to consume all the stream at once and return the sample collected:

```@docs
itsample
```
julia> using StreamSampling
julia> st = 1:10;
# Implemented ALgorithms
julia> itsample(st, 5)
5-element Vector{Int64}:
9
15
52
96
91
```
In some cases, one needs to control the updates the [`ReservoirSample`](@ref) will be subject to. In this case
you can simply use the [`update!`](@ref) function to fit new values in the reservoir:

```@docs
StreamSampling.algL
StreamSampling.algR
StreamSampling.algRSWRSKIP
StreamSampling.algAExpJ
StreamSampling.algARes
StreamSampling.algWRSWRSKIP
```
julia> using StreamSampling
julia> rs = ReservoirSample(Int, 5);
julia> for x in 1:100
@inline update!(rs, x)
end
julia> value(rs)
5-element Vector{Int64}:
7
9
20
49
74
```

Consult the [API page](https://juliadynamics.github.io/StreamSampling.jl/stable/api/) for more information on the available functionalities.
66 changes: 64 additions & 2 deletions src/StreamSampling.jl
Original file line number Diff line number Diff line change
Expand Up @@ -41,34 +41,46 @@ struct AlgAExpJ <: ReservoirAlgorithm end
struct AlgWRSWRSKIP <: ReservoirAlgorithm end

"""
Implements random sampling without replacement.
Adapted from algorithm L described in "Random sampling with a reservoir, J. S. Vitter, 1985".
"""
const algL = AlgL()

"""
Implements random sampling without replacement.
Adapted from algorithm R described in "Random sampling with a reservoir, J. S. Vitter, 1985".
"""
const algR = AlgR()

"""
Implements random sampling with replacement.
Adapted fron algorithm RSWR_SKIP described in "Reservoir-based Random Sampling with Replacement from
Data Stream, B. Park et al., 2008".
"""
const algRSWRSKIP = AlgRSWRSKIP()

"""
Implements weighted random sampling without replacement.
Adapted from algorithm A-Res described in "Weighted random sampling with a reservoir,
P. S. Efraimidis et al., 2006".
"""
const algARes = AlgARes()

"""
Implements weighted random sampling without replacement.
Adapted from algorithm A-ExpJ described in "Weighted random sampling with a reservoir,
P. S. Efraimidis et al., 2006".
"""
const algAExpJ = AlgAExpJ()

"""
Implements weighted random sampling with replacement.
Adapted from algorithm WRSWR_SKIP described in "A Skip-based Algorithm for Weighted Reservoir
Sampling with Replacement, A. Meligrana, 2024".
"""
Expand All @@ -83,12 +95,62 @@ include("UnweightedSamplingMulti.jl")
include("WeightedSamplingSingle.jl")
include("WeightedSamplingMulti.jl")


"""
ReservoirSample([rng], T, method = algL)
ReservoirSample([rng], T, n::Int, method = algL; ordered = false)
Initializes a reservoir sample which can then be fitted with [`update!`](@ref).
The first signature represents a sample where only a single element is collected.
Look at the [`Algorithms`](@ref) section for the supported methods.
"""
function ReservoirSample end

export ReservoirSample

"""
update!(rs::AbstractReservoirSample, el, [w])
Updates the reservoir sample by scanning the passed element.
In the case of weighted sampling also the weight of the element
needs to be passed to the function.
"""
function update! end

export update!

"""
value(rs::AbstractReservoirSample)
Returns the elements collected in the sample at the current
sampling stage.
"""
function value end

export value

"""
ordered_value(rs::AbstractReservoirSample)
Returns the elements collected in the sample at the current
sampling stage in the order they were collected. This applies
only when `ordered = true` is passed in [`ReservoirSample`](@ref).
"""
function ordered_value end

export ordered_value


"""
itsample([rng], iter, method = algL)
itsample([rng], iter, wv, method = algAExpJ)
itsample([rng], iter, weight, method = algAExpJ)
Return a random element of the iterator, optionally specifying a `rng`
(which defaults to `Random.default_rng()`) and a `wv` function.
(which defaults to `Random.default_rng()`) and a `weight` function which
accept each element as input and outputs the corresponding weight.
If the iterator is empty, it returns `nothing`.
-----
Expand Down
5 changes: 3 additions & 2 deletions src/WeightedSamplingMulti.jl
Original file line number Diff line number Diff line change
Expand Up @@ -90,11 +90,12 @@ function update!(s::SampleMultiAlgWRSWRSKIP, el, w)
if s.seen_k <= n
s.value[s.seen_k] = el
s.weights[s.seen_k] = w
if s.seen_k == n
if s.seen_k == n
s.value = sample(s.rng, s.value, weights(s.weights), n)
@inline recompute_skip!(s, n)
empty!(s.weights)
end
elseif s.skip_w < s.state
elseif s.skip_w <= s.state
p = w/s.state
z = (1-p)^(n-3)
q = rand(s.rng, Uniform(z*(1-p)*(1-p)*(1-p),1.0))
Expand Down
6 changes: 3 additions & 3 deletions test/runtests.jl
Original file line number Diff line number Diff line change
Expand Up @@ -8,9 +8,9 @@ using StableRNGs
using Test

@testset "StreamSampling.jl Tests" begin
#include("package_sanity_tests.jl")
#include("unweighted_sampling_single_tests.jl")
#include("unweighted_sampling_multi_tests.jl")
include("package_sanity_tests.jl")
include("unweighted_sampling_single_tests.jl")
include("unweighted_sampling_multi_tests.jl")
include("weighted_sampling_single_tests.jl")
include("weighted_sampling_multi_tests.jl")
end
2 changes: 0 additions & 2 deletions test/weighted_sampling_multi_tests.jl
Original file line number Diff line number Diff line change
Expand Up @@ -89,8 +89,6 @@ end
else
ps_exact = [prob_no_replace(k) for (k, v) in pairs_dict if length(unique(k)) == size]
end
println(method)
println(sum(ps_exact))
count_est = [v for (k, v) in pairs_dict]
chisq_test = ChisqTest(count_est, ps_exact)
@test pvalue(chisq_test) > 0.05
Expand Down

0 comments on commit aba9e6c

Please sign in to comment.