Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add time_series tutorial #272

Open
wants to merge 7 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions tutorials/time_series/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
Manifest.toml
jena_climate*
.vscode
.ipynb_*
12 changes: 12 additions & 0 deletions tutorials/time_series/Project.toml
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
[deps]
CSV = "336ed68f-0bac-5ca0-87d4-7b16caf5d00b"
CUDA = "052768ef-5323-5732-b1bb-66c8b64840ba"
DataFrames = "a93c6f00-e57d-5684-b7b6-d8193f3e46c0"
Debugger = "31a5f54b-26ea-5ae9-a837-f05ce5417438"
FFTW = "7a1cc6ca-52ef-59f5-83cd-3a7055c09341"
Flux = "587475ba-b771-5e3f-ad9e-33799f191a9c"
Literate = "98b081ad-f1c9-55d3-8b20-4c87d4299306"
MLDataPattern = "9920b226-0b2a-5f5f-9153-9aa70a013f8b"
Plots = "91a5bcdd-55d7-5caf-9e0b-520d859cae80"
StatsPlots = "f3b207a7-027a-5e70-b257-86293d7955fd"
ZipFile = "a5390f91-8eb1-5f08-bee0-b1d1ffed6cea"
16 changes: 16 additions & 0 deletions tutorials/time_series/TODO
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
* Handle target value better for dataset y's; Is there a way to ensure that time series always have a batched axis using MLDataPattern, esp lazily?
* Decide what level to put this tutorial at - base Flux or incorporate higher level packages?
* Make timeseries batching more idiomatic and faster.
* Add benchmarking integrations so we tell how other PRs affect this functionality.
* Put things in functions as needed
* Eachbatch - size or maxsize? what's more popular. Probably incorporate with DataLoaders.jl
* Flux on master for Dense behavior. Needs to be latest when this is published
* In train_model! is it critical to return and assign the model (`linear = train_model!(linear, single_step_1h, opt; bs=16, epochs=20)`)? I found that without doing it, Flux.update! would work during training, but then calling the model outside wouldn't be mutated? could this deal with calling params at the beginning?
* Early Stopping could use some work. Not sure it does exactly what I want it to. Could use earlystopping.jl
* Need to convert data to Float32? Without it, I get this when running conv_model. Related to Params being Float32?
┌ Warning: Slow fallback implementation invoked for conv! You probably don't want this; check your datatypes.
│ yT = Float64
│ T1 = Float64
│ T2 = Float32
└ @ NNlib ~/.julia/packages/NNlib/fxLrD/src/conv.jl:206
* Make sure conv examples are implemented correctly.
31 changes: 31 additions & 0 deletions tutorials/time_series/batch_ts.jl
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
using Flux: unsqueeze

"""
Takes in t, which is an array of tuples of (`sequence`, `target`) where `sequence` is an array of timestamp x features,
and target is either an array or a single value. Outputs a tuple of batched

julia> z = [([1 2 3; 2 3 4], 5), ([6 7 8; 7 8 9], 0)]
2-element Array{Tuple{Array{Int64,2},Int64},1}:
([1 2 3; 2 3 4], 5)
([6 7 8; 7 8 9], 0)

julia> batch_ts(z)
([1 2 3; 2 3 4]

[6 7 8; 7 8 9], [5]

[0])

julia> size(batch_ts(z)[1])
(2, 3, 2)

julia> size(batch_ts(z)[2])
(1, 1, 2)
"""
batch_ts(t) = reduce((x, y) -> (cat(x[1], y[1], dims=3), cat(x[2], y[2], dims=3)), t)


"""
Handles batch of size 1, with a (`sequence`, `target`) tuple
"""
batch_ts(t::Tuple) = (unsqueeze(t[1],3), unsqueeze(t[2],3))
45 changes: 45 additions & 0 deletions tutorials/time_series/runtests_batch.jl
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
using Test

include("batch_ts.jl")

@testset "multi batch - single element y" begin
for _ in 1:100
dim1 = rand(1:20)
dim2 = rand(1:20)
num = rand(2:100) # single item batch not used with single element

tups = [(rand(dim1, dim2), rand()) for i in 1:num]
b_tups = batch_ts(tups)
@test b_tups |> length == 2
@test size(b_tups[1]) == (dim1, dim2, num)
@test size(b_tups[2]) == (1, 1, num)
end
end

@testset "multi batch - multi element y" begin
for _ in 1:100
dim1 = rand(1:20)
dim2 = rand(1:20)
dim3 = rand(1:20)
dim4 = rand(1:20)
num = rand(2:100) # single item batch

tups = [(rand(dim1, dim2), rand(dim3, dim4)) for i in 1:num]
b_tups = batch_ts(tups)
@test b_tups |> length == 2
@test size(b_tups[1]) == (dim1, dim2, num)
@test size(b_tups[2]) == (dim3, dim4, num)
end
end

@testset "single item batch" begin
for _ in 1:100
dim1 = rand(1:20)
dim2 = rand(1:20)

tups = (rand(dim1, dim2), rand(dim1, 1))
b_tups = batch_ts(tups)
@test ndims(b_tups[1]) == ndims(b_tups[2]) == 3
end
end

91 changes: 91 additions & 0 deletions tutorials/time_series/runtests_wg.jl
Original file line number Diff line number Diff line change
@@ -0,0 +1,91 @@
using Test
using DataFrames
using Random
using Dates

rng = MersenneTwister(123)

times = 150
df = DataFrame("timestamp"=>1:times,
"A"=>randn(rng, Float16, times),
"B"=>sin.(rand(rng, times)),
"C"=>1:times)
train_prop = 0.8
train_end = Int(times*train_prop)
train_df = df[1:train_end,:]
valid_df = df[train_end+1:end,:]


include("window_generator.jl")
@testset "historical window of length 1, future window 1" begin
h = 1
f = 1
wg = WindowGenerator(h, f, train_df, valid_df, "C")
@test wg.target_idx == [4]
@test size(wg.train[1][1]) == (length(names(df)),h)
@test size(wg.valid[1][1]) == (length(names(df)),h)
@test length(wg.train) == train_end - (h+f-1) # truncates necessary first and last sequence points
@test length(wg.valid) == times - train_end - (h+f-1)
@test (wg.train[1][1][1]) == (wg.train[1][2][1] - h)
@test (wg.train[end][1][1]) == (wg.train[end][2][1] - h)
@test wg.train[end][1][1,end] == (wg.train[end][2][1,begin] - 1) # 1st pred is immediately after last hist

end

@testset "historical window of length 7, future window 5" begin
h = 7
f = 5
wg = WindowGenerator(h, f, train_df, valid_df, "C")
@test wg.target_idx == [4]
@test size(wg.train[1][1]) == (length(names(df)),h)
@test size(wg.valid[1][1]) == (length(names(df)),h)
@test length(wg.train) == train_end - (h+f-1)
@test length(wg.valid) == times - train_end - (h+f-1)
@test (wg.train[1][1][1]) == (wg.train[1][2][1] - h)
@test (wg.train[end][1][1]) == (wg.train[end][2][1] - h)
@test wg.train[end][1][1,end] == (wg.train[end][2][1,begin] - 1) # 1st pred is immediately after last hist
end

@testset "multiple label columns" begin
h = 16
f = 3
wg = WindowGenerator(h, f, train_df, valid_df, "C")
@test length(wg.target_idx) == 1
wg = WindowGenerator(h, f, train_df, valid_df, ["A","C"])
@test length(wg.target_idx) == 2
wg = WindowGenerator(h, f, train_df, valid_df, ["A","C","B"])
@test length(wg.target_idx) == 3
wg = WindowGenerator(h, f, train_df, valid_df, ["timestamp","A","C","B"])
@test length(wg.target_idx) == 4
end

@testset "ignores nonexistent/duplicate columns" begin
h = 5
f = 3
wg = WindowGenerator(h, f, train_df, valid_df, ["timestamp","A","C","B","D"])
@test length(wg.target_idx) == 4

wg = WindowGenerator(h, f, train_df, valid_df, ["timestamp","timestamp"])
@test length(wg.target_idx) == 1
end

@testset "labels_indices are correct" begin
valid_end = times - train_end # limited by size of validation set
for h in 1:(valid_end - 1)
for f in 1:(valid_end - h)
wg = WindowGenerator(h, f, train_df, valid_df, ["timestamp","A","C","B"])
@test f == length(wg.label_indices)
@test first(wg.label_indices) == h + 1
@test last(wg.label_indices) == h + f
end
end
end




# batch_ts
include("batch_ts.jl")
@testset "" begin

end
19,314 changes: 19,314 additions & 0 deletions tutorials/time_series/time_series.ipynb

Large diffs are not rendered by default.

Loading