Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tables API enhancement #131

Merged
merged 15 commits into from
Feb 8, 2020
Merged

Tables API enhancement #131

merged 15 commits into from
Feb 8, 2020

Conversation

quinnj
Copy link
Member

@quinnj quinnj commented Feb 1, 2020

Ok, here's a proposal for an official 1.0 API interface for Tables.jl. This PR is actually non-breaking in the sense that everything currently relying on the Tables.jl interface and it's current API definition will continue working with this code (see the unchanged, yet passing tests on this PR), but it enhances the API definitions to require a bit more in order to provide additional functionality, conveniences, and consistency. The enhanced APIs mean sources will need to enhance their implementations to be compatible with sinks who start using the new enhanced APIs.

To try and boil the changes down as succinctly as possible, we have:

  • For the "Row" interface, from requiring propertynames(x) and getproperty(x, nm) to requiring Tables.columnnames(x) and Tables.getcolumn(x, col::Union{Int, Symbol})
  • Also requiring the same columnnames and getcolumn definitions on the "Columns" interface
  • When you call Tables.getcolumn(x, col) on a "Columns" object, it must return a column as an indexable collection with known length, instead of just as an iterator

The motivation for these changes include:

  • Allowing objects where it's difficult/piracy to overload propertynames/getproperty to participate safely in the Tables interface
  • More consistency: we will now have the same consistent methods when working with a "Columns" object or "Row" object
  • More functionality: you can now access columns/column values by name or index consistently

The biggest disadvantage, IMO, of these changes is requiring less convenient methods for the interfaces; i.e. before, you could reliably count on doing row.col1, but that's not strictly required anymore (though could be strongly encouraged). The counterargument to that is the api enhancements make working generically (like with generic packages) with rows/columns more convenient via consistency/functionality, and for more casual users interested in a specific table/row type, they'll probably be more aware of that types official api (like DataFrames, or that CSV.Row documents that it supports property-access).

Along those lines, this PR also defines the AbstractColumns and AbstractRow abstract types, which allow for custom types to inherit useful behavior in the form of default definitions using the interface methods; i.e. they automatically get indexing, property access, abstractdict interface, iteration, etc. just by relying on the getcolumn and columnnames interface methods. Hopefully that encourages types to still be functionally robust (and for the most part, more consistent) in the case that users want to use the Tables api directly.

As a final note, this PR hasn't completely updated all the docs, and hasn't implemented tests to cover the new functionality, but it's far enough along that I wanted to get feedback and people's thoughts.

"Abstract row type with a simple required interface: row values are accessible via `getproperty(row, field)`; for example, a NamedTuple like `nt = (a=1, b=2, c=3)` can access its value for `a` like `nt.a` which turns into a call to the function `getproperty(nt, :a)`"
abstract type Row end
"""
Tables.AbstractColumns
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we highlight explict that not all column types inherit from AbstractColumn ?
And that it thus should not be depended on for dispatch.
That it is just for convieinve of source authors (not sink authors)
(And same for Row)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great call out.

| `Tables.getcolumn(table, nm::Symbol)` | getproperty(table, nm) | Retrieve a column by name |
| `Tables.columnnames(table)` | propertynames(table) | Return column names for a table as an indexable collection |
| Optional methods | | |
| `Tables.getcolumn(table, ::Type{T}, i::Int, nm::Symbol)` | Tables.getcolumn(table, nm) | Given a column eltype `T`, index `i`, and column name `nm`, retrieve the column. Provides a type-stable or even constant-prop-able mechanism for efficiency.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we include in this PR a test suite method that takes a instance of a subtype and checks it has a correct implementation?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The difficulty here in doing a static check of interface implementation is we have generic fallback methods for all the interface methods (so things like NamedTuples and Generators of NamedTuples just work). But I think there's something we could provide to run in your test suite to put your type through the works.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah. We need to send something though it.
We can't just use hasmethod

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Related to this, I have a thought on interface versioning. See: #133

getcolumn(row::TransformsRow, i::Int) = (getfunc(row, getfuncs(row), i))(getcolumn(getrow(row), i))
columnnames(row::TransformsRow) = columnnames(getrow(row))

struct Transforms{C, T, F} <: AbstractColumns
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before 1.0 should we take the chance to rename this ? (Or to rename the other?)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or remove it entirely. I know @bkamins and @nalimilan have been hammering on the select/transform API in DataFrames, so if there's a way we could generalize their excellent design work into a generic Tables.jl API, I think it'd be a great starting point for a TablesOperations.jl package or something.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea.
Deleting things before 1.0 is a good way to free up real estate for redevelopment.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe just mention in the docs that this isn't part of the API, and ensure it's unexported? Then we can decide what to do in the long run.

getfunc(row, nt::NamedTuple, i::Int) = i > fieldcount(typeof(nt)) ? identity : getfield(nt, i)
getfunc(row, d::Dict{String, <:Base.Callable}, i::Int) = get(d, String(columnnames(row)[i]), identity)
getfunc(row, d::Dict{Symbol, <:Base.Callable}, i::Int) = get(d, columnnames(row)[i], identity)
getfunc(row, d::Dict{Int, <:Base.Callable}, i::Int) = get(d, i, identity)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we drop the dispatch on Base.Callable ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it hurt at all? Or overly restrict?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's overly restrictive.
Doesn't work with functors.

Maybe I want to pass something though a trained NN. Flux.Chain is callable, but not a subtype of Base.Callaeble

@bkamins
Copy link
Member

bkamins commented Feb 1, 2020

If we make these changes could also adding a "public" method taking a table and returning an iterator of NamedTuples representing rows be considered as a part of public Tables.jl API for 1.0 release? In this way you could efficiently (and in type stable way) pretend that your table is a Tables.rowtable for iteration purposes.

Currently Tables.rows does not give this guarantee (and I think it should not), but can return rows of different type depending on the type of source table, so something like a lazy variant of Tables.rowtable would be useful (so that no matter what is the source table you can rely on the fact that you iterate NamedTuples - this is relevant when designing API using Tables.jl).

@quinnj
Copy link
Member Author

quinnj commented Feb 2, 2020

@bkamins, can you clarify the use-case a little for the public namedtuple iterator? Note that the machinery exists to do type-stable iteration over generic rows already: using Tables.eachcolumn and the getproperty(row, T, i, nm) (now getcolumn(row, T, i, nm)) method, which also gracefully handles extremely large datasets by not over-compiling (it switches to a run-length encoding of column types, and if that's still too much, it switches to a single for-loop).

I just worry about publicly blessing NamedTuples too much in relation to data tasks because of the inevitable production issues people will run into with extremely large datasets, especially when we already have machinery to pretty much achieve the same benefits without.

@oschulz
Copy link

oschulz commented Feb 2, 2020

you could reliably count on doing row.col1, but that's not strictly required anymore (though could be strongly encouraged)

Yes, please (strongly encouragement, at least), this makes code so much more readable (and writeable) ...

@bkamins
Copy link
Member

bkamins commented Feb 2, 2020

can you clarify the use-case a little for the public namedtuple iterator?

I agree that for very wide tables one should not use a NamedTuple iterator, but there are common cases when you want to transform several (e.g. four as in an example below) columns into a single column using each-row operation, e.g. something like:

r -> r.a in r.b ? r.c : r.d

and then you apply f to all rows of some table. Now what is crucial:

  • to allow using getproperty in the code
  • to allow using r[1] (i.e. getindex) in the code
  • to be able to tell the user what is the contract for r above

Currently in DataFrames.jl we say that the contract for r is that it is a NamedTuple and produce it using Tables.namedtupleiterator. We could stop doing this, but the we need to set a different contract which NamedTuple meets and for sure I would like not to drop allowing the usage of getproperty and getindex.

@codecov-io
Copy link

codecov-io commented Feb 3, 2020

Codecov Report

Merging #131 into master will increase coverage by 1.83%.
The diff coverage is 95.39%.

Impacted file tree graph

@@            Coverage Diff            @@
##           master    #131      +/-   ##
=========================================
+ Coverage   95.87%   97.7%   +1.83%     
=========================================
  Files           7       7              
  Lines         436     523      +87     
=========================================
+ Hits          418     511      +93     
+ Misses         18      12       -6
Impacted Files Coverage Δ
src/tofromdatavalues.jl 100% <100%> (+1.78%) ⬆️
src/fallbacks.jl 99.09% <100%> (+4.04%) ⬆️
src/matrix.jl 100% <100%> (ø) ⬆️
src/namedtuples.jl 100% <100%> (ø) ⬆️
src/utils.jl 96% <100%> (+1.31%) ⬆️
src/operations.jl 93.33% <78.57%> (+3.33%) ⬆️
src/Tables.jl 98.64% <97.95%> (-1.36%) ⬇️
... and 1 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 59de960...8bdd327. Read the comment docs.

@quinnj
Copy link
Member Author

quinnj commented Feb 4, 2020

@bkamins, ok, I understand that use-case, but I also wonder why Tables.rowtable isn't sufficient? For a limited, known use-case, it seems like using a Vector of NamedTuples would be just as useful as a namedtuple iterator, without needing to add to the public API surface area of Tables.jl.

I should clarify that I'm not against including namedtupleiterator as an official utility function; just doing some regular push-back to make sure existing solutions/work-arounds aren't adequate.

@bkamins
Copy link
Member

bkamins commented Feb 4, 2020

it seems like using a Vector of NamedTuples would be just as useful as a namedtuple iterator

The problem is that materializing a Vector of NamedTuples is potentially expensive (as opposed - as I assume - to using just an iterator that does not allocate).

Just a quick code (not useful, but similar things happen in practice):

for i in 1:10^6
    df2 = select(df, :, [:a, :b] => Row(x -> x.a + x.b + i) => :c, copycols=false)
    # and do something with df2
end

Now Tables.rowtable for a DataFrame that has 10^6 rows and 3 columns (a normal use case) takes:

julia> @benchmark Tables.rowtable(df)
BenchmarkTools.Trial:
  memory estimate:  221.23 MiB
  allocs estimate:  10998485
  --------------
  minimum time:     417.965 ms (1.66% GC)
  median time:      445.184 ms (2.71% GC)
  mean time:        466.125 ms (6.89% GC)
  maximum time:     601.872 ms (17.16% GC)
  --------------
  samples:          11
  evals/sample:     1

which is an overkill (now I see that Tables.rowtable(Tables.columntable(df)) is much faster - I will make a PR to DataFrames.jl fixing this and CC you).

But even given that it is faster (and allocating less to do):

x = Tables.rows(Tables.columntable(df))
Tables.namedtupleiterator(eltype(x), x)

than

Tables.rowtable(Tables.columntable(df))

(and iterating both seems to have the same performance later; actually - surprisingly - iterating Tables.namedtupleiterator seems 10% faster in my benchmarks)

that is why I would prefer to have Tables.namedtupleiterator exposed with simplified API (now it is hard to use except for library codes, as you have to pass two arguments to it).

Copy link
Member

@nalimilan nalimilan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some smallish comments and typos.

pages=[
"Home" => "index.md",
],
repo="https://github.com/JuliaData/Tables.jl/blob/{commit}{path}#L{line}",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this needed? It's not set up in other packages I know, and the link works.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is all generated from PkgTemplates.jl

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we got a bit overzealous? Example.jl is much simpler: https://github.com/JuliaLang/Example.jl/blob/master/docs/make.jl


makedocs(;
modules=[Tables],
format=Documenter.HTML(),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't necessary AFAIK. Same for assets.

@@ -0,0 +1,2 @@
[deps]
Documenter = "e30172f5-a6a5-5a46-863b-614d45cd2de4"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's recommended to specify a particular version to avoid breakage. Then the Manifest isn't needed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Manifest is useful becaue then you can include in it

[[Tables]]
path = ".."
uuid = "bd369af6-aec1-5ad0-b16a-f7cc5008161c"

which makes sure the docstrings are right for the dev version docs.

src/utils.jl Outdated
```julia
vectors = [collect(col) for col in Tables.eachcolumn(Tables.columns(x))]
```
While the first definition applies to an `Row` object, the last definition simply returns an AbstractColumn iterator for a `Columns` object.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
While the first definition applies to an `Row` object, the last definition simply returns an AbstractColumn iterator for a `Columns` object.
While the first definition applies to a `Row` object, the last definition simply returns an `AbstractColumn` iterator for a `Columns` object.

"""
Tables.datavaluerows(x) => NamedTuple iterator

Takes any table input `x` and returns a NamedTuple iterator
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Takes any table input `x` and returns a NamedTuple iterator
Takes any table input `x` and returns a `NamedTuple` iterator

src/Tables.jl Outdated
Check if an object has specifically defined that it is a table. Note that
not all valid tables will return true, since it's possible to satisfy the
Tables.jl interface at "run-time", e.g. a Generator of NamedTuples iterates
NamedTuples, which satisfies the Row interface, but there's no static way
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
NamedTuples, which satisfies the Row interface, but there's no static way
`NamedTuples`, which satisfies the `Row` interface, but there's no static way

src/Tables.jl Outdated

Check whether an object has specifically defined that it implements the `Tables.rows`
function. Note that `Tables.rows` will work on any object that iterates Row-compatible
objects, even if they don't define `rowaccess`, e.g. a Generator of NamedTuples. Also
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
objects, even if they don't define `rowaccess`, e.g. a Generator of NamedTuples. Also
objects, even if they don't define `rowaccess`, e.g. a `Generator` of `NamedTuples`. Also

src/Tables.jl Outdated
Tables.jl-compatible table input and make an instance of the table type. This enables "transform"
workflows that take table inputs, apply transformations, potentially converting the table to
a different form, and end with producing a table of the same type as the original input. The
default materializer is `Tables.columntable`, which converts any table input into a NamedTuple
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
default materializer is `Tables.columntable`, which converts any table input into a NamedTuple
default materializer is `Tables.columntable`, which converts any table input into a `NamedTuple`

src/Tables.jl Outdated
workflows that take table inputs, apply transformations, potentially converting the table to
a different form, and end with producing a table of the same type as the original input. The
default materializer is `Tables.columntable`, which converts any table input into a NamedTuple
of Vectors.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
of Vectors.
of `Vector`s.


### Tables.rows usage

First up, let's take a look at the [SQLite.jl]() package and how it uses the Tables.jl interface to allow loading of generic table-like data into a sqlite relational table. Here's the code:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing link.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@quinnj
Copy link
Member Author

quinnj commented Feb 5, 2020

@bkamins, I cleaned up Tables.namedtupleiterator internally, added docs, and put it in the official list of "utility functions".

@bkamins
Copy link
Member

bkamins commented Feb 5, 2020

Thank you. I will add an efficient method for this it in DataFrames.jl when you release this change.

an existing row table and appends the input table source `x`
to the existing row table.
"""
function rowtable end
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest to add isrowtable trait function #134. It would introduce "row table" to mean something different. I'm not sure if you want to include #134 in 1.0. But can you at least remove Tables.rowtable from the public API? Of course, rejecting #134 and keeping Tables.rowtable is a reasonable choice, too.

# sink function
"""
Tables.rowtable(x) => Vector{NamedTuple}
Tables.rowtable(rt, x) => rt
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know enough context to understand why rowtable(rt, x) and columntable(ct, x) exist. But wouldn't it be better to introduce tablecat(table, tables...) or maybe even Tables.cat(table, tables...)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, we should probably remove the 2 appending methods; I agree they don't really make sense.

@tkf
Copy link
Contributor

tkf commented Feb 6, 2020

I filed a few feature requests that may be easier to resolve before going 1.0:

@quinnj
Copy link
Member Author

quinnj commented Feb 7, 2020

@bkamins , just fyi, I just pushed a commit that improves Tables.namedtupleiterator performance by 3x (at least in one benchmark I've been running). Just posting in case you have benchmarks you could check as well.

@bkamins
Copy link
Member

bkamins commented Feb 7, 2020

Just posting in case you have benchmarks you could check as well.

I do not see a significant difference in my benchmarks. What is exactly your test scenario? I was generating Tables.namedtupleiterator from a NamedTuple of vectors and checked the speed of iterating it (maybe you meant that it is faster to create this iterator rather than iterate it?)

@quinnj
Copy link
Member Author

quinnj commented Feb 7, 2020

Current master:

julia> using BenchmarkTools, DataFrames, Tables
[ Info: Precompiling DataFrames [a93c6f00-e57d-5684-b7b6-d8193f3e46c0]

julia> df = DataFrame(a=1:10000, b=1.0:10000.0, c=["hey" for i = 1:10000]);

julia> @btime rowtable(df);
  3.102 ms (117974 allocations: 2.49 MiB)

This PR:

julia> using BenchmarkTools, DataFrames, Tables
[ Info: Precompiling DataFrames [a93c6f00-e57d-5684-b7b6-d8193f3e46c0]

julia> df = DataFrame(a=1:10000, b=1.0:10000.0, c=["hey" for i = 1:10000]);

julia> @btime rowtable(df);
  1.632 ms (97975 allocations: 1.88 MiB)

@bkamins
Copy link
Member

bkamins commented Feb 7, 2020

Are you benchmarking against DataFrames.jl master?

The issue is that we have release-pending https://github.com/JuliaData/DataFrames.jl/blob/master/src/other/tables.jl#L5 change (it seems you are benchmarking against v20.0 release). The change was due to the fact that in really wide tables we want to avoid compilation of Tables.columntable (e.g. for CSV.write).

That is why I have opened JuliaData/DataFrames.jl#2100 to make sure we are fast after you merge this PR and release Tables.jl (so I am waiting with fingers crossed to see it merged 😄 - thank you for all the efforts here).

@bkamins
Copy link
Member

bkamins commented Feb 7, 2020

Sorry - I see you are benchmarking it against master of DataFrames.jl. What I mean to do in JuliaData/DataFrames.jl#2100 is to move from:

julia> @btime rowtable(df);
  3.230 ms (117974 allocations: 2.49 MiB)

to

julia> @btime rowtable(Tables.columntable(df));
  85.400 μs (10018 allocations: 391.42 KiB)

(i.e. when the user requests rowtable from an AbstractDataFrame you move through Tables.columntable)

What I meant earlier with my comment is that the commit did not affect the performance of Tables.columntable. It affects Tables.rowtable from AbstractDataFrame, but I will improve that part anyway soon with JuliaData/DataFrames.jl#2100.

@quinnj quinnj merged commit fb4a125 into master Feb 8, 2020
@quinnj
Copy link
Member Author

quinnj commented Feb 8, 2020

Merging this now, since there's a good amount of work that's built up. I still want to address #134, and allow anyone else to review who would like before making the 1.0 release. I also want to make sure the new docs get built correctly. Please comment here or open an issue for anything else you have concerns with.

@bkamins
Copy link
Member

bkamins commented Feb 8, 2020

Tables in the documentation do not render correctly in section https://juliadata.github.io/Tables.jl/dev/#Implementing-the-Interface-(i.e.-becoming-a-Tables.jl-source)-1

@quinnj quinnj deleted the jq/1dotOh branch February 8, 2020 08:29
@quinnj
Copy link
Member Author

quinnj commented Feb 8, 2020

Hmmm.........I tried to fix them using https://www.tablesgenerator.com/markdown_tables, but even that seems to not have them rendering. Anyone know what's going on there?

@bkamins
Copy link
Member

bkamins commented Feb 8, 2020

See #138 for a fix.

EDIT: actually indeed something strange is happening. I will try to investigate into it. It seems that the tables that are in doc strings have the newlines removed (which in particular makes them print pad in REPL; and in general the tables are too wide for printing on 80-character terminals).

@mortenpi probably should know the right approach to use.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants