Should `Row` objects be required to support iteration #75

pdeffebach · 2019-03-09T19:05:53Z

It looks like row objects from Tables.rows have different iteration methods. Or, rather, the rows in JuliaDB allow for iteration and the ones in DataFrames do not.

This is despite that DataFrameRows were changed somewhat recently to allow iteration so they behave more like a NamedTuple.

Does Tables care about this? Or is it enough that they both implement Tables.rows and not that they return objects with the same behavior.

julia> using JuliaDB, Tables, DataFrames

julia> df = DataFrame(rand(3, 2))
3×2 DataFrame
│ Row │ x1       │ x2       │
│     │ Float64  │ Float64  │
├─────┼──────────┼──────────┤
│ 1   │ 0.286823 │ 0.045351 │
│ 2   │ 0.262343 │ 0.807344 │
│ 3   │ 0.532561 │ 0.360337 │

julia> t = table(df)
Table with 3 rows, 2 columns:
x1        x2
──────────────────
0.286823  0.045351
0.262343  0.807344
0.532561  0.360337

julia> for row in Tables.rows(df)
       @show sum(row)
       end
ERROR: MethodError: no method matching iterate(::Tables.ColumnsRow{NamedTuple{(:x1, :x2),Tuple{Array{Float64,1},Array{Float64,1}}}})
Closest candidates are:
  iterate(::Core.SimpleVector) at essentials.jl:589
  iterate(::Core.SimpleVector, ::Any) at essentials.jl:589
  iterate(::ExponentialBackOff) at error.jl:171
  ...
Stacktrace:
 [1] mapfoldl_impl(::Function, ::Function, ::NamedTuple{(),Tuple{}}, ::Tables.ColumnsRow{NamedTuple{(:x1, :x2),Tuple{Array{Float64,1},Array{Float64,1}}}}) at .\reduce.jl:53
 [2] #mapfoldl#170(::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}, ::Function, ::Function, ::Function, ::Tables.ColumnsRow{NamedTuple{(:x1, :x2),Tuple{Array{Float64,1},Array{Float64,1}}}}) at .\reduce.jl:70
 [3] mapfoldl(::Function, ::Function, ::Tables.ColumnsRow{NamedTuple{(:x1, :x2),Tuple{Array{Float64,1},Array{Float64,1}}}}) at .\reduce.jl:70
 [4] #mapreduce#174(::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}, ::Function, ::Function, ::Function, ::Tables.ColumnsRow{NamedTuple{(:x1, :x2),Tuple{Array{Float64,1},Array{Float64,1}}}}) at .\reduce.jl:203
 [5] mapreduce(::Function, ::Function, ::Tables.ColumnsRow{NamedTuple{(:x1, :x2),Tuple{Array{Float64,1},Array{Float64,1}}}}) at .\reduce.jl:203
 [6] sum(::Function, ::Tables.ColumnsRow{NamedTuple{(:x1, :x2),Tuple{Array{Float64,1},Array{Float64,1}}}}) at .\reduce.jl:397
 [7] sum(::Tables.ColumnsRow{NamedTuple{(:x1, :x2),Tuple{Array{Float64,1},Array{Float64,1}}}}) at .\reduce.jl:414
 [8] macro expansion at .\show.jl:555 [inlined]
 [9] top-level scope at .\REPL[12]:2 [inlined]
 [10] top-level scope at .\none:0

julia> for row in Tables.rows(t)
       @show sum(row)
       end
sum(row) = 0.3321738206655105
sum(row) = 1.0696873516171348
sum(row) = 0.8928982979305948

The text was updated successfully, but these errors were encountered:

pdeffebach · 2019-03-09T20:34:05Z

As mentioned on Slack, I think a way forward would be to up the requirements for these row types. They should not only be property accessible, but also iterable.

pdeffebach · 2019-03-15T15:32:46Z

The correct place to enforce this property is here, right?

Tables.jl/src/iteratorwrapper.jl

Line 37 in 0f75893

function Base.iterate(rows::IteratorWrapper)

We need can just check if row.x is iterable and if not break.

pdeffebach · 2019-03-15T18:49:06Z

So, the issue is that DataFrames doesn't do anything to implement row iteration, so it defaults to the fallback Tables row implementation defined in fallbacks.jl . getproperty defined for it here:

Base.getproperty(c::ColumnsRow, ::Type{T}, col::Int, nm::Symbol) where {T} = getproperty(getfield(c, 1), T, col, nm)[getfield(c, 2)]
Base.getproperty(c::ColumnsRow, nm::Symbol) = getproperty(getfield(c, 1), nm)[getfield(c, 2)]
Base.propertynames(c::ColumnsRow) = propertynames(getfield(c, 1))

JuliaDB returns a named tuple instead of a ColumnsRow object but I can't find where that behavior is defined in the source code.

quinnj · 2019-03-15T19:40:53Z

JuliaDB returns a NamedTuple because of the definitions in IndexedTables (and StructArrays).

I don't think it's reasonable to require rows to be iterable. Especially when it's easy enough to write your own "property iterator" to accomplish what you're going for here:

struct PropertyIterator{T}
    source::T
end

function properties(x::T) where {T}
    props = propertynames(x)
    length(props) == 0 && throw(ArgumentError("$T isn't property-iterable because it has no propertynames"))
    return PropertyIterator(x)
end

Base.length(p::PropertyIterator) = length(propertynames(p.source))
Base.IteratorEltype(::Type{<:PropertyIterator}) = Base.EltypeUnknown()

function Base.iterate(p::PropertyIterator, (props, i)=(propertynames(p.source), 1))
    i > length(props) && return nothing
    return getproperty(p.source, props[i]), (props, i + 1)
end

Happy to put this definition in Tables.jl if you think it'll be helpful generally, but I think it's about as efficient as we could expect for the various Row types that packages define for the Tables.jl interface. Some could be a tad more efficient if they're storing values in an indexable container, so they could perhaps overload Tables.properties, but anyway, we're now getting down the rabbit hole pretty far here.

pdeffebach · 2019-03-15T19:55:42Z

I don't think it's reasonable to require rows to be iterable. Especially when it's easy enough to write your own "property iterator" to accomplish what you're going for here:

Your proposed solution would be

df.total_income = map(Tables.rows(df[income_sources])) do row
    sum(properties(row))
end

Ideally, I would like the following (which work for eachrow in DataFrames.

df.total_income = map(Tables.rows(df[income_sources])) do row
    sum(row)
end

Or even

df.total_income = map(Tables.rows(df)) do row
    sum(row[income_sources]) # okay this doesn't work for NamedTuples, but it would be nice.
end

In my view, a row is a one dimensional object. I think that mean(row) and sum(row) are very intuitive operations and it would be nice to enforce that they work for all tables.

nalimilan · 2019-03-15T21:30:00Z

@quinnj Why do you think some row types couldn't implement iteration? That sounds like an essential feature for this kind of object, and falling back to propertynames is probably going to be quite slow (at least it would be for DataFrameRow).

quinnj · 2019-04-26T13:17:25Z

I feel like the argument here is essentially the same as "why aren't structs naturally iterable over their fields?" Answer: because they're fields. You access them by name. The core idea of the Tables.jl API is the dual nature of two key interfaces: PropertyAccessible and Iterable. Tables.rows produces an iterator of property-accessible objects, while Tables.columns produces a property-accessible object of iterators.

It feels uncomfortable to me, therefore, to consider imposing additional requirements on what Tables.rows returns. It makes the Tables.jl API that much more complex to implement/maintain/reason about, for what benefit? Like I pointed out, writing your own property-iterator is extremely trivial and adds what, 10 extra characters to the use-case here?

@nalimilan, I explicitly wrote the PropertyIterator example above so you would only need to call propertynames(row) once, and the same names could be passed in to each call to iterate from there on, so there shouldn't be a cause for performance concern.

nalimilan · 2019-04-28T17:01:40Z

@nalimilan, I explicitly wrote the PropertyIterator example above so you would only need to call propertynames(row) once, and the same names could be passed in to each call to iterate from there on, so there shouldn't be a cause for performance concern.

But for DataFrameRow iteration should use column indices rather than names to avoid an additional dict lookup for each cell. Do you see a generic way to define PropertyIterator that would be as efficient as that?

I understand the tension between feature completeness and keeping API requirements low, but IMHO iteration is still worth it.

quinnj · 2019-05-15T21:59:24Z

I just remembered that we also already provide the Tables.eachcolumn(row) function, that is already a property-iterator shipped w/ Tables.jl. So you could do something like:

df.total_income = map(Tables.rows(df[income_sources])) do row
    sum(Tables.eachcolumn(row))
end

quinnj · 2019-05-15T22:55:34Z

One other thought on this: by requiring Row types to support iteration, we subtly start to change how people generally use the Tables.jl interface, with possibly negative effects. For example, I would imagine for most tables, column types are not homogenous, which means that iterating over a row would be plagued with dynamic dispatch, whereas via Tables.eachcolumn using getproperty and constant propagation, accessing row values can be type stable. So while doing Tables.eachcolumn(row) might be a tad less convenient than just iterating the row directly, it also encourages better type stability.

randyzwitch · 2019-11-19T16:25:50Z

I just ran into this, and I also think its weird that iteration over rows isn't supported. I don't think it's "easy enough" for Jacob's solution, as someone who barely understands Tables.jl to start

Edit: Is the solution really this simple? If so, then it does feel like it should be a part of Tables.jl, because it's essentially syntactic sugar:

TStringRow(cols) = TStringRow([TStringValue(x) for x in Tables.eachcolumn(cols)]) #Tables.jl method

instead of

TStringRow(cols) = TStringRow([TStringValue(x) for x in cols]) #currently doesn't work

quinnj · 2020-01-09T23:13:54Z

So I've been thinking about this more recently and I think what I'm coming around to is requiring Row objects to support indexing, as well as Columns. This follows the notion that fields within a row are ordered and can be indexed by column index. I recently implemented this for CSV.Row and it felt very natural/nice and made working with rows a bit more accessible, particularly when when dealing w/ homogenous types. Being aware of other row implementations of Tables.jl-compatible packages, I also think it shouldn't pose a problem to support this.

Requiring indexing could be a good update to then tag 1.0. Anyone have thoughts/concerns with this plan?

randyzwitch · 2020-01-10T01:00:46Z

~~If I'm thinking about this correctly, wouldn't we need indexing to support threading over rows? I could easily see wanting to do this:~~

Edit: I think I'm mangling concepts, nevermind

Threads.@threads for i in 1:iter.n
      v[i] = somefunction(iter[i, :])
end

tpapp · 2020-01-10T08:04:55Z

@quinnj: so the proposal is to have two ways for row element iteration: Tables.eachcolumn(row) for performant (type stable) code, and indexing which has no such guarantees and may result in suboptimal code?

I think that indexing in Julia should be ideally be reserved for collections with homogeneous element types (ie things that support eltype in a meaningful way). I know there are exceptions (out of necessity), but I think they should be avoided when there is an alternative.

nalimilan · 2020-01-10T08:22:03Z

@tpapp IIUC what Jacob said, the idea is to make require rows and columns to be indexable (which is generally the case already for most implementations), not tables.

quinnj · 2020-02-08T04:54:26Z

In #131, the Tables.jl API has been enhanced and clarified so that Row and Columns objects (i.e. things that get returned from Tables.rows and Tables.columns) support the same interface:

Tables.getcolumn(row_or_columns, i::Int): get a column value (Row) or entire column (Columns) by column index
Tables.getcolumn(row_or_columns, nm::Symbol): get a column value (Row) or entire column (Columns) by column name
Tables.getcolumn(row_or_columns, ::Type{T}, i::Int, nm::Symbol): get a column value (Row) or entire column (Columns) by column eltype (T), column index, and column name
Tables.columnnames(row_or_columns): return column names of a Row or Columns object

While this streamlines and simplifies the interface, it was discussed that it potentially makes things less convenient in casual settings when users want/expect things like indexing, iteration, and property-access to work on rows or columns. With the new 1.0 API, my hope is that it helps clarifies things that you just use Tables.getcolumn and Tables.columnnames) instead of wondering whether this or that interface is supported. Also note that two new abstract types, Tables.AbstractRow and Tables.AbstractColumns were added, which custom row/columns types can subtype in order for their type to act like a NamedTuple (row) or NamedTuple of vectors (columns). That is, by subtyping one of those types, the custom types automatically have the indexing, iteration, dictionary, and property-access interfaces implemented by just using the getcolumn and columnnames definitions. My hope is that while the Tables.jl API becomes less convenient for casual use, it also encourages consistent behavior across row/columns types.

quinnj · 2020-02-11T21:35:53Z

Ok, one more thought I just had on this is that we could perhaps have something like the following defined in Tables.jl:

struct Row{T} <: Tables.AbstractRow
    x::T
end

Tables.getcolumn(x::Row, i::Int) = Tables.getcolumn(x.x, i)
Tables.getcolumn(x::Row, nm::Symbol) = Tables.getcolumn(x.x, nm)
Tables.getcolumn(x::Row, ::Type{T}, i::Int, nm::Symbol) where {T} = Tables.getcolumn(x.x, T, i, nm)
Tables.columnnames(x::Row) = Tables.columnnames(x.x)

which would then allow a user to do something like:

x = 0.0
for raw_row in Tables.rows(table)
    row = Tables.Row(raw_row)
    x += sum(row)
end

i.e. the very simple Tables.Row struct would wrap any interface-implementing "row" object and because Tables.Row <: Tables.AbstractRow, it would have a ton of desired properties, indexing, iteration, property-access, abstract dict, etc.

Do people think it's worth defining something like that? Would it be useful for more casual users in order to not lose convenience? It's not a ton of code and it's not super complicated, so I'm inclined to do it, but I also just wonder if people would actually use it.

nalimilan · 2020-02-11T21:46:58Z

Sounds like it could be useful. And it would be clearer than having abstract type Row end.

quinnj · 2020-02-12T07:39:32Z

Just to wrap this issue up for those on this thread: for the 1.0 release, we added a Tables.Row(row) wrapper struct that will take any Tables.AbstractRow-compatible object (i.e. any plain row iterated from Tables.row row iterator) and give it all the useful, convenient behavior discussed here and elsewhere: iteration, indexing with both index and name, property access, AbstractDict methods. Hopefully that's a good compromise to allowing more useful rows.

quinnj changed the title ~~Inconsistent behavior for Tables.rows between DataFrames and JuliaDB~~ Should Row objects be required to support iteration May 15, 2019

pdeffebach mentioned this issue Nov 26, 2019

add nrow and ncol #124

Closed

quinnj closed this as completed Feb 8, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Should `Row` objects be required to support iteration #75

Should `Row` objects be required to support iteration #75

pdeffebach commented Mar 9, 2019

pdeffebach commented Mar 9, 2019 •

edited

Loading

pdeffebach commented Mar 15, 2019

pdeffebach commented Mar 15, 2019

quinnj commented Mar 15, 2019

pdeffebach commented Mar 15, 2019

nalimilan commented Mar 15, 2019

quinnj commented Apr 26, 2019

nalimilan commented Apr 28, 2019

quinnj commented May 15, 2019

quinnj commented May 15, 2019

randyzwitch commented Nov 19, 2019 •

edited

Loading

quinnj commented Jan 9, 2020

randyzwitch commented Jan 10, 2020 •

edited

Loading

tpapp commented Jan 10, 2020

nalimilan commented Jan 10, 2020

quinnj commented Feb 8, 2020

quinnj commented Feb 11, 2020

nalimilan commented Feb 11, 2020

quinnj commented Feb 12, 2020

Should Row objects be required to support iteration #75

Should Row objects be required to support iteration #75

Comments

pdeffebach commented Mar 9, 2019

pdeffebach commented Mar 9, 2019 • edited Loading

pdeffebach commented Mar 15, 2019

pdeffebach commented Mar 15, 2019

quinnj commented Mar 15, 2019

pdeffebach commented Mar 15, 2019

nalimilan commented Mar 15, 2019

quinnj commented Apr 26, 2019

nalimilan commented Apr 28, 2019

quinnj commented May 15, 2019

quinnj commented May 15, 2019

randyzwitch commented Nov 19, 2019 • edited Loading

quinnj commented Jan 9, 2020

randyzwitch commented Jan 10, 2020 • edited Loading

tpapp commented Jan 10, 2020

nalimilan commented Jan 10, 2020

quinnj commented Feb 8, 2020

quinnj commented Feb 11, 2020

nalimilan commented Feb 11, 2020

quinnj commented Feb 12, 2020

Should `Row` objects be required to support iteration #75

Should `Row` objects be required to support iteration #75

pdeffebach commented Mar 9, 2019 •

edited

Loading

randyzwitch commented Nov 19, 2019 •

edited

Loading

randyzwitch commented Jan 10, 2020 •

edited

Loading