Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Should Row objects be required to support iteration #75

Closed
pdeffebach opened this issue Mar 9, 2019 · 19 comments
Closed

Should Row objects be required to support iteration #75

pdeffebach opened this issue Mar 9, 2019 · 19 comments

Comments

@pdeffebach
Copy link

It looks like row objects from Tables.rows have different iteration methods. Or, rather, the rows in JuliaDB allow for iteration and the ones in DataFrames do not.

This is despite that DataFrameRows were changed somewhat recently to allow iteration so they behave more like a NamedTuple.

Does Tables care about this? Or is it enough that they both implement Tables.rows and not that they return objects with the same behavior.

julia> using JuliaDB, Tables, DataFrames

julia> df = DataFrame(rand(3, 2))
3×2 DataFrame
│ Row │ x1       │ x2       │
│     │ Float64  │ Float64  │
├─────┼──────────┼──────────┤
│ 1   │ 0.286823 │ 0.045351 │
│ 2   │ 0.262343 │ 0.807344 │
│ 3   │ 0.532561 │ 0.360337 │

julia> t = table(df)
Table with 3 rows, 2 columns:
x1        x2
──────────────────
0.286823  0.045351
0.262343  0.807344
0.532561  0.360337

julia> for row in Tables.rows(df)
       @show sum(row)
       end
ERROR: MethodError: no method matching iterate(::Tables.ColumnsRow{NamedTuple{(:x1, :x2),Tuple{Array{Float64,1},Array{Float64,1}}}})
Closest candidates are:
  iterate(::Core.SimpleVector) at essentials.jl:589
  iterate(::Core.SimpleVector, ::Any) at essentials.jl:589
  iterate(::ExponentialBackOff) at error.jl:171
  ...
Stacktrace:
 [1] mapfoldl_impl(::Function, ::Function, ::NamedTuple{(),Tuple{}}, ::Tables.ColumnsRow{NamedTuple{(:x1, :x2),Tuple{Array{Float64,1},Array{Float64,1}}}}) at .\reduce.jl:53
 [2] #mapfoldl#170(::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}, ::Function, ::Function, ::Function, ::Tables.ColumnsRow{NamedTuple{(:x1, :x2),Tuple{Array{Float64,1},Array{Float64,1}}}}) at .\reduce.jl:70
 [3] mapfoldl(::Function, ::Function, ::Tables.ColumnsRow{NamedTuple{(:x1, :x2),Tuple{Array{Float64,1},Array{Float64,1}}}}) at .\reduce.jl:70
 [4] #mapreduce#174(::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}, ::Function, ::Function, ::Function, ::Tables.ColumnsRow{NamedTuple{(:x1, :x2),Tuple{Array{Float64,1},Array{Float64,1}}}}) at .\reduce.jl:203
 [5] mapreduce(::Function, ::Function, ::Tables.ColumnsRow{NamedTuple{(:x1, :x2),Tuple{Array{Float64,1},Array{Float64,1}}}}) at .\reduce.jl:203
 [6] sum(::Function, ::Tables.ColumnsRow{NamedTuple{(:x1, :x2),Tuple{Array{Float64,1},Array{Float64,1}}}}) at .\reduce.jl:397
 [7] sum(::Tables.ColumnsRow{NamedTuple{(:x1, :x2),Tuple{Array{Float64,1},Array{Float64,1}}}}) at .\reduce.jl:414
 [8] macro expansion at .\show.jl:555 [inlined]
 [9] top-level scope at .\REPL[12]:2 [inlined]
 [10] top-level scope at .\none:0

julia> for row in Tables.rows(t)
       @show sum(row)
       end
sum(row) = 0.3321738206655105
sum(row) = 1.0696873516171348
sum(row) = 0.8928982979305948
@pdeffebach
Copy link
Author

pdeffebach commented Mar 9, 2019

As mentioned on Slack, I think a way forward would be to up the requirements for these row types. They should not only be property accessible, but also iterable.

@pdeffebach
Copy link
Author

The correct place to enforce this property is here, right?

function Base.iterate(rows::IteratorWrapper)

We need can just check if row.x is iterable and if not break.

@pdeffebach
Copy link
Author

So, the issue is that DataFrames doesn't do anything to implement row iteration, so it defaults to the fallback Tables row implementation defined in fallbacks.jl . getproperty defined for it here:

Base.getproperty(c::ColumnsRow, ::Type{T}, col::Int, nm::Symbol) where {T} = getproperty(getfield(c, 1), T, col, nm)[getfield(c, 2)]
Base.getproperty(c::ColumnsRow, nm::Symbol) = getproperty(getfield(c, 1), nm)[getfield(c, 2)]
Base.propertynames(c::ColumnsRow) = propertynames(getfield(c, 1))

JuliaDB returns a named tuple instead of a ColumnsRow object but I can't find where that behavior is defined in the source code.

@quinnj
Copy link
Member

quinnj commented Mar 15, 2019

JuliaDB returns a NamedTuple because of the definitions in IndexedTables (and StructArrays).

I don't think it's reasonable to require rows to be iterable. Especially when it's easy enough to write your own "property iterator" to accomplish what you're going for here:

struct PropertyIterator{T}
    source::T
end

function properties(x::T) where {T}
    props = propertynames(x)
    length(props) == 0 && throw(ArgumentError("$T isn't property-iterable because it has no propertynames"))
    return PropertyIterator(x)
end

Base.length(p::PropertyIterator) = length(propertynames(p.source))
Base.IteratorEltype(::Type{<:PropertyIterator}) = Base.EltypeUnknown()

function Base.iterate(p::PropertyIterator, (props, i)=(propertynames(p.source), 1))
    i > length(props) && return nothing
    return getproperty(p.source, props[i]), (props, i + 1)
end

Happy to put this definition in Tables.jl if you think it'll be helpful generally, but I think it's about as efficient as we could expect for the various Row types that packages define for the Tables.jl interface. Some could be a tad more efficient if they're storing values in an indexable container, so they could perhaps overload Tables.properties, but anyway, we're now getting down the rabbit hole pretty far here.

@pdeffebach
Copy link
Author

I don't think it's reasonable to require rows to be iterable. Especially when it's easy enough to write your own "property iterator" to accomplish what you're going for here:

Your proposed solution would be

df.total_income = map(Tables.rows(df[income_sources])) do row
    sum(properties(row))
end

Ideally, I would like the following (which work for eachrow in DataFrames.

df.total_income = map(Tables.rows(df[income_sources])) do row
    sum(row)
end

Or even

df.total_income = map(Tables.rows(df)) do row
    sum(row[income_sources]) # okay this doesn't work for NamedTuples, but it would be nice.
end

In my view, a row is a one dimensional object. I think that mean(row) and sum(row) are very intuitive operations and it would be nice to enforce that they work for all tables.

@nalimilan
Copy link
Member

@quinnj Why do you think some row types couldn't implement iteration? That sounds like an essential feature for this kind of object, and falling back to propertynames is probably going to be quite slow (at least it would be for DataFrameRow).

@quinnj
Copy link
Member

quinnj commented Apr 26, 2019

I feel like the argument here is essentially the same as "why aren't structs naturally iterable over their fields?" Answer: because they're fields. You access them by name. The core idea of the Tables.jl API is the dual nature of two key interfaces: PropertyAccessible and Iterable. Tables.rows produces an iterator of property-accessible objects, while Tables.columns produces a property-accessible object of iterators.

It feels uncomfortable to me, therefore, to consider imposing additional requirements on what Tables.rows returns. It makes the Tables.jl API that much more complex to implement/maintain/reason about, for what benefit? Like I pointed out, writing your own property-iterator is extremely trivial and adds what, 10 extra characters to the use-case here?

@nalimilan, I explicitly wrote the PropertyIterator example above so you would only need to call propertynames(row) once, and the same names could be passed in to each call to iterate from there on, so there shouldn't be a cause for performance concern.

@nalimilan
Copy link
Member

@nalimilan, I explicitly wrote the PropertyIterator example above so you would only need to call propertynames(row) once, and the same names could be passed in to each call to iterate from there on, so there shouldn't be a cause for performance concern.

But for DataFrameRow iteration should use column indices rather than names to avoid an additional dict lookup for each cell. Do you see a generic way to define PropertyIterator that would be as efficient as that?

I understand the tension between feature completeness and keeping API requirements low, but IMHO iteration is still worth it.

@quinnj
Copy link
Member

quinnj commented May 15, 2019

I just remembered that we also already provide the Tables.eachcolumn(row) function, that is already a property-iterator shipped w/ Tables.jl. So you could do something like:

df.total_income = map(Tables.rows(df[income_sources])) do row
    sum(Tables.eachcolumn(row))
end

@quinnj quinnj changed the title Inconsistent behavior for Tables.rows between DataFrames and JuliaDB Should Row objects be required to support iteration May 15, 2019
@quinnj
Copy link
Member

quinnj commented May 15, 2019

One other thought on this: by requiring Row types to support iteration, we subtly start to change how people generally use the Tables.jl interface, with possibly negative effects. For example, I would imagine for most tables, column types are not homogenous, which means that iterating over a row would be plagued with dynamic dispatch, whereas via Tables.eachcolumn using getproperty and constant propagation, accessing row values can be type stable. So while doing Tables.eachcolumn(row) might be a tad less convenient than just iterating the row directly, it also encourages better type stability.

@randyzwitch
Copy link
Contributor

randyzwitch commented Nov 19, 2019

I just ran into this, and I also think its weird that iteration over rows isn't supported. I don't think it's "easy enough" for Jacob's solution, as someone who barely understands Tables.jl to start

Edit: Is the solution really this simple? If so, then it does feel like it should be a part of Tables.jl, because it's essentially syntactic sugar:

TStringRow(cols) = TStringRow([TStringValue(x) for x in Tables.eachcolumn(cols)]) #Tables.jl method

instead of

TStringRow(cols) = TStringRow([TStringValue(x) for x in cols]) #currently doesn't work

@quinnj
Copy link
Member

quinnj commented Jan 9, 2020

So I've been thinking about this more recently and I think what I'm coming around to is requiring Row objects to support indexing, as well as Columns. This follows the notion that fields within a row are ordered and can be indexed by column index. I recently implemented this for CSV.Row and it felt very natural/nice and made working with rows a bit more accessible, particularly when when dealing w/ homogenous types. Being aware of other row implementations of Tables.jl-compatible packages, I also think it shouldn't pose a problem to support this.

Requiring indexing could be a good update to then tag 1.0. Anyone have thoughts/concerns with this plan?

@randyzwitch
Copy link
Contributor

randyzwitch commented Jan 10, 2020

If I'm thinking about this correctly, wouldn't we need indexing to support threading over rows? I could easily see wanting to do this:

Edit: I think I'm mangling concepts, nevermind

Threads.@threads for i in 1:iter.n
      v[i] = somefunction(iter[i, :])
end

@tpapp
Copy link
Contributor

tpapp commented Jan 10, 2020

@quinnj: so the proposal is to have two ways for row element iteration: Tables.eachcolumn(row) for performant (type stable) code, and indexing which has no such guarantees and may result in suboptimal code?

I think that indexing in Julia should be ideally be reserved for collections with homogeneous element types (ie things that support eltype in a meaningful way). I know there are exceptions (out of necessity), but I think they should be avoided when there is an alternative.

@nalimilan
Copy link
Member

@tpapp IIUC what Jacob said, the idea is to make require rows and columns to be indexable (which is generally the case already for most implementations), not tables.

@quinnj
Copy link
Member

quinnj commented Feb 8, 2020

In #131, the Tables.jl API has been enhanced and clarified so that Row and Columns objects (i.e. things that get returned from Tables.rows and Tables.columns) support the same interface:

  • Tables.getcolumn(row_or_columns, i::Int): get a column value (Row) or entire column (Columns) by column index
  • Tables.getcolumn(row_or_columns, nm::Symbol): get a column value (Row) or entire column (Columns) by column name
  • Tables.getcolumn(row_or_columns, ::Type{T}, i::Int, nm::Symbol): get a column value (Row) or entire column (Columns) by column eltype (T), column index, and column name
  • Tables.columnnames(row_or_columns): return column names of a Row or Columns object

While this streamlines and simplifies the interface, it was discussed that it potentially makes things less convenient in casual settings when users want/expect things like indexing, iteration, and property-access to work on rows or columns. With the new 1.0 API, my hope is that it helps clarifies things that you just use Tables.getcolumn and Tables.columnnames) instead of wondering whether this or that interface is supported. Also note that two new abstract types, Tables.AbstractRow and Tables.AbstractColumns were added, which custom row/columns types can subtype in order for their type to act like a NamedTuple (row) or NamedTuple of vectors (columns). That is, by subtyping one of those types, the custom types automatically have the indexing, iteration, dictionary, and property-access interfaces implemented by just using the getcolumn and columnnames definitions. My hope is that while the Tables.jl API becomes less convenient for casual use, it also encourages consistent behavior across row/columns types.

@quinnj quinnj closed this as completed Feb 8, 2020
@quinnj
Copy link
Member

quinnj commented Feb 11, 2020

Ok, one more thought I just had on this is that we could perhaps have something like the following defined in Tables.jl:

struct Row{T} <: Tables.AbstractRow
    x::T
end

Tables.getcolumn(x::Row, i::Int) = Tables.getcolumn(x.x, i)
Tables.getcolumn(x::Row, nm::Symbol) = Tables.getcolumn(x.x, nm)
Tables.getcolumn(x::Row, ::Type{T}, i::Int, nm::Symbol) where {T} = Tables.getcolumn(x.x, T, i, nm)
Tables.columnnames(x::Row) = Tables.columnnames(x.x)

which would then allow a user to do something like:

x = 0.0
for raw_row in Tables.rows(table)
    row = Tables.Row(raw_row)
    x += sum(row)
end

i.e. the very simple Tables.Row struct would wrap any interface-implementing "row" object and because Tables.Row <: Tables.AbstractRow, it would have a ton of desired properties, indexing, iteration, property-access, abstract dict, etc.

Do people think it's worth defining something like that? Would it be useful for more casual users in order to not lose convenience? It's not a ton of code and it's not super complicated, so I'm inclined to do it, but I also just wonder if people would actually use it.

@nalimilan
Copy link
Member

Sounds like it could be useful. And it would be clearer than having abstract type Row end.

@quinnj
Copy link
Member

quinnj commented Feb 12, 2020

Just to wrap this issue up for those on this thread: for the 1.0 release, we added a Tables.Row(row) wrapper struct that will take any Tables.AbstractRow-compatible object (i.e. any plain row iterated from Tables.row row iterator) and give it all the useful, convenient behavior discussed here and elsewhere: iteration, indexing with both index and name, property access, AbstractDict methods. Hopefully that's a good compromise to allowing more useful rows.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants