-
Notifications
You must be signed in to change notification settings - Fork 370
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make row lookup easier #3051
Comments
I'm not a fan of this. You only save 3 characters and the syntax is harder to understand. It wouldn't really improve the workflow discussed in the Discourse thread, would it? |
It would not - that is why we originally have not implemented it. It is just non-conflicting and would make life of newcomers slightly easier. I wanted to discuss it to make sure we are OK with the current design. |
I was thinking of the case of a unique index.
This then allows But loses all the other DataFrame functionality. There are so many times I want to do several lookups on different columns but using the same index column. Example
Currently I'm calling a function for each group. The function then converts each sub-DataFrame to named arrays. |
I am moving it to 1.5 release. Essentially the requirement is for a So this would be something like (in general probably this could be made more efficient, but this shows the idea - please look at the API not at the implementation):
This would allow for easy indexing like (referring to the example above):
Do we think it is worth to add something like this? |
If this is for a unique index. Would it be better to return a DataFrameRow rather than a 1-row DataFrame. Then It would be nice to have the |
Yes - it was a typo - I meant
This cannot be done as indexing What we could consider doing is to add syntax like:
with the requirement that the condition must identify a unique row. @nalimilan - what do you think? |
It seems problematic to assume that I feel like we're not targeting the real problem here. Maybe the problem is just that when using df = DataFrame(name=["John", "Sally", "Kirk"], age=[23., 42., 59.], children=[3,5,2])
@chain df begin
@transform :Age_relative_to_Sally = :age .- only(:age[:name .== "Sally"])
end Or even without @chain df begin
@transform :Age_relative_to_Sally = :age .- :age[:name .== "Sally"]
end Maybe what's missing is a way to do the same using |
CC @pdeffebach |
I can't think of an easy way to mark just some columns as "whole column" references and others as not. The current implementation just takes the anonymous function created by the parsing and wraps it in We could have something that expands to Aren't there array types that allow for
? That seems pretty clean to me. |
That's a really nice syntax if its possible. With NamedArrays.jl you can do
But of course changing the NamedArray won't change the column. |
This still has some performance issues, since |
Ok. suppose there are a few columns you would like to attach the index to.
could this be condensed to something like?
and could the NamedArray functionality be merged with the DataFrame column such that you can alter cells as well as referencing them. e.g.
|
Something I just realised. DataFrame columns can be defined as NamedArrays, and cell values referenced within a
|
https://discourse.julialang.org/t/why-is-it-so-complicated-to-access-a-row-in-a-dataframe/103162/10
As indicated in the post above, "The biggest issue is that the condition you might want to use could return exactly one row, or multiple rows (where 0 rows is a special case of multiple)". Although I don't think this is really in line with the philosophy of DataFrame.jl or its internal implementation, one feature I really like in Pandas is MultiIndex (https://pandas.pydata.org/docs/user_guide/advanced.html). Even if they're not obvious to use, they make the DataFrame much more readable and could help solve this kind of problem. |
I had some bad experiences with pandas.Multiindex,. Would something as flexible as boost.MultiIndex be more desirable? |
Having thought about it:
The syntax (note a slight change, but this is a minor thing):
is tempting, but in the past with @nalimilan we wanted to avoid adding too much functionality to indexing of The alternative would be to define something like:
which would be a wrapper around Then one could write If a single lookup were required then one could of course write |
Is the problem here the syntax or the speed? If it's the former, I'm not sure if there's any benefit from notation like this df[(:name => "Sally", :age => 49), :]
lookup(df, :name, :age)["Sally", 49] relative to
We'd avoid writing I'm usually against adding new notation, especially if it creates additional ways of doing the same, while having approximately the same readibility/number of characters. Maybe it's a problem more related to documentation? In indexing syntax, I noticed that the explanations for column selectors and row indexing are intertwined. Moreover, there's only one example with the broadcasting notation, and it's at the end of the section. If the problem is speed, then I agree that having an optimized function like "lookup" could be beneficial. |
My proposal with
So the question is if we want the addition of |
If it's for speed and correctness, I like the idea. Maybe we could think more about the API? the API and name of I'm thinking of two possibilities. If it emulates filter (in the sense of a function returning a dataframe), something along the lines: rowsubset(df, :id => 10, :name => "John") the name makes it clear that it should return a row and that the API is similar to subset. Regardless of the internals, it makes it clear that it emulates the If the intention is to be used for indexing, maybe the name
BTW, is the use of grouped dfs for selecting multiple rows documented? |
Call the function df = DataFrame(name=["John", "Sally", "Kirk"], age=[23., 42., 59.], children=[3,5,2])
idf = indexby(df, :name)
sally_age = idf["Sally"].age (I first considered just overloading |
I agree. I have avoided grouped data frames until today. After today, I still find them confusing. This is a lot of parenthesis/brackets to index two keys: (link) iris_gdf = groupby(iris, :Species)
iris_gdf[[("Iris-virginica",), ("Iris-setosa",)]] Why does this error? julia> iris_gdf["Iris-virginica"]
ERROR: ArgumentError: invalid index: "Iris-virginica" of type String It is the API proposed for
It should also work for |
See the first post of this discussion. That was the original proposal. The idea is that you need a tuple, such that I misinterpreted the whole discussion I think. Given the benchmarks I showed, I thought the discussion was about performing fast indexing, but it's more related to subsetting a dataframe assuming that you'll perform multiple operations on that. Maybe there should be a clarification that if you want to index, the approach with I'm still unsure about the benefits of adding this functionality, even more now that it's not necessarily more efficient. |
This is an issue with legacy Because of this I proposed
This would be consistent. The only drawback is that we use |
DataFrames 2.0?
Square brackets would be best, but I like |
Could be, but realistically DataFrames 2.0 is like Julia 2.0 - a very distant future.
Yes, iteration protocol is separate from indexing protocol. You would be able to index @nalimilan - having said that, maybe, even though it could create a minor confusion we could start allowing:
for a by-value lookup? The rule would be that everything that is not an
As I have said - this creates a minor confusion/api inconsistency, but maybe it is better than requiring users to always use the tuple wrapper? If we went for this then we even could drop the idea of
(and the So, to put concrete proposals on a table I tend to like to add either:
or
The point is that current syntax What do you think? |
I prefer the rip-the-band-aid DataFrames 2.0 solution, but I understand.
I am unfortunately against this since Integer cannot be included. It seems like an easy way for silent "correctness" errors to slip in. I guess misunderstanding the integer indexing is already possible, but allowing indexing to work one way for julia> df = DataFrame(name=["John", "Sally", "Kirk"], age=[23., 42., 59.], children=[3,5,2])
3×3 DataFrame
Row │ name age children
│ String Float64 Int64
─────┼───────────────────────────
1 │ John 23.0 3
2 │ Sally 42.0 5
3 │ Kirk 59.0 2
julia> gdf = groupby(df, :children)
GroupedDataFrame with 3 groups based on key: children
First Group (1 row): children = 2
Row │ name age children
│ String Float64 Int64
─────┼───────────────────────────
1 │ Kirk 59.0 2
⋮
Last Group (1 row): children = 5
Row │ name age children
│ String Float64 Int64
─────┼───────────────────────────
1 │ Sally 42.0 5
julia> gdf[2]
1×3 SubDataFrame
Row │ name age children
│ String Float64 Int64
─────┼───────────────────────────
1 │ John 23.0 3
Syntax is okay, but would this be performant? I imagine |
Yes, using |
I tried to create a summary of the current and proposed methods for looking up a value. Please add or correct as necessary. using DataFrames, DataFramesMeta
# Single Category
## Definition
df = DataFrame(
x = 1.0:8.0,
id = 1001:1008,
)
## Group Method
gdf = groupby(df, :id)
xval = only(gdf[(1007,)]).x # Existing
xval = only(gdf(1007)).x # Proposed
idf = indexby(df, :id)
xval = idf(1007).x # Proposed
## Subset Method
xval = only(subset(df, :id => ByRow(==(1007)))).x # Existing
xval = only(@rsubset(df, :id == 1007)).x # Meta
## Index Method
xval = only(df[df.id .== 1007, :x]) # Existing
xval = @with(df, df[:id .== 1007, ^(:x)]) # Meta
xval = only(df[:id => 1007, :x]) # Proposed
## Filter Method
xval = only(filter(:id => ==(1007), df)).x # Existing
# Multiple Categories
## Definition
df = DataFrame(
x = 1.0:8.0,
cat1 = [1,2,3,4,1,2,3,4],
cat2 = ["A","A","A","A","B","B","B","B"],
)
## Group Method
gdf = groupby(df, [:cat1, :cat2])
xval = only(gdf[(3, "B")]).x # Existing
xval = only(gdf(3, "B")).x # Proposed
idf = indexby(df, [:cat1, :cat2])
xval = idf(3, "B").x # Proposed
## Subset Method
xval = only(subset(df, :cat1 => ByRow(==(3)), :cat2 => ByRow(==("B")))).x # Existing
xval = only(@rsubset(df, :cat1 == 3, :cat2 == "B")).x # Meta
## Index Method
xval = only(df[df.cat1 .== 3 .&& df.cat2 .== "B", :x]) # Existing
xval = @with(df, df[:cat1 .== 3 .&& :cat2 .=="B", ^(:x)]) # Meta
xval = only(df[(:cat1 => 3, :cat2 => "B"), :x]) # Proposed
## Filter Method
xval = only(filter([:cat1, :cat2] => ((x, y) -> x .== 3 .&& y .== "B"), df)).x # Existing The objective to me is to simplify the syntax for obtaining
julia> gdf[1007]
ERROR: BoundsError: attempt to access 8-element Vector{Int64} at index [1007]
julia> gdf["v"]
ERROR: ArgumentError: invalid index: "v" of type String I'm not sure now if callable
For some reason, I find
I think my favorite solution though is the Pair indexing that was dismissed for performance reasons. |
If I have not missed anything you have captured all correctly. A small comment is that in general the |
I find myself wanting this again. Any chance of adding |
For the case of a unique index column.
Maybe you could you make it so that df[i] = di[i] for some specified index column. To give the above functionality without explicitly creating the Dictionary object. |
OK, let us try to finalize this. There are two cases: a) single lookup, b) many lookups Case a. single lookupHere the current syntax is:
or
The question is if we find it too cumbersome and if yes what is the proposed alternative. Case b. multiple lookups
and then efficiently
I think this pattern is OK as if someone wants efficiency this is relatively short (note that you need two separate statements to ensure efficiency). But again - please comment how you see this. |
Keep in mind that row lookup is something a very early beginner will want to do: "How old is Sally?", but some of the concepts above are relatively advanced:
By comparison, I think these require less Julia knowledge to use:
|
Yes - that is why I am keeping this discussion 😄. I want to find a solution for a beginner. We could also define for single lookup:
and also a kwarg variant
for multiple lookups the question is if it is needed to be added (i.e. the question is if a beginner would really need this fast variant?) |
Would it be possible to index a grouped data frame with a |
@nathanrboyer Is this feature for indexing easier in general? Or is it for indexing where you know you are only going to return one row? Or maybe 0 rows. If it's for indexing with one row, do you want this check to happen when a new object is created? Like |
I understand that it is indexing when exactly one row is returned and that @nathanrboyer looks for a simpler syntax than we have now. Side note, you need |
I would find it weird that we would offer a convenience syntax to select a single row without an equivalent to select multiple rows. We require So far I don't see a good solution that would be significantly more convenient than what we have now without increasing the complexity of the API with new ad-hoc concepts or syntaxes... |
The However,
Maybe good enough documentation can close this knowledge gap without new syntax, and point people to On the other hand though, the
In this case, I would probably point people to It is just hard to know what to reach for unless you are already familiar with everything |
@pdeffebach - would you accept adding to DataFramesMeta.jl something like:
that would return a single row or error if no row or multiple rows are found? |
Yes, I would! |
Great. @nathanrboyer - I will close this issue once the functionality lands in DataFramesMeta.jl. |
Maybe it should be called |
I hate to say this, but if the functionality is going to require DataFramesMeta.jl, then @lookup(df, :x == 1, :y == 3) == only(@rsubset(df, :x == 1, :y == 3)) (I edited my above post in which I previously thought If The issue then is convincing DataFrames.jl users to use DataFramesMeta.jl:
I've still yet to dive deeply into DataFramesMeta.jl thinking it is probably overkill for what I usually need to do. However, it seems to be more simple and helpful than I gave it credit for. |
I may be biased, but users should probably go to DataFramesMeta.jl first rather than after they learn DataFrames.jl |
I think you are right. I also think my misunderstanding, that DataFramesMeta.jl is an extension of DataFrames.jl for advanced power users, is a common one. |
I will add a blog post about it :), and also update the documentation of DataFrames.jl. @pdeffebach - probably one could also write something like this in DataFramesMeta.jl documentation to be more explicit (as per JuliaData/DataFramesMeta.jl#393). I am closing this issue - if we add some. |
This is a speculative idea. Maybe we could define
GroupedDataFrame
to be callable like this:In this way instead of writing:
users could write
gdf("val")
.@nalimilan, @pdeffebach - what do you think?
Ref: https://discourse.julialang.org/t/any-plan-for-functionality-like-pandas-loc/81134
The text was updated successfully, but these errors were encountered: