[aggr-] allow ranking rows by key column #2417

midichef · 2024-05-31T06:54:34Z

This PR adds a rank aggregator that returns a list, and a command addcol-rank, which adds a new column with the rank of each row. Ranks are calculated by comparing key columns.

It also fixes a bug in memo-aggregate where long output takes an extremely long time to show up in the statusbar.
For example: seq 1222333 |vd -, then z+ list. After the list is calculated, visidata will get stuck for many seconds showing processing…, because it's very slow to run format() on a long sequence.

I think it's worth having an aggregator for rank, and the need for a simpler solution than the current method has come up before. On the other hand, I know part of Visidata philosophy is that it's not a spreadsheet. How do people feel about having a rank aggregator?

Also, in its current form, the rank aggregator will give errors when comparing key columns with different types across 2 rows:

File "/home/midichef/.local/lib/python3.10/site-packages/visidata/aggregators.py", line 169, in rank
    keys_sorted = sorted(((rowkey, i) for i, rowkey in enumerate(keys)), key=_key_progress(prog))
TypeError: '<' not supported between instances of 'float' and 'list'

What's the standard way to handle sorting mixed types for Visidata?

saulpw · 2024-06-06T00:04:06Z

What's the standard way to handle sorting mixed types for Visidata?

The standard way is to convert the column into a known type, and then anything that can't be converted (errors and nulls) become TypedWrappers which are sortable with any type. Does that work acceptably here too?

midichef · 2024-06-22T08:17:37Z

Yes, that seems like it should work. Should the rank aggregator pick the known type, and if so, which one? Or is it the user who should convert the column?

saulpw · 2024-07-01T05:58:51Z

Since it's not obvious which type to pick, the user can convert the column.

saulpw

I love what this is adding, and I think with a few tweaks it would be even more powerful!

visidata/aggregators.py

midichef · 2024-07-20T07:49:09Z

There are two kinds of ranking operations people may want.

keycol-based rank within sheet: what is the rank of this row, vs. all rows in the sheet, ranking by the value of its key columns? In this example, the key column is keycol, and the current column col is ignored:

keycol	col	keycol_sheetrank
1	10	1
1	20	1

2	60	2
2	50	2
2	30	2

column-based rank within group: when grouping the rows by key columns, what is the rank of this row, within its group? The current column determines the rank. In this example, the current column is col:

keycol	col	col_grouprank
1	10	1
1	20	2

2	60	3
2	50	2
2	30	1

What is a good name for these two aggregators? sheetrank and grouprank? Or maybe rank_key and rank_col?
Any suggestions?

It is needed now that addcol-aggregate can apply stdev to groups, which may include lists of size 1.

midichef · 2024-07-29T04:38:21Z

Okay, I implemented a command that adds a column and applies an aggregator to rows after grouping them by key columns. It's addcol-aggregate.

To get this to work with list aggregators, I made a new class ListAggregator for aggregators that return lists. Their most common use would be with addcol-aggregate. Right now the only two ListAggregators are list and rank.

I also tried making a sheetrank aggregator, but it's too different from normal aggregators. Normal aggregators apply to a column, but sheetrank is more for the sheet. So I broke it out into a separate command, addcol-sheetrank.

I'm a bit unsure about the new behavior of the list aggregator when used with addcol-aggregate. Right now, if the input column has cells with Exceptions, they show up in the new column. But the error text shows up on the display, it's not hidden behind an error note. !. So I could use guidance on a couple of issues here:

Should these Exceptions be passed through by the list aggregator, or should they be translated to null?
If they should be passed through the aggregator, how do I make them look/behave like the original cell with an exception?
The relevant code is here:

visidata/visidata/aggregators.py

Line 120 in c6c608e

vals = [ col.getTypedValue(r) for r in row_group ]

To see it in action, vd sample_data/test.jsonl, then addcol-aggregate list. The key1 and key1_list columns ought to look the same, but the fourth cell in key1_list reads Expecting ':' delimiter: line 1 column 34 (char 33) instead of empty.

Also adds a command addcol-sheetrank.

midichef · 2024-07-29T04:54:09Z

There is one detail about the grouping in addcol-aggregate. If the key column holds multiple cells with null, all nulls are grouped together as one group of rows. But if the key column holds multiple error cells, each error cell forms its own unique group of 1 row, even if all the errors have the same traceback text. I didn't design this, it's just how it behaved on sorting. Does that error cell treatment sound reasonable?

saulpw requested changes Jul 1, 2024

View reviewed changes

visidata/aggregators.py Outdated Show resolved Hide resolved

visidata/aggregators.py Outdated Show resolved Hide resolved

visidata/aggregators.py Show resolved Hide resolved

visidata/aggregators.py Outdated Show resolved Hide resolved

midichef added 3 commits July 28, 2024 21:17

[aggr-] cap runtime when formatting memo status

daeccd0

[aggr-] fix chooser lacking aggs starting with 'p'

43b202e

[aggr-] display stdev error note for lists of size 1

6c7e178

It is needed now that addcol-aggregate can apply stdev to groups, which may include lists of size 1.

midichef force-pushed the aggr_rank branch 2 times, most recently from 8078fb6 to c6c608e Compare July 29, 2024 04:20

midichef requested a review from saulpw July 29, 2024 04:38

[listaggr-] add rank aggregator, add cmd addcol-aggregate

3ef6238

Also adds a command addcol-sheetrank.

midichef force-pushed the aggr_rank branch from c6c608e to 3ef6238 Compare July 29, 2024 04:53

midichef mentioned this pull request Sep 9, 2024

addcol-window should pad first list with None to indicate no rows above first row #2279

Closed

anjakefala added 3.1 waiting on maintainer labels Sep 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[aggr-] allow ranking rows by key column #2417

[aggr-] allow ranking rows by key column #2417

midichef commented May 31, 2024

saulpw commented Jun 6, 2024

midichef commented Jun 22, 2024

saulpw commented Jul 1, 2024

saulpw left a comment

midichef commented Jul 20, 2024

midichef commented Jul 29, 2024

midichef commented Jul 29, 2024

[aggr-] allow ranking rows by key column #2417

Are you sure you want to change the base?

[aggr-] allow ranking rows by key column #2417

Conversation

midichef commented May 31, 2024

saulpw commented Jun 6, 2024

midichef commented Jun 22, 2024

saulpw commented Jul 1, 2024

saulpw left a comment

Choose a reason for hiding this comment

midichef commented Jul 20, 2024

midichef commented Jul 29, 2024

midichef commented Jul 29, 2024