Stream from stdin rather than doing `stdin.readlines()` #138

pjvandehaar · 2016-11-30T01:31:10Z

Currently, tabview doesn't work well when used with large or unending files. For example, cat /dev/urandom | tr -cd "fish,\n" | tabview - doesn't work.

I'd like for stdin to only be read as needed. Maybe this will also let tabview display iterators when used from Python.

(Do you know of any other commandline csv-viewer that does streaming to handle large files? I haven't found one.)

Changes that will be needed:

process_data needs to be a generator. Then view() will do data_processor = process_data(...). Viewer will do csv_data.append(next(data_processor)) when it reaches the end of csv_data.
detect_encoding() will be run on the first 1000 lines to determine enc. After those lines are exhausted, detect_encoding() will be run on each new line, updating enc if needed.
pad_data() can't happen in process_data. Viewer will run csv_data = pad_data(csv_data) if a new line from data_processor is longer than self.num_data_columns.
Viewer needs to have a few minor changes.

Forward searching will still work, and will just rapidly consume lines from data_processor.

When the user tries to sort, Viewer will do csv_data.extend(data_processor), which might take too long or possibly forever. User's problem.

Later on, it'd be fun to make mode and max column widths update as new data is read in, by storing the collections.Counter(), updating it for each new line, and updating self.column_width as needed.

The problems this introduces are:

When self.column_width_mode is max or mode, the width won't reflect rows that haven't been read yet.
Early lines could be legal as both utf8 and latin1. But maybe later lines would be illegal as utf8, meaning that the earlier ones should have been interpreted as latin1. But now we've already decoded them, looked at them, and forgotten the original binary data. Is this likely?

If you consider these drawbacks quite bad, I'd be happy with a flag --stream.

If I start work on a PR, do you have any recommendations?

The text was updated successfully, but these errors were encountered:

wavexx · 2016-11-30T11:20:57Z

On Wed, Nov 30 2016, Peter VandeHaar wrote: I'd like for stdin to only be read as needed. This could probably also let `tabview` work with iterators when used from Python, but I don't use that so I don't know.

This is hard with the current code.

(Do you know of another tool that does this? I haven't found one.)

Never found one, even though that's something I'd like as well. You could actually cheat with a buffering program that, besides buffering, sends EOF at regular intervals (so that you could just reload the file live in tabview), but the ones that I know don't do strictly that.

The problems this introduces are: 1. When `self.column_width_mode` is `max` or `mode`, the width won't reflect rows that haven't been read yet.

This wouldn't be a problem really, if you show what's going on. Triggering a recalculation is generally more user-friendly than auto-sizing the columns randomly.

If I start work on a PR, do you have recommendations?

Godspeed? ;) I'm not sure I fully understood the implementation details. Your plan is just to keep appending on csv_data directly as far as I understood. In this case, I would keep an initial buffer *in* the generator to perform the encoding detection *and* padding which is unrelated to what the viewer is going. The viewer shouldn't be concerned with any part of the reading process. Just provide it with a matrix to show. This way, as the data comes in, you can append to csv_data into chunks and update the internal state as little as possible. If you see it the other way around, if you have a data structure you want to show in tabview, when used as a module, you'd like to skip all this process entirely.

pjvandehaar · 2016-11-30T20:17:53Z

You run tabview myhugefile.csv.
You press c to get mode-widths.
You scroll down a bunch.
You press c again.
- Now does tabview recalculate the mode width and show that?
- Or does it switch to max width, since it already showed mode-width? (the current behavior)

pjvandehaar changed the title ~~Stream from stdin rather than reading all into memory at once~~ Stream from stdin rather than doing stdin.readlines(). Nov 30, 2016

pjvandehaar changed the title ~~Stream from stdin rather than doing stdin.readlines().~~ Stream from stdin rather than doing stdin.readlines() Nov 30, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stream from stdin rather than doing `stdin.readlines()` #138

Stream from stdin rather than doing `stdin.readlines()` #138

pjvandehaar commented Nov 30, 2016 •

edited

Loading

wavexx commented Nov 30, 2016 via email

pjvandehaar commented Nov 30, 2016 •

edited

Loading

Stream from stdin rather than doing stdin.readlines() #138

Stream from stdin rather than doing stdin.readlines() #138

Comments

pjvandehaar commented Nov 30, 2016 • edited Loading

wavexx commented Nov 30, 2016 via email

pjvandehaar commented Nov 30, 2016 • edited Loading

Stream from stdin rather than doing `stdin.readlines()` #138

Stream from stdin rather than doing `stdin.readlines()` #138

pjvandehaar commented Nov 30, 2016 •

edited

Loading

pjvandehaar commented Nov 30, 2016 •

edited

Loading