Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stream from stdin rather than doing stdin.readlines() #138

Open
pjvandehaar opened this issue Nov 30, 2016 · 2 comments
Open

Stream from stdin rather than doing stdin.readlines() #138

pjvandehaar opened this issue Nov 30, 2016 · 2 comments

Comments

@pjvandehaar
Copy link

pjvandehaar commented Nov 30, 2016

Currently, tabview doesn't work well when used with large or unending files. For example, cat /dev/urandom | tr -cd "fish,\n" | tabview - doesn't work.

I'd like for stdin to only be read as needed. Maybe this will also let tabview display iterators when used from Python.

(Do you know of any other commandline csv-viewer that does streaming to handle large files? I haven't found one.)

Changes that will be needed:

  1. process_data needs to be a generator. Then view() will do data_processor = process_data(...). Viewer will do csv_data.append(next(data_processor)) when it reaches the end of csv_data.
  2. detect_encoding() will be run on the first 1000 lines to determine enc. After those lines are exhausted, detect_encoding() will be run on each new line, updating enc if needed.
  3. pad_data() can't happen in process_data. Viewer will run csv_data = pad_data(csv_data) if a new line from data_processor is longer than self.num_data_columns.
  4. Viewer needs to have a few minor changes.

Forward searching will still work, and will just rapidly consume lines from data_processor.

When the user tries to sort, Viewer will do csv_data.extend(data_processor), which might take too long or possibly forever. User's problem.

Later on, it'd be fun to make mode and max column widths update as new data is read in, by storing the collections.Counter(), updating it for each new line, and updating self.column_width as needed.

The problems this introduces are:

  1. When self.column_width_mode is max or mode, the width won't reflect rows that haven't been read yet.
  2. Early lines could be legal as both utf8 and latin1. But maybe later lines would be illegal as utf8, meaning that the earlier ones should have been interpreted as latin1. But now we've already decoded them, looked at them, and forgotten the original binary data. Is this likely?

If you consider these drawbacks quite bad, I'd be happy with a flag --stream.

If I start work on a PR, do you have any recommendations?

@pjvandehaar pjvandehaar changed the title Stream from stdin rather than reading all into memory at once Stream from stdin rather than doing stdin.readlines(). Nov 30, 2016
@pjvandehaar pjvandehaar changed the title Stream from stdin rather than doing stdin.readlines(). Stream from stdin rather than doing stdin.readlines() Nov 30, 2016
@wavexx
Copy link
Member

wavexx commented Nov 30, 2016 via email

@pjvandehaar
Copy link
Author

pjvandehaar commented Nov 30, 2016

  1. You run tabview myhugefile.csv.
  2. You press c to get mode-widths.
  3. You scroll down a bunch.
  4. You press c again.
    • Now does tabview recalculate the mode width and show that?
    • Or does it switch to max width, since it already showed mode-width? (the current behavior)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants