Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add option to specify output format #305

Open
sxdjt opened this issue Jan 23, 2024 · 4 comments
Open

Add option to specify output format #305

sxdjt opened this issue Jan 23, 2024 · 4 comments

Comments

@sxdjt
Copy link

sxdjt commented Jan 23, 2024

Is your feature request related to a problem? Please describe.

No, not related to a problem.

Describe the solution you'd like

The default output of the conversion is parquet. While parquet is efficient, having the option to output CSV would be beneficial. It would support quick checks of the conversion, as well as further analysis/manipulation without having to deal with a parquet file.

What I would like: I would like to be able to specify the output format, e.g. -output-format csv to override the default format.

Describe alternatives you've considered

Converting parquet to CSV manually is certainly workable, but the tool should provide this option directly.

@varunmittal91
Copy link
Collaborator

Hi @sxdjt , This is one of the planned feature. We will get this resolved soon. One of the issue here to think about is, do we public one consolidated output or a split file, which then can be split into multiple files.

@sxdjt
Copy link
Author

sxdjt commented Feb 1, 2024

Hi @varunmittal91 - that is a consideration, especially when dealing with multi-GB CSV files.

Ideally, users would be presented with options on how they want the output generated, something like:

Select your CSV output option:

1 - A single CSV with all converted rows
2 - Multiple CSVs with up to 1,048,575 rows each (the current row limit for Microsoft Excel)
3 - Multiple CSVs with a user-specified number of rows per file

If the user selects option 3, they would be prompted to enter the number of rows they want per CSV. These could also be passed as arguments, e.g. --csv-rows all or --csv-rows 1000.

There are tools that could be adapted for use to reduce coding efforts?

xsv: https://github.com/BurntSushi/xsv
csvkit: https://github.com/wireservice/csvkit

@ahullah
Copy link

ahullah commented Jun 3, 2024

sorry I added #348 before I read this... multiple output options would be beneficial

@ahullah
Copy link

ahullah commented Jun 6, 2024

Hey Folks, as a workaround I'm looking at using https://github.com/clemensv/avrotize as a second conversion step to get to AVRO format. fingers crossed...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants