Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Word Count #197

Open
alamb opened this issue Apr 26, 2021 · 1 comment
Open

Word Count #197

alamb opened this issue Apr 26, 2021 · 1 comment
Labels
datafusion Changes in the datafusion crate

Comments

@alamb
Copy link
Contributor

alamb commented Apr 26, 2021

Note: migrated from original JIRA: https://issues.apache.org/jira/browse/ARROW-12293

I am learning DataFusion and tried to do the canonical big data version of hello world, word count, using DataFusion.  I have been unsuccessful, and I am wondering if word count is even currently possible with DataFusion.

 

Typically word count involves a flat_map where you split each string based on the white space contained within each string.  

 

There are two issues I am running into

  1. creating a udf that goes from &str -> Vec<&str>.  I cannot find an arrow::array that maps to a collection of string, which is preventing me from creating a udf that can perform the split.

  2. Assuming I could get 1 to work, I am not aware of a method that is similar to flat_map that may be performed on a column.  In sql, I believe this is called explode, which I can't find in the codebase, which makes me think flat_map style operations aren't possible.

 

My questions are:

Is word count currently possible in DataFusion?  If so, how can perform the split and how can you perform a flat_map?  If word count cannot be done, what would need to be implemented to make it possible?

@alamb alamb added the datafusion Changes in the datafusion crate label Apr 26, 2021
@andygrove
Copy link
Member

The analysis seems correct here. I filed #212 for SQL explode support.

We should look into supporting DataFrame map / flat_map functions with lambdas as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
datafusion Changes in the datafusion crate
Projects
None yet
Development

No branches or pull requests

2 participants