Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Intelligent CloudFlow partitioned indexing #152

Open
isaacabraham opened this issue Nov 20, 2015 · 1 comment
Open

Intelligent CloudFlow partitioned indexing #152

isaacabraham opened this issue Nov 20, 2015 · 1 comment

Comments

@isaacabraham
Copy link
Contributor

One of the things that e.g. Hive allows you to do is define indexes on flat files based on their physical structure e.g. imagine a folder structure like: -

{country}/{city}/{companyName}.txt

In Hive you can provide hints on above so it can intelligently search only files that match e.g. UK/London rather than having to scan through all files. Is this something that is (a) needed in MBrace, and (b) achievable?

@dsyme
Copy link
Contributor

dsyme commented Nov 20, 2015

I believe this use of folders as an implicit partitioning/indexing structure is one of the main reasons that Hadoop/Hive/HDFS have been successful - and hence Spark too. The ease with which people can organize masses of data using mostly normal Unix file system commands and then have it partitioned implicitly is very impressive.

I'd love to see these ideas brought into MBrace more completely. I believe one piece of the puzzle is to have a "mstore.exe" or "mb.exe" command-line utility tool that can be used in the obvious ways:

mstore cp *.txt /data/foo/*.txt
mstore rm /data/foo/**/logs/*.log
mstore mv ......
mstore ls ....
mstore mkdir ....

just like HDFS and a bit like azure.exe but working with any MBrace system. The configuration/get-cluster mechanism to bind to Azure or Thespian or AWS would have to be implicit user env variables etc.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants