Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DelegateCombineFileInputFormat Doesn't Honor CombineFileInputFormat.maxSplitSize #419

Open
gsteelman opened this issue Sep 29, 2014 · 3 comments

Comments

@gsteelman
Copy link

I'm trying to use DelegateCombineFileInputFormat + LzoTextInputFormat + LzoTextOutputFormat. I'm also trying to specify the maxSplitSize for combining files. I've found that DelegateCombineFileInputFormat doesn't honor maxSplitSize, minSplitSizeNode, or minSplitSizeRack if they are configured before the job is run.

Per @jcoveney "If there is a maxInputSplitSize in Hadoop's CombineFileInputFormat no, it is not honored.":
https://github.com/kevinweil/elephant-bird/blob/master/core/src/main/java/com/twitter/elephantbird/util/SplitUtil.java#L35

I can see a couple approaches for a fix:

  1. SplitUtil.getCombinedSplitSize(Configuration): Change it so it tries to getLong from COMBINE_SPLIT_SIZE, if it can't it'll try to get from CombineFileInputFormat "mapreduce.input.fileinputformat.split.maxsize" which apparently isn't a static constant, but a hard coded string...

  2. DelegateCombineFileInputFormat could set SplitUtil.COMBINE_SPLIT_SIZE equal to CombineFileInputFormat max split size if it was set. This same approach could be used for minSplitSizeNode and minSplitSizeRack. Where in DelegateCombineFileInputFormat would this go?

@jcoveney
Copy link
Contributor

I think we should do a mix of what you propose.

  1. DelegateCombineFileInputFormat should check for and honor CombineFileInputFormat info, and pass it to getcombinedsplitSize

getCombinedSplitSize should also check for both.

@gsteelman
Copy link
Author

That does seem the safe approach. What about for minSplitSizeNode and minSplitSizeRack? I'd think to extract from conf using CFIF config keys, then set them into EB's SplitUtil conf keys too? However it doesn't appear that DelegateCFIF has any notion of a min size.

@gsteelman
Copy link
Author

I've created a pull request: https://github.com/kevinweil/elephant-bird/pull/420

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants