-
Notifications
You must be signed in to change notification settings - Fork 87
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add CCFileProcessorSparkJob to support file-wise processing #45
base: main
Are you sure you want to change the base?
Conversation
…d in integrity job)
Thanks for the contribution, @jt55401! First, I understand the use case: be able to process any kind of file, not only WARC files and derivatives (WAT, WET).
Of course, this is already possible using CCSparkJob as base class by overriding the method run_job(...). For example, HostLinksToGraph reads and writes from/to Parquet. From CCSparkJob it uses only command-line parsing / option processing, logging definitions and optional profiling.
Good catch! And very well expressed! This needs to be put into the README because it's one of the central design decisions. I never really thought about it, just ported it from cc-mrjob. I do not know why @Smerity decided to use a manifest while most big data tools read the input list from command-line arguments. I see one advantage of the manifest: it's easy to select a (larger) random sample. Using command-line arguments you quickly may reach the system limit on the maximum argument length when passing, say, 1k paths to WARC files as arguments. One question: What's the rationale for using a NamedTemporyFile? Being able to share the content as a file with other processes? ... and one remarks which should be addressed: 90% of the code lines in sparkccfile.py are copied unmodified from sparkcc.py. This complicates maintenance because contributors may forget to implement a bug fix or improvement in both files. Two suggestions how to reduce the code duplication:
In any case: the job / class should be listed in the README, maybe together with a simple example. |
@sebastian-nagel - thank you for the review, it's greatly appreciated.
Yes - that is exactly right. Some parts of the jobs we run use external tools, so we need a file in the outside world.
Yes - we are commonly processing 250,000-1,000,000+ files in a run. (Example: all the wet/wat files for an entire year of crawls)
Ah, very astute. I will review these options and update the PR with a refactor.
Yes, no problem. |
OK @sebastian-nagel - I'm reasonably happy with this version. The only slight downside is due to the way it's packaged, we now have to depend on warcio when it's not really needed. I'm not a deep expert in python modularization, so, if there is a clever way to fix this while preserving the cleanliness of this refactor, please let me know - otherwise, I'm fine leaving it. Let me know if you have any other feedback. |
I've since enhanced this further with 3 more functions:
These are mostly convenience functions which do as they each say for local file paths or S3 paths. |
For some spark jobs, we want to process an entire file at one time.
I copied and simplified sparkcc to do this.
This is used in the upcoming integrity process.