Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add CCFileProcessorSparkJob to support file-wise processing #45

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

jt55401
Copy link

@jt55401 jt55401 commented Jul 31, 2024

For some spark jobs, we want to process an entire file at one time.
I copied and simplified sparkcc to do this.
This is used in the upcoming integrity process.

@jt55401 jt55401 requested a review from sebastian-nagel July 31, 2024 15:57
@sebastian-nagel
Copy link
Contributor

Thanks for the contribution, @jt55401!

First, I understand the use case: be able to process any kind of file, not only WARC files and derivatives (WAT, WET).
Yes, is is "more generic" and allows to cover more use cases. Including those where you want to

  • process a WARC file without using a WARC parser
  • use a custom output format.

Of course, this is already possible using CCSparkJob as base class by overriding the method run_job(...). For example, HostLinksToGraph reads and writes from/to Parquet. From CCSparkJob it uses only command-line parsing / option processing, logging definitions and optional profiling.

A Spark job definition to process individual files from a manifest.

Good catch! And very well expressed! This needs to be put into the README because it's one of the central design decisions. I never really thought about it, just ported it from cc-mrjob. I do not know why @Smerity decided to use a manifest while most big data tools read the input list from command-line arguments. I see one advantage of the manifest: it's easy to select a (larger) random sample. Using command-line arguments you quickly may reach the system limit on the maximum argument length when passing, say, 1k paths to WARC files as arguments.

One question: What's the rationale for using a NamedTemporyFile? Being able to share the content as a file with other processes?

... and one remarks which should be addressed: 90% of the code lines in sparkccfile.py are copied unmodified from sparkcc.py. This complicates maintenance because contributors may forget to implement a bug fix or improvement in both files.

Two suggestions how to reduce the code duplication:

  1. CCSparkJob inherits from CCFileProcessorSparkJob

    • (preferred variant, although more work; expected to remove more duplicated code)
    • move the definition of CCFileProcessorSparkJob into sparkcc.py
      • makes the deployment easier
      • avoids that the deployment for existing setups is broken by the changes below
    • remove duplicated code from CCSparkJob
      • keep only code / methods specific for WARC file processing
    • (difficult) fetch_warc to call fetch_file where applicable
      • the method "fetch_warc" is complex (120+ lines of code)
      • "fetch_file" mostly duplicates 70 lines
      • ideally
  2. CCFileProcessorSparkJob inherits from CCSparkJob

    • (easier to implement)
    • cf. above and HostLinksToGraph
    • basically, only fetch_file(...) and run_job(...) are then implemented by CCFileProcessorSparkJob
    • could move the definition also into sparkcc.py for easier deployment

In any case: the job / class should be listed in the README, maybe together with a simple example.

@jt55401
Copy link
Author

jt55401 commented Aug 2, 2024

@sebastian-nagel - thank you for the review, it's greatly appreciated.

NamedTemporyFile

Yes - that is exactly right. Some parts of the jobs we run use external tools, so we need a file in the outside world.

Manifests vs. command line args

Yes - we are commonly processing 250,000-1,000,000+ files in a run. (Example: all the wet/wat files for an entire year of crawls)

Two suggestions how to reduce the code duplication:

Ah, very astute. I will review these options and update the PR with a refactor.

In any case: the job / class should be listed in the README, maybe together with a simple example.

Yes, no problem.

@jt55401 jt55401 changed the title Add sparkccfile.py to support file-wise processing Add CCFileProcessorSparkJob to support file-wise processing Aug 3, 2024
@jt55401
Copy link
Author

jt55401 commented Aug 3, 2024

OK @sebastian-nagel - I'm reasonably happy with this version.

The only slight downside is due to the way it's packaged, we now have to depend on warcio when it's not really needed. I'm not a deep expert in python modularization, so, if there is a clever way to fix this while preserving the cleanliness of this refactor, please let me know - otherwise, I'm fine leaving it.

Let me know if you have any other feedback.

@jt55401
Copy link
Author

jt55401 commented Sep 10, 2024

I've since enhanced this further with 3 more functions:

  • validate_s3_bucket_from_uri
  • check_for_output_file
  • write_output_file

These are mostly convenience functions which do as they each say for local file paths or S3 paths.
This further makes writing jobs which process entire files, and output NEW files to locations other than the default spark result table (which I've been using more as an audit log for such file-wise processing jobs)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants