Add CCFileProcessorSparkJob to support file-wise processing #45

jt55401 · 2024-07-31T15:57:23Z

For some spark jobs, we want to process an entire file at one time.
I copied and simplified sparkcc to do this.
This is used in the upcoming integrity process.

…d in integrity job)

sebastian-nagel · 2024-08-01T16:47:30Z

Thanks for the contribution, @jt55401!

First, I understand the use case: be able to process any kind of file, not only WARC files and derivatives (WAT, WET).
Yes, is is "more generic" and allows to cover more use cases. Including those where you want to

process a WARC file without using a WARC parser
use a custom output format.

Of course, this is already possible using CCSparkJob as base class by overriding the method run_job(...). For example, HostLinksToGraph reads and writes from/to Parquet. From CCSparkJob it uses only command-line parsing / option processing, logging definitions and optional profiling.

A Spark job definition to process individual files from a manifest.

Good catch! And very well expressed! This needs to be put into the README because it's one of the central design decisions. I never really thought about it, just ported it from cc-mrjob. I do not know why @Smerity decided to use a manifest while most big data tools read the input list from command-line arguments. I see one advantage of the manifest: it's easy to select a (larger) random sample. Using command-line arguments you quickly may reach the system limit on the maximum argument length when passing, say, 1k paths to WARC files as arguments.

One question: What's the rationale for using a NamedTemporyFile? Being able to share the content as a file with other processes?

... and one remarks which should be addressed: 90% of the code lines in sparkccfile.py are copied unmodified from sparkcc.py. This complicates maintenance because contributors may forget to implement a bug fix or improvement in both files.

Two suggestions how to reduce the code duplication:

CCSparkJob inherits from CCFileProcessorSparkJob
- (preferred variant, although more work; expected to remove more duplicated code)
- move the definition of CCFileProcessorSparkJob into sparkcc.py
  - makes the deployment easier
  - avoids that the deployment for existing setups is broken by the changes below
- remove duplicated code from CCSparkJob
  - keep only code / methods specific for WARC file processing
- (difficult) fetch_warc to call fetch_file where applicable
  - the method "fetch_warc" is complex (120+ lines of code)
  - "fetch_file" mostly duplicates 70 lines
  - ideally
CCFileProcessorSparkJob inherits from CCSparkJob
- (easier to implement)
- cf. above and HostLinksToGraph
- basically, only fetch_file(...) and run_job(...) are then implemented by CCFileProcessorSparkJob
- could move the definition also into sparkcc.py for easier deployment

In any case: the job / class should be listed in the README, maybe together with a simple example.

jt55401 · 2024-08-02T01:59:22Z

@sebastian-nagel - thank you for the review, it's greatly appreciated.

NamedTemporyFile

Yes - that is exactly right. Some parts of the jobs we run use external tools, so we need a file in the outside world.

Manifests vs. command line args

Yes - we are commonly processing 250,000-1,000,000+ files in a run. (Example: all the wet/wat files for an entire year of crawls)

Two suggestions how to reduce the code duplication:

Ah, very astute. I will review these options and update the PR with a refactor.

In any case: the job / class should be listed in the README, maybe together with a simple example.

Yes, no problem.

jt55401 · 2024-08-03T02:40:20Z

OK @sebastian-nagel - I'm reasonably happy with this version.

The only slight downside is due to the way it's packaged, we now have to depend on warcio when it's not really needed. I'm not a deep expert in python modularization, so, if there is a clever way to fix this while preserving the cleanliness of this refactor, please let me know - otherwise, I'm fine leaving it.

Let me know if you have any other feedback.

jt55401 · 2024-09-10T01:03:21Z

I've since enhanced this further with 3 more functions:

validate_s3_bucket_from_uri
check_for_output_file
write_output_file

These are mostly convenience functions which do as they each say for local file paths or S3 paths.
This further makes writing jobs which process entire files, and output NEW files to locations other than the default spark result table (which I've been using more as an audit log for such file-wise processing jobs)

Add sparkccfile.py to support file-wise processing in spark jobs (use…

d384ecd

…d in integrity job)

jt55401 requested a review from sebastian-nagel July 31, 2024 15:57

jt55401 changed the title ~~Add sparkccfile.py to support file-wise processing~~ Add CCFileProcessorSparkJob to support file-wise processing Aug 3, 2024

jt55401 added 2 commits August 2, 2024 21:34

Merge CCFileProcessorSparkJob into sparkcc.py

26090f9

Add CCFileProcessorSparkJob example and link in readme for it.

ee644a5

fix s3 functions so they work in spark environment

987709c

jt55401 mentioned this pull request Sep 10, 2024

Add an alternate spark version of indexwarcsjob (without mrjob) commoncrawl/webarchive-indexing#7

Open

jt55401 added 3 commits September 10, 2024 21:21

fix bug when the file was not downloaded from s3

cd64b14

fix: fix scheme to recognize s3a and s3n

7b209df

fix: don't log 404 errors when checking for existence

8aeca80

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add CCFileProcessorSparkJob to support file-wise processing #45

Add CCFileProcessorSparkJob to support file-wise processing #45

jt55401 commented Jul 31, 2024

sebastian-nagel commented Aug 1, 2024

jt55401 commented Aug 2, 2024

jt55401 commented Aug 3, 2024

jt55401 commented Sep 10, 2024

Add CCFileProcessorSparkJob to support file-wise processing #45

Are you sure you want to change the base?

Add CCFileProcessorSparkJob to support file-wise processing #45

Conversation

jt55401 commented Jul 31, 2024

sebastian-nagel commented Aug 1, 2024

jt55401 commented Aug 2, 2024

jt55401 commented Aug 3, 2024

jt55401 commented Sep 10, 2024