Support map-reduce pattern in concurrency utils #17

adivekar-utexas · 2025-03-06T14:34:50Z

Right now, a lot of the orchestration of using dispatch_executor and dispatch has to be done manually, even though a lot of the parallel processing use-cases fit into a common pattern:

Get a sequence of data
Process each item in the sequence
(Optional) Aggregate.

This is essentially MapReduce (where the Reduce step is optional). We should have a way to batch-submit and retrieve results iteratively.

Right now there is some version of this implementation in dispatch_apply, but that is not fleshed out. Instead, we want a paradigm which can be used a bit like this:

Example Usage (parallel map only):

>>> def process_query_df(query_id, query_df):
>>>     query_df: pd.DataFrame = set_ranks(query_df, sort_col="example_id")
>>>     query_df['product_text'] = query_df['product_text'].apply(clean_text)
>>>     return query_df['product_text'].apply(len).mean()
>>>
>>> for mean_query_doc_lens in map_reduce(
>>>     retrieval_dataset.groupby("query_id"),
>>>     fn=process_query_df,
>>>     parallelize='processes',
>>>     max_workers=20,
>>>     pbar=dict(miniters=1),
>>>     batch_size=30,
>>>     iter=True,
>>> ):
>>>     ## Prints the output of each call to process_query_df, which is the mean length of
>>>     ## product_text for each query_df:
>>>     print(mean_query_doc_lens)
>>> 1171.090909090909
>>> 1317.7931034482758
>>> 2051.945945945946
>>> 1249.9375
>>> ...

Example Usage (parallel map and reduce):

>>> def process_query_df(query_id, query_df):
>>>     query_df: pd.DataFrame = set_ranks(query_df, sort_col="example_id")
>>>     query_df['product_text'] = query_df['product_text'].apply(clean_text)
>>>     return query_df['product_text'].apply(len).sum()
>>>
>>> def reduce_query_df(l):
>>>     ## Applied to every batch of outputs from process_query_df
>>>     ## and then again to thr final list of reduced outputs:
>>>     return sum(l)
>>>
>>> print(map_reduce(
>>>     retrieval_dataset.groupby("query_id"),
>>>     fn=process_query_df,
>>>     parallelize='processes',
>>>     max_workers=20,
>>>     pbar=True,
>>>     batch_size=30,
>>>     reduce_fn=reduce_query_df,
>>> ))
>>> 374453878

The text was updated successfully, but these errors were encountered:

adivekar-utexas added the enhancement 🚀 New feature or request label Mar 6, 2025

adivekar-utexas self-assigned this Mar 6, 2025

adivekar-utexas added a commit that referenced this issue Mar 6, 2025

[#17] Added map_reduce style execution to concurrency utils.

8643eb1

adivekar-utexas mentioned this issue Mar 6, 2025

[#17] Added map_reduce style execution to concurrency utils. #18

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support map-reduce pattern in concurrency utils #17

Support map-reduce pattern in concurrency utils #17

adivekar-utexas commented Mar 6, 2025

Support map-reduce pattern in concurrency utils #17

Support map-reduce pattern in concurrency utils #17

Comments

adivekar-utexas commented Mar 6, 2025