Node computation flexibility #64

thorwhalen · 2023-07-20T19:30:59Z

thorwhalen
Jul 20, 2023
Maintainer

meshed provides tools to compose Python functions into a Directed Acyclic Graph (DAG), defining task dependencies and data flow. By default, tasks (in meshed-speak, FuncNodes) are executed in topological order by a central, imperative and sequential scheduler (DAG.__call__), reading inputs from and writing outputs to data nodes.

However, the declarative nature of the DAG allows for flexibility in execution models.
For example, data parallelism can be achieved by splitting a task into subtasks that operate on different data, and task parallelism can be achieved by executing different tasks concurrently.

Furthermore, the location of task execution can be customized. By default, tasks are executed locally and sequentially by the scheduler. However, each task could potentially be assigned to a different worker for execution.

For now, we'll focus on a centralized, imperative execution model, where the DAG scheduler manages task execution and data flow. This contrasts with a distributed system, where tasks are executed independently on networked workers, coordinating their actions via the communication structure imposed by the DAG.

Within the centralized, imperative execution model, the method to alter the computation location involves modifying the task function itself. The new function should maintain the same interface as the original, but its internal operations should be adjusted.

For instance, consider a task foo(x, y) that we want to offload to a cloud-based worker. We would need to create a variant of foo that sends a request to the cloud worker to execute foo with inputs x and y, retrieves the response, and returns the relevant part of the response as output.

Typically, communication with cloud-based workers is facilitated through a web service API. Therefore, the revised foo(x, y) task would need to convert the Python x and y inputs into a format compatible with the web service, use the API to send a request to execute foo with x and y, and so on.

thorwhalen · 2023-11-12T12:51:04Z

thorwhalen
Nov 12, 2023
Maintainer Author

Parallelism and threads

Here's an example of how one can use parallelism and threads.

Consider the following code:

def process_and_save(items, process, save):
   for key, item in items:
       processed = process(item)
       save(key, processed)

How can we optimize it to use parallelism and threads?

We can use ThreadPoolExecutor for both functions but handle them in a way that allows process to run while save is waiting on I/O. Here's an updated version of the code:

from concurrent.futures import ThreadPoolExecutor, as_completed

def main(items):
    with ThreadPoolExecutor() as executor:
        # First, submit the process tasks and collect futures
        process_futures = {executor.submit(process, item): key for key, item in items}

        # As processing completes, submit save tasks
        for future in as_completed(process_futures):
            key = process_futures[future]
            try:
                processed_item = future.result()
                # Submit the save task
                executor.submit(save, key, processed_item)
            except Exception as e:
                print(f"An error occurred during processing: {e}")

# Example usage
# items = [(key1, item1), (key2, item2), ...]
# main(items)

In this revised approach:

The process function is executed for each item, and the futures (representing the results of these calls) are stored in a dictionary, process_futures.
We then iterate over the completed processing tasks using as_completed. This allows us to start the save operation as soon as its corresponding process task is done.
The save function is called immediately after its corresponding process function completes, allowing the next process function to start in parallel with the save operation.

This approach ensures that while one item is being saved (an I/O-bound operation), the processing of the next item can already start, thus better utilizing the available resources and improving overall efficiency.

Does this code take advantage both of the fact that the processing of one item doesn't need to wait for previous item to have been processed nor saved?

The revised code I provided earlier does ensure that the processing of each item doesn't have to wait for the previous item to be processed or saved. However, it could be further optimized to make this clearer and more efficient. Let's break down the process:

Processing: Each process(item) call is submitted to the thread pool as soon as possible. This means that the processing of different items can happen in parallel, without waiting for previous items to finish processing.
Saving: The save(key, processed_item) call is submitted to the thread pool immediately after its corresponding process(item) call completes. This allows the saving of one item to happen concurrently with the processing of other items.

However, the code could be improved to ensure that the processing and saving tasks are more clearly independent. Here's an updated version:

from concurrent.futures import ThreadPoolExecutor, as_completed

def main(items):
    with ThreadPoolExecutor() as executor:
        # Submit all processing tasks and store the futures
        process_futures = {executor.submit(process, item): key for key, item in items}

        # Dictionary to store futures of save tasks
        save_futures = {}

        # Monitor the completion of processing tasks
        for future in as_completed(process_futures):
            key = process_futures[future]
            try:
                processed_item = future.result()
                # Submit the save task and store its future
                save_future = executor.submit(save, key, processed_item)
                save_futures[save_future] = key
            except Exception as e:
                print(f"An error occurred during processing: {e}")

        # Optionally, wait for all save tasks to complete
        for future in as_completed(save_futures):
            try:
                # Check for exceptions in save tasks
                future.result()
            except Exception as e:
                print(f"An error occurred during saving: {e}")

# Example usage
# items = [(key1, item1), (key2, item2), ...]
# main(items)

In this version:

All process tasks are submitted to the executor immediately, allowing them to run in parallel.
As soon as a process task completes, the corresponding save task is submitted.
The code also monitors the completion of save tasks, which can be useful for error handling or logging.

This approach ensures that processing and saving tasks are handled as independently as possible, taking full advantage of parallel execution for both CPU-bound (processing) and I/O-bound (saving) operations.

What would this code look like if you used a reactive programming framework?

Reactive programming in Python can be implemented using libraries like RxPy (Reactive Extensions for Python). This approach is quite different from the traditional imperative style, as it involves working with asynchronous data streams and applying various operations on these streams.

Here's an example of how your code might look using RxPy to handle the processing and saving of items reactively:

First, ensure you have RxPy installed:

pip install rx

Then, you can write your code like this:

import rx
from rx import operators as ops
from rx.scheduler import ThreadPoolScheduler
import multiprocessing

def process_item(item):
    # Your processing logic here
    return processed_item

def save_item(key, processed_item):
    # Your saving logic here

def main(items):
    # Calculate the number of available CPU cores and create a ThreadPoolScheduler
    optimal_thread_count = multiprocessing.cpu_count()
    pool_scheduler = ThreadPoolScheduler(optimal_thread_count)

    # Create an observable from the items
    source = rx.from_(items)

    # Define the pipeline
    processed_stream = source.pipe(
        # Map each item to a process function call
        ops.map(lambda key_item: (key_item[0], process_item(key_item[1]))),
        # Execute processing in parallel
        ops.subscribe_on(pool_scheduler),
        # Map each processed item to a save function call
        ops.map(lambda key_processed: save_item(key_processed[0], key_processed[1])),
        # Execute saving in parallel
        ops.subscribe_on(pool_scheduler)
    )

    # Subscribe to the processed stream
    processed_stream.subscribe(
        on_next=lambda x: print("Item processed and saved"),
        on_error=lambda e: print(f"Error occurred: {e}"),
        on_completed=lambda: print("All items processed and saved")
    )

# Example usage
# items = [(key1, item1), (key2, item2), ...]
# main(items)

In this example:

rx.from_ creates an observable sequence from the list of items.
ops.map is used to apply the process_item and save_item functions to each item.
subscribe_on(pool_scheduler) ensures that the processing and saving are done in parallel, using a thread pool.
The subscribe method is used to start the stream processing and handle the emitted items, errors, and completion signal.

This reactive approach is beneficial for handling streams of data, especially when the order of items is not critical, and you want to process items as they become available. It's also more declarative, expressing the logic of what you want to achieve rather than detailing how to do it step by step.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Node computation flexibility #64

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Node computation flexibility #64

Uh oh!

thorwhalen Jul 20, 2023 Maintainer

Replies: 1 comment

Uh oh!

thorwhalen Nov 12, 2023 Maintainer Author

Parallelism and threads

How can we optimize it to use parallelism and threads?

Does this code take advantage both of the fact that the processing of one item doesn't need to wait for previous item to have been processed nor saved?

What would this code look like if you used a reactive programming framework?

thorwhalen
Jul 20, 2023
Maintainer

thorwhalen
Nov 12, 2023
Maintainer Author