Collapse and expand nodes #54

thorwhalen · 2022-03-03T21:17:37Z

thorwhalen
Mar 3, 2022
Maintainer

As DAG's get more nodes, things can get quickly unwieldy. One of our tools to mitigate this is being able to collapse/expand sub-DAGs. This is no revolutionary idea. This kind of chunking (psychology term) can be found all over software objects: We group lines of code in functions, group these in other functions (and/or classes), group these in modules, modules in packages etc.

Here, we just want to have tools that do this, specialized for DAGs (and other mesh objects!).

Any sub-DAG is a DAG itself, therefore a callable itself, so could be collapsed into a single FuncNode, as well as expanded back to the original DAG.

Consideration

Consider that though collapsing sub-dags one by one will result in a partition of the DAG, there are many ways one could do that.

When we expand a collapsed node, how the expansion integrates with the rest of the DAG is problematic. Say you make a DAG with two nodes, f and g, and g itself was a collapsed DAG. If you expand it, making a var node x appear, this x might clash with a var node x that f has as an input. It may not be the same x. This problem won't happen if you start with a totally expanded DAG and collapse nodes from there, but it could happen if you make a DAG with FuncNodes that are DAGs themselves.

Once we expand a collapsed node, if we don't do something extra (like having a ledger of what nodes were collapsed), we wouldn't have the required information to revert that expansion back to the collapsed form.

It may be useful to have these "ledgers" that record sets of nodes that are good candidates to be use as FuncNodes, but these sets of nodes (i.e. DAGs) could overlap. When mapping DAG operations to a GUI, it may be useful to enable the selection and naming of sets of nodes into a ledger, that could then be used to highlight the corresponding nodes in a current DAG, and allow these to be collapsed. The same ledger could then be used to hold the sets of nodes create by expanding nodes in the DAG so we can keep the info we need to re-collapse them.

Example: Rolling in (orphaned) nodes

DAG.partial had a (not implemented) _roll_in_orphaned_nodes arguments described as such:

When True, actually change the DAG so it only contains the nodes it "needs": that is, recursively remove func-nodes (and their output nodes) if their inputs are fixed.

This argument was removed because it was not clear what it should mean or how useful it would be at this juncture.

This issue is to keep track of the thinking around this. This also fits in the general thinking about what we'd like to have as far as DAG (or other meshed data structures) operations.

Considerations

In the case of rolling in nodes (which is also related to expanding nodes that are DAGs), here are some considerations:

What more do we mean the simply removing nodes (and with them, the edges attached to them) from the underlying graph?

Say we have a node representing f(a=1, b=2): we may not want to have a and b show up in the signature, or the graphviz display, but what do we do with the value it should compute? We could compute the function, then store the result in the out node, to then be used as the default of the functions that use it. Maybe recursively doing the same with those functions.

One problem with that can arise if f is not deterministic, the resulting DAG won't be the same as our "rolled in" DAG, but this simplification and caching mechanism can still be useful in many contexts. It should be requested explicitly though.

thorwhalen · 2023-09-19T07:00:52Z

thorwhalen
Sep 19, 2023
Maintainer Author

AI chat.

(Beware of hallucinations!)

5 replies

thorwhalen Sep 19, 2023
Maintainer Author

collapse

I'm writing a DAG package to make data processing pipelines. I'd like to write a function collapse(dag, func_nodes) that will return a DAG that is functionally equivalent to the input dag, but where the func_nodes have been "collapsed" into a single node that is functionally equivalent. By functionally equivalent I mean "has same input signature and output, and returns the same output for the same input values".

I called this "collapse", but what is this called in the literature?
Tell me, for 2-3 most common DAG packages what that method/function is called.
Write some python code that shows me what the general algorithm is for this.

The process you're describing, where you are merging multiple nodes of a Directed Acyclic Graph (DAG) into a single node while preserving the original DAG's functional properties, is often referred to as node fusion or graph fusion in the context of optimizing computational graphs.

Popular DAG-based or graph-based packages often employ fusion techniques, particularly when they are used for deep learning or other numerical computation tasks where computational efficiency is critical.

Here are the terms used in some common packages:

TensorFlow: TensorFlow refers to this process as graph optimization and employs various techniques including node fusion under the hood, especially in its GraphDef optimizations.
Apache Airflow: Airflow, a platform for programmatically authoring, scheduling, and monitoring workflows, doesn't directly use the term "node fusion." Its primary focus is on scheduling rather than on optimizing computational tasks. So, the concept of merging tasks isn't a primary concern in this context.
Dask: Dask is a parallel computing library that integrates with Python's existing ecosystem. It might employ similar optimization strategies, but not explicitly under the term "node fusion" or similar.

Now, here's a very simplified algorithm for collapsing/fusing nodes in a DAG:

def fuse_nodes(dag, nodes_to_fuse):
    """
    Given a dag and a list of nodes, fuse those nodes into a single node.
    This is a simplified version and assumes that the given nodes can be fused linearly.
    """
    # Create the new node
    fused_node = create_fused_node(nodes_to_fuse)
    
    # Find all the edges coming into the first node in nodes_to_fuse
    incoming_edges = find_incoming_edges(dag, nodes_to_fuse[0])

    # Find all the edges going out from the last node in nodes_to_fuse
    outgoing_edges = find_outgoing_edges(dag, nodes_to_fuse[-1])
    
    # Remove all the nodes_to_fuse from the dag
    for node in nodes_to_fuse:
        dag.remove_node(node)

    # Add the fused_node to the dag
    dag.add_node(fused_node)
    
    # Connect all the incoming edges to the fused_node
    for edge in incoming_edges:
        dag.add_edge(edge[0], fused_node)
    
    # Connect all the outgoing edges from the fused_node
    for edge in outgoing_edges:
        dag.add_edge(fused_node, edge[1])
    
    return dag

This is a very high-level and basic algorithm. A complete implementation would need to deal with many edge cases, handle data dependencies correctly, and ensure that fusing is performed in a way that doesn't introduce errors or inefficiencies.

thorwhalen Sep 19, 2023
Maintainer Author

can_fuse

It's the nodes_to_fuse part that I'm not clear on.

For example, it's not clear to me that I can I fuse any set of nodes_to_fuse, or must these satisfy some conditions regarding their position in the dag, relationship with other nodes?

Please write a function that exhibits how that works a bit more in details.

You're correct. There are constraints on which nodes can be fused:

Dependency: The nodes to be fused should form a connected subgraph. In other words, there should be a direct path through only the nodes in nodes_to_fuse from the first node to the last node. No nodes outside of this set should be in between.
No External Dependencies: No node inside nodes_to_fuse should have a dependency on a node outside of this set, except for the first node. Similarly, no node outside of nodes_to_fuse should depend on a node inside this set, except for the last node.

With these constraints in mind, let's write a function that checks if a given set of nodes can be fused:

def can_fuse(dag, nodes_to_fuse):
    """
    Check if a given set of nodes in a dag can be fused together.
    """
    
    # There should be a direct path from the first to the last node through only nodes in nodes_to_fuse
    visited = set()
    stack = [nodes_to_fuse[0]]
    while stack:
        node = stack.pop()
        visited.add(node)
        
        for successor in dag.successors(node):
            if successor == nodes_to_fuse[-1] and all(n in visited for n in nodes_to_fuse):
                return True
            
            if successor not in visited and successor in nodes_to_fuse:
                stack.append(successor)
    
    return False

Now, let's use this can_fuse function in our fuse_nodes function:

def fuse_nodes(dag, nodes_to_fuse):
    """
    Fuse nodes_to_fuse in the dag into a single node if they can be fused.
    """
    
    if not can_fuse(dag, nodes_to_fuse):
        raise ValueError("The given nodes cannot be fused")
    
    # ... [rest of the code as before]

In this implementation, the can_fuse function checks for a direct path from the first node to the last node in the nodes_to_fuse list. If there's a path and all the nodes in nodes_to_fuse are visited by following this path, then the nodes can be fused. If not, it raises a ValueError.

thorwhalen Sep 19, 2023
Maintainer Author

Example of invalid nodes-to-collapse selection?

Note: Hallucination? (Note here that I don't agree with the answer at all. In my mind BCE can totally be collapsed; the original dag

graph TD
  A --> B
  B --> C
  A --> D
  D --> E
  E --> C

becoming

graph TD
  A --> BCE
  A --> D
  D --> BCE

**

Can you give me an example of a dag + func_nodes that would not be a valid selection for fusion?

Certainly! Let's consider a simple DAG represented by the following structure:

A -> B -> C
|         ^
v         |
D ------> E

Here, A, B, C, D, and E are the nodes, and the arrows represent directed edges from one node to another.

Now, let's consider func_nodes = [B, C, E] as the set of nodes we want to fuse.

In this scenario, we cannot fuse B, C, and E into a single node for several reasons:

They don't form a connected subgraph. There's no direct path that goes from B to C to E using only the nodes in func_nodes.
E has an external dependency on D, which is outside of our func_nodes set.
External nodes depend on nodes inside func_nodes besides just the last node. For instance, C (which isn't the last in our list) is an ancestor to E.

Thus, the choice of func_nodes = [B, C, E] is not a valid selection for fusion in this DAG.

thorwhalen Sep 19, 2023
Maintainer Author

Inverse of fusion?

What would the inverse of "fusion" be? That is, what word is usually used to describe the inverse of fuse_nodes, whereby, with an input dag and (single) func_node (that resulted from the fusion of several nodes), we get the original dag, before the fusion.
How must the fuse_nodes work so that this inverse is possible? What are the best practice designs for this (for example, using a "ledger" to remember the original state, or using an instance of a special class which produces callable, but contains within it the actual subdag corresponding to the fused nodes.

The inverse operation of "fusion" is typically referred to as "expansion" or "splitting". When a node that was produced by fusion is expanded or split, it's broken down into its original constituent nodes.

For the inverse operation to be possible:

Retention of Original Information: fuse_nodes should retain all the information about the original nodes and their interconnections. This information should be sufficient to reconstruct the original DAG from the fused node.
Ledger/Metadata Storage: One common way to enable expansion is to maintain a ledger or metadata along with the fused node. This metadata contains the exact specifications of the nodes that were fused and their relationships. If you're working with objects, this metadata can be an attribute of the fused node.
Special Class: As you suggested, another robust method is to use a special class for the fused node. The class would produce a callable for the combined operations but also retain the original sub-DAG or operations internally. This makes splitting/expanding very straightforward, as the original information is encapsulated within the object.

Example design:

class FusedNode:
    def __init__(self, sub_dag):
        self.sub_dag = sub_dag

    def __call__(self, *args, **kwargs):
        # Implement the combined operations of the sub_dag
        # This will be a series of operations represented by the nodes in the sub_dag
        pass

    def expand(self):
        # Use self.sub_dag to recreate the original nodes and their connections
        return self.sub_dag

# Using the design
fused_node = FusedNode(sub_dag=some_sub_dag)
result = fused_node(some_input)  # This will execute the combined operations
original_dag = fused_node.expand()  # This will give you the original nodes and their connections

Expansion Functionality: Whether you're using metadata storage or a special class, you would also need a function to handle the expansion process. This function would take a DAG with a fused node and use the ledger or class information to recreate the original nodes and relationships, replacing the fused node with the original sub-DAG.
Handling Edge Cases: Ensure that external dependencies and interconnections are handled correctly. When expanding, you must re-establish any connections from outside the fused node to nodes inside the sub-DAG.

In conclusion, the best practice would depend on the specific use case and the requirements of the application. If the primary goal is efficient execution with occasional expansion, the special class method might be more suitable. If the DAG undergoes frequent changes, then a metadata/ledger approach might be easier to manage.

thorwhalen Sep 19, 2023
Maintainer Author

If not collapse or fusion, what term then?

### Indeed, this is the approach I was taking. My "collapsing" of func_nodes wasn't so much a "fusion", which seems I'd be compiling the func_nodes into something optimized. I do have that as well, and I I might call it "fusion" instead of "compilation" which I was calling it. But it seems like "fusion" wouldn't be the right term if I'm just, essentially, encapsulating the sub-DAG defined by the func_nodes. What other terms might you suggest (favor those that might be more standard for this domain).

Encapsulating a sub-DAG without optimizing it sounds like you're primarily aiming to manage or reduce the complexity of the DAG. Here are some terms that might better fit this operation:

Grouping: This term indicates that you're bundling a set of nodes together, without necessarily implying any optimization.
Subgraph Encapsulation: A bit verbose, but it quite clearly describes taking a portion of a graph and treating it as a single unit.
Node Aggregation: Aggregation often implies a bringing together without loss, so it might fit your purpose.
Packing/Unpacking: This would involve 'packing' multiple nodes into one and 'unpacking' them back.
Cluster: This term is used in graph theory and visualization to describe a group of closely related nodes.
Module/Modularization: The idea of treating a sub-DAG as a module might also work, depending on the context.
Decomposition/Recomposition: Break down the DAG into smaller parts and then recompose it back together.

Of these terms, Grouping, Packing, and Module/Modularization are probably the most commonly understood in various contexts. If you're presenting this to an audience familiar with DAGs and graph operations, Cluster and Subgraph Encapsulation might also be appropriate.

Regardless of the term you choose, it's essential to provide clear documentation or explanations to ensure others understand the operation's exact nature and purpose.

thorwhalen · 2025-02-25T12:48:20Z

thorwhalen
Feb 25, 2025
Maintainer Author

expanding and collapsing function calls

The fact that we separate variable and functional nodes in a DAG is not very compliant with
the Uniform Access Principle (UAP).
This obliges us to see (root) var nodes as inputs and func nodes as "parametrization",
or worse, as something we can only change via function injection.

We'd like to be able to chose what should be an input and what should be fixed with less ackwardness.
We can fix an arguments value with partial, thuse parametrizing the DAG, but how do we
"unfix" a function so that it becomes an input?

Let's have a look at a "pipeline" that one would use to visualize segments of text.
One would need to prepare the data (so that it's easy to get metadata and text from it),
embed segments of the text (we'll hide the detals here), cluster and planerize the embeddings
(vectors of high dimension), and finally aggregate everything so it can be readily used by
a visualization function.

from meshed import code_to_dag

@code_to_dag
def embed_process():
    """A "pipeline" that one would use to visualize segments of text. 
    One would need to prepare the data (so that it's easy to get metadata and text from it),
    embed segments of the text (we'll hide the detals here), cluster and planerize the embeddings 
    (vectors of high dimension), and finally aggregate everything so it can be readily used by 
    a visualization function. 
    """
    base_data = prepper(data_src)
    texts_metadata = metadata_extractor(base_data)
    embeddings = embedder(base_data)
    planar_vectors = planarizer(embeddings)
    cluster_indices = clusterer(embeddings)
    aggregate = aggregator(texts_metadata, planar_vectors, cluster_indices)

embed_process.dot_digraph()

That's all fine and good, but how do I change my prepper or embedder?
Yes, okay, by injecting the ones I want to use, but what if I want to really consider these
two as inputs of my pipeline, so that the user can easily provide the ones they want to use?

One answer is: By "expanding" the functions into "function callers".
That is, we change the prepper node, for instance, into a call node that takes the prepper and data_src as inputs, and simply does
prepper(data_src) with it.

from meshed.scrap.collapse_and_expand import expand_function_calls

expand_function_calls(embed_process, include=['prepper', 'embedder']).dot_digraph()

If you need to do the inverse to pack things a bit (for example, because it's easier to read the diagram),
you can do so:

from meshed.scrap.collapse_and_expand import collapse_function_calls

collapse_function_calls(embed_process, include=['prepper', 'embedder']).dot_digraph()

0 replies

Collapse and expand nodes #54

Uh oh!

Uh oh!

thorwhalen Mar 3, 2022 Maintainer

Consideration

Example: Rolling in (orphaned) nodes

Considerations

Replies: 2 comments · 5 replies

Uh oh!

Uh oh!

thorwhalen Sep 19, 2023 Maintainer Author

Uh oh!

Uh oh!

thorwhalen Sep 19, 2023 Maintainer Author

collapse

Uh oh!

Uh oh!

thorwhalen Sep 19, 2023 Maintainer Author

can_fuse

Uh oh!

Uh oh!

thorwhalen Sep 19, 2023 Maintainer Author

Example of invalid nodes-to-collapse selection?

Uh oh!

Uh oh!

thorwhalen Sep 19, 2023 Maintainer Author

Inverse of fusion?

Uh oh!

thorwhalen Sep 19, 2023 Maintainer Author

If not collapse or fusion, what term then?

Uh oh!

thorwhalen Feb 25, 2025 Maintainer Author

expanding and collapsing function calls

thorwhalen
Mar 3, 2022
Maintainer

Replies: 2 comments 5 replies

thorwhalen
Sep 19, 2023
Maintainer Author

thorwhalen Sep 19, 2023
Maintainer Author

thorwhalen Sep 19, 2023
Maintainer Author

thorwhalen Sep 19, 2023
Maintainer Author

thorwhalen Sep 19, 2023
Maintainer Author

thorwhalen Sep 19, 2023
Maintainer Author

thorwhalen
Feb 25, 2025
Maintainer Author