Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

efficiency: increment_graph_window #5315

Closed
oliver-sanders opened this issue Jan 20, 2023 · 7 comments
Closed

efficiency: increment_graph_window #5315

oliver-sanders opened this issue Jan 20, 2023 · 7 comments
Labels
bug Something is wrong :( efficiency For notable efficiency improvements
Milestone

Comments

@oliver-sanders
Copy link
Member

I've just tried out an old Cylc 7 scaling example and found out that Cylc 8 scales disastrously against this benchmark.

The issue occurs when pushing the number of edges in a workflow. This causes the increment_graph_window function to do a lot of work, which causes a vast number of non-functional TaskProxy objects to be created (and destroyed).

Here's some --profile results showing the increment_graph_window call stack soaking up CPU:

Screenshot from 2023-01-20 10-50-20

And here's the example workflow that generated it (I used -s TASKS=50):

#!Jinja2                                                                                                   
                                                                                                           
{% set TASKS = TASKS | default(100) %}                                                                     
{% set CYCLES = CYCLES | default(1) %}                                                                     
{% set SLEEP = SLEEP | default(1) %}                                                                       
                                                                                                           
[meta]                                                                                                     
    description = """                                                                                      
        Example scaling workflow with {{ (TASKS * 2) + 2 }} tasks and {{ (TASKS ** 2) + (TASKS * 2) }}                             
        dependencies per cycle which runs for {{ CYCLES }} cycles.                                         
    """                                                                                                    
                                                                                                           
[task parameters]                                                                                          
    x = 1..{{ TASKS }}                                                                                     
    y = 1..{{ TASKS }}                                                                                     
                                                                                                           
[scheduling]                                                                                               
    cycling mode = integer                                                                                 
    initial cycle point = 1                                                                                
    final cycle point = {{ CYCLES }}                                                                       
    [[graph]]                                                                                              
        P1 = """                                                                                           
            d[-P1] => a => b<x> => c<y> => d                                                               
        """                                                                                                
                                                                                                           
[runtime]                                                                                                  
    [[b<x>, c<y>]]                                                                                         
        script = sleep {{ SLEEP }}                                                                         
    [[a, d]]    

The underlying culprits are the platforms and tokens methods:

Screenshot from 2023-01-19 17-56-34

There are things we can do to reduce the overheads of these:

However, the best solution is to avoid calling them in the first place.

Pull requests welcome!
This is an Open Source project - please consider contributing a bug fix
yourself (please read CONTRIBUTING.md before starting any work though).

@oliver-sanders oliver-sanders added bug Something is wrong :( efficiency For notable efficiency improvements labels Jan 20, 2023
@oliver-sanders oliver-sanders added this to the cylc-8.1.x milestone Jan 20, 2023
@oliver-sanders
Copy link
Member Author

oliver-sanders commented Jan 20, 2023

It appears what's happening is that for an example like <a> => <b> ALL a => b edges are "expanded" for EVERY a.

So the graph expansion is running <a> * <b> * <a> times. Given <a> = <b> in this example this means expand_graph_window is scaling to O(N^3) which is the source of the issue. We need to bring this down to O(N^2).

But even after that there's another factor of two in ther somewhere. For my example N=50 so I'd expect 50^3 calls which is 125'000, but we actually got 262651

Screenshot from 2023-01-20 11-49-54

Which brings the total to O(2N^3).

oliver-sanders added a commit to oliver-sanders/cylc-flow that referenced this issue Jan 20, 2023
* There is a significant overhead to initiating a `TaskProxy` object.
* Avoid doing this where it is not necessary.
* In my test with the example in cylc#5315
  with `-s TASKS=15 -s CYCLES=1` this reduced the time taken by
  `expand_graph_window` by ~38% from from 12.2s to `7.54`.
oliver-sanders added a commit to oliver-sanders/cylc-flow that referenced this issue Jan 20, 2023
* There is a significant overhead to initiating a `TaskProxy` object.
* Avoid doing this where it is not necessary.
* In my test with the example in cylc#5315
  with `-s TASKS=15 -s CYCLES=1` this reduced the time taken by
  `expand_graph_window` by ~38% from from 12.2s to `7.54`.
@oliver-sanders
Copy link
Member Author

oliver-sanders commented Jan 20, 2023

I've had a go at writing a graph walker thinggy which expands the graph around multiple tasks. This way we can add a batch of tasks, then expand the window around the batch which would save walking over the same graph edges.

https://github.com/oliver-sanders/cylc-flow/pull/new/increment-graph-window

I think I've got vaguely the right edges coming out of it, but 'm having real trouble hooking that into the data store. I'm not really sure what increment_graph_window is meant to do, how it registers nodes/edges and what all those n_window_* dicts are supposed to hold.

I think having one function which walks the graph and another which registers nodes/edges should be easier to understand/optimise than two functions which call each other recursively as I keep getting lost in that logic.

I the mean time we're trying to make the platforms/tokens interfaces more efficient in order to reduce the impact of the large number of function calls.

@dwsutherland
Copy link
Member

dwsutherland commented Jan 21, 2023

I've had a go at writing a graph walker thinggy which expands the graph around multiple tasks. This way we can add a batch of tasks, then expand the window around the batch which would save walking over the same graph edges so many times.

The pool hands increment_graph_window a task proxy, and the data and some functionality of the proxy is used to populate the respective store node. And the purpose of generating task proxies was to call increment_graph_window in the same way the pool does, I didn't realize how expensive it is, maybe we can just generate the store off existing objects/info, however, we do get a lot of information from it:

def _process_internal_task_proxy(self, itask, tproxy):

I think I've got vaguely the right edges coming out of it, but 'm having real trouble hooking that into the data store. I'm not really sure what increment_graph_window is meant to do, how it registers nodes/edges and what all those n_window_* dicts are supposed to hold.

Essentially increment_graph_window:

  • Generate the store node
  • Get list of children and call _expand_graph_window
  • Get list of parents and call _expand_graph_window

Then _expand_graph_window:

  • Generate store edges
  • For any new edges call increment_graph_window to repeat the above for parent/child

The n_window_* dicts (and some other dicts/sets) are to keep track of branches associated with the active tasks for pruning purposes.

I think having one function which walks the graph and another which registers nodes/edges should be easier to understand/optimise than two functions which call each other recursively as I keep getting lost in that logic.

I think I can refactor the code to put the recursion (graph walk) in the same function, however, the problem is a recursive one, especially with variable window sizes.

There are two things I will look into to reduce the CPU cost significantly:

  • Reduce the reliance on TaskProxy objects in the data_manager, by just passing in data it needs and externalizing functionality to share (where possible).
  • Try record reference information for the graph walks (like those mysterious n_window_* dicts) and use this, to avoid re-walking the same paths.

I the mean time we're trying to make the platforms/tokens interfaces more efficient in order to reduce the impact of the large number of function calls.

Great 👍

In fact, we don't even need to run self.platform = get_platform() for store purposes.. (we get platform name from config, jobs do the full resolving prior to running.. which gets added to the new runtime info)
We probably use too much of the information from TaskProxy to not generate one, however, perhaps we can optionally generate store versions that are less expensive.

@oliver-sanders
Copy link
Member Author

oliver-sanders commented Jan 23, 2023

Thanks for this info, really helpful.

I didn't realize how expensive it is

MB. We used to have a profile-battery to help us spot these things but sadly never got around to porting it to Cylc 8. It's been compounded by the platforms/tokens interfaces which aren't the fastest.

to avoid re-walking the same paths.

This is definitely the mid/long term solution. We have to be careful with the cache though as it could get quite large if not correctly housekept. It might be worth caching the edge-id with this value as that re-computation gets expensive. We are also looking into caching in the Tokens objects to see if there are any passive gains to be made there.

In fact, we don't even need to run self.platform = get_platform() for store purposes

I don't know why we set the localhost platform as a default for all TaskProxiesin the first place, this is likely a bug in waiting. We're currently looking into removing this.

perhaps we can optionally generate store versions that are less expensive.

I think [/hope] if a task isn't in the pool then the additional information the TaskProxy object holds shouldn't be necessary because the task has either already run or not yet run.

@oliver-sanders
Copy link
Member Author

those mysterious n_window_* dicts

Could you give me a quick rundown of what those dicts are doing at some point. I need to understand this stuff better.

@dwsutherland
Copy link
Member

dwsutherland commented Jan 26, 2023

those mysterious n_window_* dicts

Could you give me a quick rundown of what those dicts are doing at some point. I need to understand this stuff better.

An overview of the pruning would help, consider:

a => b => c => e
b => d

with b active (and n=1) the associated set of nodes is B={a, b, c, d},
when c is active the associated set is C={b, c, e}.

When b leaves the pool (becomes inactive) we flag it, then we do a union (B, . . .) of all the association sets of flagged nodes (and/or those no longer in the active pool)..
We also do a union(C, . . .) of active associated nodes.
The difference between these union sets is what we prune (and anything else flagged)... There are a few other rules for exceptions, like a change in window size, isolates, and end of branch.. but this is essentially it.

Now:
self.n_window_nodes[active_id] = set() - is this association being initialized for active_id

self.n_window_edges[active_id] = set() - is the same for edges, but only used to determine if the edge has been walked down for this task (don't create another, or walk in a circle)

self.n_window_boundary_nodes[active_id] = {} - is a similar association but according to edge distance from active_id, it actually extents to n+1 distance (hence why we walk 1 distance further than n), so we can trigger pruning for anything beyond the window extent (which catches some rare cases). (we don't generate store nodes to +1, just associations)

Hope that helps

@oliver-sanders oliver-sanders modified the milestones: cylc-8.1.x, cylc-8.x, 8.1.1 Jan 26, 2023
@oliver-sanders
Copy link
Member Author

oliver-sanders commented Jan 26, 2023

Closing this issue as fixed by #5319 and (improved further by #5321, #5325, ...), thanks all for your work on this one! I think we should be able to get these fixes deployed before anyone runs into the issue for real.

Tracking of further enhancements to be covered by cylc/cylc-admin#38

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something is wrong :( efficiency For notable efficiency improvements
Projects
None yet
Development

No branches or pull requests

2 participants