Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix process leak on Supervisor with StarterProcess (DPP-84) #77

Open
wants to merge 2 commits into
base: development
Choose a base branch
from

Conversation

tavisrudd
Copy link
Contributor

Fix https://cloud-haskell.atlassian.net/browse/DPP-84

When the StarterProcess ChildStart variant is used with dynamic
supervision, each cycle of startNewChild + terminateChild/deleteChild
leaks a process.

Note, this required converting handleDeleteChild from RestrictedProcess to Process.

When the StarterProcess ChildStart variant is used with dynamic
supervision, each cycle of startNewChild + terminateChild/deleteChild
leaks a process.
@hyperthunk
Copy link
Member

So it is safe for us to kill this ProcessId from our point of view, because we're deleting the child spec (and therefore won't require this "re-starter process" to continue living. But what if the code that created this re-starter tries to then send it to another supervisor? The ProcessId given to that supervisor will be dead/stale and the supervisor will reject it (when trying to start the child spec) with StartFailureDied DiedUnknownId which is confusing.

At first glance, I thought what we actually wanted here is a finalizer that is guaranteed to kill off the re-starter process "at some point" after the ChildSpec becomes unreachable (and is gc'ed). However, the System.Mem.Weak documentation isn't clear on whether or not this is guaranteed for an ADT such as ProcessId. In particular, this comment:

WARNING: weak pointers to ordinary non-primitive Haskell types are particularly fragile, because the compiler is free to optimise away or duplicate the underlying data structure. Therefore attempting to place a finalizer on an ordinary Haskell type may well result in the finalizer running earlier than you expected. This is not a problem for caches and memo tables where early finalization is benign.

The alternative to this would be to document this behaviour of ChildSpec, explaining that once deleted from a supervisor, the ChildSpec becomes invalid. But I really really dislike this idea. The problem is that the ChildSpec is now behaving like a shared, mutable (unsafe) data structure. Even if you serialize the ChildSpec and send it to another node, if the same ChildSpec (or even another ChildSpec that shares the same ToChildStart thunk!) is removed from any supervisor in the system, we've just invalidated that same data structure across all nodes, because the ProcessId is no longer valid. That just seems wrong to me, no matter how much documentation we throw at it.

One approach that might work here would be to convert the StarterProcess constructor to take a Weak ProcessId and in the ToChildStart instance create a weak reference to the process id and create a finalizer that kills the process once all its clients - the supervisors using it - go away, i.e., the ChildStart datum becomes garbage. Problem solved? Well no.....

Firstly, we'd need to test that this finalization business works properly for a Weak ProcessId. You've already written profiling code to detect the leak, so that shouldn't be too difficult, though relying on finalization will probably mean having to increase the time bounds of the tests to give the System.Mem.Weak infrastructure time to do the cleanup - maybe even forcing a GC at some point.

Secondly, and this is a serious problem: finalisation is useless if/when the ChildStart datum escapes the local node. Which means, in practice, that you can't serialize and send it remotely, otherwise this whole finalization thing will go wrong - we'll never detect that a remote peer is using the ChildStart at all (i.e., we will loose track of it as soon as it's gone over the wire) and therefore - assuming that finalizing a ProcessId works at all - we'll end up killing the re-starter process whilst remote supervisors are still using it. Nasty. Confusing. Bug. :(

You can see now why I was a fussy little kitty when @roman and I were debating how to support local children in the original ticket that introduced StarterProcess. This issue is fraught with edge cases and we're changing a piece of critical fault-tolerance infrastructure, so we can't afford to screw it up.

I think we have three choices here, as I see it - feel free to suggest alternatives though, as always:

  1. remove StarterProcess and make people use the Closure Process constructor instead
  2. re-work StarterProcess to reference count its clients/supervisors
  3. remove the ToChildStart instance for Process () and export StarterProcess then make people implement this themselves

The idea of (1) is that we're avoiding the issue by making the API less friendly. Not a great option IMO.

The idea behind (2) is that the re-starter process will keep some internal state to track each time a new supervisor tries to use it. Each time a supervisor deletes a ChildSpec that uses this re-starter pid, instead of kill restarterPid ... they should exit restarterPid Shutdown and the re-starter process should catchExit expecting Shutdown and decrement its supervisor/reference count each time. Once that ref-count reaches zero, it terminates. This is still vulnerable to the problem I outlined above though, so I think it's not really viable. We cannot provide an API to the general population that is open to abuse like this.

The idea of (3) is that we either find a way to expose that re-starter pid or just make people implement the ToChildStart themselves. The point is, if you've written your code so that you know which re-starter is associated with which ChildSpec and you know (in your own code) that you're no longer using the ChildSpec or the StarterProcess then you - the one who knows what is going on are best placed to terminate that process yourself when you're sure it is no longer needed. This puts the responsibility where it really IMHO belongs - in the hands of the application developers.

Thoughts? ......

@hyperthunk
Copy link
Member

And of course, it'd be really nice if we came up with a way to do "distributed finalization", but that's really beyond the scope of this ticket. :)

@tavisrudd
Copy link
Contributor Author

(1) isn't an option for our use case and (2) sounds seriously hard to get right. I think (3) makes most sense. It would still require cleanup of the restarterPid somewhere, but as you say, the developer using dynamic supervision is best positioned to do that correctly. They'd do something like terminateChild ... >> deleteChild ... >> cleanupStarterProcess. It may be worth keeping the code from the related ToChildStart instances in d-p-p but as plain functions and in a separate module accompanied by documentation and warnings specific to that use case.

@hyperthunk
Copy link
Member

It may be worth keeping the code from the related ToChildStart instances in d-p-p but as plain functions and in a separate module accompanied by documentation and warnings specific to that use case.

Yes I agree that's probably the right thing to do. We could leave it in the Supervisor API, but as a plain top-level function that returns the process id and then provide a mapping from pid to ChildStart:

-- | Spawns a process that can be used to (re)start child processes
-- when they fail. The `ChildStart` is suitable for use in a `ChildSpec`.
-- Note that once the starter process is no longer used, the caller
-- (of this function) is responsible for cleaning up (i.e., stopping) it.
spawnChildStarter :: Process () -> Process (ProcessId, ChildStart)
spawnChildStarter proc = ... etc

@tavisrudd
Copy link
Contributor Author

@hyperthunk I've been experimenting with spawnChildStarter but don't have a patch ready yet. I'd like to send you a few refactoring patches first. The first just makes the implementation details of the StarterProcess variants of ToChildStart private via where clauses. The second moves all the public class definitions, data types and associated derived instances from Supervisor.hs to Control.Distributed.Process.Platform.Supervisor.Types. Does that sound ok to you? If so, I'll open a new PR for each.

@hyperthunk
Copy link
Member

Does that sound ok to you? If so, I'll open a new PR for each.

Yeah that sounds good to me. Thanks for all the work you've been putting around this - it's much appreciated! I'll try and get all the outstanding PR's QA'd and merged this week.

tavisrudd added a commit to BirdseyeSoftware/distributed-process-platform that referenced this pull request Apr 8, 2014
Moved helper functions to where clauses. See the discussion on PR haskell-distributed#77.
tavisrudd added a commit to BirdseyeSoftware/distributed-process-platform that referenced this pull request Apr 9, 2014
This is in preparation for some changes to the handling of the
StarterProcess variants of ToChildStart. See discussion on PR haskell-distributed#77.
@hyperthunk hyperthunk self-assigned this Apr 20, 2014
@hyperthunk
Copy link
Member

So.... where are we with this pull request now @tavisrudd ?

@hyperthunk
Copy link
Member

I'm going to have to release without this fix I'm afraid, as I'm on holiday next week and this is the last chance I'm going to have to actually make the release. Let's get this merged when I'm back online (next Friday) and do a patch 0.1.1 release with it in.

@hyperthunk
Copy link
Member

Caveats - unless you can get me a fix in the next few hours.... :) ?

@tavisrudd
Copy link
Contributor Author

I won't have time till next week unfortunately. The last few weeks of parenthood and work have been a bit intense and I'm just catching a breath now. Enjoy your holiday!

Tavis

On May 30, 2014, at 5:12 AM, Tim Watson [email protected] wrote:

Caveats - unless you can get me a fix in the next few hours.... :) ?


Reply to this email directly or view it on GitHub.

@hyperthunk
Copy link
Member

Hey @tavisrudd - how're you? Do you want me to pick up the fix for this?

@tavisrudd
Copy link
Contributor Author

I'm well, but busier than I've ever been with a new kid and new job. I've
been meaning to get back to this for the past two months but haven't had
time. If you want to take this one over, I won't be insulted.

On Wed, 13 Aug 2014, Tim Watson wrote:

Hey @tavisrudd - how're you? Do you want me to pick up the fix for this?


Reply to this email directly or view it on GitHub:
#77 (comment)

@hyperthunk
Copy link
Member

Ok @tavisrudd - I'm going to pick this one up now. As you can tell, I've been a bit on the busy side myself!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants