Child start handling for `Process ()` is broken.

First reported as part of https://github.com/haskell-distributed/distributed-process-platform/pull/77, there is a serious fault in the `ToChildStart` instance for `Process ()`. 

Not only do we potentially spin up and subsequently leak starter processes, the whole premise of this approach is wrong, since it can lead to a supervisor waiting indefinitely for a child to start. This breaks the contract between parent and child processes and goes against the design principles of supervision as laid out in the original OTP implementation. 

I propose we remove this instance and leave it to implementors to define, but also that we remove `StarterPid` from the data type, since we have no clean solutions for using these without hitting the issues mentioned above.

Specifically, spawn should be asynchronous, and indication that a child has died should come from monitor signals, such that once the child has spawned, the monitor signal should be established before the code has a chance to proceed (and potentially crash prior to monitoring being properly established). 

Some things I think we can/should rule out...

**The original fix (from @tavisrudd before the repos were split)**

Here's a playback of the commentary there... When the StarterProcess ChildStart variant is used with dynamic supervision, each cycle of startNewChild + terminateChild/deleteChild
leaks a process. The proposed fix was to kill the started process, for which we have an ID.

This was broken for two reasons. The first is that we do not evaluate `toChildStart` in the supervisor process, thus killing the starter process means we can no longer restart the child. The second reason is that we break location transparency, as per this part of the thread:

... So it is safe for us to kill this `ProcessId` from _our_ point of view, because we're deleting the child spec (and therefore won't require this "re-starter process" to continue living. But what if the code that created this re-starter tries to then send it to another supervisor? The `ProcessId` given to that supervisor will be dead/stale and the supervisor will reject it (when trying to start the child spec) with `StartFailureDied DiedUnknownId` which is confusing. (UPDATE: note that I'd missed the fact _this_ supervisor won't be able to start children either...)

At first glance, I thought what we actually wanted here is a finalizer that is guaranteed to kill off the re-starter process "at some point" after the `ChildSpec` becomes unreachable (and is gc'ed). However, the [`System.Mem.Weak`](http://hackage.haskell.org/package/base-4.6.0.1/docs/System-Mem-Weak.html) documentation isn't clear on whether or not this is guaranteed for an ADT such as `ProcessId`. In particular, this comment:

> WARNING: weak pointers to ordinary non-primitive Haskell types are particularly fragile, because the compiler is free to optimise away or duplicate the underlying data structure. Therefore attempting to place a finalizer on an ordinary Haskell type may well result in the finalizer running earlier than you expected. This is not a problem for caches and memo tables where early finalization is benign.

The alternative to this would be to document this behaviour of `ChildSpec`, explaining that once deleted from a supervisor, the `ChildSpec` becomes invalid. But I really really dislike this idea. The problem is that the `ChildSpec` is now behaving like a shared, mutable (unsafe) data structure. Even if you serialize the `ChildSpec` and send it to another node, if the same `ChildSpec` (or even another `ChildSpec` that shares the same `ToChildStart` thunk!) is removed from any supervisor in the system, we've just invalidated that same data structure across all nodes, because the `ProcessId` is no longer valid. That just seems wrong to me, no matter how much documentation we throw at it.

One approach that _might_ work here would be to convert the `StarterProcess` constructor to take a `Weak ProcessId` and in the `ToChildStart` instance create a weak reference to the process id _and_ create a finalizer that kills the process once all its clients - the supervisors using it - go away, i.e., the `ChildStart` datum becomes garbage. Problem solved? Well no.....

Firstly, we'd need to test that this finalization business works properly for a `Weak ProcessId`. You've already written profiling code to detect the leak, so that shouldn't be too difficult, though relying on finalization will probably mean having to increase the time bounds of the tests to give the `System.Mem.Weak` infrastructure time to do the cleanup - maybe even forcing a GC at some point.

Secondly, and this is a **serious** problem: finalisation is useless if/when the `ChildStart` datum escapes the local node. Which means, in practice, that you can't serialize and `send` it remotely, otherwise this whole finalization thing will go wrong - we'll never detect that a remote peer is using the `ChildStart` at all (i.e., we will loose track of it as soon as it's gone over the wire) and therefore - assuming that finalizing a `ProcessId` works at all - we'll end up killing the re-starter process whilst remote supervisors are still using it. Nasty. Confusing. Bug. :(

You can see now why I was a fussy little kitty when @roman and I were debating how to support local children in [the original ticket that introduced `StarterProcess`](https://cloud-haskell.atlassian.net/browse/DPP-81). This issue is fraught with edge cases and we're changing a piece of critical fault-tolerance infrastructure, so we can't afford to screw it up.

I think we have three choices here, as I see it - feel free to suggest alternatives though, as always:
1. remove `StarterProcess` and make people use the `Closure Process` constructor instead
2. re-work `StarterProcess` to reference count its clients/supervisors
3. remove the `ToChildStart` instance for `Process ()` and export `StarterProcess` then make people implement this themselves

The idea of (1) is that we're avoiding the issue by making the API less friendly. Not a great option IMO. 

UPDATE: I actually think (1) is what we should do now. I went on to say the following, which I'll recant shortly...

> The idea behind (2) is that the re-starter process will keep some internal state to track each time a new supervisor tries to use it. Each time a supervisor deletes a `ChildSpec` that uses this re-starter pid, instead of `kill restarterPid ...` they should `exit restarterPid Shutdown` and the re-starter process should `catchExit` expecting `Shutdown` and decrement its supervisor/reference count each time. Once that ref-count reaches zero, it terminates. This is **still** vulnerable to the problem I outlined above though, so I think it's not really viable. We cannot provide an API to the general population that is open to abuse like this.

> The idea of (3) is that we either find a way to expose that re-starter pid **or** just make people implement the `ToChildStart` themselves. The point is, if you've written your code so that you know which re-starter is associated with which `ChildSpec` and you know (in your own code) that you're no longer using the `ChildSpec` **or** the `StarterProcess` then **you - the one who knows what is going on** are best placed to terminate that process yourself when you're sure it is no longer needed. This puts the responsibility where it really IMHO belongs - in the hands of the application developers.

The problem with (2) and (3) is that we introduce a huge degree of complexity for very little benefit. Despite the original report in https://cloud-haskell.atlassian.net/browse/DPP-81, it is not difficult to turn a `Process ()` expression into a closure without template haskell, and therefore we're jumping through hoops for hardly any reason at all. But there's more: The way that `StarterProcess` works is fundamentally broken, because the supervisor has to interact with this other _known process_ to spawn its children, and get back a process id and monitor ref. That is broken in a fundamental way: if we do not own the code for that _starter process_ then we've no way of ensuring we won't block indefinitely waiting for a reply. Even if we do, if the starter resides on a foreign node, network congestion could cause almost indefinite blocking (even in the presence of heartbeats), so this is a bit no-no. We cannot have the supervisor process blocked waiting like that - spawning a child has to be asynchronous, and there needs to be a guarantee that monitoring has been set up correctly before the child proceeds to run, which requires a wrapper that we need to have written. We cannot leave this up to 3rd parties.

Because of this (above), we have to have a thunk that we can execute, which we can wrap in the relevant spawn/monitor code for safely. Since we cannot send `Process a` over the wire, only `Closure (Process a)` is possible - allowing for the `CreateHandle` instance obviously, which takes `Closure (SupervisorPid -> Process (ChildPid, Message))` - and we cannot operate in terms of `Process a`. A concession would be to break up `ChildSpec` into those used to define the supervisor and those used for dynamic supervision, allowing `Process ()` in a child spec used to boot the supervisor but not in subsequent calls to `addChild` etc. I *really* do not like this idea however, since not only to it create an imbalanced API, it could also prevent us sending supervisor setup data over the wire (since it could potentially contain `Process ()` instead of `Closure (Process ())` thunks), and that is a pretty huge disadvantage.

The final concession we might make, would be to separate out the supervisor runtime implementation into two parts, one that handles local children and another for remote. I _still_ don't like this idea though, because we end up with a fractured API, but I will consider it if people shout asking for `Process ()` to be supported as a `ToChildStart` instance again. The approach I would take to achieve this would involve (a) breaking up the server so that the child start handling could be modified, (b) segregating the client APIs and having two supervisor client handles, one for local only and another for remote only supervisors. The client handle for a local only supervisor would carry an `STM.TChan` used for sending unserialisable thunks to the server. This would require us to implement https://github.com/haskell-distributed/distributed-process-client-server/issues/9 first.    

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Child start handling for `Process ()` is broken. #8

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Child start handling for Process () is broken. #8

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Child start handling for `Process ()` is broken. #8