Skip to content
This repository was archived by the owner on Sep 3, 2024. It is now read-only.
This repository was archived by the owner on Sep 3, 2024. It is now read-only.

Child start handling for Process () is broken. #8

Closed
@hyperthunk

Description

@hyperthunk

First reported as part of haskell-distributed/distributed-process-platform#77, there is a serious fault in the ToChildStart instance for Process ().

Not only do we potentially spin up and subsequently leak starter processes, the whole premise of this approach is wrong, since it can lead to a supervisor waiting indefinitely for a child to start. This breaks the contract between parent and child processes and goes against the design principles of supervision as laid out in the original OTP implementation.

I propose we remove this instance and leave it to implementors to define, but also that we remove StarterPid from the data type, since we have no clean solutions for using these without hitting the issues mentioned above.

Specifically, spawn should be asynchronous, and indication that a child has died should come from monitor signals, such that once the child has spawned, the monitor signal should be established before the code has a chance to proceed (and potentially crash prior to monitoring being properly established).

Some things I think we can/should rule out...

The original fix (from @tavisrudd before the repos were split)

Here's a playback of the commentary there... When the StarterProcess ChildStart variant is used with dynamic supervision, each cycle of startNewChild + terminateChild/deleteChild
leaks a process. The proposed fix was to kill the started process, for which we have an ID.

This was broken for two reasons. The first is that we do not evaluate toChildStart in the supervisor process, thus killing the starter process means we can no longer restart the child. The second reason is that we break location transparency, as per this part of the thread:

... So it is safe for us to kill this ProcessId from our point of view, because we're deleting the child spec (and therefore won't require this "re-starter process" to continue living. But what if the code that created this re-starter tries to then send it to another supervisor? The ProcessId given to that supervisor will be dead/stale and the supervisor will reject it (when trying to start the child spec) with StartFailureDied DiedUnknownId which is confusing. (UPDATE: note that I'd missed the fact this supervisor won't be able to start children either...)

At first glance, I thought what we actually wanted here is a finalizer that is guaranteed to kill off the re-starter process "at some point" after the ChildSpec becomes unreachable (and is gc'ed). However, the System.Mem.Weak documentation isn't clear on whether or not this is guaranteed for an ADT such as ProcessId. In particular, this comment:

WARNING: weak pointers to ordinary non-primitive Haskell types are particularly fragile, because the compiler is free to optimise away or duplicate the underlying data structure. Therefore attempting to place a finalizer on an ordinary Haskell type may well result in the finalizer running earlier than you expected. This is not a problem for caches and memo tables where early finalization is benign.

The alternative to this would be to document this behaviour of ChildSpec, explaining that once deleted from a supervisor, the ChildSpec becomes invalid. But I really really dislike this idea. The problem is that the ChildSpec is now behaving like a shared, mutable (unsafe) data structure. Even if you serialize the ChildSpec and send it to another node, if the same ChildSpec (or even another ChildSpec that shares the same ToChildStart thunk!) is removed from any supervisor in the system, we've just invalidated that same data structure across all nodes, because the ProcessId is no longer valid. That just seems wrong to me, no matter how much documentation we throw at it.

One approach that might work here would be to convert the StarterProcess constructor to take a Weak ProcessId and in the ToChildStart instance create a weak reference to the process id and create a finalizer that kills the process once all its clients - the supervisors using it - go away, i.e., the ChildStart datum becomes garbage. Problem solved? Well no.....

Firstly, we'd need to test that this finalization business works properly for a Weak ProcessId. You've already written profiling code to detect the leak, so that shouldn't be too difficult, though relying on finalization will probably mean having to increase the time bounds of the tests to give the System.Mem.Weak infrastructure time to do the cleanup - maybe even forcing a GC at some point.

Secondly, and this is a serious problem: finalisation is useless if/when the ChildStart datum escapes the local node. Which means, in practice, that you can't serialize and send it remotely, otherwise this whole finalization thing will go wrong - we'll never detect that a remote peer is using the ChildStart at all (i.e., we will loose track of it as soon as it's gone over the wire) and therefore - assuming that finalizing a ProcessId works at all - we'll end up killing the re-starter process whilst remote supervisors are still using it. Nasty. Confusing. Bug. :(

You can see now why I was a fussy little kitty when @roman and I were debating how to support local children in the original ticket that introduced StarterProcess. This issue is fraught with edge cases and we're changing a piece of critical fault-tolerance infrastructure, so we can't afford to screw it up.

I think we have three choices here, as I see it - feel free to suggest alternatives though, as always:

  1. remove StarterProcess and make people use the Closure Process constructor instead
  2. re-work StarterProcess to reference count its clients/supervisors
  3. remove the ToChildStart instance for Process () and export StarterProcess then make people implement this themselves

The idea of (1) is that we're avoiding the issue by making the API less friendly. Not a great option IMO.

UPDATE: I actually think (1) is what we should do now. I went on to say the following, which I'll recant shortly...

The idea behind (2) is that the re-starter process will keep some internal state to track each time a new supervisor tries to use it. Each time a supervisor deletes a ChildSpec that uses this re-starter pid, instead of kill restarterPid ... they should exit restarterPid Shutdown and the re-starter process should catchExit expecting Shutdown and decrement its supervisor/reference count each time. Once that ref-count reaches zero, it terminates. This is still vulnerable to the problem I outlined above though, so I think it's not really viable. We cannot provide an API to the general population that is open to abuse like this.

The idea of (3) is that we either find a way to expose that re-starter pid or just make people implement the ToChildStart themselves. The point is, if you've written your code so that you know which re-starter is associated with which ChildSpec and you know (in your own code) that you're no longer using the ChildSpec or the StarterProcess then you - the one who knows what is going on are best placed to terminate that process yourself when you're sure it is no longer needed. This puts the responsibility where it really IMHO belongs - in the hands of the application developers.

The problem with (2) and (3) is that we introduce a huge degree of complexity for very little benefit. Despite the original report in https://cloud-haskell.atlassian.net/browse/DPP-81, it is not difficult to turn a Process () expression into a closure without template haskell, and therefore we're jumping through hoops for hardly any reason at all. But there's more: The way that StarterProcess works is fundamentally broken, because the supervisor has to interact with this other known process to spawn its children, and get back a process id and monitor ref. That is broken in a fundamental way: if we do not own the code for that starter process then we've no way of ensuring we won't block indefinitely waiting for a reply. Even if we do, if the starter resides on a foreign node, network congestion could cause almost indefinite blocking (even in the presence of heartbeats), so this is a bit no-no. We cannot have the supervisor process blocked waiting like that - spawning a child has to be asynchronous, and there needs to be a guarantee that monitoring has been set up correctly before the child proceeds to run, which requires a wrapper that we need to have written. We cannot leave this up to 3rd parties.

Because of this (above), we have to have a thunk that we can execute, which we can wrap in the relevant spawn/monitor code for safely. Since we cannot send Process a over the wire, only Closure (Process a) is possible - allowing for the CreateHandle instance obviously, which takes Closure (SupervisorPid -> Process (ChildPid, Message)) - and we cannot operate in terms of Process a. A concession would be to break up ChildSpec into those used to define the supervisor and those used for dynamic supervision, allowing Process () in a child spec used to boot the supervisor but not in subsequent calls to addChild etc. I really do not like this idea however, since not only to it create an imbalanced API, it could also prevent us sending supervisor setup data over the wire (since it could potentially contain Process () instead of Closure (Process ()) thunks), and that is a pretty huge disadvantage.

The final concession we might make, would be to separate out the supervisor runtime implementation into two parts, one that handles local children and another for remote. I still don't like this idea though, because we end up with a fractured API, but I will consider it if people shout asking for Process () to be supported as a ToChildStart instance again. The approach I would take to achieve this would involve (a) breaking up the server so that the child start handling could be modified, (b) segregating the client APIs and having two supervisor client handles, one for local only and another for remote only supervisors. The client handle for a local only supervisor would carry an STM.TChan used for sending unserialisable thunks to the server. This would require us to implement haskell-distributed/distributed-process-client-server#9 first.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions