Description
Note: I'm actively working on this, just documenting here to increase visibility and get outside feedback.
We have three endpoints that need to interact with git. crates/new
, versions/yank
, and versions/unyank
. All three of these are slower than they should be (crate upload is slow for additional reasons which is outside the scope of this issue). Currently our behavior is:
- Do our git operation
- Attempt to push
- If pushing fails,
git fetch
andgit reset --hard
, then retry - Retry at most 20 times
This has a lot of problems. The most obviously pressing for me is that we are doing an operation that takes seconds, and blocking sending a response on it. There's a comment that talks about just spawning another thread for this, but that is very wrong and will cause major problems if we do that (what if it fails? what if the dyno is restarted before that thread completes?).
The second problem is that our retry behavior is limited to a very narrow scope. If any failure occurs other than failing to push, we abort. This is even more fragile, because we assume that our local checkout is up to date with the index. While this is almost always true (since the most common operation here is publishing a new crate, which shouldn't fail from being out of sync), it's not guaranteed, which leads to issues like #1296.
What we need here is a proper background queue. For publishing a new crate, yanking, and unyanking, we should immediately update the database, and push updating the index into a queue to be completed by another machine in the future.
I feel very strongly that we should keep PostgreSQL as our only external dependency for as long as we possibly can. I don't want to introduce RabbitMQ, RocksDB (e.g. Faktory), etc as a dependency at this point.
For this reason, I've prototyped out a very simplistic background queue that uses PG's row locks to handle tracking whether a job is ready, taken, or done for us. You can see the current state of the code here (note: it will not compile right now, it relies on two features of Diesel 1.3).
The implementation I'm looking at assumes the following:
- All background jobs will be idempotent. To put it another way, they will work under "at least once" retry semantics.
- Every job should eventually complete, given sufficient retries
- This implies that a job continuing to fail for long enough should page us. At best the failure is because GitHub is down (which means we should update our status page), and at worst it means a mismatch between our database and the index which needs manual resolution.
- Failures which don't result in the transaction returning
Err
(meaning in this implementation, the retries counter is not incremented) are things we don't care about, and it's fine to immediately try again.- This would be caused by panics from OOM, the dyno restarting, or a hardware failure
- We will only have one background worker for git related tasks
- This isn't really ever actually assumed in the code, but it makes no sense for us to have more than one worker for this. If we have two they'll just constantly conflict with each other, and actually slow things down.