-
Notifications
You must be signed in to change notification settings - Fork 649
Move all of our git logic off the web server #1384
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Just FYI, RocksDB is an embedded database. It's inside the Faktory binary so you don't need to know or do anything with it. It's not something you install and manage. The only thing you haven't handled well is your third point: the infamous poison pill, a job which kills its process. If you retry immediately, you'll get a runaway busy loop. Typically you'll use a timeout or resurrection process which doesn't run constantly. Sidekiq Pro runs an hourly check for "orphaned" jobs which never finished. |
Yeah, I'm definitely making assumptions based on our current job load. There shouldn't be any job that actually results in the process being terminated. We rely on these jobs completing for consistency, so any jobs repeatedly failing to complete is definitely an event that should page whoever is on call. |
Sounds good. Please update us with what you learn in six months: the good and bad that came from this decision. I'm really curious if this will turn out well ("Kept it simple, rock solid") or poorly ("death by edge cases and assumptions"). Most likely somewhere in between. 🤘 |
Thanks for taking the time to look at this ❤️ |
I think this is a great proposal! I had an initial concern about reporting success to the client before getting the bits pushed over to the index on Github, but there are so many advantages of a background job here and the drawbacks seem workable. Pros
Cons
|
I think this is an area that needs a teensy bit more design thought. Currently, if you
posting this comment and then taking a look at the code so far... |
Left a few comments on the prototype code, I'd love to see some tests around the queueing logic too as it becomes more real! |
This has been deployed for several weeks. |
Note: I'm actively working on this, just documenting here to increase visibility and get outside feedback.
We have three endpoints that need to interact with git.
crates/new
,versions/yank
, andversions/unyank
. All three of these are slower than they should be (crate upload is slow for additional reasons which is outside the scope of this issue). Currently our behavior is:git fetch
andgit reset --hard
, then retryThis has a lot of problems. The most obviously pressing for me is that we are doing an operation that takes seconds, and blocking sending a response on it. There's a comment that talks about just spawning another thread for this, but that is very wrong and will cause major problems if we do that (what if it fails? what if the dyno is restarted before that thread completes?).
The second problem is that our retry behavior is limited to a very narrow scope. If any failure occurs other than failing to push, we abort. This is even more fragile, because we assume that our local checkout is up to date with the index. While this is almost always true (since the most common operation here is publishing a new crate, which shouldn't fail from being out of sync), it's not guaranteed, which leads to issues like #1296.
What we need here is a proper background queue. For publishing a new crate, yanking, and unyanking, we should immediately update the database, and push updating the index into a queue to be completed by another machine in the future.
I feel very strongly that we should keep PostgreSQL as our only external dependency for as long as we possibly can. I don't want to introduce RabbitMQ, RocksDB (e.g. Faktory), etc as a dependency at this point.
For this reason, I've prototyped out a very simplistic background queue that uses PG's row locks to handle tracking whether a job is ready, taken, or done for us. You can see the current state of the code here (note: it will not compile right now, it relies on two features of Diesel 1.3).
The implementation I'm looking at assumes the following:
Err
(meaning in this implementation, the retries counter is not incremented) are things we don't care about, and it's fine to immediately try again.The text was updated successfully, but these errors were encountered: