Skip to content

Move all of our git logic off the web server #1384

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
sgrif opened this issue May 3, 2018 · 8 comments
Closed

Move all of our git logic off the web server #1384

sgrif opened this issue May 3, 2018 · 8 comments
Assignees
Labels
A-backend ⚙️ A-infrastructure 📡 C-internal 🔧 Category: Nonessential work that would make the codebase more consistent or clear

Comments

@sgrif
Copy link
Contributor

sgrif commented May 3, 2018

Note: I'm actively working on this, just documenting here to increase visibility and get outside feedback.

We have three endpoints that need to interact with git. crates/new, versions/yank, and versions/unyank. All three of these are slower than they should be (crate upload is slow for additional reasons which is outside the scope of this issue). Currently our behavior is:

  • Do our git operation
  • Attempt to push
  • If pushing fails, git fetch and git reset --hard, then retry
  • Retry at most 20 times

This has a lot of problems. The most obviously pressing for me is that we are doing an operation that takes seconds, and blocking sending a response on it. There's a comment that talks about just spawning another thread for this, but that is very wrong and will cause major problems if we do that (what if it fails? what if the dyno is restarted before that thread completes?).

The second problem is that our retry behavior is limited to a very narrow scope. If any failure occurs other than failing to push, we abort. This is even more fragile, because we assume that our local checkout is up to date with the index. While this is almost always true (since the most common operation here is publishing a new crate, which shouldn't fail from being out of sync), it's not guaranteed, which leads to issues like #1296.

What we need here is a proper background queue. For publishing a new crate, yanking, and unyanking, we should immediately update the database, and push updating the index into a queue to be completed by another machine in the future.

I feel very strongly that we should keep PostgreSQL as our only external dependency for as long as we possibly can. I don't want to introduce RabbitMQ, RocksDB (e.g. Faktory), etc as a dependency at this point.

For this reason, I've prototyped out a very simplistic background queue that uses PG's row locks to handle tracking whether a job is ready, taken, or done for us. You can see the current state of the code here (note: it will not compile right now, it relies on two features of Diesel 1.3).

The implementation I'm looking at assumes the following:

  • All background jobs will be idempotent. To put it another way, they will work under "at least once" retry semantics.
  • Every job should eventually complete, given sufficient retries
    • This implies that a job continuing to fail for long enough should page us. At best the failure is because GitHub is down (which means we should update our status page), and at worst it means a mismatch between our database and the index which needs manual resolution.
  • Failures which don't result in the transaction returning Err (meaning in this implementation, the retries counter is not incremented) are things we don't care about, and it's fine to immediately try again.
    • This would be caused by panics from OOM, the dyno restarting, or a hardware failure
  • We will only have one background worker for git related tasks
    • This isn't really ever actually assumed in the code, but it makes no sense for us to have more than one worker for this. If we have two they'll just constantly conflict with each other, and actually slow things down.
@sgrif sgrif added C-enhancement ✨ Category: Adding new behavior or a change to the way an existing feature works A-infrastructure 📡 C-internal 🔧 Category: Nonessential work that would make the codebase more consistent or clear E-hard A-backend ⚙️ labels May 3, 2018
@sgrif sgrif self-assigned this May 3, 2018
@mperham
Copy link

mperham commented May 3, 2018

Just FYI, RocksDB is an embedded database. It's inside the Faktory binary so you don't need to know or do anything with it. It's not something you install and manage.

The only thing you haven't handled well is your third point: the infamous poison pill, a job which kills its process. If you retry immediately, you'll get a runaway busy loop. Typically you'll use a timeout or resurrection process which doesn't run constantly. Sidekiq Pro runs an hourly check for "orphaned" jobs which never finished.

@sgrif
Copy link
Contributor Author

sgrif commented May 4, 2018

Yeah, I'm definitely making assumptions based on our current job load. There shouldn't be any job that actually results in the process being terminated. We rely on these jobs completing for consistency, so any jobs repeatedly failing to complete is definitely an event that should page whoever is on call.

@mperham
Copy link

mperham commented May 4, 2018

Sounds good. Please update us with what you learn in six months: the good and bad that came from this decision. I'm really curious if this will turn out well ("Kept it simple, rock solid") or poorly ("death by edge cases and assumptions"). Most likely somewhere in between. 🤘

@sgrif
Copy link
Contributor Author

sgrif commented May 4, 2018

Thanks for taking the time to look at this ❤️

@jtgeibel
Copy link
Member

I think this is a great proposal! I had an initial concern about reporting success to the client before getting the bits pushed over to the index on Github, but there are so many advantages of a background job here and the drawbacks seem workable.

Pros

  • It should eliminate a lot of the git gymnastics you mention.
  • Scaling of web dynos is independent of the git operations (which would perform worse with more workers running).
  • We can accept and queue updates even if Github is down.
  • If Github is down when a web dyno restarts, the dyno won't block/fail on cloning the index.

Cons

  • This would report success after the crate upload is verified and sent to S3, but before the authoritative source of info is updated. Given that this should simplify the logic and move it into a single process, the worst case condition requiring manual intervention should be rare.
  • Heroku may still very briefly run two instances when refreshing dynos, I haven't looked at exactly how they handle non-web instances there. It sounds like you have a plan with row locking that probably covers this even if Heroku doesn't make any guarantees here.

@carols10cents
Copy link
Member

reporting success to the client before getting the bits pushed over to the index on Github

I think this is an area that needs a teensy bit more design thought. Currently, if you cargo publish and that exits successfully, the user expectation is that you can immediately download that version as a dependency. Putting git operations in a background queue breaks that expectation, so I think we need to make these additional changes:

  • At minimum: A successful cargo publish should print a message that your crate has been successfully queued for publishing
  • At minimum: There should be a page in the crates.io UI where you can see if publishes for crates you own have completed yet or not (or if the publish failed and what you should do about it if it has failed, should the user try publishing again or email us, for example?)
  • Optionally: Have a cargo command that lets a user poll for the status of the publish (for people who want to do everything from the command line)
  • Optionally: Send an email when the publish is complete (if there is an email, we don't yet require an email for publishing but we probably should soon anyway. If we email all owners on publish, this also adds a tiny bit of security against malicious, unexpected publishes)

posting this comment and then taking a look at the code so far...

@carols10cents
Copy link
Member

Left a few comments on the prototype code, I'd love to see some tests around the queueing logic too as it becomes more real!

@sgrif
Copy link
Contributor Author

sgrif commented Mar 18, 2019

This has been deployed for several weeks.

@sgrif sgrif closed this as completed Mar 18, 2019
@Turbo87 Turbo87 removed the C-enhancement ✨ Category: Adding new behavior or a change to the way an existing feature works label Mar 11, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-backend ⚙️ A-infrastructure 📡 C-internal 🔧 Category: Nonessential work that would make the codebase more consistent or clear
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants