Skip to content

Do unconditional redirects for downloads when the db is broken #3564

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Apr 27, 2021

Conversation

pietroalbini
Copy link
Member

@pietroalbini pietroalbini commented Apr 27, 2021

We had a few database-related outages in the recent weeks, which impacted people downloading crates with Cargo. Download requests are the majority of the traffic we serve, and they're also the most critical ones. Even if the rest of crates.io stops working downloads must continue to be operational to reduce the impact on our users.

As a preface, the downloads endpoint used the database for two reasons: counting the downloads (which can be skipped during outages) and ensuring the crate name is canonicalized. For example, if someone tries to Foo-bar 1.0.0 by calling /api/v1/crates/foo_Bar/1.0.0/download the crates.io application canonicalizes the name and redirects the user to https://static.crates.io/crates/Foo-bar/Foo-bar-1.0.0.crate. If we were not to perform canonicalization the user would be redirected to https://static.crates.io/crates/foo_Bar/foo_Bar-1.0.0.crate, which would result in a 404.

While analyzing the outages and after investigating the traffic patterns of multiple weeks, we identified that the vast majority of downloads (including all downloads from Cargo) already send the canonical name to crates.io, and only a couple of third-party tools send crate names that are not canonicalized.

This PR changes the downloads endpoint to do unconditional redirects without canonicalizing the crate names during a full database outage. During normal operations the canonicalization will still be performed. The change will allow Cargo builds to continue functioning even without a database, and the really small percentage of requests will get a 404 instead of a 500, which in my view is an acceptable tradeoff.

This PR also adds two metrics:

  • cratesio_instance_downloads_non_canonical_crate_name_total: how many download requests we received with a non-canonical crate name. This will allow us to revisit whether the tradeoff is still worth, and it's way easier to query than the previous method we used to extract the data.
  • cratesio_instance_downloads_unconditional_redirects_total: how many unconditional redirects we performed. We'll want to setup an alert that pages us as soon as the metric is > 0.

Finally, this PR implements full tests for the changes introduced here (tests are actually the bulk of the diff). The test uses the "real" database pool instead of the dummy one used for the rest of the tests, acting on a fresh schema. Creating the fresh schema takes a second or two, so we're only doing it in this test to avoid slowing down the test suite. Also, instead of connecting directly to PostgreSQL the database the pool connects through ChaosProxy, a simple proxy implemented in our codebase that allows to break or restore the connection at will.

Part of #3541
r? @jtgeibel

Some tests need access to a "real" production database pool, but that
means we can't rely on begin_test_transaction(), as that only works
inside a single connection.

This commit adds support creating a brand new schema for each test, and
adds a way for tests to opt into that. This is not enabled for all
tests, as creating a new schema and running migrations on it takes
around 1.5 seconds on my computer.
@jtgeibel
Copy link
Member

🥳 Looks great to me! The only "issue" I ran into is that I finally had to set up my database user with password auth. (Until now, I've just used user auth with TEST_DATABASE_URL=postgres:///cargo_registry_test.)

Local testing of stopping postgres works as expected. r2d2 appears to output the following periodically while the database is offline, and quickly backs off to attempting every 15 seconds, so it does not appear this will grow our logs substantially during an outage.

Apr 27 18:59:52.721 ERROR r2d2: could not connect to server: Connection refused
        Is the server running on host "localhost" (127.0.0.1) and accepting
        TCP/IP connections on port 5432?

@bors r+

@bors
Copy link
Contributor

bors commented Apr 27, 2021

📌 Commit 4739d30 has been approved by jtgeibel

@bors
Copy link
Contributor

bors commented Apr 27, 2021

⌛ Testing commit 4739d30 with merge 767b358...

@bors
Copy link
Contributor

bors commented Apr 27, 2021

☀️ Test successful - checks-actions
Approved by: jtgeibel
Pushing 767b358 to master...

@bors bors merged commit 767b358 into rust-lang:master Apr 27, 2021
@pietroalbini pietroalbini deleted the yolo branch April 28, 2021 09:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants