Skip to content

Ensure downloads work without a database connection #3541

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
3 tasks done
pietroalbini opened this issue Apr 22, 2021 · 0 comments · Fixed by #3572
Closed
3 tasks done

Ensure downloads work without a database connection #3541

pietroalbini opened this issue Apr 22, 2021 · 0 comments · Fixed by #3572
Assignees
Labels
C-tracking-issue Category: A tracking issue for an RFC, an unstable feature, or an issue made of many parts

Comments

@pietroalbini
Copy link
Member

pietroalbini commented Apr 22, 2021

We've had a couple database-related outages in the past few weeks, all of which impacted the download endpoint of crates.io. That endpoint is the most important one in the application, as any error returned by it results in end-user or CI builds failing. To improve crates.io's resilience we need to make sure the endpoint works even without a database connection.

The two uses of a connection during a download are counting the downloads and canonicalizing the crate name, to ensure the user is redirect to the CDN path with the right capitalization and hyphenation. Cargo always provides the canonical name to the API though, so only during outages we can do an unconditional redirects without affecting Cargo downloads.

Third-party clients that don't send the canonical name might get redirected to an invalid CDN path, but 0.00039% of the download requests we served in the past 48 hours would receive a wrong response (and none of those are Cargo builds). That's better than having downloads be down for everyone until the outage is fixed.

Needed steps:

@pietroalbini pietroalbini added the C-tracking-issue Category: A tracking issue for an RFC, an unstable feature, or an issue made of many parts label Apr 22, 2021
@pietroalbini pietroalbini self-assigned this Apr 22, 2021
bors added a commit that referenced this issue Apr 26, 2021
Create `crates-admin migrate` to sync categories and migrate the db

The new command will be used in production to run migrations instead of installing the Diesel CLI and running migrations through it. In addition to that, synchronizing categories has been moved to the `crates-admin migrate` command instead of being executed every time the application starts.

The main advantage of this is, booting the server will not *require* a fully working database connection anymore. A connection is still needed, but this removes all the blockers for removing that limit.

r? `@jtgeibel`
Part of #3541
bors added a commit that referenced this issue Apr 27, 2021
Do unconditional redirects for downloads when the db is broken

We had a few database-related outages in the recent weeks, which impacted people downloading crates with Cargo. Download requests are the majority of the traffic we serve, and they're also the most critical ones. Even if the rest of crates.io stops working downloads *must* continue to be operational to reduce the impact on our users.

As a preface, the downloads endpoint used the database for two reasons: counting the downloads (which can be skipped during outages) and ensuring the crate name is canonicalized. For example, if someone tries to  `Foo-bar 1.0.0` by calling `/api/v1/crates/foo_Bar/1.0.0/download` the crates.io application canonicalizes the name and redirects the user to `https://static.crates.io/crates/Foo-bar/Foo-bar-1.0.0.crate`. If we were not to perform canonicalization the user would be redirected to `https://static.crates.io/crates/foo_Bar/foo_Bar-1.0.0.crate`, which would result in a 404.

While analyzing the outages and after investigating the traffic patterns of multiple weeks, we identified that the vast majority of downloads (including *all* downloads from Cargo) already send the canonical name to crates.io, and only a couple of third-party tools send crate names that are not canonicalized.

This PR changes the downloads endpoint to do unconditional redirects without canonicalizing the crate names **during a full database outage**. During normal operations the canonicalization will still be performed. The change will allow Cargo builds to continue functioning even without a database, and the really small percentage of requests will get a 404 instead of a 500, which in my view is an acceptable tradeoff.

This PR also adds two metrics:

* `cratesio_instance_downloads_non_canonical_crate_name_total`: how many download requests we received with a non-canonical crate name. This will allow us to revisit whether the tradeoff is still worth, and it's way easier to query than the previous method we used to extract the data.
* `cratesio_instance_downloads_unconditional_redirects_total`: how many unconditional redirects we performed. We'll want to setup an alert that pages us as soon as the metric is `> 0`.

Finally, this PR implements full tests for the changes introduced here (tests are actually the bulk of the diff). The test uses the "real" database pool instead of the dummy one used for the rest of the tests, acting on a fresh schema. Creating the fresh schema takes a second or two, so we're only doing it in this test to avoid slowing down the test suite. Also, instead of connecting directly to PostgreSQL the database the pool connects through `ChaosProxy`, a simple proxy implemented in our codebase that allows to break or restore the connection at will.

Part of #3541
r? `@jtgeibel`
@bors bors closed this as completed in 509d7a2 May 14, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-tracking-issue Category: A tracking issue for an RFC, an unstable feature, or an issue made of many parts
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant