Skip to content

Commit 767b358

Browse files
committed
Auto merge of #3564 - pietroalbini:yolo, r=jtgeibel
Do unconditional redirects for downloads when the db is broken We had a few database-related outages in the recent weeks, which impacted people downloading crates with Cargo. Download requests are the majority of the traffic we serve, and they're also the most critical ones. Even if the rest of crates.io stops working downloads *must* continue to be operational to reduce the impact on our users. As a preface, the downloads endpoint used the database for two reasons: counting the downloads (which can be skipped during outages) and ensuring the crate name is canonicalized. For example, if someone tries to `Foo-bar 1.0.0` by calling `/api/v1/crates/foo_Bar/1.0.0/download` the crates.io application canonicalizes the name and redirects the user to `https://static.crates.io/crates/Foo-bar/Foo-bar-1.0.0.crate`. If we were not to perform canonicalization the user would be redirected to `https://static.crates.io/crates/foo_Bar/foo_Bar-1.0.0.crate`, which would result in a 404. While analyzing the outages and after investigating the traffic patterns of multiple weeks, we identified that the vast majority of downloads (including *all* downloads from Cargo) already send the canonical name to crates.io, and only a couple of third-party tools send crate names that are not canonicalized. This PR changes the downloads endpoint to do unconditional redirects without canonicalizing the crate names **during a full database outage**. During normal operations the canonicalization will still be performed. The change will allow Cargo builds to continue functioning even without a database, and the really small percentage of requests will get a 404 instead of a 500, which in my view is an acceptable tradeoff. This PR also adds two metrics: * `cratesio_instance_downloads_non_canonical_crate_name_total`: how many download requests we received with a non-canonical crate name. This will allow us to revisit whether the tradeoff is still worth, and it's way easier to query than the previous method we used to extract the data. * `cratesio_instance_downloads_unconditional_redirects_total`: how many unconditional redirects we performed. We'll want to setup an alert that pages us as soon as the metric is `> 0`. Finally, this PR implements full tests for the changes introduced here (tests are actually the bulk of the diff). The test uses the "real" database pool instead of the dummy one used for the rest of the tests, acting on a fresh schema. Creating the fresh schema takes a second or two, so we're only doing it in this test to avoid slowing down the test suite. Also, instead of connecting directly to PostgreSQL the database the pool connects through `ChaosProxy`, a simple proxy implemented in our codebase that allows to break or restore the connection at will. Part of #3541 r? `@jtgeibel`
2 parents 7c548f4 + 4739d30 commit 767b358

File tree

15 files changed

+497
-147
lines changed

15 files changed

+497
-147
lines changed

Cargo.lock

Lines changed: 12 additions & 0 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

Cargo.toml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -81,7 +81,7 @@ sha2 = "0.9"
8181
swirl = { git = "https://github.com/sgrif/swirl.git", rev = "e87cf37" }
8282
tar = "0.4.16"
8383
tempfile = "3"
84-
tokio = { version = "1", features = ["net", "signal", "io-std", "io-util", "rt-multi-thread"]}
84+
tokio = { version = "1.5.0", features = ["net", "signal", "io-std", "io-util", "rt-multi-thread", "macros"]}
8585
toml = "0.5"
8686
tracing = "0.1"
8787
tracing-subscriber = "0.2"
@@ -92,7 +92,7 @@ claim = "0.5"
9292
conduit-test = "0.9.0-alpha.4"
9393
hyper-tls = "0.5"
9494
lazy_static = "1.0"
95-
tokio = "1"
95+
tokio = "1.5.0"
9696
tower-service = "0.3.0"
9797

9898
[build-dependencies]

src/app.rs

Lines changed: 34 additions & 26 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,7 @@
11
//! Application-wide components in a struct accessible from each request
22
3-
use crate::{db, Config, Env};
3+
use crate::db::{ConnectionConfig, DieselPool};
4+
use crate::{Config, Env};
45
use std::{sync::Arc, time::Duration};
56

67
use crate::downloads_counter::DownloadsCounter;
@@ -18,10 +19,10 @@ use scheduled_thread_pool::ScheduledThreadPool;
1819
#[allow(missing_debug_implementations)]
1920
pub struct App {
2021
/// The primary database connection pool
21-
pub primary_database: db::DieselPool,
22+
pub primary_database: DieselPool,
2223

2324
/// The read-only replica database connection pool
24-
pub read_only_replica_database: Option<db::DieselPool>,
25+
pub read_only_replica_database: Option<DieselPool>,
2526

2627
/// GitHub API client
2728
pub github: GitHubClient,
@@ -103,38 +104,45 @@ impl App {
103104
_ => 30,
104105
};
105106

106-
let primary_db_connection_config = db::ConnectionConfig {
107-
statement_timeout: db_connection_timeout,
108-
read_only: config.db_primary_config.read_only_mode,
109-
};
110-
111107
let thread_pool = Arc::new(ScheduledThreadPool::new(db_helper_threads));
112108

113-
let primary_db_config = r2d2::Pool::builder()
114-
.max_size(db_pool_size)
115-
.min_idle(db_min_idle)
116-
.connection_timeout(Duration::from_secs(db_connection_timeout))
117-
.connection_customizer(Box::new(primary_db_connection_config))
118-
.thread_pool(thread_pool.clone());
119-
120-
let primary_database =
121-
db::diesel_pool(&config.db_primary_config.url, config.env, primary_db_config);
122-
123-
let replica_database = if let Some(url) = config.db_replica_config.as_ref().map(|c| &c.url)
124-
{
125-
let replica_db_connection_config = db::ConnectionConfig {
109+
let primary_database = if config.use_test_database_pool {
110+
DieselPool::new_test(&config.db_primary_config.url)
111+
} else {
112+
let primary_db_connection_config = ConnectionConfig {
126113
statement_timeout: db_connection_timeout,
127-
read_only: true,
114+
read_only: config.db_primary_config.read_only_mode,
128115
};
129116

130-
let replica_db_config = r2d2::Pool::builder()
117+
let primary_db_config = r2d2::Pool::builder()
131118
.max_size(db_pool_size)
132119
.min_idle(db_min_idle)
133120
.connection_timeout(Duration::from_secs(db_connection_timeout))
134-
.connection_customizer(Box::new(replica_db_connection_config))
135-
.thread_pool(thread_pool);
121+
.connection_customizer(Box::new(primary_db_connection_config))
122+
.thread_pool(thread_pool.clone());
123+
124+
DieselPool::new(&config.db_primary_config.url, primary_db_config)
125+
};
136126

137-
Some(db::diesel_pool(&url, config.env, replica_db_config))
127+
let replica_database = if let Some(url) = config.db_replica_config.as_ref().map(|c| &c.url)
128+
{
129+
if config.use_test_database_pool {
130+
Some(DieselPool::new_test(url))
131+
} else {
132+
let replica_db_connection_config = ConnectionConfig {
133+
statement_timeout: db_connection_timeout,
134+
read_only: true,
135+
};
136+
137+
let replica_db_config = r2d2::Pool::builder()
138+
.max_size(db_pool_size)
139+
.min_idle(db_min_idle)
140+
.connection_timeout(Duration::from_secs(db_connection_timeout))
141+
.connection_customizer(Box::new(replica_db_connection_config))
142+
.thread_pool(thread_pool);
143+
144+
Some(DieselPool::new(&url, replica_db_config))
145+
}
138146
} else {
139147
None
140148
};

src/background_jobs.rs

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -2,10 +2,9 @@ use reqwest::blocking::Client;
22
use std::panic::AssertUnwindSafe;
33
use std::sync::{Arc, Mutex, MutexGuard, PoisonError};
44

5-
use diesel::r2d2::PoolError;
65
use swirl::PerformError;
76

8-
use crate::db::{DieselPool, DieselPooledConn};
7+
use crate::db::{DieselPool, DieselPooledConn, PoolError};
98
use crate::git::Repository;
109
use crate::uploaders::Uploader;
1110

src/config.rs

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -22,6 +22,7 @@ pub struct Config {
2222
pub downloads_persist_interval_ms: usize,
2323
pub ownership_invitations_expiration_days: u64,
2424
pub metrics_authorization_token: Option<String>,
25+
pub use_test_database_pool: bool,
2526
}
2627

2728
#[derive(Debug)]
@@ -211,6 +212,7 @@ impl Default for Config {
211212
.unwrap_or(60_000), // 1 minute
212213
ownership_invitations_expiration_days: 30,
213214
metrics_authorization_token: dotenv::var("METRICS_AUTHORIZATION_TOKEN").ok(),
215+
use_test_database_pool: false,
214216
}
215217
}
216218
}

src/controllers/version/downloads.rs

Lines changed: 62 additions & 31 deletions
Original file line numberDiff line numberDiff line change
@@ -2,53 +2,84 @@
22
//!
33
//! Crate level functionality is located in `krate::downloads`.
44
5+
use super::{extract_crate_name_and_semver, version_and_crate};
56
use crate::controllers::prelude::*;
6-
7-
use chrono::{Duration, NaiveDate, Utc};
8-
7+
use crate::db::PoolError;
98
use crate::models::{Crate, VersionDownload};
109
use crate::schema::*;
1110
use crate::views::EncodableVersionDownload;
12-
13-
use super::{extract_crate_name_and_semver, version_and_crate};
11+
use chrono::{Duration, NaiveDate, Utc};
1412

1513
/// Handles the `GET /crates/:crate_id/:version/download` route.
1614
/// This returns a URL to the location where the crate is stored.
1715
pub fn download(req: &mut dyn RequestExt) -> EndpointResult {
16+
let app = req.app().clone();
1817
let recorder = req.timing_recorder();
1918

20-
let crate_name = &req.params()["crate_id"];
21-
let version = &req.params()["version"];
22-
23-
let (version_id, canonical_crate_name): (_, String) = {
24-
use self::versions::dsl::*;
25-
26-
let conn = recorder.record("get_conn", || req.db_conn())?;
27-
28-
// Returns the crate name as stored in the database, or an error if we could
29-
// not load the version ID from the database.
30-
recorder.record("get_version", || {
31-
versions
32-
.inner_join(crates::table)
33-
.select((id, crates::name))
34-
.filter(Crate::with_name(crate_name))
35-
.filter(num.eq(version))
36-
.first(&*conn)
37-
})?
38-
};
39-
40-
// The increment does not happen instantly, but it's deferred to be executed in a batch
41-
// along with other downloads. See crate::downloads_counter for the implementation.
42-
req.app().downloads_counter.increment(version_id);
19+
let mut crate_name = req.params()["crate_id"].clone();
20+
let version = req.params()["version"].as_str();
21+
22+
let mut log_metadata = None;
23+
match recorder.record("get_conn", || req.db_conn()) {
24+
Ok(conn) => {
25+
use self::versions::dsl::*;
26+
27+
// Returns the crate name as stored in the database, or an error if we could
28+
// not load the version ID from the database.
29+
let (version_id, canonical_crate_name) = recorder.record("get_version", || {
30+
versions
31+
.inner_join(crates::table)
32+
.select((id, crates::name))
33+
.filter(Crate::with_name(&crate_name))
34+
.filter(num.eq(version))
35+
.first::<(i32, String)>(&*conn)
36+
})?;
37+
38+
if canonical_crate_name != crate_name {
39+
app.instance_metrics
40+
.downloads_non_canonical_crate_name_total
41+
.inc();
42+
log_metadata = Some(("bot", "dl"));
43+
}
44+
crate_name = canonical_crate_name;
45+
46+
// The increment does not happen instantly, but it's deferred to be executed in a batch
47+
// along with other downloads. See crate::downloads_counter for the implementation.
48+
app.downloads_counter.increment(version_id);
49+
}
50+
Err(PoolError::UnhealthyPool) => {
51+
// The download endpoint is the most critical route in the whole crates.io application,
52+
// as it's relied upon by users and automations to download crates. Keeping it working
53+
// is the most important thing for us.
54+
//
55+
// The endpoint relies on the database to fetch the canonical crate name (with the
56+
// right capitalization and hyphenation), but that's only needed to serve clients who
57+
// don't call the endpoint with the crate's canonical name.
58+
//
59+
// Thankfully Cargo always uses the right name when calling the endpoint, and we can
60+
// keep it working during a full database outage by unconditionally redirecting without
61+
// checking whether the crate exists or the rigth name is used. Non-Cargo clients might
62+
// get a 404 response instead of a 500, but that's worth it.
63+
//
64+
// Without a working database we also can't count downloads, but that's also less
65+
// critical than keeping Cargo downloads operational.
66+
67+
app.instance_metrics
68+
.downloads_unconditional_redirects_total
69+
.inc();
70+
log_metadata = Some(("unconditional_redirect", "true"));
71+
}
72+
Err(err) => return Err(err.into()),
73+
}
4374

4475
let redirect_url = req
4576
.app()
4677
.config
4778
.uploader
48-
.crate_location(&canonical_crate_name, version);
79+
.crate_location(&crate_name, &*version);
4980

50-
if &canonical_crate_name != crate_name {
51-
req.log_metadata("bot", "dl");
81+
if let Some((key, value)) = log_metadata {
82+
req.log_metadata(key, value);
5283
}
5384

5485
if req.wants_json() {

0 commit comments

Comments
 (0)