Skip to content

database: Change num_no_build to be NOT NULL #9767

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Oct 27, 2024

Conversation

Turbo87
Copy link
Member

@Turbo87 Turbo87 commented Oct 27, 2024

This PR was extracted from #9756, because it needs to be deploy separately from the other migrations in that PR.

The backfill SQL script in #9766 has been run on the staging and production databases, so this should be good to go. SELECT * FROM versions WHERE num_no_build IS NULL returns zero rows.

@Turbo87 Turbo87 added C-internal 🔧 Category: Nonessential work that would make the codebase more consistent or clear A-backend ⚙️ labels Oct 27, 2024
@Turbo87 Turbo87 enabled auto-merge October 27, 2024 15:08
@Turbo87 Turbo87 merged commit 622c786 into rust-lang:main Oct 27, 2024
8 checks passed
@Turbo87 Turbo87 deleted the num_no_build2 branch October 27, 2024 15:17
Copy link

codecov bot commented Oct 27, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 88.81%. Comparing base (084d46d) to head (f12378b).
Report is 11 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #9767   +/-   ##
=======================================
  Coverage   88.81%   88.81%           
=======================================
  Files         289      289           
  Lines       29810    29820   +10     
=======================================
+ Hits        26475    26485   +10     
  Misses       3335     3335           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Comment on lines +88 to +89
#[builder(default = strip_build_metadata(num))]
num_no_build: &'a str,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

argh, I screwed this up... this part of the diff should've been in the previous PR instead. every new version is currently inserted with num_no_build = NULL, which makes the set not null migration in this PR fail 🤦‍♂️

update versions
set num_no_build = split_part(num, '+', 1)
where num_no_build IS NULL;

I will run ⬆️ right before deploying, which should be enough for this to deploy cleanly...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

alright, that worked. all good now! :)

@eth3lbert
Copy link
Contributor

When I tried to import data using scripts/import-database-dump.sh today, I encountered an error, as shown below:

...
    \copy "versions" ("bin_names", "checksum", "crate_id", "crate_size", "created_at", "downloads", "edition", "features", "has_lib", "id", "license", "links", "num", "published_by", "rust_version", "
updated_at", "yanked") FROM 'data/versions.csv' WITH CSV HEADER
ERROR:  null value in column "num_no_build" of relation "versions" violates not-null constraint
...

I believe this error occurred because the "num_no_build" column is set to NOT_NULL and we also exclude it from our dump data. This then violates the NOT_NULL constraint.

@Turbo87
Copy link
Member Author

Turbo87 commented Nov 27, 2024

ugh... unfortunately that makes a lot of sense 😞

we might have to rethink #9786 🤔

@dtolnay any thoughts on this?

@eth3lbert
Copy link
Contributor

Another idea would be to modify the script template, changing the DDL to allow NULL values before copying. We could also include #9767 (comment) in the template to set the value after copying and then set the column back to NOT_NULL after updating.

@Turbo87
Copy link
Member Author

Turbo87 commented Nov 27, 2024

that basic version of the backfill won't help for the full dataset though, because it would cause uniqueness issues on the corresponding index for some of the older versions where we still allowed for duplicates :-/

@dtolnay
Copy link
Member

dtolnay commented Nov 28, 2024

If we have an UPDATE command that can do a 100% faithful backfill cheaply, is there a place we can put that? Somewhere in import.sql? If not, let's revert #9786.

This worked for me on today's db-dump:

CREATE TABLE public.versions ( ..., num_no_build VARCHAR, ... );

\copy versions (bin_names, checksum, crate_id, crate_size, created_at, downloads, edition, features, has_lib, id, license, links, num, published_by, rust_version, updated_at, yanked) FROM 'versions.csv' WITH CSV HEADER

UPDATE versions
SET num_no_build = CASE WHEN deduplicate.count = 1 THEN deduplicate.nnb ELSE versions.num END
FROM
  (SELECT crate_id, split_part(num, '+', 1) nnb, count(*)
   FROM versions
   GROUP BY crate_id, nnb) deduplicate
WHERE versions.crate_id = deduplicate.crate_id
  AND split_part(versions.num, '+', 1) = deduplicate.nnb;

ALTER TABLE versions ALTER COLUMN num_no_build SET NOT NULL;

-- example crate with duplicates:
SELECT crate_id, num, num_no_build
FROM versions
WHERE crate_id = 1643
ORDER BY num;

@Turbo87
Copy link
Member Author

Turbo87 commented Nov 28, 2024

If we have an UPDATE command that can do a 100% faithful backfill cheaply

-- update versions
-- set num_no_build = split_part(num, '+', 1);
--
-- with duplicates as (
-- -- find all versions that have the same `crate_id` and `num_no_build`
-- select crate_id, num_no_build, array_agg(num ORDER BY id) as nums
-- from versions
-- group by crate_id, num_no_build
-- having count(*) > 1
-- ),
-- duplicates_to_update as (
-- -- for each group of duplicates, update all versions except the one that
-- -- doesn't have "build metadata", or the first one that was published if
-- -- all versions have "build metadata"
-- select crate_id, num_no_build, unnest(case when array_position(nums, num_no_build) IS NOT NULL then array_remove(nums, num_no_build) else nums[2:] end) as num
-- from duplicates
-- )
-- update versions
-- set num_no_build = duplicates_to_update.num
-- from duplicates_to_update
-- where versions.crate_id = duplicates_to_update.crate_id
-- and versions.num = duplicates_to_update.num;
is the script that I used on production when we backfilled the data. I'm not sure if I would call it "cheap", but it worked well enough :D

Somewhere in import.sql?

hmm, let me see if I can adjust our scripts to do this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-backend ⚙️ C-internal 🔧 Category: Nonessential work that would make the codebase more consistent or clear
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants