Skip to content

Fix comparison of changed files #29

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Nov 21, 2022
Merged

Fix comparison of changed files #29

merged 6 commits into from
Nov 21, 2022

Conversation

pascalkuthe
Copy link
Contributor

@pascalkuthe pascalkuthe commented Nov 19, 2022

This PR is a major refactor of the core diff logic to fix #26.
The crate completely removes any textual diffing that was previously performed using imara-diff.
I verified that imara-diff function correctly.
The problem is that imara-diff (and diffs in general) are intended to provide linear editing sequences.
That is a set of edits to transform file A into file B without any jumps.
However, for crates-index-diff we actually don't care about this property, because we only care about how any given
checksum changed (removed/added/yanked).
This means that imara-diff provides unideal diffs for this usecase.
For example if a version X is moved from the end of the file to the start of the file imara diff provides to changes:
add X at line 0
remove X at last line

That means that any remove changes provided by imara-diff had to be ignored for correct functionality.
Ignoring removed lines works when crates are never deleted.
However, if comparing two (far apart) commits between which a crate was deleted and later reclaimed by someone else.
Then we need to mark any removed checksum as deleted.

This PR changes the diff logic to perform a simple unordered HashMap<Checksum, CrateVersion> comparison.
The details are a bit tricky to avoid parsing every line (and to avoid storing the checksum twice).
However, the code is functionally equivalent to the following s (non compiling pseudocode, some details left out):

let old_versions: HashMap<Checksum, CrateVersion> = diff.old.lines().map(|line| parse_crate_version(line));
let new_versions: HashMap<Checksum, CrateVersion> = diff.new.lines().map(|line| parse_crate_version(line));
for new_version in new_versions.difference(old_versions){
  self.changes.push(Change::Added(new_version))
}
for old_version in old_versions.difference(new_versions){
  self.changes.push(Change::Deleted(old_version))
}
for (old_version, new_version) in old_versions.intersection(new_versions){
  let change = match (old_version.yanked, new_version.yanked) {
      (true, false) => Change::Unyanked(new_version),
      (false, true) => Change::Yanked(new_version),
      _ => continue,
  };
  self.changes.push(change)
}

In the process I discovered some ambiguities in the API of Change around yanked crates.
To address that I have added two new variants to that enum:

Unyanked(CrateVersion)
AddedAndYanked(CrateVersion)

This ensures that downstream crates can always tell whether a crate
was yanked/unyanked/added without any extra operations.

Another API change:

Due to the refactor of the diff algorithm, changes are now naturally grouped
by crate. That means additional sorting inside crates-index-diff is not required to achieve that property.
To improve performance the sorting by crate-name/version has been removed.
If sorting is required by downstream crates, it is trivial for them to call changes.sort_by themselves.
This change is a nice performance boost for any crates that don't care about the order of changes (or require a different sorting then the one that was used by crates-index-diff).
Note that the order of the changes is completely non-deterministic (except for being grouped by crate).
So if any downstream crate (or a test inside this crate) requires a deterministic order, it must sort the vector.

This PR fixes the bug discussed in #26 where crates-index-diff would report
a different set of changes when splitting the diff across multiple commits.
This PR also fixed the baseline test which previously forget to include the HEAD commit (and only included ancestors).

Copy link
Owner

@Byron Byron left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fantastic work, thanks so much!

Thanks to git-lfs I can't push my changes to the PR directly, but have attached the patches here.

patches.zip (Updated)

Before merging, I'd love if you could try to remove the unsafe from the code as it's unclear if these carry any weight without profiling a before and after. From the profile attached here (diffing commit by commit) the time for the actual diffing is marginal at best, and even though bigger steps (i.e. more changes) will result in more work for the differ, I doubt it can stand up to the time it takes to decompress objects. docs.rs pulls regularly and there is only ~650 changes per day.

Screenshot-2022-11-20-at-20 27 12

And with that it should be ready to merge.

Byron and others added 4 commits November 21, 2022 19:28
- typos and form
- improve docs and docs consistency.
- don't run them unless it's a release build
- make the existence of atomic tests more obvious with `make`
- share code between both baseline variants
- some output to learn about progress
The performance difference is rather drastic at about 2.5x, and
definitely worth having if there is no compatibility issue
due to shared C dependencies in the same binary.

Additionally we setup the makefile to use big object caches
to avoid having to decompress the same object too often, accelerating
the diffing process about 4x, for a total of 10x performance boost.
Copy link
Owner

@Byron Byron left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And it's a wrap! Thanks so much for all the effort that you put into this project, it's on an entirely new level thanks to it and I think it might finally be 'done' (despite #30 which might be implemented one fine day or maybe never).

@Byron Byron merged commit 1b5684f into Byron:main Nov 21, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

New publish concurrent with yanks is missed
2 participants