-
Notifications
You must be signed in to change notification settings - Fork 9
Fix comparison of changed files #29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fantastic work, thanks so much!
Thanks to git-lfs
I can't push my changes to the PR directly, but have attached the patches here.
patches.zip (Updated)
Before merging, I'd love if you could try to remove the unsafe
from the code as it's unclear if these carry any weight without profiling a before and after. From the profile attached here (diffing commit by commit) the time for the actual diffing is marginal at best, and even though bigger steps (i.e. more changes) will result in more work for the differ, I doubt it can stand up to the time it takes to decompress objects. docs.rs
pulls regularly and there is only ~650 changes per day.
And with that it should be ready to merge.
- typos and form - improve docs and docs consistency.
- don't run them unless it's a release build - make the existence of atomic tests more obvious with `make` - share code between both baseline variants - some output to learn about progress
The performance difference is rather drastic at about 2.5x, and definitely worth having if there is no compatibility issue due to shared C dependencies in the same binary. Additionally we setup the makefile to use big object caches to avoid having to decompress the same object too often, accelerating the diffing process about 4x, for a total of 10x performance boost.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And it's a wrap! Thanks so much for all the effort that you put into this project, it's on an entirely new level thanks to it and I think it might finally be 'done' (despite #30 which might be implemented one fine day or maybe never).
This PR is a major refactor of the core diff logic to fix #26.
The crate completely removes any textual diffing that was previously performed using
imara-diff
.I verified that
imara-diff
function correctly.The problem is that
imara-diff
(and diffs in general) are intended to provide linear editing sequences.That is a set of edits to transform file A into file B without any jumps.
However, for
crates-index-diff
we actually don't care about this property, because we only care about how any givenchecksum changed (removed/added/yanked).
This means that
imara-diff
provides unideal diffs for this usecase.For example if a version X is moved from the end of the file to the start of the file imara diff provides to changes:
add X at line 0
remove X at last line
That means that any
remove
changes provided byimara-diff
had to be ignored for correct functionality.Ignoring removed lines works when crates are never deleted.
However, if comparing two (far apart) commits between which a crate was deleted and later reclaimed by someone else.
Then we need to mark any removed checksum as deleted.
This PR changes the diff logic to perform a simple unordered
HashMap<Checksum, CrateVersion>
comparison.The details are a bit tricky to avoid parsing every line (and to avoid storing the checksum twice).
However, the code is functionally equivalent to the following s (non compiling pseudocode, some details left out):
In the process I discovered some ambiguities in the API of
Change
around yanked crates.To address that I have added two new variants to that enum:
This ensures that downstream crates can always tell whether a crate
was yanked/unyanked/added without any extra operations.
Another API change:
Due to the refactor of the diff algorithm, changes are now naturally grouped
by crate. That means additional sorting inside
crates-index-diff
is not required to achieve that property.To improve performance the sorting by crate-name/version has been removed.
If sorting is required by downstream crates, it is trivial for them to call
changes.sort_by
themselves.This change is a nice performance boost for any crates that don't care about the order of changes (or require a different sorting then the one that was used by
crates-index-diff
).Note that the order of the changes is completely non-deterministic (except for being grouped by crate).
So if any downstream crate (or a test inside this crate) requires a deterministic order, it must sort the vector.
This PR fixes the bug discussed in #26 where
crates-index-diff
would reporta different set of changes when splitting the diff across multiple commits.
This PR also fixed the
baseline
test which previously forget to include theHEAD
commit (and only included ancestors).