Skip to content

nano-optimization for memchr::repeat_byte #50398

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
May 4, 2018

Conversation

llogiq
Copy link
Contributor

@llogiq llogiq commented May 2, 2018

This replaces the multiple shifts & bitwise or with a single multiplication

In my benchmarks this performs equally well or better, especially on 64bit systems (it shaves a stable nanosecond on my skylake). This may go against conventional wisdom, but the shifts and bitwise ors cannot be pipelined because of hard data dependencies.

While it may or may not be worthwile from an optimization standpoint, it also reduces code size, so there's basically no downside.

@rust-highfive
Copy link
Contributor

r? @KodrAus

(rust_highfive has picked a reviewer for you, use r? to override)

@rust-highfive rust-highfive added the S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. label May 2, 2018
@llogiq
Copy link
Contributor Author

llogiq commented May 3, 2018

cc @Manishearth who wrote the original code

@Manishearth
Copy link
Member

Manishearth commented May 3, 2018

cc @BurntSushi who actually wrote this code (I just git mvd it)

and @jimblandy whose C code this was originally adapted from

@BurntSushi
Copy link
Member

cc @bluss, who is the one who actually wrote the original fast memchr fallback implementation. :-)

This does look good to me though! Cute trick.

@Manishearth
Copy link
Member

tldr nobody wrote it and it just appeared

@nagisa
Copy link
Member

nagisa commented May 3, 2018

The improvement here (if any) is greatly dependent on relative instruction latency between bitwise ops and multiply. Most of the modern architectures have a fast multiply (throughput of 1+ insns per cycle) but latency is greater (3+ cycles). To contrast, bitwise instructions usually have a latency of 1 cycle (and througput of 2 to 4 insns per cycle), so as long as there are no more than 3 bitwise instructions in the critical path, the bitwise code will be almost universally faster (there seem to be 4 for 32-bit targets).

It might be considerably worse for 32-bit architectures which do not have a multiply instruction at all. I verified that we (or LLVM) do not support any 32+-bit targets which do not have a native instruction.

I observed some backends such as MIPS and Lanai to translate the multiply back into the original sequence of bitwise ops, presumably because it is more efficient to do so there. I observed x86 backend to do the same in some cases as well, but, notably, not ARM.

With this in mind, it seems pretty save to

@bors r+

@bors
Copy link
Collaborator

bors commented May 3, 2018

📌 Commit 1cefb5c has been approved by nagisa

@bors bors added S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels May 3, 2018
@nagisa
Copy link
Member

nagisa commented May 3, 2018

Ah, the pain of spotting a typo, but not wanting to confuse bors by editing the comment.

@bors
Copy link
Collaborator

bors commented May 4, 2018

⌛ Testing commit 1cefb5c with merge e78c51a...

bors added a commit that referenced this pull request May 4, 2018
nano-optimization for memchr::repeat_byte

This replaces the multiple shifts & bitwise or with a single multiplication

In my benchmarks this performs equally well or better, especially on 64bit systems (it shaves a stable nanosecond on my skylake). This may go against conventional wisdom, but the shifts and bitwise ors cannot be pipelined because of hard data dependencies.

While it may or may not be worthwile from an optimization standpoint, it also reduces code size, so there's basically no downside.
@bors
Copy link
Collaborator

bors commented May 4, 2018

☀️ Test successful - status-appveyor, status-travis
Approved by: nagisa
Pushing e78c51a to master...

@bors bors merged commit 1cefb5c into rust-lang:master May 4, 2018
@llogiq llogiq deleted the memchr-nano-opt branch May 4, 2018 10:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants