Skip to content

Commit b91e130

Browse files
committed
mm: optimize PageWaiters bit use for unlock_page()
In commit 6290602 ("mm: add PageWaiters indicating tasks are waiting for a page bit") Nick Piggin made our page locking no longer unconditionally touch the hashed page waitqueue, which not only helps performance in general, but is particularly helpful on NUMA machines where the hashed wait queues can bounce around a lot. However, the "clear lock bit atomically and then test the waiters bit" sequence turns out to be much more expensive than it needs to be, because you get a nasty stall when trying to access the same word that just got updated atomically. On architectures where locking is done with LL/SC, this would be trivial to fix with a new primitive that clears one bit and tests another atomically, but that ends up not working on x86, where the only atomic operations that return the result end up being cmpxchg and xadd. The atomic bit operations return the old value of the same bit we changed, not the value of an unrelated bit. On x86, we could put the lock bit in the high bit of the byte, and use "xadd" with that bit (where the overflow ends up not touching other bits), and look at the other bits of the result. However, an even simpler model is to just use a regular atomic "and" to clear the lock bit, and then the sign bit in eflags will indicate the resulting state of the unrelated bit #7. So by moving the PageWaiters bit up to bit #7, we can atomically clear the lock bit and test the waiters bit on x86 too. And architectures with LL/SC (which is all the usual RISC suspects), the particular bit doesn't matter, so they are fine with this approach too. This avoids the extra access to the same atomic word, and thus avoids the costly stall at page unlock time. The only downside is that the interface ends up being a bit odd and specialized: clear a bit in a byte, and test the sign bit. Nick doesn't love the resulting name of the new primitive, but I'd rather make the name be descriptive and very clear about the limitation imposed by trying to work across all relevant architectures than make it be some generic thing that doesn't make the odd semantics explicit. So this introduces the new architecture primitive clear_bit_unlock_is_negative_byte(); and adds the trivial implementation for x86. We have a generic non-optimized fallback (that just does a "clear_bit()"+"test_bit(7)" combination) which can be overridden by any architecture that can do better. According to Nick, Power has the same hickup x86 has, for example, but some other architectures may not even care. All these optimizations mean that my page locking stress-test (which is just executing a lot of small short-lived shell scripts: "make test" in the git source tree) no longer makes our page locking look horribly bad. Before all these optimizations, just the unlock_page() costs were just over 3% of all CPU overhead on "make test". After this, it's down to 0.66%, so just a quarter of the cost it used to be. (The difference on NUMA is bigger, but there this micro-optimization is likely less noticeable, since the big issue on NUMA was not the accesses to 'struct page', but the waitqueue accesses that were already removed by Nick's earlier commit). Acked-by: Nick Piggin <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Bob Peterson <[email protected]> Cc: Steven Whitehouse <[email protected]> Cc: Andrew Lutomirski <[email protected]> Cc: Andreas Gruenbacher <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Mel Gorman <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
1 parent 2d706e7 commit b91e130

File tree

3 files changed

+45
-6
lines changed

3 files changed

+45
-6
lines changed

arch/x86/include/asm/bitops.h

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -139,6 +139,19 @@ static __always_inline void __clear_bit(long nr, volatile unsigned long *addr)
139139
asm volatile("btr %1,%0" : ADDR : "Ir" (nr));
140140
}
141141

142+
static __always_inline bool clear_bit_unlock_is_negative_byte(long nr, volatile unsigned long *addr)
143+
{
144+
bool negative;
145+
asm volatile(LOCK_PREFIX "andb %2,%1\n\t"
146+
CC_SET(s)
147+
: CC_OUT(s) (negative), ADDR
148+
: "ir" ((char) ~(1 << nr)) : "memory");
149+
return negative;
150+
}
151+
152+
// Let everybody know we have it
153+
#define clear_bit_unlock_is_negative_byte clear_bit_unlock_is_negative_byte
154+
142155
/*
143156
* __clear_bit_unlock - Clears a bit in memory
144157
* @nr: Bit to clear

include/linux/page-flags.h

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -73,13 +73,13 @@
7373
*/
7474
enum pageflags {
7575
PG_locked, /* Page is locked. Don't touch. */
76-
PG_waiters, /* Page has waiters, check its waitqueue */
7776
PG_error,
7877
PG_referenced,
7978
PG_uptodate,
8079
PG_dirty,
8180
PG_lru,
8281
PG_active,
82+
PG_waiters, /* Page has waiters, check its waitqueue. Must be bit #7 and in the same byte as "PG_locked" */
8383
PG_slab,
8484
PG_owner_priv_1, /* Owner use. If pagecache, fs may use*/
8585
PG_arch_1,

mm/filemap.c

Lines changed: 31 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -912,6 +912,29 @@ void add_page_wait_queue(struct page *page, wait_queue_t *waiter)
912912
}
913913
EXPORT_SYMBOL_GPL(add_page_wait_queue);
914914

915+
#ifndef clear_bit_unlock_is_negative_byte
916+
917+
/*
918+
* PG_waiters is the high bit in the same byte as PG_lock.
919+
*
920+
* On x86 (and on many other architectures), we can clear PG_lock and
921+
* test the sign bit at the same time. But if the architecture does
922+
* not support that special operation, we just do this all by hand
923+
* instead.
924+
*
925+
* The read of PG_waiters has to be after (or concurrently with) PG_locked
926+
* being cleared, but a memory barrier should be unneccssary since it is
927+
* in the same byte as PG_locked.
928+
*/
929+
static inline bool clear_bit_unlock_is_negative_byte(long nr, volatile void *mem)
930+
{
931+
clear_bit_unlock(nr, mem);
932+
/* smp_mb__after_atomic(); */
933+
return test_bit(PG_waiters);
934+
}
935+
936+
#endif
937+
915938
/**
916939
* unlock_page - unlock a locked page
917940
* @page: the page
@@ -921,16 +944,19 @@ EXPORT_SYMBOL_GPL(add_page_wait_queue);
921944
* mechanism between PageLocked pages and PageWriteback pages is shared.
922945
* But that's OK - sleepers in wait_on_page_writeback() just go back to sleep.
923946
*
924-
* The mb is necessary to enforce ordering between the clear_bit and the read
925-
* of the waitqueue (to avoid SMP races with a parallel wait_on_page_locked()).
947+
* Note that this depends on PG_waiters being the sign bit in the byte
948+
* that contains PG_locked - thus the BUILD_BUG_ON(). That allows us to
949+
* clear the PG_locked bit and test PG_waiters at the same time fairly
950+
* portably (architectures that do LL/SC can test any bit, while x86 can
951+
* test the sign bit).
926952
*/
927953
void unlock_page(struct page *page)
928954
{
955+
BUILD_BUG_ON(PG_waiters != 7);
929956
page = compound_head(page);
930957
VM_BUG_ON_PAGE(!PageLocked(page), page);
931-
clear_bit_unlock(PG_locked, &page->flags);
932-
smp_mb__after_atomic();
933-
wake_up_page(page, PG_locked);
958+
if (clear_bit_unlock_is_negative_byte(PG_locked, &page->flags))
959+
wake_up_page_bit(page, PG_locked);
934960
}
935961
EXPORT_SYMBOL(unlock_page);
936962

0 commit comments

Comments
 (0)