Skip to content

Commit 290b6a5

Browse files
htejuntorvalds
authored andcommitted
Revert "slub: move synchronize_sched out of slab_mutex on shrink"
Patch series "slab: make memcg slab destruction scalable", v3. With kmem cgroup support enabled, kmem_caches can be created and destroyed frequently and a great number of near empty kmem_caches can accumulate if there are a lot of transient cgroups and the system is not under memory pressure. When memory reclaim starts under such conditions, it can lead to consecutive deactivation and destruction of many kmem_caches, easily hundreds of thousands on moderately large systems, exposing scalability issues in the current slab management code. I've seen machines which end up with hundred thousands of caches and many millions of kernfs_nodes. The current code is O(N^2) on the total number of caches and has synchronous rcu_barrier() and synchronize_sched() in cgroup offline / release path which is executed while holding cgroup_mutex. Combined, this leads to very expensive and slow cache destruction operations which can easily keep running for half a day. This also messes up /proc/slabinfo along with other cache iterating operations. seq_file operates on 4k chunks and on each 4k boundary tries to seek to the last position in the list. With a huge number of caches on the list, this becomes very slow and very prone to the list content changing underneath it leading to a lot of missing and/or duplicate entries. This patchset addresses the scalability problem. * Add root and per-memcg lists. Update each user to use the appropriate list. * Make rcu_barrier() for SLAB_DESTROY_BY_RCU caches globally batched and asynchronous. * For dying empty slub caches, remove the sysfs files after deactivation so that we don't end up with millions of sysfs files without any useful information on them. This patchset contains the following nine patches. 0001-Revert-slub-move-synchronize_sched-out-of-slab_mutex.patch 0002-slub-separate-out-sysfs_slab_release-from-sysfs_slab.patch 0003-slab-remove-synchronous-rcu_barrier-call-in-memcg-ca.patch 0004-slab-reorganize-memcg_cache_params.patch 0005-slab-link-memcg-kmem_caches-on-their-associated-memo.patch 0006-slab-implement-slab_root_caches-list.patch 0007-slab-introduce-__kmemcg_cache_deactivate.patch 0008-slab-remove-synchronous-synchronize_sched-from-memcg.patch 0009-slab-remove-slub-sysfs-interface-files-early-for-emp.patch 0010-slab-use-memcg_kmem_cache_wq-for-slab-destruction-op.patch 0001 reverts an existing optimization to prepare for the following changes. 0002 is a prep patch. 0003 makes rcu_barrier() in release path batched and asynchronous. 0004-0006 separate out the lists. 0007-0008 replace synchronize_sched() in slub destruction path with call_rcu_sched(). 0009 removes sysfs files early for empty dying caches. 0010 makes destruction work items use a workqueue with limited concurrency. This patch (of 10): Revert 89e364d ("slub: move synchronize_sched out of slab_mutex on shrink"). With kmem cgroup support enabled, kmem_caches can be created and destroyed frequently and a great number of near empty kmem_caches can accumulate if there are a lot of transient cgroups and the system is not under memory pressure. When memory reclaim starts under such conditions, it can lead to consecutive deactivation and destruction of many kmem_caches, easily hundreds of thousands on moderately large systems, exposing scalability issues in the current slab management code. This is one of the patches to address the issue. Moving synchronize_sched() out of slab_mutex isn't enough as it's still inside cgroup_mutex. The whole deactivation / release path will be updated to avoid all synchronous RCU operations. Revert this insufficient optimization in preparation to ease future changes. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Tejun Heo <[email protected]> Reported-by: Jay Vana <[email protected]> Cc: Vladimir Davydov <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: Pekka Enberg <[email protected]> Cc: David Rientjes <[email protected]> Cc: Joonsoo Kim <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
1 parent af3b5f8 commit 290b6a5

File tree

5 files changed

+23
-31
lines changed

5 files changed

+23
-31
lines changed

mm/slab.c

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -2315,7 +2315,7 @@ static int drain_freelist(struct kmem_cache *cache,
23152315
return nr_freed;
23162316
}
23172317

2318-
int __kmem_cache_shrink(struct kmem_cache *cachep)
2318+
int __kmem_cache_shrink(struct kmem_cache *cachep, bool deactivate)
23192319
{
23202320
int ret = 0;
23212321
int node;
@@ -2335,7 +2335,7 @@ int __kmem_cache_shrink(struct kmem_cache *cachep)
23352335

23362336
int __kmem_cache_shutdown(struct kmem_cache *cachep)
23372337
{
2338-
return __kmem_cache_shrink(cachep);
2338+
return __kmem_cache_shrink(cachep, false);
23392339
}
23402340

23412341
void __kmem_cache_release(struct kmem_cache *cachep)

mm/slab.h

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -167,7 +167,7 @@ static inline unsigned long kmem_cache_flags(unsigned long object_size,
167167

168168
int __kmem_cache_shutdown(struct kmem_cache *);
169169
void __kmem_cache_release(struct kmem_cache *);
170-
int __kmem_cache_shrink(struct kmem_cache *);
170+
int __kmem_cache_shrink(struct kmem_cache *, bool);
171171
void slab_kmem_cache_release(struct kmem_cache *);
172172

173173
struct seq_file;

mm/slab_common.c

Lines changed: 2 additions & 25 deletions
Original file line numberDiff line numberDiff line change
@@ -582,29 +582,6 @@ void memcg_deactivate_kmem_caches(struct mem_cgroup *memcg)
582582
get_online_cpus();
583583
get_online_mems();
584584

585-
#ifdef CONFIG_SLUB
586-
/*
587-
* In case of SLUB, we need to disable empty slab caching to
588-
* avoid pinning the offline memory cgroup by freeable kmem
589-
* pages charged to it. SLAB doesn't need this, as it
590-
* periodically purges unused slabs.
591-
*/
592-
mutex_lock(&slab_mutex);
593-
list_for_each_entry(s, &slab_caches, list) {
594-
c = is_root_cache(s) ? cache_from_memcg_idx(s, idx) : NULL;
595-
if (c) {
596-
c->cpu_partial = 0;
597-
c->min_partial = 0;
598-
}
599-
}
600-
mutex_unlock(&slab_mutex);
601-
/*
602-
* kmem_cache->cpu_partial is checked locklessly (see
603-
* put_cpu_partial()). Make sure the change is visible.
604-
*/
605-
synchronize_sched();
606-
#endif
607-
608585
mutex_lock(&slab_mutex);
609586
list_for_each_entry(s, &slab_caches, list) {
610587
if (!is_root_cache(s))
@@ -616,7 +593,7 @@ void memcg_deactivate_kmem_caches(struct mem_cgroup *memcg)
616593
if (!c)
617594
continue;
618595

619-
__kmem_cache_shrink(c);
596+
__kmem_cache_shrink(c, true);
620597
arr->entries[idx] = NULL;
621598
}
622599
mutex_unlock(&slab_mutex);
@@ -787,7 +764,7 @@ int kmem_cache_shrink(struct kmem_cache *cachep)
787764
get_online_cpus();
788765
get_online_mems();
789766
kasan_cache_shrink(cachep);
790-
ret = __kmem_cache_shrink(cachep);
767+
ret = __kmem_cache_shrink(cachep, false);
791768
put_online_mems();
792769
put_online_cpus();
793770
return ret;

mm/slob.c

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -634,7 +634,7 @@ void __kmem_cache_release(struct kmem_cache *c)
634634
{
635635
}
636636

637-
int __kmem_cache_shrink(struct kmem_cache *d)
637+
int __kmem_cache_shrink(struct kmem_cache *d, bool deactivate)
638638
{
639639
return 0;
640640
}

mm/slub.c

Lines changed: 17 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -3891,7 +3891,7 @@ EXPORT_SYMBOL(kfree);
38913891
* being allocated from last increasing the chance that the last objects
38923892
* are freed in them.
38933893
*/
3894-
int __kmem_cache_shrink(struct kmem_cache *s)
3894+
int __kmem_cache_shrink(struct kmem_cache *s, bool deactivate)
38953895
{
38963896
int node;
38973897
int i;
@@ -3903,6 +3903,21 @@ int __kmem_cache_shrink(struct kmem_cache *s)
39033903
unsigned long flags;
39043904
int ret = 0;
39053905

3906+
if (deactivate) {
3907+
/*
3908+
* Disable empty slabs caching. Used to avoid pinning offline
3909+
* memory cgroups by kmem pages that can be freed.
3910+
*/
3911+
s->cpu_partial = 0;
3912+
s->min_partial = 0;
3913+
3914+
/*
3915+
* s->cpu_partial is checked locklessly (see put_cpu_partial),
3916+
* so we have to make sure the change is visible.
3917+
*/
3918+
synchronize_sched();
3919+
}
3920+
39063921
flush_all(s);
39073922
for_each_kmem_cache_node(s, node, n) {
39083923
INIT_LIST_HEAD(&discard);
@@ -3959,7 +3974,7 @@ static int slab_mem_going_offline_callback(void *arg)
39593974

39603975
mutex_lock(&slab_mutex);
39613976
list_for_each_entry(s, &slab_caches, list)
3962-
__kmem_cache_shrink(s);
3977+
__kmem_cache_shrink(s, false);
39633978
mutex_unlock(&slab_mutex);
39643979

39653980
return 0;

0 commit comments

Comments
 (0)