Skip to content

Commit c73322d

Browse files
hnaztorvalds
authored andcommitted
mm: fix 100% CPU kswapd busyloop on unreclaimable nodes
Patch series "mm: kswapd spinning on unreclaimable nodes - fixes and cleanups". Jia reported a scenario in which the kswapd of a node indefinitely spins at 100% CPU usage. We have seen similar cases at Facebook. The kernel's current method of judging its ability to reclaim a node (or whether to back off and sleep) is based on the amount of scanned pages in proportion to the amount of reclaimable pages. In Jia's and our scenarios, there are no reclaimable pages in the node, however, and the condition for backing off is never met. Kswapd busyloops in an attempt to restore the watermarks while having nothing to work with. This series reworks the definition of an unreclaimable node based not on scanning but on whether kswapd is able to actually reclaim pages in MAX_RECLAIM_RETRIES (16) consecutive runs. This is the same criteria the page allocator uses for giving up on direct reclaim and invoking the OOM killer. If it cannot free any pages, kswapd will go to sleep and leave further attempts to direct reclaim invocations, which will either make progress and re-enable kswapd, or invoke the OOM killer. Patch #1 fixes the immediate problem Jia reported, the remainder are smaller fixlets, cleanups, and overall phasing out of the old method. Patch #6 is the odd one out. It's a nice cleanup to get_scan_count(), and directly related to #5, but in itself not relevant to the series. If the whole series is too ambitious for 4.11, I would consider the first three patches fixes, the rest cleanups. This patch (of 9): Jia He reports a problem with kswapd spinning at 100% CPU when requesting more hugepages than memory available in the system: $ echo 4000 >/proc/sys/vm/nr_hugepages top - 13:42:59 up 3:37, 1 user, load average: 1.09, 1.03, 1.01 Tasks: 1 total, 1 running, 0 sleeping, 0 stopped, 0 zombie %Cpu(s): 0.0 us, 12.5 sy, 0.0 ni, 85.5 id, 2.0 wa, 0.0 hi, 0.0 si, 0.0 st KiB Mem: 31371520 total, 30915136 used, 456384 free, 320 buffers KiB Swap: 6284224 total, 115712 used, 6168512 free. 48192 cached Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 76 root 20 0 0 0 0 R 100.0 0.000 217:17.29 kswapd3 At that time, there are no reclaimable pages left in the node, but as kswapd fails to restore the high watermarks it refuses to go to sleep. Kswapd needs to back away from nodes that fail to balance. Up until commit 1d82de6 ("mm, vmscan: make kswapd reclaim in terms of nodes") kswapd had such a mechanism. It considered zones whose theoretically reclaimable pages it had reclaimed six times over as unreclaimable and backed away from them. This guard was erroneously removed as the patch changed the definition of a balanced node. However, simply restoring this code wouldn't help in the case reported here: there *are* no reclaimable pages that could be scanned until the threshold is met. Kswapd would stay awake anyway. Introduce a new and much simpler way of backing off. If kswapd runs through MAX_RECLAIM_RETRIES (16) cycles without reclaiming a single page, make it back off from the node. This is the same number of shots direct reclaim takes before declaring OOM. Kswapd will go to sleep on that node until a direct reclaimer manages to reclaim some pages, thus proving the node reclaimable again. [[email protected]: check kswapd failure against the cumulative nr_reclaimed count] Link: http://lkml.kernel.org/r/[email protected] [[email protected]: fix condition for throttle_direct_reclaim] Link: http://lkml.kernel.org/r/[email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Johannes Weiner <[email protected]> Signed-off-by: Shakeel Butt <[email protected]> Reported-by: Jia He <[email protected]> Tested-by: Jia He <[email protected]> Acked-by: Michal Hocko <[email protected]> Acked-by: Hillf Danton <[email protected]> Acked-by: Minchan Kim <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
1 parent a87c75f commit c73322d

File tree

5 files changed

+43
-23
lines changed

5 files changed

+43
-23
lines changed

include/linux/mmzone.h

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -630,6 +630,8 @@ typedef struct pglist_data {
630630
int kswapd_order;
631631
enum zone_type kswapd_classzone_idx;
632632

633+
int kswapd_failures; /* Number of 'reclaimed == 0' runs */
634+
633635
#ifdef CONFIG_COMPACTION
634636
int kcompactd_max_order;
635637
enum zone_type kcompactd_classzone_idx;

mm/internal.h

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -80,6 +80,12 @@ static inline void set_page_refcounted(struct page *page)
8080

8181
extern unsigned long highest_memmap_pfn;
8282

83+
/*
84+
* Maximum number of reclaim retries without progress before the OOM
85+
* killer is consider the only way forward.
86+
*/
87+
#define MAX_RECLAIM_RETRIES 16
88+
8389
/*
8490
* in mm/vmscan.c:
8591
*/

mm/page_alloc.c

Lines changed: 2 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -3521,12 +3521,6 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
35213521
return false;
35223522
}
35233523

3524-
/*
3525-
* Maximum number of reclaim retries without any progress before OOM killer
3526-
* is consider as the only way to move forward.
3527-
*/
3528-
#define MAX_RECLAIM_RETRIES 16
3529-
35303524
/*
35313525
* Checks whether it makes sense to retry the reclaim to make a forward progress
35323526
* for the given allocation request.
@@ -4534,7 +4528,8 @@ void show_free_areas(unsigned int filter, nodemask_t *nodemask)
45344528
K(node_page_state(pgdat, NR_WRITEBACK_TEMP)),
45354529
K(node_page_state(pgdat, NR_UNSTABLE_NFS)),
45364530
node_page_state(pgdat, NR_PAGES_SCANNED),
4537-
!pgdat_reclaimable(pgdat) ? "yes" : "no");
4531+
pgdat->kswapd_failures >= MAX_RECLAIM_RETRIES ?
4532+
"yes" : "no");
45384533
}
45394534

45404535
for_each_populated_zone(zone) {

mm/vmscan.c

Lines changed: 32 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -2620,6 +2620,15 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
26202620
} while (should_continue_reclaim(pgdat, sc->nr_reclaimed - nr_reclaimed,
26212621
sc->nr_scanned - nr_scanned, sc));
26222622

2623+
/*
2624+
* Kswapd gives up on balancing particular nodes after too
2625+
* many failures to reclaim anything from them and goes to
2626+
* sleep. On reclaim progress, reset the failure counter. A
2627+
* successful direct reclaim run will revive a dormant kswapd.
2628+
*/
2629+
if (reclaimable)
2630+
pgdat->kswapd_failures = 0;
2631+
26232632
return reclaimable;
26242633
}
26252634

@@ -2694,10 +2703,6 @@ static void shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
26942703
GFP_KERNEL | __GFP_HARDWALL))
26952704
continue;
26962705

2697-
if (sc->priority != DEF_PRIORITY &&
2698-
!pgdat_reclaimable(zone->zone_pgdat))
2699-
continue; /* Let kswapd poll it */
2700-
27012706
/*
27022707
* If we already have plenty of memory free for
27032708
* compaction in this zone, don't free any more.
@@ -2817,14 +2822,17 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
28172822
return 0;
28182823
}
28192824

2820-
static bool pfmemalloc_watermark_ok(pg_data_t *pgdat)
2825+
static bool allow_direct_reclaim(pg_data_t *pgdat)
28212826
{
28222827
struct zone *zone;
28232828
unsigned long pfmemalloc_reserve = 0;
28242829
unsigned long free_pages = 0;
28252830
int i;
28262831
bool wmark_ok;
28272832

2833+
if (pgdat->kswapd_failures >= MAX_RECLAIM_RETRIES)
2834+
return true;
2835+
28282836
for (i = 0; i <= ZONE_NORMAL; i++) {
28292837
zone = &pgdat->node_zones[i];
28302838
if (!managed_zone(zone) ||
@@ -2905,7 +2913,7 @@ static bool throttle_direct_reclaim(gfp_t gfp_mask, struct zonelist *zonelist,
29052913

29062914
/* Throttle based on the first usable node */
29072915
pgdat = zone->zone_pgdat;
2908-
if (pfmemalloc_watermark_ok(pgdat))
2916+
if (allow_direct_reclaim(pgdat))
29092917
goto out;
29102918
break;
29112919
}
@@ -2927,14 +2935,14 @@ static bool throttle_direct_reclaim(gfp_t gfp_mask, struct zonelist *zonelist,
29272935
*/
29282936
if (!(gfp_mask & __GFP_FS)) {
29292937
wait_event_interruptible_timeout(pgdat->pfmemalloc_wait,
2930-
pfmemalloc_watermark_ok(pgdat), HZ);
2938+
allow_direct_reclaim(pgdat), HZ);
29312939

29322940
goto check_pending;
29332941
}
29342942

29352943
/* Throttle until kswapd wakes the process */
29362944
wait_event_killable(zone->zone_pgdat->pfmemalloc_wait,
2937-
pfmemalloc_watermark_ok(pgdat));
2945+
allow_direct_reclaim(pgdat));
29382946

29392947
check_pending:
29402948
if (fatal_signal_pending(current))
@@ -3114,7 +3122,7 @@ static bool prepare_kswapd_sleep(pg_data_t *pgdat, int order, int classzone_idx)
31143122

31153123
/*
31163124
* The throttled processes are normally woken up in balance_pgdat() as
3117-
* soon as pfmemalloc_watermark_ok() is true. But there is a potential
3125+
* soon as allow_direct_reclaim() is true. But there is a potential
31183126
* race between when kswapd checks the watermarks and a process gets
31193127
* throttled. There is also a potential race if processes get
31203128
* throttled, kswapd wakes, a large process exits thereby balancing the
@@ -3128,6 +3136,10 @@ static bool prepare_kswapd_sleep(pg_data_t *pgdat, int order, int classzone_idx)
31283136
if (waitqueue_active(&pgdat->pfmemalloc_wait))
31293137
wake_up_all(&pgdat->pfmemalloc_wait);
31303138

3139+
/* Hopeless node, leave it to direct reclaim */
3140+
if (pgdat->kswapd_failures >= MAX_RECLAIM_RETRIES)
3141+
return true;
3142+
31313143
for (i = 0; i <= classzone_idx; i++) {
31323144
struct zone *zone = pgdat->node_zones + i;
31333145

@@ -3214,9 +3226,9 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
32143226
count_vm_event(PAGEOUTRUN);
32153227

32163228
do {
3229+
unsigned long nr_reclaimed = sc.nr_reclaimed;
32173230
bool raise_priority = true;
32183231

3219-
sc.nr_reclaimed = 0;
32203232
sc.reclaim_idx = classzone_idx;
32213233

32223234
/*
@@ -3295,7 +3307,7 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
32953307
* able to safely make forward progress. Wake them
32963308
*/
32973309
if (waitqueue_active(&pgdat->pfmemalloc_wait) &&
3298-
pfmemalloc_watermark_ok(pgdat))
3310+
allow_direct_reclaim(pgdat))
32993311
wake_up_all(&pgdat->pfmemalloc_wait);
33003312

33013313
/* Check if kswapd should be suspending */
@@ -3306,10 +3318,14 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
33063318
* Raise priority if scanning rate is too low or there was no
33073319
* progress in reclaiming pages
33083320
*/
3309-
if (raise_priority || !sc.nr_reclaimed)
3321+
nr_reclaimed = sc.nr_reclaimed - nr_reclaimed;
3322+
if (raise_priority || !nr_reclaimed)
33103323
sc.priority--;
33113324
} while (sc.priority >= 1);
33123325

3326+
if (!sc.nr_reclaimed)
3327+
pgdat->kswapd_failures++;
3328+
33133329
out:
33143330
/*
33153331
* Return the order kswapd stopped reclaiming at as
@@ -3509,6 +3525,10 @@ void wakeup_kswapd(struct zone *zone, int order, enum zone_type classzone_idx)
35093525
if (!waitqueue_active(&pgdat->kswapd_wait))
35103526
return;
35113527

3528+
/* Hopeless node, leave it to direct reclaim */
3529+
if (pgdat->kswapd_failures >= MAX_RECLAIM_RETRIES)
3530+
return;
3531+
35123532
/* Only wake kswapd if all zones are unbalanced */
35133533
for (z = 0; z <= classzone_idx; z++) {
35143534
zone = pgdat->node_zones + z;
@@ -3779,9 +3799,6 @@ int node_reclaim(struct pglist_data *pgdat, gfp_t gfp_mask, unsigned int order)
37793799
sum_zone_node_page_state(pgdat->node_id, NR_SLAB_RECLAIMABLE) <= pgdat->min_slab_pages)
37803800
return NODE_RECLAIM_FULL;
37813801

3782-
if (!pgdat_reclaimable(pgdat))
3783-
return NODE_RECLAIM_FULL;
3784-
37853802
/*
37863803
* Do not scan if the allocation should not be delayed.
37873804
*/

mm/vmstat.c

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1425,7 +1425,7 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat,
14251425
"\n node_unreclaimable: %u"
14261426
"\n start_pfn: %lu"
14271427
"\n node_inactive_ratio: %u",
1428-
!pgdat_reclaimable(zone->zone_pgdat),
1428+
pgdat->kswapd_failures >= MAX_RECLAIM_RETRIES,
14291429
zone->zone_start_pfn,
14301430
zone->zone_pgdat->inactive_ratio);
14311431
seq_putc(m, '\n');

0 commit comments

Comments
 (0)