Skip to content

Commit a4a3ede

Browse files
Pavel Tatashintorvalds
Pavel Tatashin
authored andcommitted
mm: zero reserved and unavailable struct pages
Some memory is reserved but unavailable: not present in memblock.memory (because not backed by physical pages), but present in memblock.reserved. Such memory has backing struct pages, but they are not initialized by going through __init_single_page(). In some cases these struct pages are accessed even if they do not contain any data. One example is page_to_pfn() might access page->flags if this is where section information is stored (CONFIG_SPARSEMEM, SECTION_IN_PAGE_FLAGS). One example of such memory: trim_low_memory_range() unconditionally reserves from pfn 0, but e820__memblock_setup() might provide the exiting memory from pfn 1 (i.e. KVM). Since struct pages are zeroed in __init_single_page(), and not during allocation time, we must zero such struct pages explicitly. The patch involves adding a new memblock iterator: for_each_resv_unavail_range(i, p_start, p_end) Which iterates through reserved && !memory lists, and we zero struct pages explicitly by calling mm_zero_struct_page(). === Here is more detailed example of problem that this patch is addressing: Run tested on qemu with the following arguments: -enable-kvm -cpu kvm64 -m 512 -smp 2 This patch reports that there are 98 unavailable pages. They are: pfn 0 and pfns in range [159, 255]. Note, trim_low_memory_range() reserves only pfns in range [0, 15], it does not reserve [159, 255] ones. e820__memblock_setup() reports linux that the following physical ranges are available: [1 , 158] [256, 130783] Notice, that exactly unavailable pfns are missing! Now, lets check what we have in zone 0: [1, 131039] pfn 0, is not part of the zone, but pfns [1, 158], are. However, the bigger problem we have if we do not initialize these struct pages is with memory hotplug. Because, that path operates at 2M boundaries (section_nr). And checks if 2M range of pages is hot removable. It starts with first pfn from zone, rounds it down to 2M boundary (sturct pages are allocated at 2M boundaries when vmemmap is created), and checks if that section is hot removable. In this case start with pfn 1 and convert it down to pfn 0. Later pfn is converted to struct page, and some fields are checked. Now, if we do not zero struct pages, we get unpredictable results. In fact when CONFIG_VM_DEBUG is enabled, and we explicitly set all vmemmap memory to ones, the following panic is observed with kernel test without this patch applied: BUG: unable to handle kernel NULL pointer dereference at (null) IP: is_pageblock_removable_nolock+0x35/0x90 PGD 0 P4D 0 Oops: 0000 [#1] PREEMPT ... task: ffff88001f4e2900 task.stack: ffffc90000314000 RIP: 0010:is_pageblock_removable_nolock+0x35/0x90 Call Trace: ? is_mem_section_removable+0x5a/0xd0 show_mem_removable+0x6b/0xa0 dev_attr_show+0x1b/0x50 sysfs_kf_seq_show+0xa1/0x100 kernfs_seq_show+0x22/0x30 seq_read+0x1ac/0x3a0 kernfs_fop_read+0x36/0x190 ? security_file_permission+0x90/0xb0 __vfs_read+0x16/0x30 vfs_read+0x81/0x130 SyS_read+0x44/0xa0 entry_SYSCALL_64_fastpath+0x1f/0xbd Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Pavel Tatashin <[email protected]> Reviewed-by: Steven Sistare <[email protected]> Reviewed-by: Daniel Jordan <[email protected]> Reviewed-by: Bob Picco <[email protected]> Tested-by: Bob Picco <[email protected]> Acked-by: Michal Hocko <[email protected]> Cc: Alexander Potapenko <[email protected]> Cc: Andrey Ryabinin <[email protected]> Cc: Ard Biesheuvel <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: Christian Borntraeger <[email protected]> Cc: David S. Miller <[email protected]> Cc: Dmitry Vyukov <[email protected]> Cc: Heiko Carstens <[email protected]> Cc: "H. Peter Anvin" <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Mark Rutland <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Sam Ravnborg <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Will Deacon <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
1 parent ea1f5f3 commit a4a3ede

File tree

3 files changed

+71
-0
lines changed

3 files changed

+71
-0
lines changed

include/linux/memblock.h

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -237,6 +237,22 @@ unsigned long memblock_next_valid_pfn(unsigned long pfn, unsigned long max_pfn);
237237
for_each_mem_range_rev(i, &memblock.memory, &memblock.reserved, \
238238
nid, flags, p_start, p_end, p_nid)
239239

240+
/**
241+
* for_each_resv_unavail_range - iterate through reserved and unavailable memory
242+
* @i: u64 used as loop variable
243+
* @flags: pick from blocks based on memory attributes
244+
* @p_start: ptr to phys_addr_t for start address of the range, can be %NULL
245+
* @p_end: ptr to phys_addr_t for end address of the range, can be %NULL
246+
*
247+
* Walks over unavailable but reserved (reserved && !memory) areas of memblock.
248+
* Available as soon as memblock is initialized.
249+
* Note: because this memory does not belong to any physical node, flags and
250+
* nid arguments do not make sense and thus not exported as arguments.
251+
*/
252+
#define for_each_resv_unavail_range(i, p_start, p_end) \
253+
for_each_mem_range(i, &memblock.reserved, &memblock.memory, \
254+
NUMA_NO_NODE, MEMBLOCK_NONE, p_start, p_end, NULL)
255+
240256
static inline void memblock_set_region_flags(struct memblock_region *r,
241257
unsigned long flags)
242258
{

include/linux/mm.h

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -95,6 +95,15 @@ extern int mmap_rnd_compat_bits __read_mostly;
9595
#define mm_forbids_zeropage(X) (0)
9696
#endif
9797

98+
/*
99+
* On some architectures it is expensive to call memset() for small sizes.
100+
* Those architectures should provide their own implementation of "struct page"
101+
* zeroing by defining this macro in <asm/pgtable.h>.
102+
*/
103+
#ifndef mm_zero_struct_page
104+
#define mm_zero_struct_page(pp) ((void)memset((pp), 0, sizeof(struct page)))
105+
#endif
106+
98107
/*
99108
* Default maximum number of active map areas, this limits the number of vmas
100109
* per mm struct. Users can overwrite this number by sysctl but there is a
@@ -2030,6 +2039,12 @@ extern int __meminit __early_pfn_to_nid(unsigned long pfn,
20302039
struct mminit_pfnnid_cache *state);
20312040
#endif
20322041

2042+
#ifdef CONFIG_HAVE_MEMBLOCK
2043+
void zero_resv_unavail(void);
2044+
#else
2045+
static inline void zero_resv_unavail(void) {}
2046+
#endif
2047+
20332048
extern void set_dma_reserve(unsigned long new_dma_reserve);
20342049
extern void memmap_init_zone(unsigned long, int, unsigned long,
20352050
unsigned long, enum memmap_context);

mm/page_alloc.c

Lines changed: 40 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6215,6 +6215,44 @@ void __paginginit free_area_init_node(int nid, unsigned long *zones_size,
62156215
free_area_init_core(pgdat);
62166216
}
62176217

6218+
#ifdef CONFIG_HAVE_MEMBLOCK
6219+
/*
6220+
* Only struct pages that are backed by physical memory are zeroed and
6221+
* initialized by going through __init_single_page(). But, there are some
6222+
* struct pages which are reserved in memblock allocator and their fields
6223+
* may be accessed (for example page_to_pfn() on some configuration accesses
6224+
* flags). We must explicitly zero those struct pages.
6225+
*/
6226+
void __paginginit zero_resv_unavail(void)
6227+
{
6228+
phys_addr_t start, end;
6229+
unsigned long pfn;
6230+
u64 i, pgcnt;
6231+
6232+
/*
6233+
* Loop through ranges that are reserved, but do not have reported
6234+
* physical memory backing.
6235+
*/
6236+
pgcnt = 0;
6237+
for_each_resv_unavail_range(i, &start, &end) {
6238+
for (pfn = PFN_DOWN(start); pfn < PFN_UP(end); pfn++) {
6239+
mm_zero_struct_page(pfn_to_page(pfn));
6240+
pgcnt++;
6241+
}
6242+
}
6243+
6244+
/*
6245+
* Struct pages that do not have backing memory. This could be because
6246+
* firmware is using some of this memory, or for some other reasons.
6247+
* Once memblock is changed so such behaviour is not allowed: i.e.
6248+
* list of "reserved" memory must be a subset of list of "memory", then
6249+
* this code can be removed.
6250+
*/
6251+
if (pgcnt)
6252+
pr_info("Reserved but unavailable: %lld pages", pgcnt);
6253+
}
6254+
#endif /* CONFIG_HAVE_MEMBLOCK */
6255+
62186256
#ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
62196257

62206258
#if MAX_NUMNODES > 1
@@ -6638,6 +6676,7 @@ void __init free_area_init_nodes(unsigned long *max_zone_pfn)
66386676
node_set_state(nid, N_MEMORY);
66396677
check_for_memory(pgdat, nid);
66406678
}
6679+
zero_resv_unavail();
66416680
}
66426681

66436682
static int __init cmdline_parse_core(char *p, unsigned long *core)
@@ -6801,6 +6840,7 @@ void __init free_area_init(unsigned long *zones_size)
68016840
{
68026841
free_area_init_node(0, zones_size,
68036842
__pa(PAGE_OFFSET) >> PAGE_SHIFT, NULL);
6843+
zero_resv_unavail();
68046844
}
68056845

68066846
static int page_alloc_cpu_dead(unsigned int cpu)

0 commit comments

Comments
 (0)