A pair of memory-allocation improvements in Linux 5.13

Welcome to LWN.receive

The next subscription-only mumble material has been made available to you
by an LWN subscriber. Hundreds of subscribers depend on LWN for the
finest info from the Linux and free instrument communities. While you abilities this
article, please pick into consideration subscribing to LWN. Thank you
for visiting LWN.receive!

By Jonathan Corbet

Might per chance perchance perchance per chance 6, 2021

Among the many many changes merged for five.13 would possibly per chance moreover be found efficiency
improvements at some stage within the kernel. This work doesn’t continuously stand out
the style that unique substances reach, however it’s vitally crucial for the style forward for
the kernel total. In the memory-management converse, just a few
lengthy-working patch units absorb at final made it into the mainline; these
provide a bulk page-allocation interface and
gigantic-page mappings within the

vmalloc()

converse.
Every of these changes must restful manufacture things sooner, now not now not up to for some
workloads.

Batch page allocation

The kernel’s memory-allocation functions absorb lengthy been optimized for
efficiency and scalability, however there are scenarios where that work restful
has now not carried out the specified results. One amongst these is high-tempo
networking. Abet in 2016, networking developer Jesper Dangaard Brouer described the challenges that near with the
fastest community links; when the system is processing hundreds of thousands of
packets per 2d, the time available to address any given packet is
highly constrained. The kernel would possibly per chance per chance only absorb just a few hundred CPU cycles
available to job each and every packet, and obtaining a page from the memory
allocator would possibly per chance per chance, by itself, require greater than that. The utilize of the entire
CPU-time budget to allocate memory is now not how it’s possible you’ll well receive the finest community
efficiency.

At the time, Brouer requested for an API that will enable a substantial series of pages to be
allocated with a single call, confidently with an spectacular decrease per-page label.
The networking code would possibly per chance per chance then take a pile
of memory, and immediate hand out pages as wanted. Nobody objected to the
search info from at the time; it’s properly understood that batching operations can
develop throughput in scenarios like this. However it took a whereas for
that interface to forestall again around.

Mel Gorman took on that job and place collectively a patch series, the
sixth model of which became posted and taken into the -mm tree in
March. It adds two unique interfaces for the allocation of single (“portray-0”)
pages, starting up with:

    unsigned lengthy alloc_pages_bulk(gfp_t gfp, unsigned lengthy nr_pages,
    				   struct list_head *list);

The allocation flags to utilize are stored in gfp, nr_pages
is the series of pages the caller wish to allocate, and list
is a list onto which the allocated pages are to be place. The return label
would possibly be the series of pages undoubtedly allocated, which would possibly per chance per chance perchance be now not up to
nr_pages for any of a series of causes. The page
buildings for the allocated pages are assembled into a list (the utilization of the
lru entry) and linked to the supplied list.

Returning the pages in a linked list would possibly per chance per chance seem a chunk of strange, in particular
since “linked lists” and “scalability” tend now not to switch collectively properly.
The advantage of this system is that it doesn’t require allocating any
memory to track the allocated pages. Since the list is unlikely to be
traversed (there is by no diagram a prefer to lag during the list as a full), the
scalability disorders reach now not be conscious right here. Restful, this interface
would possibly per chance per chance seem awkward to some. For folks that will fairly provide an array to be
stuffed with pointers, a particular interface is within the marketplace:

    unsigned lengthy alloc_pages_bulk_array(gfp_t gfp, unsigned lengthy nr_pages,
    					 struct page page_array);

This characteristic will store strategies to the page buildings for the
allocated pages into page_array, which must restful undoubtedly be now not now not up to
nr_pages substances lengthy or outrageous facet results would possibly per chance per chance appear.
Interestingly, pages will only be allocated for NULL entries in
page_array, so alloc_pages_bulk_array() would possibly per chance moreover be habitual to
fill up a in part emptied array of pages. This array, thus, must restful be
zeroed earlier to the critical call to alloc_pages_bulk_array().

For users desiring more capture an eye on, the characteristic beneath the hood that does the
work of each and every alloc_pages_bulk() and
alloc_pages_bulk_array() is:

    unsigned int __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
				    nodemask_t *nodemask, int nr_pages,
				    struct list_head *page_list,
				    struct page page_array);

The extra parameters capture an eye on the put of the allocated pages on a
NUMA system; preferred_nid is the node to be habitual if doable,
whereas nodemask, if contemporary, indicates the allowable put of nodes.
Precisely one of page_list and page_array must restful be
non-NULL and must restful be habitual to forestall again the allocated pages. If each and every are
supplied, page_array will be habitual and page_list will be
neglected.

Benchmarks included with the patch put converse a nearly 13% tempo develop for
the high-tempo networking case, and something nearer to 500% for a Solar RPC
take a look at case. Gorman noted, though, that: “Every skill users on this
series are corner circumstances (NFS and high-tempo networks) so it’s unlikely
that most users will glimpse any help within the immediate length of time.” The Solar RPC and networking makes utilize of absorb
long past without prolong into 5.13; others tend to have a study.

Huge-page vmalloc()

Most kernel memory-allocation functions return strategies to both pages or
addresses within the kernel’s
address diagram; both way, the addresses correspond to the physical address
of the memory that has been allocated. That works properly for limited
allocations (one page or below), however physical memory allocations change into
more tough to meet as the scale of the
allocation increases as a result of fragmentation of memory over time.
For this motive, powerful
work has been executed over the years to capture faraway from the necessity for multi-page
allocations every time doable.

In most cases, though, only a substantial, contiguous subject will reach; the
vmalloc() interface exists to serve that need. The pages
allocated by vmalloc() will (presumably) be scattered around
physical memory, however they are going to be made nearly contiguous by mapping them
into a particular portion of the kernel’s address dwelling. Traditionally,
low utilize of vmalloc() became sad as a result of prices of
organising the mappings and the limited size of the dedicated address dwelling
on 32-bit methods. The address-dwelling limitation is now not an venture on 64-bit
methods, though, and utilize of vmalloc() has been rising over time.

Addresses within the vmalloc() vary are slower to utilize than addresses
within the kernel’s snarl mapping, though, for the reason that latter are mapped the utilization of
gigantic pages every time doable. That reduces strain on the CPU’s
translation lookaside buffer (TLB), which is habitual to capture faraway from resolving
virtual addresses during the page tables. Mappings within the
vmalloc() vary utilize limited (“imperfect”) pages, which would possibly per chance per chance be more tough on the
TLB.

As of 5.13, though, vmalloc() can utilize gigantic pages for suitably
substantial allocations thanks to this patch from
Nicholas Piggin. For vmalloc() allocations which would possibly per chance per chance be greater than
the smallest gigantic-page size, an strive will be made to utilize gigantic pages
fairly than imperfect pages. That can well toughen efficiency very a lot for
some kernel records buildings, as Piggin described:

Several of the most [used] buildings within the kernel (e.g., vfs and
community hash tables) are allocated with vmalloc on NUMA machines,
in portray to distribute receive admission to bandwidth over the machine. Mapping
these with greater pages can toughen TLB utilization very a lot, for
example this reduces TLB misses by nearly 30x on a `git diff`
workload on a 2-node POWER9 (59,800 -> 2,100) and reduces CPU
cycles by 0.54%, due to vfs hashes being allocated with 2MB pages.

There are some skill disadvantages, alongside with wasting greater portions of
memory due to inner fragmentation; a 3MB allocation would possibly per chance per chance perchance be positioned into
two 2MB gigantic pages, to illustrate, leaving 1MB of unused memory at the pause.
It is moreover doable that the distribution of memory at some stage in NUMA methods would possibly per chance per chance
be less balanced when greater pages are habitual. Some vmalloc()
callers would possibly per chance per chance perchance be unprepared for gigantic-page allocations, so that they must now not executed
in each and every single subject; in suppose, the module loader, which makes utilize of vmalloc()
and must restful presumably pick pleasure in gigantic pages, doesn’t currently utilize them.

Restful, the benefits of the utilization of gigantic pages for vmalloc() would
appear to outweigh the disadvantages, now not now not up to within the sorting out that has been
executed to this level. There is a novel repeat-line parameter,
nohugevmalloc=, which would possibly per chance moreover be habitual to disable this habits if need
be.

Most users are unlikely to gape any improbable tempo improvements resulting
from these changes. However they’re a undoubtedly crucial portion of the continued effort
to optimize the kernel’s habits wherever doable; a lengthy list of changes
like right here’s the causes why Linux performs moreover it does.

(Log in to post feedback)

Learn More

Welcome to LWN.receive

Batch page allocation

Huge-page vmalloc()

Leave a Reply Cancel reply