Linux v4.15: Performance Goodies

With the Meltdown and Spectre fiascos, performance isn't a very hot topic at the moment. In fact, with Linux v4.15 released, it is one of the rare times I've seen security win over performance in such a one sided way. Normally security features are tucked away under a kernel config option nobody really uses. Of course the software fixes are also backported in one way or another, so this isn't really specific to the latest kernel release.

All this said, v4.15 came out with a few performance enhancements across subsystems. The following is an unsorted and incomplete list of changes that went in. Note that the term 'performance' can be vague in that some gains in one area can negatively affect another, so take everything with a grain of salt and reach your own conclusions.

epoll: scale nested calls

Nested epolls are necessary to allow semantics where a file descriptor in the epoll interested-list is also an epoll instance. Such calls are not all that common, but some real world applications suffered severe performance issues in that it relied on global spinlocks, acquired throughout the callbacks in the epoll state machine. By removing them, we can speed up adding fds to the instance as well as polling, such that epoll_wait() can improve by 100x, scaling linearly when increasing amounts of cores block an an event.

[Commit 57a173bdf5ba, 37b5e5212a44]

pvspinlock: hybrid fairness paravirt semantics

Locking under virtual environments can be tricky, balancing performance and fairness while avoiding artifacts such as starvation and lock holder/waiter preemption. The current paravirtual queued spinlocks, while free from starvation, can perform less optimally than an unfair lock in guests with CPU over-commitment. With Linux v4.15, guest spinlocks now combine the best of both worlds, with an unfair and a queued mode. The idea is that, upon contention, extend the lock stealing attempt in the slowpath (unfair mode) as long as there are queued MCS waiters present, hence improving performance while avoiding starvation. Kernel build experiments show that as a VM becomes more and more over-committed, the ratio of locks acquired in unfair mode increases.

[Commit 11752adb68a3]

mm,x86: avoid saving/restoring interrupts state in gup

When x86 was converted to use the generic get_user_pages_fast() call a performance regression was introduced at a microbenchmark level. The generic gup function attempts to walk the page tables without acquiring any locks, such as the mmap semaphore. In order to do this, interrupts must be disabled, which is where things went different between the arch-specific and generic flavors. The later must save and restore the current state of interrupt, introducing extra overhead when compared to a simple local_irq_enable/disable().

[Commit 5b65c4677a57]

ipc: scale INFO commands

Any syscall used to get info from sysvipc (such as semctl(IPC_INFO) or shmctl(SHM_INFO)) requires internally computing the last ipc identifier. For cases with large amounts of keys, this operation alone can consume a large amount of cycles as it looked up on-demand, in O(N). In order to make this information available in constant time, we keep track of it whenever a new identifier is added.

[Commit 15df03c87983]

ext4: improve smp scalability for inode generation

The superblock's inode generation number was currently sequentially increased (from a randomly initialized value) and protected by a spinlock, making the usage pattern quite primitive and not very friendly to workloads that are generating files/inodes concurrently. The inode generation path was optimized to remove the lock altogether and simply rely on prandom_u32() such that a fast/seeded pseudo random-number algorithm is used for computing the i_generation.

[Commit 232530680290]