This year at the Linux Plumbers Conference, Oracle Linux developer Daniel Jordan co-organized the performance and scalability microconference along with Pasha Tatashin from Microsoft and Ying Huang from Intel. The event had nine speakers, about half of whom were from Oracle, so this was a nice opportunity for our team to raise its concerns with the community. Daniel contributes this writeup on the challenges and opportunities they discussed.
This was a good year for Oracle at the 2018 Linux Plumbers Conference with several of the attendees telling me that they noticed the heavy representation from Oracle, both in talks and the hallway.
Plumbers was most useful for having small, focused discussions that would never happen on mailing lists. You can end up going deeper and finding more common ground with extended face time.
Tim Chen from Intel spoke about a bottleneck in TPC-C with scheduler task accounting for cgroups on multi-socket systems. (Oracle's Unbreakable Enterprise Kernel 5 is configured with the same scheduler options he used in the runs.) An atomic operation to track the aggregate load average in a task group (load_avg in struct task_group) was showing up at the top of his profiles. Rik van Riel pointed out that this wasn't just a problem on multisocket boxes, they were also seeing this on single-socket systems at Facebook. Peter Zijlstra suggested that Tim revive some old patches that broke this counter up across NUMA nodes, and after further discussion, it was agreed to split along Last-Level Cache (LLC) boundaries instead because systems have grown larger since the patches were last posted. We're hopeful to see good results from these changes soon!
Pasha Tatashin from Microsoft spoke about seamlessly updating a host OS while minimizing downtime of guest VMs, presenting two high-level strategies for the audience to consider. First, kexec a new host kernel and transfer control between the old and new host kernel as the transition between the two kernels is happening, and second, boot a new host OS inside a VM, migrate the guests into this VM (relying on nested virtualization), and kexec into it, fixing EPT translations before transferring control. There was some concern about how to support SR-IOV devices in the second solution, but in the end, Pasha decided to experiment with the second option.
Steve Sistare and Subhra Mazumdar spoke about scheduler scalability work they've been involved in. Steve's blog post is coming in January; read Subhra's blog post here.
Mike Kravetz and Christoph Lameter led a session on huge page issues in the kernel. Here's an excerpt about the session from Mike:
During this MC, Christoph Lameter and myself talked about promoting huge page usage. This was mostly a rehash of material previously presented and discussed. The 'hope' was to spark discussion and possibly new ideas. During this session, one really good suggestion was made. Align mmap addresses for THP. I sent out a similar RFC to align for pmd sharing a couple years back (https://lkml.org/lkml/2016/3/28/478) but did not follow through. Will add both to my todo list.
Boqun Feng held a discussion about an issue he had seen with workqueues and CPU hotplug that had come up when optimizing an RCU (Read-Copy-Update lock) path. According to a comment above its definition, queue_work_on requires callers to ensure the requested CPU for the work item can't go offline before the queueing is finished, and the problem was the RCU code path doesn't follow this requirement. Boqun wanted to disable preemption around the queue_work_on call, effectively preventing CPU hotplug, but Thomas Gleixner opposed this, saying that disabling preemption prevents CPU hotplug only by accident and that there was no semantic guarantee for it. Paul McKenney and Thomas went back and forth about what to do, but to make a long story short, in the hallway afterward it was discovered that the workqueue comment giving the requirement about CPU hotplug was stale. The workqueue splats Boqun Feng saw had some other cause, pending investigation.
I spoke about ktask, an interface for parallelizing CPU-intensive kernel work, with the goal of discussing a few of the open problems in this work. The audience had a different idea, and we spent the session fielding questions about how ktask worked and where else it might be used. For example, Junaid Shahid from Google had a use case for ktask to multithread kvm dirty page tracking during live migration, but was concerned about threads having different amounts of work to do in their assigned memory regions such that the load may not be equally shared between them. My new plan, post-conference, will be to alleviate this by splitting up the ranges to be tracked into small pieces interleaved across threads to minimize the chance that one thread would get stuck with a busy range.
Finally, Yang Shi led a discussion on mmap_sem, a perennial bottleneck in the kernel that often serializes updates to a process's address space, including its rbtree of VMAs and various fields in mm_struct. This discussion is hard to summarize since there were so many comments:
To Yang Shi's suggestion about a per-VMA lock, Davidlohr Bueso said it wouldn't help alleviate contention when multiple threads update the same VMA. Vlastimil Babka suggested splitting large VMAs into many smaller ones, even if they shared the same flags, to make per-VMA locks work better. Rik van Riel was skeptical about this, since application threads may not have an even access pattern across the process's virtual address space.
On a different topic, Waiman Long warned that a strategy used in one of the recent mmap_sem fixes for alleviating contention, downgrading the holder from writer to reader, may not benefit sometimes because readers don't optimistically spin the way writers do.
Matthew Wilcox mentioned a planned experiment to use an RCU-safe B-tree (aka the Maple Tree) to avoid taking mmap_sem for read.
Steve Sistare said the problem with the range locks is you have to traverse a tree of ranges to find which range to operate on, which creates a bottleneck in itself, and suggested potentially using a hashed array of locks, which is parallelizable but can suffer when the VA region to operate on is very large.
Davidlohr Bueso mentioned that a range locking primitive exists already upstream. An rwsem will always outperform it because of optimistic spinning, but the worst case scenario described in the range locking series isn't that much worse than rwsem. He believes the main question right now is how to serialize threads operating on the same VMA.
Laurent Dufour said the problem with the VMA is that there are so many ways to get to it: mm_struct's rbtree, mm_struct's VMA list, and anon_vma lists. Matthew Wilcox agreed, and said the kernel doesn't differentiate the case where the whole address space needs protecting and the case where an individual VMA does. Laurent and Matthew agreed on the need for a per-VMA lock. Matthew hopes for a per-process spinlock for the entire address space and a semaphore for each VMA.
Thanks to Paul McKenney, Davidlohr Bueso, Dave Hansen, and Dhaval Giani, who provided helpful advice in the process of organizing this microconference
Here are a few recommended talks from LPC. Videos and slides are posted on the talk pages, linked from here .
Mike Kravetz's and Christoph Lameter's "Very large Contiguous regions in userspace" for its useful and very interactive discussion about how to proceed with a common problem between different kernel communities.
"RDMA and get_user_pages" from Matthew Wilcox, Dan Williams, Jan Kara, and John Hubbard for the great audience interaction, problem solving, and interesting technical content.
Vlastimil Babka's "The hard work behind large physical allocations in the kernel" because of how well it laid out current issues in this area. The slide deck is very readable on its own, for those who prefer reading to watching but become frustrated at following powerpoints.
"Concurrency with tools/memory-model" from Andrea Parri and Paul McKenney. If your work involves memory barriers, this is a good one to watch to learn about the expectations of the maintainers of the LKMM (Linux Kernel Memory Model). It turns out they filter for upstream postings containing barriers (e.g. smp_mb) and review the changes to make sure they're correct, and follow the expected commenting style for paired barriers.
Thanks to everyone who helped organize this event: it is a massive undertaking to make a conference this large happen. Excited for next year!