Folio Wins Without Trying

Where git bisect might fail to find the culprit commit, manual bisect can work, and that suggests perhaps some enhancement might be required for git bisect.

Why `git bisect` fails?

It has been a common practice to run git bisect between a known good commit and a known bad commit to identify the culprit commit behind a certain issue or responsible for a certain fix. While sometimes this tool works, other times it produces puzzling results in that, the indicted commit is clearly innocent, couldn’t possiblly be the cause to or fix for a bug.

It appears that the failures are due to merge. As Linux kernel developers know that there are dozens of Linux kernel sub-area such as cpu scheduler, memory management, architectural dependent low level kernel infrastructures and tons of drivers. Some sub-areas have a few more sub-areas to it, all together they form a short tree with a very wide span or branches so-to-speak. The Linux kernel release process involves merging commits from the sub-area, or sub-tree, sub-branch, into higher level sub-trees, and ultimately into Linus’ main tree for release.

The process of git merge is like a card player with stacks of cards, each stack labeled, and he fork-merges the stacks to form one stack of cards. Consequently in the final stack, two adjascent cards might not come from the same sub stack. But if one were to checkout a card, every card in the same sub-stack follows since they share the same label.

I recently encountered a puzzling git bisect experience, luckily the indicted commit is clearly innocent, potentially saving a lot time from dwelling in a rabbit hole.

How Manual bisecting can work?

In my experience, a close examination at the adjacent good and bad commits revealed that they do not share a common ancestor, indicating they came from different merge tags. Another way to view the innocent commit being indicted is that the true culprit commit does share a common ancestor with the innocent commit, at some merge level.

Hence the next step is to identify the closest common ancestor, then go down the hierarchy.

There appear to be two ways to do this –

If already armed with an innocent victim, trace backwards to its closest Merge tag, determine if it is bad. If not, then move up to the next closest Merge tag until a bad Merge tag is identified.
If you start out fresh, bisect the top level Merge tags, then move down the merge tag hierarchy until the lowest level Merge tag is identified.

Once the lowest level Merge tag is identified, git bisect should be able to reliably identify the bad commit. Note that the hierarchy is unlikely a balanced tree, and a Merge tag can host both sub-level Merge tag as well as leaf-level commits.

A Recent Experience

A performance regression issue in UEK8 (Linux kernel v6.12) against UEK7U3 (Linux Stable kernel v5.15. ) has been reported. The issue is that MR (Memory Region) reg (registration) performance has dropped to about half as much for device-dax backed MRs, but not for system RAM backed MRs.

The relevant call chain in the kernel for the MR reg performance test is

    ib_uverbs_reg_mr
      mlx5_ib_reg_user_mr
        ib_umem_get

Bisecting Linux kernel releases between v5.15. and v6.12 quickly identified v6.2-rc1 release as bad, and the last known good release is v6.1-rc8. There are 14K commits between v6.1-rc8 and v6.2-rc1, merge tags included:

$ git log --oneline v6.1-rc8..v6.2-rc1 | wc -l
14762

git bisect indicted commit 44a84da45272, with its immediate “chronically” preceding neighbour commit 3e844d842d49 as the last good commit:

785d21ba2f44 Merge tag 'vfio-v6.2-rc1' of https://github.com/awilliam/linux-vfio
8fa590bf3448 Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
44a84da45272 io_uring: use call_rcu_hurry if signaling an eventfd   <-- BAD
3e844d842d49 x86/mm: Ensure forced page table splitting         <-- GOOD
1cfaac2400c7 x86/kasan: Populate shadow for shared chunk of the CPU entry area
bde258d97409 x86/kasan: Add helpers to align shadow addresses up and down

However, commit 44a84da45272 only has a one line change that does not even apply to UEK8. Besides, the MR reg performance test has nothing to do with io_uring:

--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -538,7 +538,7 @@ static void io_eventfd_signal(struct io_ring_ctx *ctx)
        } else {
                atomic_inc(&ev_fd->refs);
                if (!atomic_fetch_or(BIT(IO_EVENTFD_OP_SIGNAL_BIT), &ev_fd->ops))
-                       call_rcu(&ev_fd->rcu, io_eventfd_ops);
+                       call_rcu_hurry(&ev_fd->rcu, io_eventfd_ops);
                else
                        atomic_dec(&ev_fd->refs);
        }

#ifdef CONFIG_RCU_LAZY
void call_rcu_hurry(struct rcu_head *head, rcu_callback_t func);
#else
static inline void call_rcu_hurry(struct rcu_head *head, rcu_callback_t func)
{
    call_rcu(head, func);
}
#endif

where CONFIG_RCU_LAZY is not selected in UEK8, in fact, it’s not selected by default in any OL release.

However interestingly, the good and bad commits did not come from the same sub-branch:

$ git merge-base --is-ancestor 3e844d842d49 44a84da45272 && echo "Yes" || echo "No"
No

That said, the real bad commit should share a same sub-branch with commit 44a84da45272, though not necessarily the same immediate sub-branch. Here is a snippet of the massive --graph view showing the relevant Merge tags:

$ git log --graph --merges --oneline v6.1-rc8..v6.2-rc1
* 72a85e2b0a1e Merge tag 'spi-fix-v6.2-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi
* 0a023cbb11e3 Merge tag 'regulator-fix-v6.2-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regulator
[..]
| | *   1644d755d0b0 Merge branch 'linus'
| | |\
| | * \   b29e6ece454f Resync master with latest Linus upstream
| | |\ \
[..]
| | * | ce8a79d5601a Merge tag 'for-6.2/block-2022-12-08' of git://git.kernel.dk/linux
| | | | * 8f415307c3ca Merge tag 'nvme-6.2-2022-12-07' of git://git.infradead.org/nvme into for-6.2/block
| | | | * b1476451488b Merge tag 'floppy-for-6.2' of https://github.com/evdenis/linux-floppy into for-6.2/block
| | | | * 368c7f1f8a68 Merge branch 'md-next' of https://git.kernel.org/pub/scm/linux/kernel/git/song/md into for-6.2/block
| | | | * 8613dec04e74 Merge tag 'nvme-6.2-2022-11-29' of git://git.infradead.org/nvme into for-6.2/block
| | | | * 5626196a5ae0 Merge branch 'md-next' of https://git.kernel.org/pub/scm/linux/kernel/git/song/md into for-6.2/block
| | * | 96f7e448b9f4 Merge tag 'for-6.2/io_uring-next-2022-12-08' of git://git.kernel.dk/linux

In the above --graph view, Merge tags start from * are at the top level, next level down are Merge tags starting from | *, and so on.

Bisecting on Merge tags shows that ce8a79d5601a is the guilty party. So the next step is git checkout -b Merge-tag-ce8a79d5601a ce8a79d5601a in order to bisect further down:

* | ce8a79d5601a Merge tag 'for-6.2/block-2022-12-08' of git://git.kernel.dk/linux
|\
[..]
| * 6d4338cb4070 ABI: sysfs-bus-pci: add documentation for p2pmem allocate
| * 7e9c7ef83d78 PCI/P2PDMA: Allow userspace VMA allocations through sysfs
| * 7ee4ccf57484 block: set FOLL_PCI_P2PDMA in bio_map_user_iov()
| * 5e3e3f2e15df block: set FOLL_PCI_P2PDMA in __bio_iov_iter_get_pages()
| * 1567b49d1a40 lib/scatterlist: add check when merging zone device pages  <-- BAD
| * 49580e690755 block: add check when merging zone device pages        <-- GOOD
| * d82076403cef iov_iter: introduce iov_iter_get_pages_[alloc_]flags()

The true bad commit is 1567b49d1a40, [PATCH v11 5/9] in patch series [PATCH v11 0/9] Userspace P2PDMA with O_DIRECT NVMe devices. Further debug shows that the regression came from this one line change

    ib_uverbs_reg_mr
      mlx5_ib_reg_user_mr
        ib_umem_get
          sg_alloc_append_table_from_pages
            pages_are_mergeable
              zone_device_pages_have_same_pgmap(a,b)
                return a->pgmap == b->pgmap     <-- culprit

where a and b are struct page * pointers that point to the MR pages.

This additional line was added to ensure that pages not backed by the same ZONE_DEVICE shall not be merged – a concern that applies to PCI P2PDMA use cases but irrelevant to the MR reg performence test. Therefore it is possible to replace return a->pgmap == b->pgmap with return true as an experiment. The result came out that with this slight modification, the regression disappeared.

Two take aways from the experiment:

It explains why the regression issue follows device-dax backed MR and not system RAM backed MR, because for system RAM back MR, the return a->pgmap == b->pgmap line is not reached;
It suggests that the cause to regression is CPU cache thrashing, as in the test scenario, 31 million pages are subjected to the ->pgmap comparsons in a relatively tight loop. The test system is an X9-2, each socket has 48K L1 cache, once the execution enters the loop, the L1 cachelines are quickly being replaced akin to a game of musical chairs.

The Fix

There is a fix for this issue, in v6.15, although the set of patches potentially needed to be backported are going to be more than one, the precise patch that nailed the issue is commit 82ba975e4c43 mm: allow compound zone device pages. The idea of the fix boils down to one line:

    return page_pgmap(page_folio(a)->pgmap, page_folio(b)->pgmap)`

where page_folio(a) adds no additional cache footprint in the test scenario, and foio->pgmap adds only additional 1/512, or roughly 0.2% cache footprint due to the fact device-dax instances are created to present device memory in 2MB-pagesize hugepages by default.

Note that, although the stated purpose of the patch is about doing the sensible thing via folio rather than fixing someone else’ performance issue, it ends up fixing a real world performance regression issue.

Thus thanks to Matthew Wilcox for the foresight of redesigning one of the most fundational data structures in Linux Kernel Memory Management. By replacing struct page with struct folio, people like me keep discovering benefits here and there. Some might argue the same can be achieved with compound page, true, but it’s less likely for developers to form a habit of converting page to compound head whereever possible like they do with folio. Perhaps it has something to do with a sense that the concept of compound page that bundles multiple pages sharing the same characteristics in a group is ‘special’, while the concept of folio treating bundles of 1 or many pages as equally elemental.

`git bisect` – Room for Improvement?

While manual bisecting comes with the benefit of opportunistically utilizing human insights, such as traversing on Merge tags, skipping obviously unrelated commits, or not bisecting in the middle of a large patch series too soon, etc, it’s time consuming, hence not fit for automated bisecting tasks.

Would it be feasible to add a couple of additional features to git bisect?

An option to bisect on Merge tags in a top-down fashion before bisecting leaf commits;
An option to provide a preview given the GOOD and BAD commits, showing the landscape such as Merge tag trees (or hierachy), perhaps just the number of Merge tags at each hierarchical level.

Perhaps someone quite savvy with git might work out some improvement that will benefit the entire community!

Folio Wins Without Trying

Why `git bisect` fails?

How Manual bisecting can work?

A Recent Experience

The Fix

`git bisect` – Room for Improvement?

References

Jane Chu

Binary Compatibility and OpenELA's ELValidated Project

A Closer Look: What Makes Oracle Linux the Ideal Choice for Running Oracle Database

Folio Wins Without Trying

Why git bisect fails?

How Manual bisecting can work?

A Recent Experience

The Fix

git bisect – Room for Improvement?

References

Authors

Jane Chu

Binary Compatibility and OpenELA's ELValidated Project

A Closer Look: What Makes Oracle Linux the Ideal Choice for Running Oracle Database

Why `git bisect` fails?

`git bisect` – Room for Improvement?