X

News, tips, partners, and perspectives for the Oracle Linux operating system and upstream Linux kernel work

Persistent Memory and Oracle Linux

Scott Michael
Director, Software Development

In this blog post, Oracle Linux kernel developer Jane Chu talks about persistent memory, the support we have in Oracle Linux for it and some examples on how to use it.

Persistent Memory Introduction

Persistent Memory Overview

More than ever, applications have been driving hardware technologies. Big Data is responsible for recent advances in artificial intelligence which demand massive parallel processing capability and immediate access to large data. For example, realtime business analytics, providing results based on realtime consumer/product information; realtime traffic pattern analysis based on data collected by smart cameras, etc. These applications present challenges in transporting, storing, and processing big data. Modern disk technology provides access to large amount of persistent data, but access times are not fast enough. DRAM offers fast access to data, but it's generally not feasible to store one and a half terabytes of data for CPU intensive computation without involving IO.

Various NVRAM solutions have been used to address the needs for speed and capacity. NAND flash offers capacity for a resonable price, but is slow and not byte-addressable. NVDIMMs based on a combination of DDR4, supercapacitors, and flash offer fast speed and byte-addressibility, but capacity is limited by DDR4 capacity.

A new technology developed jointly by Intel and Micron, 3D XPoint, offers all these features. Intel Optane DC PMEM is the product from Intel that uses 3D XPoint technology.

Intel Optane Data Center Persistent Memory (Optane DC PMEM)

The Optane DC PMEM DIMM is pin compatible with a DDR4 DIMM and is available in 128G, 256G and 512G capacity. It has much higher endurance than NAND flash and its density is eight times of DDR4. It supports byte-addressable load and store instructions, where read latency is roughly 4 times that of DDR4, and write latency is roughly 10 times that of DDR4.

There are two PMEM modes that are supported on Intel systems equipped with Optane DC PMEM.

2-Level memory mode (2LM)

In this mode, NVDIMMs are used as main memory, and DRAM is treated as a write-back cache.
A load instruction ends up fetching data from DRAM if there is a cache hit, otherwise data will be fetched from the second level memory PMEM, incur longer latency. A store instruction is not cached, so will always incur longer lantecy. Therefore, performance in 2LM mode depends on both the nature of the workload and the DRAM:PMEM ratio.

PMEM in 2LM mode can be considered volatile, as the hardware ensures that data is not available after a power cycle.

Application Direct mode (AppDirect)

As the name implies, this is the mode where applications can directly access PMEM in byte-addressable style. Unlike in 2LM mode, PMEM in this mode is not presented as system memory, but rather device memory that an application can map into its address space.

PMEM in AppDirect mode is persistent, persistence being achieved via Asynchronous DRAM Refresh (ADR). ADR is a mechanism activated by the power loss signal. It ensures data that has reached the ADR domain is flushed to the media to be persisted. The ADR Domain consists of the memory controller, the Write Pending Queue (WPQ), the Transaction Pending Queue (TPQ), and NVDIMM media. To ensure no data loss, the application is responsible for flushing data out of the CPU caches (into the ADR domain). ADR could in theory fail due to an unqualified Power Supply having signal issue or being unable to sustain sufficient voltage long enough to flush the pending queues.

To enable PMEM, these basic components are required:

  1. Some NVDIMMs, and CPUs that support PMEM, such as Intel's Cascade Lake
  2. BIOS or UEFI firmware that supports PMEM
  3. Kernel support, NVDIMM drivers and filesystem DAX support
  4. Administrative management tools such as ipmctl and ndctl to manage the NVDIMMs

Oracle Linux support for PMEM

Oracle is committed to providing solid PMEM support to its customers in Oracle Linux. Oracle Linux 7/UEK5 has the latest full set of Linux PMEM support, including but not limited to:

  1. device-dax support, device-dax memory for memory hot-plug feature
  2. filesystem-dax support, including MAP_SYNC, DAX-XFS metadata protection
  3. btt block driver support for PMEM used as traditional block device
  4. ndctl utility

Oracle is actively participating in the upstream development of Linux PMEM support, as well as backporting new upstream PMEM features/fixes to OL/UEK kernels.

How to use PMEM on Oracle Linux

Interleaved vs NonInterleaved

NVDIMM interleaved works the same way as DDR4 interleave, what is worth noting is its storage characteristics. In NonInterleaved mode, each NVDIMM is like a disk that can be carved up into partitions that one can put filesystems in. In N-way interleaved mode (N >= 2), the disk is formed by the N-way striped NVDIMMs, hence a single partition spans over N participating NVDIMMs. In PMEM terms, such a disk is called region, defined as physically contiguous memory. Its raw capacity can be partitioned into logical devices called namespaces. Interleaved configuration is accomplished at BIOS/UEFI level, hence a firmware reboot is required for an action initiated in OS.

PMEM Configuration

To configure 2LM mode with NVDIMMs in non-interleaved mode,

# ipmctl create -goal memory-mode=100 PersistentMemoryType=AppDirectNotInterleaved

To configure 2LM mode with NVDIMMs in interleaved mode,

# ipmctl create -goal memory-mode=100 PersistentMemoryType=AppDirect

To configure NVDIMMs in AppDirect in interleaved mode,

# ipmctl create -goal memory-mode=0 PersistentMemoryType=AppDirect 

To configure NVDIMMs in AppDirect in non-interleaved mode,

# ipmctl create -goal memory-mode=0 PersistentMemoryType=AppDirectNotInterleaved 

For more information, see ipmctl github and NDCTL User Guide.

On a 2-node system with 8 NVDIMMs configured in AppDirect non-interleave mode, there will be 8 regions.
# ndctl list -Ru | grep -c region
8
# ndctl list -NRui -r region0
{
  "regions":[
    {
      "dev":"region0",
      "size":"126.00 GiB (135.29 GB)",
      "available_size":"126.00 GiB (135.29 GB)",
      "max_available_extent":"126.00 GiB (135.29 GB)",
      "type":"pmem",
      "iset_id":"0xcc18da901a1e8a22",
      "persistence_domain":"memory_controller",
      "namespaces":[
        {
          "dev":"namespace0.0",
          "mode":"raw",
          "size":0,
          "uuid":"00000000-0000-0000-0000-000000000000",
          "sector_size":512,
          "state":"disabled"
        }
      ]
    }
  ]
}

The above shows that region0 has capacity of 126GiB as indicated by max_available_extent value, and no namespace has been created in region0 yet. namespace0.0 above is a seed namespace purely for programming purpose.

Now, create an fsdax type namespace in region0 called pmem0 to make a PMEM block device.

# ndctl create-namespace -m fsdax -r region0
{
  "dev":"namespace0.0",
  "mode":"fsdax",
  "map":"dev",
  "size":"124.03 GiB (133.18 GB)",
  "uuid":"e04893f8-8b50-4232-b71c-9742ea3a6a3b",
  "sector_size":512,
  "align":2097152,
  "blockdev":"pmem0"
}

where "align":2097152 indicates the default namespace alignment size: 2MiB.

Filesystem-DAX

Filesystem DAX is supported in XFS and EXT2/4. To use FS-DAX, an fsdax mode namespace has to be created, such as /dev/pmem0 in the above. Then, a filesystem to be created in /dev/pmem0, and mounted with -o dax option.

For reducing memory footprint and for better performance, 2MiB page size is preferred with FS-DAX. To achieve that, two things must be done:

  1. the fsdax namespace must be created in 2MiB alignment, such as above;
  2. the mkfs parameters must be specified as below.
# mkdir /mnt_xfs
# mkfs.xfs -d agcount=2,extszinherit=512,su=2m,sw=1 -f /dev/pmem0
# mount -o dax /dev/pmem0 /mnt_xfs

To verify the 2MiB hugepage support, one can turn on the PMD fault debug trace, look for dax_pmd_fault_done events in the trace log.

# echo 1 > /sys/kernel/debug/tracing/events/fs_dax/dax_pmd_fault_done/enable
# fallocate --length 1G /mnt_xfs/1G_file
# xfs_bmap -v /mnt_xfs/1G_file
/mnt_xfs/1G_file:
 EXT: FILE-OFFSET      BLOCK-RANGE      AG AG-OFFSET          TOTAL FLAGS
   0: [0..2097151]:    8192..2105343     0 (8192..2105343)  2097152

Turn off the debug trace by

# echo 0  > /sys/kernel/debug/tracing/events/fs_dax/dax_pmd_fault_done/enable

Device-DAX

Device dax is character device that supports byte-addressable direct access to PMEM. The command below force (-f) reconfigure namespace2.0 to a devdax device, setting the namespace alignment to 1GiB. The device name is /dev/dax2.0.

# ndctl create-namespace -f -e namespace2.0 -m devdax -a 1G
{
  "dev":"namespace2.0",
  "mode":"devdax",
  "map":"dev",
  "size":"124.00 GiB (133.14 GB)",
  "uuid":"2d058872-4025-4440-b49a-105501d366b7",
  "daxregion":{
    "id":2,
    "size":"124.00 GiB (133.14 GB)",
    "align":1073741824,
    "devices":[
      {
        "chardev":"dax2.0",
        "size":"124.00 GiB (133.14 GB)"
      }
    ]
  },
  "align":1073741824
}

Then, a process can mmap(2) /dev/dax2.0 to its address space.

RDMA with PMEM device

Traditional RDMA via pinned pages works with device-dax, but is not (yet) supported with filesystem-DAX. For filesystem-DAX, RDMA via ODP (On-Demand Paging) is the alternative.

PMEM UE report and "badblocks"

Similar to DRAM, PMEM might have errors. Most of the errors are correctable errors (CE) that can be corrected by ECC without OS intervention. But rarely, an Uncorrectable Error (UE) occurs in PMEM, and when the act of read consumes the UE, that triggers a Machine Check Event (MCE) event which traps the CPU, and consequently the memory error handler is invoked.

The handler marks the offending page in order to prevent it from being prefetched. If the page belongs to kernel thread, the system will panic. If the page belongs to user processes, the page will be unmapped from every process that has a mapping from the page. Then kernel sends a SIGBUS to the user process. Among the signal payload, these fields are the worth noting:

  • .si_code := always BUS_MCERERR_AR for PMEM UE
  • .si_addr := page aligned user address
  • .si_addr_lsb := Least Significant Bit(LSB) in the .si_addr field, for a 2MiB page, that's 21.

In addition, the libnvdimm driver subscribes for a callback service to MCE in order to get a record of the UE. The driver maintain a badblocks list to keep track of the known UEs. A known UE is also called poison as the initial consumption of the UE causes the poison bit set in the media.

Here is the command that displays the existing badblocks.

# ndctl list -n namespace6.0 --media-errors
[
  {
    "dev":"namespace6.0",
    "mode":"devdax",
    "map":"mem",
    "size":135289372672,
    "uuid":"d45108b4-2b9d-48f6-a55b-1a095fd1eb51",
    "chardev":"dax6.0",
    "align":2097152,
    "badblock_count":3,
    "badblocks":[
      {
        "offset":4128778,
        "length":1,
        "dimms":[
          "nmem6"
        ]
      },
    ]
  }
]

Following the block device tradition, the bad blocks are in unit of 512 bytes.
The offset starts from the beginning of the user-visible area in the namespace. In the above example, offset = 4128778 * 512 = 0x74ee1400. If a process mmap()s /dev/dax6.0 entirely at virtual address vaddr, then vaddr + 0x74ee1400 is the starting address of the poisoned block.

To clear the poison, one may issue

# ndctl clear-errors -r region6 --scrub -v

The command scrubs the entire region6, clears the known poison, as well as poison from UE that has not been consumed in the region.

PMEM Emulation using DRAM

It is possible to emulate PMEM with DRAM via the memmap kernel parameter.

First, examine the e820 table via dmesg, select a usable region.

[    0.000000] user: [mem 0x0000000100000000-0x000000407fffffff] usable

Second, add the memmap parameters to the boot cmdline: memmap=16G!20G memmap=16G!36G memmap=16G!52G memmap=16G!68G and reboot.

After system boots up, dmesg shows

[    0.000000] user: [mem 0x0000000100000000-0x00000004ffffffff] usable
[    0.000000] user: [mem 0x0000000500000000-0x00000008ffffffff] persistent (type 12)
[    0.000000] user: [mem 0x0000000900000000-0x0000000cffffffff] persistent (type 12)
[    0.000000] user: [mem 0x0000000d00000000-0x00000010ffffffff] persistent (type 12)
[    0.000000] user: [mem 0x0000001100000000-0x00000014ffffffff] persistent (type 12)
[    0.000000] user: [mem 0x0000001500000000-0x000000407fffffff] usable

By default, the four memmap regions are emulated as four fsdax devices -

$ sudo ndctl list -Nu
[
  {
    "dev":"namespace1.0",
    "mode":"fsdax",
    "map":"mem",
    "size":"16.00 GiB (17.18 GB)",
    "sector_size":512,
    "blockdev":"pmem1"
  },
  {
    "dev":"namespace3.0",
    "mode":"fsdax",
    "map":"mem",
    "size":"16.00 GiB (17.18 GB)",
    "sector_size":512,
    "blockdev":"pmem3"
  },
  {
    "dev":"namespace0.0",
    "mode":"fsdax",
    "map":"mem",
    "size":"16.00 GiB (17.18 GB)",
    "sector_size":512,
    "blockdev":"pmem0"
  },
  {
    "dev":"namespace2.0",
    "mode":"fsdax",
    "map":"mem",
    "size":"16.00 GiB (17.18 GB)",
    "sector_size":512,
    "blockdev":"pmem2"
  }
]

Be the first to comment

Comments ( 0 )
Please enter your name.Please provide a valid email address.Please enter a comment.CAPTCHA challenge response provided was incorrect. Please try again.