Introduction

In Linux storage management, ensuring data availability, reliability, and performance is paramount. Device Mapper Multipath stands as a robust solution designed to address these critical requirements within a storage environment. This technology, integral to the Linux kernel, facilitates the efficient management of multiple paths to storage devices, offering enhanced fault tolerance and load balancing.

Device Mapper Multipath, commonly known as DM Multipath, is a Linux kernel feature that enables the aggregation of multiple physical paths to storage devices into a single logical device. It plays a pivotal role in storage area network (SAN) environments where multiple paths exist between server nodes and storage arrays.

The primary purpose of DM Multipath is to ensure high availability and reliability of storage by managing redundant paths to storage devices. By presenting a unified view of these paths to the system, it not only offers fault tolerance against path failures but also enhances I/O performance through load balancing across available paths.

Key Features and Functionality

DM Multipath achieves its functionality through various key features:

  • Path Redundancy: It enables the system to continue operations even if one or more paths to the storage device fail, thereby reducing the risk of data unavailability or loss. eg. active/passive configuration.

  • Load Balancing: By intelligently distributing I/O requests across multiple paths, it optimizes throughput and improves overall performance. eg. active/active configuration.

  • Failover and Recovery: In case of path failure, DM Multipath automatically reroutes I/O operations to alternate paths, ensuring seamless continuity of operations.

  • Configuration Flexibility: Administrators can configure and customize DM Multipath to suit specific storage setups and requirements using configuration files and utilities.

Multipath Components

A working multipath setup consists of a number of components divided between user space and kernel space.

Userspace

Multipath package: Install device-mapper-multipath package

yum install device-mapper-multipath

mpathconf tool: This tool helps in configuring DM mutipath, which involves creating or editing the file multipath.conf. The command below creates and enables a default multipath configuration and it will also start the multipath daemon. The configuration file will be located at /etc/multipath.conf.

mpathconf --enable --with_multipathd y

By default mpathconf use WWID to create a device name which is hard to read for users. So, to give devices a user friendly name, we can use the following command:

mpathconf --enable --user_friendly_names y --with_multipathd y

A sample file for multipath.conf is located at the following location:

/usr/share/doc/device-mapper-multipath/multipath.conf

After configuring multipath one can edit /etc/multipath.conf as per their requirements. The default settings for DM Multipath are set to the following values:

# device-mapper-multipath configuration file

# For a complete list of the default configuration values, run either:
# # multipath -t
# or
# # multipathd show config

# For a list of configuration options with descriptions, see the
# multipath.conf man page.

defaults {
        user_friendly_names yes
        find_multipaths yes
        enable_foreign "^$"
}

blacklist_exceptions {
        property "(SCSI_IDENT_|ID_WWN)"
}

blacklist {
}

multipathd daemon: This daemon automatically creates and removes multipath devices and monitors their paths. Below are the main responsibilities of the multipathd daemon:

  • Performs path discovery, path policy management, configuration, and path health testing.
  • Is responsible for discovering the network topology for multipath block devices.
  • Path discovery involves determining the set of routes from a host to a particular block device which is configured for multipathing. Path discovery is done by scanning sysfs looking for the block devices.

Below is the command to start and enable the multipathd daemon.

pre class=”brush: bash;” style=”background:#eeeeee;border:1px solid #cccccc;padding:5px 10px;”>systemctl start multipathd systemctl enable –now multipathd

Remember any change made in multipath.conf requires the multipathd daemon to be restarted.

multipath -F
systemctl restart multipathd.service

multipath command: This command helps listing and configuring the multipath devices. Udev also uses this whenever a new block device is added.
After enabling the multipath feature we can view the configured devices by running the following command:

multipath -ll

Sample output:

multipath -ll
mpathm (3600140589176222290a46e19b13e1d6d) dm-0 LIO-ORG,IBLOCK
size=100G features='0' hwhandler='1 alua' wp=rw
|-+- policy='round-robin 0' prio=50 status=active
| `- 6:0:0:0  sde 8:64 active ready running
`-+- policy='round-robin 0' prio=50 status=enabled
  `- 4:0:0:0  sdd 8:48 active ready running

Kernel Space

dm_multipath kernel module: This kernel module reroutes IO and supports failover for paths and path groups.

dm-ioctl

dm-ioctl facilitates the communication between user space and the device mapper kernel module. It serves as the interface through which user-level applications or utilities interact with the device mapper, issuing commands and receiving responses. This interface helps configuring multipath devices.

/*-----------------------------------------------------------------
 * Implementation of open/close/ioctl on the special char
 * device.
 *---------------------------------------------------------------*/
static ioctl_fn lookup_ioctl(unsigned int cmd, int *ioctl_flags)
{
        static const struct {
                int cmd;
                int flags;
                ioctl_fn fn;
        } _ioctls[] = {
                {DM_VERSION_CMD, 0, NULL}, /* version is dealt with elsewhere */
                {DM_REMOVE_ALL_CMD, IOCTL_FLAGS_NO_PARAMS | IOCTL_FLAGS_ISSUE_GLOBAL_EVENT, remove_all},
                {DM_LIST_DEVICES_CMD, 0, list_devices},

                {DM_DEV_CREATE_CMD, IOCTL_FLAGS_NO_PARAMS | IOCTL_FLAGS_ISSUE_GLOBAL_EVENT, dev_create},
                {DM_DEV_REMOVE_CMD, IOCTL_FLAGS_NO_PARAMS | IOCTL_FLAGS_ISSUE_GLOBAL_EVENT, dev_remove},
                {DM_DEV_RENAME_CMD, IOCTL_FLAGS_ISSUE_GLOBAL_EVENT, dev_rename},
                {DM_DEV_SUSPEND_CMD, IOCTL_FLAGS_NO_PARAMS, dev_suspend},
                {DM_DEV_STATUS_CMD, IOCTL_FLAGS_NO_PARAMS, dev_status},
                {DM_DEV_WAIT_CMD, 0, dev_wait},

                {DM_TABLE_LOAD_CMD, 0, table_load},
                {DM_TABLE_CLEAR_CMD, IOCTL_FLAGS_NO_PARAMS, table_clear},
                {DM_TABLE_DEPS_CMD, 0, table_deps},
                {DM_TABLE_STATUS_CMD, 0, table_status},

                {DM_LIST_VERSIONS_CMD, 0, list_versions},

                {DM_TARGET_MSG_CMD, 0, target_message},
                {DM_DEV_SET_GEOMETRY_CMD, 0, dev_set_geometry},
                {DM_DEV_ARM_POLL, IOCTL_FLAGS_NO_PARAMS, dev_arm_poll},
                {DM_GET_TARGET_VERSION, 0, get_target_version},
        };

        if (unlikely(cmd >= ARRAY_SIZE(_ioctls)))
                return NULL;

        cmd = array_index_nospec(cmd, ARRAY_SIZE(_ioctls));
        *ioctl_flags = _ioctls[cmd].flags;
        return _ioctls[cmd].fn;
}
/*
 * A traditional ioctl interface for the device mapper.
 *
 * Each device can have two tables associated with it, an
 * 'active' table which is the one currently used by io passing
 * through the device, and an 'inactive' one which is a table
 * that is being prepared as a replacement for the 'active' one.
 *
 * DM_VERSION:
 * Just get the version information for the ioctl interface.
 *
 * DM_REMOVE_ALL:
 * Remove all dm devices, destroy all tables.  Only really used
 * for debug.
 *
 * DM_LIST_DEVICES:
 * Get a list of all the dm device names.
 *
 * DM_DEV_CREATE:
 * Create a new device, neither the 'active' or 'inactive' table
 * slots will be filled.  The device will be in suspended state
 * after creation, however any io to the device will get errored
 * since it will be out-of-bounds.
 *
 * DM_DEV_REMOVE:
 * Remove a device, destroy any tables.
 *
 * DM_DEV_RENAME:
 * Rename a device or set its uuid if none was previously supplied.
 *
 * DM_SUSPEND:
 * This performs both suspend and resume, depending which flag is
 * passed in.
 * Suspend: This command will not return until all pending io to
 * the device has completed.  Further io will be deferred until
 * the device is resumed.
 * Resume: It is no longer an error to issue this command on an
 * unsuspended device.  If a table is present in the 'inactive'
 * slot, it will be moved to the active slot, then the old table
 * from the active slot will be _destroyed_.  Finally the device
 * is resumed.
 *
 * DM_DEV_STATUS:
 * Retrieves the status for the table in the 'active' slot.
 *
 * DM_DEV_WAIT:
 * Wait for a significant event to occur to the device.  This
 * could either be caused by an event triggered by one of the
 * targets of the table in the 'active' slot, or a table change.
 *
 * DM_TABLE_LOAD:
 * Load a table into the 'inactive' slot for the device.  The
 * device does _not_ need to be suspended prior to this command.
 *
* DM_TABLE_CLEAR:
 * Destroy any table in the 'inactive' slot (ie. abort).
 *
 * DM_TABLE_DEPS:
 * Return a set of device dependencies for the 'active' table.
 *
 * DM_TABLE_STATUS:
 * Return the targets status for the 'active' table.
 *
 * DM_TARGET_MSG:
 * Pass a message string to the target at a specific offset of a device.
 *
 * DM_DEV_SET_GEOMETRY:
 * Set the geometry of a device by passing in a string in this format:
 *
 * "cylinders heads sectors_per_track start_sector"
 *
 * Beware that CHS geometry is nearly obsolete and only provided
 * for compatibility with dm devices that can be booted by a PC
 * BIOS.  See struct hd_geometry for range limits.  Also note that
 * the geometry is erased if the device size changes.
 */

/*
 * All ioctl arguments consist of a single chunk of memory, with
 * this structure at the start.  If a uuid is specified any
 * lookup (eg. for a DM_INFO) will be done on that, *not* the
 * name.
 */
struct dm_ioctl {
        /*
         * The version number is made up of three parts:
         * major - no backward or forward compatibility,
         * minor - only backwards compatible,
         * patch - both backwards and forwards compatible.
         *
         * All clients of the ioctl interface should fill in the
         * version number of the interface that they were
         * compiled with.
         *
         * All recognised ioctl commands (ie. those that don't
         * return -ENOTTY) fill out this field, even if the
         * command failed.
         */
        __u32 version[3];       /* in/out */
        __u32 data_size;        /* total size of data passed in
                                 * including this struct */

        __u32 data_start;       /* offset to start of data
                                 * relative to start of this struct */

        __u32 target_count;     /* in/out */
        __s32 open_count;       /* out */
        __u32 flags;            /* in/out */

        /*
         * event_nr holds either the event number (input and output) or the
         * udev cookie value (input only).
         * The DM_DEV_WAIT ioctl takes an event number as input.
         * The DM_SUSPEND, DM_DEV_REMOVE and DM_DEV_RENAME ioctls
         * use the field as a cookie to return in the DM_COOKIE
         * variable with the uevents they issue.
         * For output, the ioctls return the event number, not the cookie.
         */
        __u32 event_nr;         /* in/out */
        __u32 padding;

        __u64 dev;              /* in/out */

        char name[DM_NAME_LEN]; /* device name */
        char uuid[DM_UUID_LEN]; /* unique identifier for
                                 * the block device */
        char data[7];           /* padding or data */
};

DM Multipath Table

To create and configure a multipath device, multipathd sends the device related data, which includes the features that will be enabled on the paths, path selection policy, path details to the multipath driver by invoking the table_load ioctl. Configuration related information comes from /etc/multipath.conf. Below is an example of what this table looks like:

[root@shmsingh-multipath ~]# dmsetup table /dev/mapper/mpathm
0 209715200 multipath 0 1 alua 2 1 round-robin 0 1 1 8:16 1 round-robin 0 1 1 8:32 1

Here mpathm is a multipath device and it has 2 path groups, each path group has one path, 8:16 and 8:32, and the path selection policy is set to round-robin. Each field in the table format is explained below:

 

Arg
Description
0
targetStarting offset
209715200
target length in 512-bytes blocks
multipath
target type
0
number of feature set. eg. queue_if_no_path. if set then next argument will be the feature name.
1
number of hardware handlers
alua
hardware handler
2
number of priority groups
1
next path group to try
round-robin
path selection policy
0
number of selector arguments. fixed to zero
1
number of paths in this group
1
number of path arguments. Fixed to 1.
8:16
path major:minor numbers
1
number of io requests to send to this path before switching

 

Note: This table output can vary depending upon the feature set, the path selection policy set and other variables.

dm-mpath Driver Module.

static struct target_type multipath_target = {
        .name = "multipath",
        .version = {1, 14, 0},
        .features = DM_TARGET_SINGLETON | DM_TARGET_IMMUTABLE |
                    DM_TARGET_PASSES_INTEGRITY,
        .module = THIS_MODULE,
        .ctr                = multipath_ctr,
        .dtr                = multipath_dtr,
        .clone_and_map_rq   = multipath_clone_and_map,
        .release_clone_rq   = multipath_release_clone,
        .rq_end_io          = multipath_end_io,
        .map                = multipath_map_bio,
        .end_io             = multipath_end_io_bio,
        .presuspend         = multipath_presuspend,
        .postsuspend        = multipath_postsuspend,
        .resume             = multipath_resume,
        .status             = multipath_status,
        .message            = multipath_message,
        .prepare_ioctl      = multipath_prepare_ioctl,
        .iterate_devices    = multipath_iterate_devices,
        .busy               = multipath_busy,
};
/* Multipath context */
struct multipath {
        unsigned long flags;            /* Multipath state flags */

        spinlock_t lock;
        enum dm_queue_mode queue_mode;

        struct pgpath *current_pgpath;
        struct priority_group *current_pg;
        struct priority_group *next_pg; /* Switch to this PG if set */

        atomic_t nr_valid_paths;        /* Total number of usable paths */
        unsigned nr_priority_groups;
        struct list_head priority_groups;

        const char *hw_handler_name;
        char *hw_handler_params;
        wait_queue_head_t pg_init_wait; /* Wait for pg_init completion */
        unsigned pg_init_retries;       /* Number of times to retry pg_init */
        unsigned pg_init_delay_msecs;   /* Number of msecs before pg_init retry */
        atomic_t pg_init_in_progress;   /* Only one pg_init allowed at once */
        atomic_t pg_init_count;         /* Number of times pg_init called */

        struct mutex work_mutex;
        struct work_struct trigger_event;
        struct dm_target *ti;

        struct work_struct process_queued_bios;
        struct bio_list queued_bios;

        struct timer_list nopath_timer; /* Timeout for queue_if_no_path */
};
  • The multipath device driver creates a node that consists of all the paths belonging to the device. It creates a multipath device based on the dm table passed onto by table_load.
  • It handles the IO coming to the multipath device and depending upon the policy set, it maps the IO to the respective path.

DM Path Selection Policies

The multipath target driver provides path failover and path load sharing. IO failure on one path to a device is captured and retried down on an alternate path to the same device. Only after all paths to the same device have been tried and failed is an IO error returned to the IO initiator. Path selection policy enables the distribution of bio’s amongst the paths to the same device according to the policy set.

service-time

Iterates through all valid paths and chooses the best path per the following calculations:

  • If both have the same throughput value. Choose the less loaded path.
  • If both have the same load. Choose the higher throughput path.
  • If one path has no throughput value. Choose the other one.
  • Calculate service time. Choose the faster path.
struct path_info {
        struct list_head list;
        struct dm_path *path;
        unsigned repeat_count;
        unsigned relative_throughput;
        atomic_t in_flight_size;        /* Total size of in-flight I/Os */
};

round-robin

If path selection policy is set to round-robin then the next path in the valid path list will be selected and that path will be moved to the end of the list.

struct path_info {
        struct list_head list;
        struct dm_path *path;
        unsigned repeat_count;
};

queue-length

If path selection policy is set to queue-length then iterate through all valid path queue lengths and whichever paths queue length is smallest, select that path.

struct path_info {
        struct list_head        list;
        struct dm_path          *path;
        unsigned                repeat_count;
        atomic_t                qlen;   /* the number of in-flight I/Os */
};

io-affinity

If path selection policy is set to io-affinity, then during configuring each path is mapped to a cpu. When an IO comes, find the current cpu id and use its corresponding mapped path for IO.

struct path_info {
        struct dm_path *path;
        cpumask_var_t cpumask;
        refcount_t refcount;
        bool failed;
};

struct dm_dev contains a struct block_device reference. While pscontext stores a reference to struct path_info.

struct dm_path {
        struct dm_dev *dev;     /* Read-only */
        void *pscontext;        /* For path-selector use */
};

Reference