Saturday Sep 24, 2005

Network configurator for laptops

Network configurator for laptop

I have followed the trend of many nascent bloggers by being silent for a long time.  So what have I been doing.  Aside from doing the normal work, I have been trying to get things organized so that we can start putting tools and utilities that make a Solaris laptop a bit more real.    Internally we have been using a number of tools delivered by frkit.  Now the plan is to make frkit available on laptop community for OpenSolaris.  The community is now open and a number of people are contibuting.

Inetmenu is one of the tools used internally to configure network interfaces.  Out of the box, Solaris and OpenSolaris don't deal well with nomadic systems like laptops.   One typically has install Solaris as a stand alone system.  Then after booting up, the user has to manually plumb the interface,  start DHCP  and then fix up the files required  for  DNS or NIS.   Inetmenu is a script that does all that.  It handles, wired, wireless and  dial up interfaces.   One can define profiles for use.   All these make life a bit nicer for laptop users.

Inetmenu is just one of the network configurators.   There is yet another one that is used internally called netprof and hopefully we should be able to have that available in not too distant future. 

Technorati Tag: OpenSolaris

Tuesday Jun 14, 2005

Resync Regions and Optimized Resyncs


Optimized Resyncs in Solaris Volume Manager

Over the last couple of months, a number of people wanted to know about optimized resyncs. People familiar with VxVM, might know this as DRL (Dirty Region Logging).  In Solaris Volume Manager this functionality is called  optimized resyncs.  Optimized resync in Solaris Volume Manager is implemented using resync regions(RR).

The function of DRL or RR is to ensure that a mirror is consistent in event of a crash.   Consistency does not mean that the mirror will contain up to date information.   What is guaranteed is that  a read request of the same block from any side of a mirror returns the same data. For example,  if block 10 is read from a 2 sided mirror, the data returned must be identical whether it is supplied from side 1 or side 2 of that mirror.

When parallel writes to a mirror are enabled, a window exists where a system may die before writes to all sides of the mirror are completed.   To ensure the mirror is consistent in event of a crash, a simple implementation might be to choose one side of a mirror and copy its contents to all the other sides when the system boots up.  This obviously is not very efficient.   A smarter approach is to track regions in which writes occurred and resync only those regions.   SVM uses this technique.     SVM divides a mirror into 1001 regions (max). This is maintained as an incore bitmap and in the mddb. When a write request arrives at the mirror strategy routine, it has the block number and length. From this information the  impacted regions are computed.  Prior to issuing a write the incore bitmap region is checked to see if the region has already been marked dirty.  If not the incore bitmap is updated.   An asynchronous resync kernel daemon thread monitors this bitmap every few seconds and writes it out to the mddb if required. After the RR bitmap is flushed to the mddb, the bitmap is reset. On boot up  svc:/system/metainit:default  starts the resync kernel threads.  There is one thread per mirror.   The resync thread scans the mddb and only the regions that are marked dirty are resynced.  When a machine is shutdown cleanly,  the bitmap is zeroed out and  no resync occurs when starting up.

 In the mddb, the resync bitmap is called the resync record.  Every mirror has two resync records associated with it.  To reduce hot spots, the resync records are spread across multiple mddbs.  That is, if one has 2 mirrors and 4 mddbs, then the resync record for one mirror will be on mddb1 and mddb2. For the second mirror the resync record will be on mddb3 and mddb4.  The actual algorithm for resync record placement is a bit more sophisticated.

One can get metastat to display the location of the resync regions for the mirror.

            flags           first blk       block count
         a        u         16              8192            /dev/dsk/c1t1d0s7
         a        u         16              8192            /dev/dsk/c1t0d0s7
         a        u         16              8192            /dev/dsk/c1t2d0s0
    # export MD_DEBUG=STAT
    # metastat d10
    d10: Mirror
        Submirror 0: d0
          State: Okay         Wed Jun  1 19:53:10 2005
        Submirror 1: d1
          State: Okay         Wed Jun  1 19:53:10 2005
        Pass: 1
        Read option: roundrobin (default)
        Write option: parallel (default)
        Size: 67094528 blocks (31 GB)
        Regions which are dirty: 34% (blksize 67094 num 1001)
        Resync record[0]: 0 (/dev/dsk/c1t1d0s7 16 8192)
        Resync record[1]: 1 (/dev/dsk/c1t0d0s7 16 8192)
    d0: Submirror of d10
        State: Okay         Wed Jun  1 19:53:10 2005
        Size: 67094528 blocks (31 GB)
        Stripe 0:
            Device                             Start Dbase State       Reloc Hot Spare Time
            /dev/dsk/c3t50020F23000100F7d9s0       0 No    Okay        Yes             Wed Jun  1 19:52:53 2005
    d1: Submirror of d10
        State: Okay         Wed Jun  1 19:53:10 2005
        Size: 67094528 blocks (31 GB)
        Stripe 0:
            Device                              Start Dbase State       Reloc Hot Spare Time
            /dev/dsk/c3t50020F23000100F7d10s0       0 No    Okay        Yes             Wed Jun  1 19:53:04 2005
    Device Relocation Information:
    Device                            Reloc Device ID
    /dev/dsk/c3t50020F23000100F7d9    Yes   id1,ssd@n60020f20000100f740336c7b00023087
    /dev/dsk/c3t50020F23000100F7d10   Yes   id1,ssd@n60020f20000100f740336ca20001241b

In the output above, notice that the resync regions are spread across 2 mddbs.  I was running newfs on the mirror and therefore it  shows that 34% of the regions are dirty.   The blksize refers to the size of the resync region.    If you were monitoring the iostat output for an active mirror, you would notice that the disks that contain the mddbs are being written to.  These writes are due to the periodic updates of resync region bitmaps to the mddb.

Technorati Tag: OpenSolaris
Technorati Tag: Solaris

SMS Oban and MT ioctls in SVM


SMS, Oban and multi-threading ioctls in SVM

Now that Solaris is open sourced, it's time to share a bit of inside information.  All Volume Manager projects were named after various single malt scotch(SMS). Over the last couple of years we have gone through (yes ingested) Lagavulin, Laphroig, Springbank, Ardbeg and the most recent one  was Oban. The system names for our lab machines are named after either distilleries that are defunct or the islands around Islay.  For some of us names like, Ronaldsay, Eday, Stronsay, Shapinsay easily roll off our tongues.

Often there are applications, such as Oracle RAC or SAN file systems, where multiple nodes need to access the shared storage simultaneously. Cluster volume managers typically provide this functionality. OpenSolaris supports this functionality with Solaris Cluster Volume Manager(SCVM). Oban was its code name.  SCVM supports striping, mirroring and soft partitions.   In addition the mirror and soft partition drivers were enhanced to support Application Based Recovery (ABR) ioctls for block and character devices, to improve cluster I/O performance.  Oracle for example uses this functionality to speed up recovery after a crash.

Solaris Volume Manager can manage storage grouped in named disksets. For instance one could group all the storage for home directories in a diskset. This diskset could be moved from one node to another as required. Data stored on the disk in this type of a diskset can also be remotely replicated there by greatly improving disaster recovery. Ardbeg, a single malt that tastes like burnt rubber delivered capability to move disksets and support remote replication of disksets.

A diskset always has a host associated with it. If multiple nodes share the same set of disks then those nodes can participate in the same diskset. Disksets have single owner and multi-owner attributes.  Single owner attribute is akin to EVMS's private container, that is only one node at any given time can access the disks in the diskset. Multi-owner disksets allow multiple nodes to access the disksets simultaneously.  These disksets support cluster volume management functionality. Multi-owner disksets are similar to EVMS's shared containers.&nb sp; Internally we called them Oban disksets.  The metadevices(or volumes) in these multi-owner disksets can be managed from any node in a cluster. 

While other blogs on OpenSolaris have focused on describing complex code, I felt it would be interesting to describe complex problems that were addressed with a simple solution.  Normally one multi threads sections of code for performance.  That was not the case here; multi threading was required for correctness. To appreciate the issues, one needs to understand a bit about the SVM and SCVM functionality and the daemon rpc.mdcommd.  The detailed workings of SCVM are beyond the scope of this blog, but here are few salient features. 

  1.  Every multi-owner diskset has a master node.
  2.  The master node controls access to the mddb.
  3.  Node to node communication is handled by rpc.mdcommd.
  4.  Communication is in form of messages.
Ioctls in SVM are used to either get configuration information or change/maintain a metadevice.  Since configuration changes are infrequent and status of the metadevices is not a critical path, multi threading SVM ioctls was never high on our agenda.  SCVM changed that.  Our goal was to make cluster volume management functionality a seamless extension of Solaris Volume Manager. This meant that local, single owner and multi-owner disksets all had to co-exist.

For all the nodes to have a consistent view of the state of metadevices, state changes on one node must be propagated to all the nodes in the cluster.  Similarly when a configuration change is made on any node, it should appear on all the nodes.  The daemon rpc.mdcommd is used to transmit and marshal SCVM meta data  across a cluster.  When meta data from one node needs to be propagated to all the nodes in the cluster, the following events occur: 
  • The sending node creates a message and sends it to rpc.mdcommd on the local node.
  • rpc.mdcommd  sends the message to the master node.
  • rpc.mdcommd on the master node, receives this message and creates a thread to handle the message.
  • The handler is first executed on the master.  The message is then send to all the slave nodes.  Each of the slave nodes then executes the same handler.
Messages typically contain configuration change information. Ioctls pass the contents of the message to the kernel through an ioctl.  Since each diskset can have a different master node, a single node may be master for one diskset and slave for the other.  If ioctls are single threaded, then critical messages sent to the master node for one diskset could be blocked on that node by a state change update to another diskset.   Some of the messages generate a sub-message which need to be propagated to all the nodes too. Any attempt to send the sub message will immediately  cause a deadlock since it is called from the context of the first ioctl.  There are other messages that need information from the kernel and therefore need to issue ioctls calls.  These again will hang.   Since local disksets, named disksets and single owner disksets can co-exist, operations on a single owner diskset can block operations on a multi-owner diskset.  For all these reasons it was decided to multi-thread the ioctls.

While multi-threading the ioctls itself is simple, the impact of this change to the rest of the SVM code is potentially significant. Large chunks of the code path must  be looked at to avoid race conditions. Therefore at the time of the project, it was decided to multi-thread only critical ioctls for multi-owner disksets. 

The obvious question is how should single threaded and multi-threaded ioctls interact ?

Most of the multi-threaded ioctls are directly related to cluster interaction.  For example, when a cluster membership is  forming,  rpc.mdcommd must be suspended until a cluster new membership list is available.  At that point the daemon must update its notion of the active nodes in the cluster and send messages only to those nodes.   As a result, the ioctls that handle cluster related operations  were deemed to have a higher priority than the ones that changed the state of the mddb. These ioctls also did not interact with md structures.  Based on this analysis, we decided that while a single threaded ioctl was in progress, multi-threaded ioctls must be allowed.  The next issue was:

Should a single threaded ioctl be allowed when a multi-threaded ioctl was in progress ?

Recall that the traditional ioctls(i.e. single threaded ones) can change and update of the mddb.  These changes  can result in messages being sent across the cluster.  If the state of the cluster is changing the mddb state change needs to be held back until the cluster is stable.  Hence it was decided to block single threaded ioctls if multi-threaded ioctls were in progress.  This also means that we risk starving single threaded ioctls if multi-threaded calls keep occurring.   We deemed this risk to be minimal since only a few ioctls were multi-threaded and a if large number of these were occurring, it indicated a problem with the cluster.  In such a situation sacrificing a node to enable the availability of a cluster is reasonable.

The implementation of multi threading ioctls in this manner turns out to be quite simple.   The code snippets are from the function mdioctl in usr/src/uts/common/io/lvm/md/md.c

            if (!is_mt_ioctl(cmd) && md_ioctl_lock_enter() == EINTR) {
    return (EINTR);

    \* initialize lock tracker

    /\* Flag to indicate that MD_GBL_IOCTL_LOCK is not acquired \*/

    if (is_mt_ioctl(cmd)) {
    /\* increment the md_mtioctl_cnt \*/
    lock.l_flags |= MD_MT_IOCTL;

md_ioctl_lock_enter() calls md_global_lock_enter() with ~MD_GBL_IOCTL_LOCK.
    (only the relevent code shown)

    if (!(global_locks_owned_mask & MD_GBL_IOCTL_LOCK)) {
    while ((md_mtioctl_cnt != 0) ||
    (md_status & MD_GBL_IOCTL_LOCK)) {
    if (cv_wait_sig_swap(&md_cv, &md_mx) == 0) {
    return (EINTR);
    md_status |= MD_GBL_IOCTL_LOCK;

The if(!global_locks_owned_mask.. statement will always be true in the above call sequence.   We therefore achieve the logic that while a multi-threaded ioctl is in progress or if another single threaded ioctl call is in progress, the subsequent single threaded ioctls will wait. 

..and is the story behind multi-threading SVM ioctls.
Technorati Tag: OpenSolaris
Technorati Tag: Solaris




« June 2016