Quorum monitoring

We have seen previously that, prior to suncluster 3.2 update 2, there was no quorum monitoring. While this can be a non issue for disk based quorum device (usually you have datas on the quorum disk, so if the disk fails you 'll probably notice it) this is an issue with quorum servers : loss or unavailability of QS was unnoticed until the next cmm reconfiguration, almost occuring when a node was leaving the cluster, usually leading to a loss operational quorum complete cluster outage.

Also then was no command reporting the QD status realtime.


Suncluster 3.2 update 2 introduce Quorum monitoring


  • detect quorum failure

  • detect quorum repair

do not generate a CMM reconfiguration.



Example : 2 nodes cluster, with 1 quorum server


on the quorum server

# ./clqs show
=== Quorum Server on port 9000 ===

Disabled                        False


  ---  Cluster cluster32u2 (id 0x4979D880) Reservation ---

  Node ID:                      1
    Reservation key:            0x4979d88000000001

  ---  Cluster cluster32u2 (id 0x4979D880) Registrations ---

  Node ID:                      1
    Registration key:           0x4979d88000000001

  Node ID:                      2
    Registration key:           0x4979d88000000002


let's stop the quorum server daemon :



 ./clqs show
=== Quorum Server on port 9000 ===

Disabled                        False

clqs:  (C339181) Quorum server is not yet started on port "9000".



nodes are noticing that the QS in unavailable :

immediately if we ran a stat command , and after a timer if we do nothing :


c-220ra-1-epar03# /usr/cluster/bin/scstat -q
Jan 26 14:12:38 c-220ra-1-epar03 cl_runtime: WARNING: CMM: Connection to quorum server qs failed with error 146.
Jan 26 14:12:38 c-220ra-1-epar03 last message repeated 1 time
Jan 26 14:12:38 c-220ra-1-epar03 cl_runtime: WARNING: CMM: Erstwhile online quorum device qs (qid 1) is inaccessible now.

-- Quorum Summary --

  Quorum votes possible:      3
  Quorum votes needed:        2
  Quorum votes present:       3


-- Quorum Votes by Node --

                    Node Name           Present Possible Status
                    ---------           ------- -------- ------
  Node votes:       c-220ra-1-epar03    1        1       Online
  Node votes:       c-220ra-2-epar03    1        1       Online


-- Quorum Votes by Device --

                    Device Name         Present Possible Status
                    -----------         ------- -------- ------
  Device votes:     qs                  0        1       Offline



let's restart the QS :

nodes notice that the QS is available :

c-220ra-1-epar03#  /usr/cluster/bin/scstat -q
Jan 26 14:13:13 c-220ra-1-epar03 cl_runtime: NOTICE: CMM: Quorum device qs: owner set to node 1.
Jan 26 14:13:13 c-220ra-1-epar03 cl_runtime: NOTICE: CMM: Erstwhile inaccessible quorum device qs (qid 1) is online now.

-- Quorum Summary --

  Quorum votes possible:      3
  Quorum votes needed:        2
  Quorum votes present:       3


-- Quorum Votes by Node --

                    Node Name           Present Possible Status
                    ---------           ------- -------- ------
  Node votes:       c-220ra-1-epar03    1        1       Online
  Node votes:       c-220ra-2-epar03    1        1       Online


-- Quorum Votes by Device --

                    Device Name         Present Possible Status
                    -----------         ------- -------- ------
  Device votes:     qs                  1        1       Online



So, let's first stop node 2 :



c-220ra-1-epar03# /usr/cluster/bin/scstat -q

-- Quorum Summary --

  Quorum votes possible:      3
  Quorum votes needed:        2
  Quorum votes present:       2


-- Quorum Votes by Node --

                    Node Name           Present Possible Status
                    ---------           ------- -------- ------
  Node votes:       c-220ra-1-epar03    1        1       Online
  Node votes:       c-220ra-2-epar03    0        1       Offline


-- Quorum Votes by Device --

                    Device Name         Present Possible Status
                    -----------         ------- -------- ------
  Device votes:     qs                  1        1       Online



then let's break the QS :


c-220ra-1-epar03# /usr/cluster/bin/scstat -q
Jan 26 14:19:41 c-220ra-1-epar03 cl_runtime: WARNING: CMM: Connection to quorum server qs failed with error 145.

-- Quorum Summary --

  Quorum votes possible:      3
  Quorum votes needed:        2
  Quorum votes present:       2


-- Quorum Votes by Node --

                    Node Name           Present Possible Status
                    ---------           ------- -------- ------
  Node votes:       c-220ra-1-epar03    1        1       Online
  Node votes:       c-220ra-2-epar03    0        1       Offline


-- Quorum Votes by Device --

                    Device Name         Present Possible Status
                    -----------         ------- -------- ------
  Device votes:     qs                  0        1       Offline



So this does not trigger a cmm reconfig, and the QS is seen as been offline.

SEE BUG 6797576 6775902.

Here's the proposed new output :

a. The quorum vote summary is from the last node reconfiguration
b. The quorum votes contributed by the nodes and the qds are
the current status 
and the sum of a and b may differ 
Refer to example below  :

2 nodes and QS up :

2 node cluster with one QD - all online 
----------------------------------------
# scstat -q
-- Quorum Summary from latest node reconfiguration --
 Quorum votes possible:          3
 Quorum votes needed:            2
 Quorum votes present:           3
-- Quorum Votes by Node (current status) --
                Node Name         Present Possible Status
                ---------         ------- -------- ------
 Node votes:    c-220ra-1-epar03  1       1        Online
 Node votes:    c-220ra-2-epar03  1       1        Online
-- Quorum Votes by Device (current status) --
                Device Name       Present Possible Status
                -----------       ------- -------- ------
 Device votes:  qs                1       1        Online

Now halt one node (changes in blue)

# clq status
=== Cluster Quorum ===
--- Quorum Votes Summary from latest node reconfiguration ---
 Needed Present Possible
 ------ ------- --------
 2      2       3
--- Quorum Votes by Node (current status) ---
Node Name         Present Possible Status
---------         ------- -------- ------
c-220ra-1-epar03  1       1        Online
c-220ra-2-epar03  0       1        Offline
--- Quorum Votes by Device (current status) ---
Device Name Present Possible Status
----------- ------- -------- ------
qs          1       1        Online

Now offline the qs (changes in red)

#clq status
=== Cluster Quorum ===
--- Quorum Votes Summary from latest node reconfiguration ---
 Needed Present Possible
 ------ ------- --------
 2      2       3
--- Quorum Votes by Node (current status) ---
Node Name         Present Possible Status
---------         ------- -------- ------
c-220ra-1-epar03  1       1        Online
c-220ra-1-epar03  0       1        Offline
--- Quorum Votes by Device (current status) ---
Device Name Present Possible Status
----------- ------- -------- ------
qs          0       1        Offline




Note that there has been no cmm reconfig and hence only the qd vote has changed
This is the exact scenario and the fix that was done was to enhance the display string, marked above in green 
Comments:

Post a Comment:
  • HTML Syntax: NOT allowed
About

Jean-Christophe Lamoure

Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today