Friday Mar 04, 2011

Validating Multicast Transport: Where Did My Instances Go?

For cluster health information and communication of high-availability state, GlassFish 3.1 depends on GMS. To dynamically discover the members of a cluster, GMS depends on UDP multicast. So if there's something about your network preventing multicast communication among hosts, instances will become isolated from each other.

As I wrote in my last blog on the 'asadmin get-health' command, it's a good idea after starting your cluster to make sure everything is working correctly. With the asadmin validate-multicast command, you can diagnose issues with intra-instance communication or plan your cluster deployment before creating the cluster instances.

This asadmin subcommand is used to send and receive UDP multicast information, and so acts to validate that various hosts can all communicate with each other. The usage is simple in concept: run the tool on each host at the same time, using the same multicast address and port, and verify from the tool's output that each host receives messages from the others. So if you're running on hosts 1, 2, and 3, then you should see this when running on host 1:

    Listening for data...
    Sending message with content "host1" every 2,000 milliseconds
    Received data from host1 (loopback)
    Received data from host2
    Received data from host3

Likewise, hosts 2 and 3 should see messages from all 3 machines. Make sure you're not running the DAS and instances at the same time, or else there will be interference with the UDP traffic. Here is a video showing some features of the tool:

Debugging, Step 1: Use Same Multicast Port/Address as Cluster

While this tool can be useful to check your network before deploying your cluster, it is most helpful when one instance is not communicating with the DAS/other instances. You may see this if you run the 'get-health' command described in the previous blog. If you know that an instance is up according to its server log, but it's showing up as "not started" in the get-health output, then it's likely that the DAS and the instance are not seeing each others' UDP multicast messages. In this case, you want to run asadmin validate-multicast with the following options:

  • --multicastport The value of gms-multicast-port for your cluster in domain.xml.
  • --multicastaddress The value of gms-multicast-address for your cluster in domain.xml.

Using those options will make the tool use the same values as the members of your cluster, in effect simulating the GMS traffic between the DAS and instances. To find the values for those options, you can read them from the attributes on the <cluster> element in domain.xml. For instance:

  <clusters>
    <cluster name="mycluster"
        gms-multicast-port="22262"
        gms-multicast-address="228.9.244.214"
        [etc.] >
      <server-ref ref="instance1"></server-ref>
      <!-- [etc.] -->
    </cluster>
  </clusters>

Debugging, Step 2: TTL

Unless specified on the command line, the validate-multicast tool and GMS use the default MulticastSocket time-to-live for your operating system or 4, whichever is greater. You can see this in the tool's output if you run with the --verbose flag. For example:

    McastSender: The default TTL for the socket is 1. Setting it to minimum 4 instead.

You can try increasing this value to see if it is the limiting factor that prevents packets from reaching the cluster members with your network configuration. To specify a different value, use the following option with the tool (in addition to --multicastport and --multicastaddress):

  • --timetolive Sets the time-to-live value of the multicast packets sent by the tool.

If you are now seeing all the instances you expect, you can change your cluster configuration so that GMS uses this TTL value. It is simple to pass this value into the asadmin create-cluster command. See the "To Create a Cluster" section of the High Availability Administration Guide for an example. If your cluster is already running, however, you can set this value with asadmin set. See the "To Change GMS Settings After Cluster Creation" section of the HA guide. The property to be set is GMS_MULTICAST_TIME_TO_LIVE, and it is listed in the "Dotted Names for GMS Settings" section.

Debugging, Step 3: Specifying the Network Adapter

On a multi-home machine machine (possessing two or more network interfaces), you may need to specify the local interface that should be used for UDP multicast traffic. You can use ifconfig, or the equivalent on your system, to list the network interfaces and obtain the local IP address of the interface you wish to use. This address can then be used with the following command line parameter (along with any you're already specifying):

  • --bindaddress Sets the local interface used to receive packets.

Note that this value will be different on each machine where you are running the tool. If you are now seeing all the instances you expect, you can set the GMS bind interface for each instance following the instructions in the "Traffic Separation Using Multi-Homing" section of the HA guide.

If one or more machines are still missing in the output, then it may be that they are located on different subnets from each other. Or it could be that UDP multicast is not enabled on the network. You may need to ask the network administrator to verify that the network is configured so that the UDP multicast transport is available.

For more information, see the validate-multicast man page. A copy of the help information is attached here as well. The validate-multicast tool is also covered in more depth in the HA guide referenced above.

Monday Feb 28, 2011

State of the Cluster With Get-Health Command

GlassFish 3.1 uses GMS, part of Project Shoal, to provide dynamic membership information about a cluster, including the state of its instances and the DAS. The asadmin subcommand "get-health" gives you a snapshot of this state. For example, here is my cluster ready to be started:

  hostname% ./asadmin get-health mycluster
  instance1 not started
  instance2 not started
  instance3 not started
  Command get-health executed successfully.

...and after the cluster has started:

  hostname% ./asadmin get-health mycluster
  instance1 started since Fri Feb 25 16:27:16 EST 2011
  instance2 started since Fri Feb 25 16:27:15 EST 2011
  instance3 started since Fri Feb 25 16:27:15 EST 2011
  Command get-health executed successfully.

Besides this basic information, the GMS-based system can give you information about instances that have been shut down or that have failed (see instances 1 and 2 in the next example). Because this information comes from the GMS group members (instances or DAS), information from the instances is still correct even if the DAS is restarted. In the following, notice the "stopped" and "failed" messages. After the DAS restarts, the startup time for instance3 is still valid. Because instances 1 and 2 are stopped, they cannot communicate when they failed/stopped:

  hostname% ./asadmin get-health mycluster
  instance1 stopped since Fri Feb 25 16:57:56 EST 2011
  instance2 failed since Fri Feb 25 16:58:26 EST 2011
  instance3 started since Fri Feb 25 16:27:15 EST 2011
  Command get-health executed successfully.

  hostname% ./asadmin restart-domain
  Successfully restarted the domain
  Command restart-domain executed successfully.

  hostname% ./asadmin get-health mycluster
  instance1 not started
  instance2 not started
  instance3 started since Fri Feb 25 16:27:15 EST 2011
  Command get-health executed successfully.

Note that you can see these events in the server.log as well. Here are some of the messages related to the shutdown and failure above:

  • GMS1017: Received PlannedShutdownEvent Announcement from member: instance1 with shutdown type: INSTANCE_SHUTDOWN of group: mycluster
  • GMS1007: Received FailureSuspectedEvent for member: instance2 of group: mycluster
  • GMS1019: member: instance2 of group: mycluster has failed

The asadmin get-health command is not new, but there is one new feature in GlassFish 3.1. If you have an instance configured as a service that is automatically restarted, it could fail and restart quickly before the system has a chance to process that it has failed for certain. In this case, you could see a message such as:

  hostname% ./asadmin get-health mycluster
  instance1 rejoined since Fri Feb 25 17:01:14 EST 2011
  ...

In a case like this, it is important to find out what is happening in that instance. If an instance fails and stays down, it becomes obvious quickly. But if an instance fails and restarts often, it may not be obvious unless you look through the server logs. So seeing an instance in the "rejoined" state could signal a problem that the instance in questions is constantly failing. Here are some of the messages you would see in server.log related to the rejoin:

  • GMS1053: member: instance1 was restarted at 3:45:41 PM EST on Feb 26, 2011.
  • GMS1054: Note that there was no Failure notification sent out for this instance that was previously started at ....
  • GMS1024: Adding Join member: instance1 group: mycluster StartupState: INSTANCE_STARTUP rejoining: missed reporting past FAILURE of this instance that had joined the group at ....

Whenever you create and start a cluster, it's a good idea to use the asadmin get-health command to make sure communication is working properly among the instances and the DAS. In my next blog, I'll show how you can use the "validate-multicast" subcommand to help diagnose a problem if one or more instances are not being found by asadmin get-health.

For more information on the get-health subcommand, run asadmin get-health --help. A version of it is attached here as get-health.txt. For more information on clusters and GMS, please see the article "Clustering in GlassFish Version 3.1".

Moving On Up: Upgrading to GlassFish 3.1

GlassFish 3.1 is here, and once you're done playing around with it, it might be time to do some work. If you already have an earlier GlassFish installation working for you, this blog will walk you through the steps to upgrade it to GlassFish 3.1. Specifically, I'm going to upgrade a v2.1.1 installation with a two-instance cluster.

If you're upgrading from a v2 developer profile or a 3.0.1 installation, then the upgrade process is mostly the same. You're just done after the upgrade tool exits, since you won't have cluster instances to recreate. If you're upgrading from a v2 enterprise edition installation, then please see the Upgrade Guide for more information. The upgrade process is the same, but there are some manual steps you will need to perform because GlassFish 3.1 does not support NSS.

All of the steps in this example are included in this video, and I'll describe them below:

Step 1: Upgrade the DAS

First, make sure the DAS, node agent, and instances are all stopped. If you have any 3rd-party libraries installed in glassfish/lib (not domainX/lib), you'll need to copy those over to the 3.1 installation. Then you can use the upgrade tool located in the bin directory: bin/asadmin/asupgrade. While you don't need to give it any options, the following are the most useful (use --help for the full list):

  • -c or --console This will start the tool in console mode instead of GUI mode.
  • -s or --source This will specify the source domain to be upgraded.
  • -t or --target This will specify the target domains directory into which the source domain will be copied. This is only really needed when using the console mode. In the GUI mode, it is filled in for you.

When we say "upgrading the DAS," what you're really doing is upgrading the domain that is running in the DAS. This process hasn't really changed since GlassFish v3. Later, this information will be synchronized with the cluster instances. What the upgrade tool does for you is copy the old domain to the 3.1 server, and then it runs asadmin start-domain --upgrade <domain_name> to upgrade the configuration in the domain. Just like with GlassFish v3, the server itself performs the upgrade duties.

Step 2: Recreate the Cluster Instances

Because the cluster information is stored in domain.xml, we don't have to do any other steps to create the 3.1 cluster. However, we need to recreate the instances. I'm using the asadmin create-local-instance command to create my instances. See Jennifer's blog for more information on the command.

In the video, I specify the --node and --cluster options when creating the instances. These values, along with the instance name, match the node agent, cluster name, and instance names that were used in the v2 cluster. When the cluster is started, all of the configuration and application data in the DAS will be synchronized with the instances. The original instances do not store any per-instance data (with one exception below), so there is no separate "upgrade instances" step.

The one extra step you need before starting up the cluster is to copy the IMQ directories from the old instances to the newly-created ones. This persistent JMS information is not part of the domain configuration, and you don't want to lose it during the upgrade process. For instance, copy the directory:

    glassfish/nodeagents/<agentname>/<instance>/imq

To:

    glassfish3/glassfish/nodes/<agentname>/<instance>/imq

Then you're ready to start everything up with asadmin start-cluster <cluster_name> and the upgrade of your cluster is complete. Please see the upgrade guide linked above for full information. Happy upgrading!

About

Whatever part of GlassFish or the Java EE world that catches my attention. (Also, go Red Sox.)

Search

Categories
Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today