By now you have probably read documentation or previous blog posts about how Zero Downtime Patching provides a convenient automated method of updating a WebLogic Domain in a rolling fashion. By automating the process, Zero Downtime Patching greatly saves time and eliminates the potential human errors from the repetitive course of procedure. In addition to that there is also some special features around replicated HTTP sessions that make sure end users do not lose their session at any point during the rollout process. Lets explore the technical details around maintaining session state during Zero Downtime Patching.
One of the key aspects of WLS replicated session persistence contract is that the session may be maintained within the cluster even in the rare situation where a server crashes. However, the session persistence contract cannot guarantee sessions will be maintained when more than a single server goes down in a short time period. This is because the session has a single copy replicated to some secondary server within the cluster. The session is only replicated when the client makes a request to update the session so the client’s cookie can store a reference to the secondary server. Thus, if the primary server were to go down and then the secondary server were to go down before the session could be updated by a subsequent client request then the session would be lost. The rolling nature of Zero Downtime Patching fits this pattern, and thus must take extra care to avoid losing sessions. Administrators may have already observed that it is very easy to lose sessions by restarting one server at a time through the cluster.
Before we go into technical details on how Zero Downtime Patching prevents the issue of losing sessions, it is important to note that the entire methodology relies on Oracle Traffic Director for load balancing, dynamic discovery, health checks, and session failover handling. In addition to this setup, 3 key features were utilized by Zero Downtime Patching directly to prevent the loss of sessions:
1. Preemptive Session Replication - Session data is preemptively propagated to another server in the cluster during graceful shutdown when necessary. To get even more detailed on this, lets examine the scenario where the ZDT rollout has shutdown the server holding the HTTP Session, and the next step is to shutdown the server holding the replica. In that case, WebLogic can detect during shutdown that the session will be lost as there is no backup copy within the cluster. So the ZDT rollout can ensure that WebLogic Server replicates that session to another server within the cluster.
The illustration below shows the problematic scenario where the server, s1, holding the primary copy of the session is shutdown followed by the shutdown of the server, s2, holding the secondary or replica copy. The ZDT Orchestration signals that s2 should preemptively replicate any single session copies before shutting down. Thus there is always a copy available within the cluster.
2. Session State Query Protocol - Due to the way that WebLogicServer relies on the association of an HTTP Session with a primary server and a secondary server, it is not sufficient to simply have the session somewhere in the cluster. There is also a need to be able to find the session when the client request lands on an arbitrary server within the cluster. The ZDT rollout enables the ability for WebLogicServers to query other servers in the cluster for specific sessions if they don’t have their own copy.
The diagram above shows that an incoming request to a server without the session can trigger a query and once the session is found within the cluster it can be fetched so that the request can be served on the server, "s4".
3. Orphaned Session Cleanup - Once we combine the ability to preemptive replicate session instances, and the ability to fetch sessions from within the cluster, we must also take a more active approach to cleanup instances that are fetched. Historically, WebLogic Server hasn’t had to worry much about orphaned sessions. Front end load balancers and web servers have been required to honor the session’s server affinity. And in the rare case that a request would land on a server that did not contain the primary or secondary, the session would be fetched from the primary server or secondary server and then the orphaned copy would be forgotten to be cleaned up upon timeout or other regular intervals. It was assumed that because the pointer to the session changed, that the actual stored reference would never be used again. However, the ZDT rollout repeatedly presents the scenario where a session must be found within the cluster and fetched from the server that holds the session. Not only can the number of session instances proliferate - all with various versions of the same session - the cluster is now queried for the copy and we must not find any stale copies - only the current replica of the session.
The above illustration shows the cleanup action after s4 has fetched the session data to serve the incoming request. It launches the cleanup request to s3 to ensure no stale data is left within the cluster.
Now during ZDT Patching we can shutdown server1, and expect that any lone session copies will be propagated to server2 without the clients knowledge. When the client does send another request, WLS will be able to handle that request and query the cluster to find the session data. The data will be fetched and used on the server handling the request. The orphaned copy will be cleaned up and the server handling the request will go through the process of choosing its preferred Secondary server to store the replica.
For more information about Zero Downtime Patching, view the documentation