In this post I’ll discuss a recent incident we encountered due to surprising behavior of the HTTP protocol. I’ll describe the problem, the analysis process, how we solved the problem, and some conclusions we drew from this experience.
During one of our last production version drops, we had some performance regression with our “update application” REST API call. The regression was a part of a system-wide performance improvement process, and it was monitored by our automation team. We decided to roll it to production, knowing it should have no major effect on the user experience. But if things can go wrong, they do. After this production roll, one of our customer began to encounter some strange errors and failures, and couldn’t update his application.
Checking our server logs showed that although the update operation took more than 30 seconds, it did end successfully. However, taking a deeper look on the system logs, we found something rather disturbing: we saw that after the original update application request was received, we kept getting update HTTP requests every 30 seconds, from the same user – non-stop!
In order to support concurrent editing of our applications and keep data consistent, we keep a version for each application (and use an optimistic locking mechanism to settle versioning conflicts). Each HTTP POST request for update application sends this version, then it is validated to be the most updated. If the update is successful, the version number is incremented and returned to the client. We realized that the multiple update requests put the client out-of-sync with the server’s version, which caused the failures observed by the user.
The only thing that finally stopped those repeating HTTP request was closing the user’s browser.
Can the browser possibly retry the long-running requests? As we know, HTTP POST requests have side effects. How can it possibly be valid to automatically repeat them?
Well, some research on the HTTP spec led us to this quite uncomfortable finding:
If an HTTP/1.1 client sends a request which includes a request body, but which does not include an Expect request-header field with the "100-continue" expectation, and if the client is not directly connected to an HTTP/1.1 origin server, and if the client sees the connection close before receiving any status from the server, the client SHOULD retry the request. W3 Hypertext Transfer Protocol -- HTTP/1.1
It turned out that in some cases the browser SHOULD (and as we saw, modern browsers actually do) repeat HTTP requests, even POSTs. By capturing the requests on the server side, we saw that the repeating requests were identical and that there is no way to actually know (from server perspective) that they are part of a retry. Also, from the browser side, there’s no trace for this (e.g., opening the network view doesn’t show the retry). We also noted that the browser cares about the latest retry and not the previous requests.
That led us to a fast verification of all of our connection-timeout configurations along the stack – EC2’s ELB, NGINX, and Tomcat configurations. They were all way above the reported execution time of the update application operation.
Finally, we asked the customer to try and work from a non-managed computer outside his enterprise network, meaning no proxy, no firewall, and no corporate IT configurations on the computer itself. The repeating requests were gone, and so were the errors. Therefore, we assumed that one of the components (above) was probably closing connections that were longer than 30 seconds, which led to the HTTP retries.
While still working on system-wide performance fixes, we had to find some kind of a solution – and fast. Knowing we can assume nothing about our clients’ network topology, and knowing we don’t have any quick performance solution for this specific API, we needed to somehow deal with repeating requests. So we decided to patch up this specific API, marking every request with a GUID and ignoring them.
We also sought a general solution for handling the retry HTTP requests issue. As already mentioned, from server’s perspective, retry requests are identical, so you cannot differentiate between them. One way to solve this is to use some kind of hashing of “checksum” on the request payload to distinguish between requests. A simpler way is to attach a GUID to each request (but this requires client-side implementation).
Let’s assume we go with the GUID approach. Generally speaking, the solution algorithm will be as follows. For each HTTP request:
To make this work for every HTTP request, you can write a special servlet-filter for it. For example:
For each HTTP request, the HttpRetryHandlingFilter will implement the algorithm above.
We need to maintain all successfully handled GUIDs and their results in some kind of a cache. We want this cache to be synced between all our servers in the cluster. Moreover, checking if a request was already handled is a critical section, so we need some way to do distributed locking efficiently.
All those requirements can be easily implemented using Hazelcast, which is an in-memory distributed data grid. It supports both a key-value cache and locking mechanism. Here is an example for a simple distributed memory-locks service using Hazelcast. We manage locks by names.
Now we can define the GUID requests/results cache. We define a simple map with 10 minutes TTL and a maximum of 1000 items.
Finally, here is the general skeleton for the HTTP filter: