X

How to Remediate Application Slowness Due To Incomplete DNS Resolutions

In my role as a Solutions Architect, I encountered instances of application slowness across Oracle internal workloads that were migrated to Oracle Cloud Infrastructure. Andy Herm, Cloud Architect and Jim Sirk, Cloud Network Architect discovered that the application slowness was due to incomplete DNS resolutions.

We wrote this blog post to help other Oracle Cloud Infrastructure users troubleshoot and resolve this issue.  

Why is This Happening?

The issue is related to glibc (starting in 2.9) issuing both IPv4 (A) and IPv6 (AAAA) DNS queries from the client. The IPv6 query doesn’t get a response back from our custom DNS and times out, causing a 5-second delay. For more details, check out Unix & Linux Stack Exchange.

One option is to separate out the IPv6 and IPV4 queries. But this means that you would have to touch all the existing clients that have been migrated to Oracle Cloud Infrastructure. We took the following steps to troubleshoot the issue.

Troubleshooting Steps

Packets were captured from the client servers to identify and isolate the issue. The yellow highlights show the corresponding slowness of packets to the TCP request sequence number.

[root@ddpt0jnsb0 tmp]# tcpdump -nvvv -i ens3 host x.y.z.67 -w /var/tmp/tcpdump_byhost.pcap
[root@ddpt0jnsb0 tmp]# tcpdump -r tcpdump_byhost.pcap  > tcpdump_byhost.txt
 
07:09:37.292378 IP ddpt0jnsb0.xxx.com.64168 > x.y.z.67.domain: 49707+ A? ddpt0jnsc0.xxx.com. (59)
07:09:37.292396 IP ddpt0jnsb0.xxx.com.64168 > x.y.z.67.domain: 10048+ AAAA? ddpt0jnsc0.xxx.com. (59)
07:09:37.292933 IP x.y.z.67.domain > ddpt0jnsb0.xxx.com.64168: 49707 1/6/0 A x.y.z.24 (232)
07:09:42.297054 IP ddpt0jnsb0.xxx.com.64168 > x.y.z.67.domain: 49707+ A? ddpt0jnsc0.xxx.com. (59)
07:09:42.297583 IP x.y.z.67.domain > ddpt0jnsb0.xxx.com.64168: 49707 1/6/0 A x.y.z.24 (232)
07:09:42.297638 IP ddpt0jnsb0.xxx.com.64168 > x.y.z.67.domain: 10048+ AAAA? ddpt0jnsc0.xxx.com. (59)
07:09:42.300937 IP x.y.z.67.domain > ddpt0jnsb0.xxx.com.64168: 10048 0/1/0 (124)

Additional tests show that the behavior impacts normal operations.

[rgbu_ui@ddpt0jnsb0 ~]$ time ssh -o StrictHostKeyChecking=yes ddpt0jnsc0.xxx.com
No ECDSA host key is known for ddpt0jnsc0.xxx.com and you have requested strict checking.
Host key verification failed.
 
real    0m5.032s
user    0m0.007s
sys     0m0.005s
 
[rgbu_ui@ddpt0jnsb0 ~]$ nslookup ddpt0jnsc0.xxx.com
Server:         x.y.z.67
Address:        x.y.z.67#53
 
Non-authoritative answer:
Name:   ddpt0jnsc0.xxx.com
Address: x.y.z.24
 
[rgbu_ui@ddpt0jnsb0 ~]$ time ssh -o StrictHostKeyChecking=yes x.y.z.24
No ECDSA host key is known for x.y.z.24 and you have requested strict checking.
Host key verification failed.
 
real    0m0.028s
user    0m0.006s
sys     0m0.005s

Stracing either process shows it pausing here, waiting for a response from the DNS server, timing out, and then retrying.  This likely also explains the exact 5-second increase in time.

connect(3, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("x.y.z.67")}, 16) = 0
poll([{fd=3, events=POLLOUT}], 1, 0)    = 1 ([{fd=3, revents=POLLOUT}])
sendmmsg(3, {{{msg_name(0)=NULL, msg_iov(1)=[{"\177\242\1\0\0\1\0\0\0\0\0\0\nddpt0jnsd0\3iad\7icst"..., 59}], msg_controllen=0, msg_flags=0}, 59}, {{msg_name(0)=NULL, msg_iov(1)=[{"\226\270\1\0\0\1\0\0\0\0\0\0\nddpt0jnsd0\3iad\7icst"..., 59}], msg_controllen=0, msg_flags=0}, 59}}, 2, MSG_NOSIGNAL) = 2
poll([{fd=3, events=POLLIN}], 1, 5000)  = 1 ([{fd=3, revents=POLLIN}])
ioctl(3, FIONREAD, [232])               = 0
recvfrom(3, "\177\242\201\200\0\1\0\1\0\6\0\0\nddpt0jnsd0\3iad\7icst"..., 2048, 0, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("x.y.z.67")}, [16]) = 232
poll([{fd=3, events=POLLIN}], 1, 4998
….

It pauses here

….
)  = 0 (Timeout)
poll([{fd=3, events=POLLOUT}], 1, 0)    = 1 ([{fd=3, revents=POLLOUT}])
sendto(3, "\267\5\1\0\0\1\0\0\0\0\0\0\nddpt0jnsc0\3iad\7icst"..., 59, MSG_NOSIGNAL, NULL, 0) = 59
poll([{fd=3, events=POLLIN}], 1, 5000)  = 1 ([{fd=3, revents=POLLIN}])
ioctl(3, FIONREAD, [232])               = 0
recvfrom(3, "\267\5\201\200\0\1\0\1\0\6\0\0\nddpt0jnsc0\3iad\7icst"..., 2048, 0, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("100.127.143.67")}, [16]) = 232
poll([{fd=3, events=POLLOUT}], 1, 4999) = 1 ([{fd=3, revents=POLLOUT}])
sendto(3, "<0\1\0\0\1\0\0\0\0\0\0\nddpt0jnsc0\3iad\7icst"..., 59, MSG_NOSIGNAL, NULL, 0) = 59
poll([{fd=3, events=POLLIN}], 1, 4998)  = 1 ([{fd=3, revents=POLLIN}])
ioctl(3, FIONREAD, [124])               = 0
brk(NULL)                               = 0x5611833af000
brk(0x5611833de000)                     = 0x5611833de000
recvfrom(3, "<0\201\200\0\1\0\0\0\1\0\0\nddpt0jnsc0\3iad\7icst"..., 65536, 0, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("x.y.z.67")}, [16]) = 124
close(3)                                = 0

…and then it goes onto opening a port 22 connection to the IP address.

Possible Solution

This could be resolved by adding ‘options single-request-reopen’ to /etc/named.conf to address the 5-seconds delay. After that change, the output of tcpdump looks like the following:

07:55:28.647859 IP ddpt0jnsb0.xxx.com.36657 > x.y.z.67.domain: 5112+ A? ddpt0jnsc0.xxx.com. (59)
07:55:28.648469 IP x.y.z.67.domain > ddpt0jnsb0.xxx.com.36657: 5112 1/6/0 A x.y.z.24 (232)
07:55:28.648547 IP ddpt0jnsb0.xxx.com.46795 > x.y.z.67.domain: 28682+ AAAA? ddpt0jnsc0.xxx.com. (59)
07:55:28.648945 IP x.y.z.67.domain > ddpt0jnsb0.xxx.com.46795: 28682 0/1/0 (124)

But that's not ideal. The better way to handle it is to change to stateless rules for DNS so that we don't have to modify the clients at all. 

The Recommended Solution

When clients initiate DNS queries to their resolver, by default they send both an AAAA and A request to the name server in a single transaction.  Both queries are issued concurrently, and the state table entry gets removed when the first response comes back, dropping the second response. 

By allowing ingress/egress traffic to be stateless for DNS (TCP/UDP 53), the second response from the name server is no longer dropped.

In this example, the VCN is the /18 aggregate subnet. It is also necessary to update the security list, allowing the clients to send DNS queries to the DNS servers.

Here is the screenshot of the security list for the DNS servers.

Ingress Rules (Stateless)

Egress Rules (Stateless)

 

We hope this blog post helps you address any application slowness due to incomplete DNS resolutions. I'd like to recognize Andy Herm, Cloud Architect and Jim Sirk, Cloud Network Architect who were instrumental in troubleshooting the issue.  Ryan Otis from the Oracle Cloud Infrastructure team also helped with this investigation.

I have replaced the FQDN with xxx.com and the real IP addresses for the hosts with x.y.z.67 and x.y.z24 so that our internal IPs are not exposed.

See more guidance on resolving common issues with DNS on Oracle Cloud Infrastructure here.

 

Join the discussion

Comments ( 1 )
  • Jon-Eric Eliker Saturday, October 27, 2018
    Excellent write-up and research! Thank you for sharing this. I’ve stashed away a link back to this article as it seems to be a likely issue to encounter sooner or later.
Please enter your name.Please provide a valid email address.Please enter a comment.CAPTCHA challenge response provided was incorrect. Please try again.