Recently we had conntrack drop connection on one of our memcache node at work. Turns out we were receiving a lot more traffic than usual, but nothing that should have stressed the network stack that badly. We were in the thousand established connections. We got bitten because of the count of TIME_WAIT connections that filled up the conntrack buckets. So what follows is the analysis of what happened and how we fixed it.
When two nodes communicate, they have to perform 3 steps:
- TCP handshake: client sends TCP SYN, server responds TCP ACK, client sends TCP SYN/ACK
- Data exchange: client sends its request in TCP PUSH/ACK (1 or several packets), server responds in TCP PUSH/ACK (1 or several packets)
- Connection closing: client sends TCP/FIN to server, server sends TCP/ACK to client, server sends TCP/FIN to client, client sends TCP/ACK to server
So, for a classic memcache request from a client to a server, we will have at least 10 packets exchanges, and possibly around 15 packets total if we query webforms.
Under high load, the average RTT between a client and a memcache node is ~0.6 milliseconds. Thus, a connection from beginning to end should take ~9 ms. That's 0.009 seconds.
In the default kernel configuration, the connection will stay in TIME_WAIT state for 60 seconds. In case a packet got lost and the kernel needs to send a RST for it. But in the case of a gigabit LAN, with a RTT that should be way below the milliseconds, it's just impossible that a packet would get lost in routers for 60 seconds.
The side effect of using a TIME_WAIT of 60 seconds is that conntrack must keep track of the connection during that time. So a connection will take 0.009s to process, and then be tracked for 60s. During that time, the conntrack bucket fill up, and potentially overflows.
In the kernel, a recycle parameter exists for this very purpose:
tcp_tw_reuse - BOOLEAN
Allow to reuse TIME-WAIT sockets for new connections when it is
safe from protocol viewpoint. Default value is 0.
It should not be changed without advice/request of technical
It can be activated as follow:
# echo 1 > /proc/sys/net/ipv4/tcp_tw_reuse
Or to make it permanent, add this to sysctl.conf
net.ipv4.tcp_tw_reuse = 1
This kernel parameter should be enabled everywhere. It's just a smart reuse of network ressources.
Note on Conntrack
While I do not believe it is necessary to increase the current hashsize from its 65536 value, we could very safely do so. A conntrack consumes 400 bytes in the kernel (see /proc/slabinfo), which means tracking 1,000,000 connections would consume 400MBytes of RAM:
# cat /proc/sys/net/ipv4/netfilter/ip_conntrack_max
# cat /sys/module/nf_conntrack/parameters/hashsize
ip_conntrack_max represents the maximum number of tracked connection, while hashsize is the size of the hash table storing the list of conntrack entries.
A conntrack entry is stored in a node of a linked list, and there are several lists, each list being an element in a hash table. So each hash table entry (also called a bucket) contains a linked list of conntrack entries.
Each hash will contain ~8 connections: (ip_conntrack_max = hashsize * 8). But to be conservative, the kernel sets those value to (ip_conntrack_max = hashsize * 4) by default.
We can safely augment those by a factor of 8:
# echo 524288 > /proc/sys/net/ipv4/netfilter/ip_conntrack_max
# echo 131072 > /sys/module/nf_conntrack/parameters/hashsize
Or, to make them permanent, modify /etc/sysctl.conf instead:
net.ipv4.netfilter.ip_conntrack_max = 524288
net.ipv4.netfilter.ip_conntrack_buckets = 131072
Note on local ports
When a client connect to a server, it needs a source ip, a source port, a destination ip and a destination port. The source ip, destination ip and destination port are static. But the source port is taken in the range made available by the kernel:
$ sysctl net.ipv4.ip_local_port_range
net.ipv4.ip_local_port_range = 32768 61000
By default, this range is limited to the ports between 32768 and 61000. That means that one client will be able to open a maximum of (61000 - 32768) = 28232 connections to a server. This limit can safely be increased in sysctl.conf to use the range 5000 to 65000.
Add this line to sysctl.conf
net.ipv4.ip_local_port_range = 5000 65000