When two nodes communicate, they have to perform 3 steps:
- TCP handshake: client sends TCP SYN, server responds TCP ACK, client sends TCP SYN/ACK
- Data exchange: client sends its request in TCP PUSH/ACK (1 or several packets), server responds in TCP PUSH/ACK (1 or several packets)
- Connection closing: client sends TCP/FIN to server, server sends TCP/ACK to client, server sends TCP/FIN to client, client sends TCP/ACK to server
So, for a classic memcache request from a client to a server, we will have at least 10 packets exchanges, and possibly around 15 packets total if we query webforms.
Under high load, the average RTT between a client and a memcache node is ~0.6 milliseconds. Thus, a connection from beginning to end should take ~9 ms. That's 0.009 seconds.
In the default kernel configuration, the connection will stay in TIME_WAIT state for 60 seconds. In case a packet got lost and the kernel needs to send a RST for it. But in the case of a gigabit LAN, with a RTT that should be way below the milliseconds, it's just impossible that a packet would get lost in routers for 60 seconds.
The side effect of using a TIME_WAIT of 60 seconds is that conntrack must keep track of the connection during that time. So a connection will take 0.009s to process, and then be tracked for 60s. During that time, the conntrack bucket fill up, and potentially overflows.
In the kernel, a recycle parameter exists for this very purpose:
tcp_tw_reuse - BOOLEAN
Allow to reuse TIME-WAIT sockets for new connections when it is
safe from protocol viewpoint. Default value is 0.
It should not be changed without advice/request of technical
experts.
It can be activated as follow:
# echo 1 > /proc/sys/net/ipv4/tcp_tw_reuse
Or to make it permanent, add this to sysctl.conf
net.ipv4.tcp_tw_reuse = 1
Some explanation from the Haproxy mailing list: http://marc.info/?l=haproxy&m=132784701528003&w=4
This kernel parameter should be enabled everywhere. It's just a smart reuse of network ressources.
Note on Conntrack
While I do not believe it is necessary to increase the current hashsize from its 65536 value, we could very safely do so. A conntrack consumes 400 bytes in the kernel (see /proc/slabinfo), which means tracking 1,000,000 connections would consume 400MBytes of RAM:
# cat /proc/sys/net/ipv4/netfilter/ip_conntrack_max
65536
# cat /sys/module/nf_conntrack/parameters/hashsize
16384
ip_conntrack_max represents the maximum number of tracked connection, while hashsize is the size of the hash table storing the list of conntrack entries.
A conntrack entry is stored in a node of a linked list, and there are several lists, each list being an element in a hash table. So each hash table entry (also called a bucket) contains a linked list of conntrack entries.
source: http://wiki.khnet.info/index.php/Conntrack_tuning
Each hash will contain ~8 connections: (ip_conntrack_max = hashsize * 8). But to be conservative, the kernel sets those value to (ip_conntrack_max = hashsize * 4) by default.
We can safely augment those by a factor of 8:
ip_conntrack_max=524288
hashsize=131072
# echo 524288 > /proc/sys/net/ipv4/netfilter/ip_conntrack_max
# echo 131072 > /sys/module/nf_conntrack/parameters/hashsize
Or, to make them permanent, modify /etc/sysctl.conf instead:
net.ipv4.netfilter.ip_conntrack_max = 524288
net.ipv4.netfilter.ip_conntrack_buckets = 131072
Note on local ports
When a client connect to a server, it needs a source ip, a source port, a destination ip and a destination port. The source ip, destination ip and destination port are static. But the source port is taken in the range made available by the kernel:
$ sysctl net.ipv4.ip_local_port_range
net.ipv4.ip_local_port_range = 32768 61000
By default, this range is limited to the ports between 32768 and 61000. That means that one client will be able to open a maximum of (61000 - 32768) = 28232 connections to a server. This limit can safely be increased in sysctl.conf to use the range 5000 to 65000.
Add this line to sysctl.conf
net.ipv4.ip_local_port_range = 5000 65000