Quelques digressions sous GPL...

Aller au contenu | Aller au menu | Aller à la recherche

DIY eMail @ Home

A few weeks ago, I presented my personal email architecture to PLUG, the Philadelphia Linux User Group. I have been using this setup for years now, and every time someone sends me an email at julien[at]linuxwall.info, it reaches one of these server.

A lot of this is documented on http://wiki.linuxwall.info. Maybe some day I'll get around writing an actual documentation. In the meantime, check out the links in the slides below.


Netfilter workshop at Fosscon 2012: the slides

A couple of weeks ago, I gave a 2H workshop at Fosscon Philadelphia on Advanced Netfilter features. The workshop went well, and I will probably do it again. In the meantime, I posted the slides below. There is a video too, but the quality isn't great, and filming a workshop isn't as good as I hoped it would be.

The goal of the workshop is to demonstrate how netfilter, iptables, ipset and other tools available in Linux, can be used to build complex firewall policies for dynamic environments. I mentionned, at the end, some of the work i've done with Chef and the AFW cookbook. It's good stuff, so check it out.

Netfilter and Iptables talk at AWeber

Every tuesday, at AWeber, we have a Tutorial Tuesday. Engineers submit topics they would live to present, and everyone vote for the ones they are interested in. A few weeks ago, I presented on Netfilter and Iptables. The goal was to give a quick overview of the firewall of Linux, show some basic and advanced features, and give some tools to help developers debug firewall issues.

I will present some of this material again at FOSSCON Philadelphia, on August 11th. So if you are in the area, come to the workshop: http://fosscon.org/

Netfilter-JulienVehent-AWeber from Julien Vehent on Vimeo.

TCP tuning for the LAN

Recently we had conntrack drop connection on one of our memcache node at work. Turns out we were receiving a lot more traffic than usual, but nothing that should have stressed the network stack that badly. We were in the thousand established connections. We got bitten because of the count of TIME_WAIT connections that filled up the conntrack buckets. So what follows is the analysis of what happened and how we fixed it.

When two nodes communicate, they have to perform 3 steps:
  1. TCP handshake: client sends TCP SYN, server responds TCP ACK, client sends TCP SYN/ACK
  2. Data exchange: client sends its request in TCP PUSH/ACK (1 or several packets), server responds in TCP PUSH/ACK (1 or several packets)
  3. Connection closing: client sends TCP/FIN to server, server sends TCP/ACK to client, server sends TCP/FIN to client, client sends TCP/ACK to server
So, for a classic memcache request from a client to a server, we will have at least 10 packets exchanges, and possibly around 15 packets total if we query webforms. 
Under high load, the average RTT between a client and a memcache node is ~0.6 milliseconds. Thus, a connection from beginning to end should take ~9 ms. That's 0.009 seconds. 
In the default kernel configuration, the connection will stay in TIME_WAIT state for 60 seconds. In case a packet got lost and the kernel needs to send a RST for it. But in the case of a gigabit LAN, with a RTT that should be way below the milliseconds, it's just impossible that a packet would get lost in routers for 60 seconds. 
The side effect of using a TIME_WAIT of 60 seconds is that conntrack must keep track of the connection during that time. So a connection will take 0.009s to process, and then be tracked for 60s. During that time, the conntrack bucket fill up, and potentially overflows.
In the kernel, a recycle parameter exists for this very purpose:
tcp_tw_reuse - BOOLEAN
Allow to reuse TIME-WAIT sockets for new connections when it is
safe from protocol viewpoint. Default value is 0.
It should not be changed without advice/request of technical
experts.
It can be activated as follow:
# echo 1 > /proc/sys/net/ipv4/tcp_tw_reuse
Or to make it permanent, add this to sysctl.conf 
net.ipv4.tcp_tw_reuse = 1
 Some explanation from the Haproxy mailing list: http://marc.info/?l=haproxy&m=132784701528003&w=4
This kernel parameter should be enabled everywhere. It's just a smart reuse of network ressources.

Note on Conntrack

While I do not believe it is necessary to increase the current hashsize from its 65536 value, we could very safely do so. A conntrack consumes 400 bytes in the kernel (see /proc/slabinfo), which means tracking 1,000,000 connections would consume 400MBytes of RAM: 
# cat /proc/sys/net/ipv4/netfilter/ip_conntrack_max
65536 
# cat /sys/module/nf_conntrack/parameters/hashsize 16384 
ip_conntrack_max represents the maximum number of tracked connection, while hashsize is the size of the hash table storing the list of conntrack entries. A conntrack entry is stored in a node of a linked list, and there are several lists, each list being an element in a hash table. So each hash table entry (also called a bucket) contains a linked list of conntrack entries. source: http://wiki.khnet.info/index.php/Conntrack_tuning
Each hash will contain ~8 connections: (ip_conntrack_max = hashsize * 8). But to be conservative, the kernel sets those value to (ip_conntrack_max = hashsize * 4) by default. We can safely augment those by a factor of 8: ip_conntrack_max=524288 hashsize=131072 
# echo 524288 > /proc/sys/net/ipv4/netfilter/ip_conntrack_max 
# echo 131072 > /sys/module/nf_conntrack/parameters/hashsize
 Or, to make them permanent, modify /etc/sysctl.conf instead:
net.ipv4.netfilter.ip_conntrack_max = 524288
net.ipv4.netfilter.ip_conntrack_buckets = 131072

Note on local ports

When a client connect to a server, it needs a source ip, a source port, a destination ip and a destination port. The source ip, destination ip and destination port are static. But the source port is taken in the range made available by the kernel: 
$ sysctl net.ipv4.ip_local_port_range
net.ipv4.ip_local_port_range = 32768 61000
By default, this range is limited to the ports between 32768 and 61000. That means that one client will be able to open a maximum of (61000 - 32768) = 28232 connections to a server. This limit can safely be increased in sysctl.conf to use the range 5000 to 65000. Add this line to sysctl.conf
net.ipv4.ip_local_port_range = 5000	65000

TCP Performances and memcache

I am still interested in figuring out the proper set of parameters for a TCP stack on the lan, and today I was taking a closer look at the TCP performances of a farm of memcache servers.

Connections from clients to memcache are, for the immense majority, short lived and with small amounts of data. The average packet size leaving the memcache system is below the MTU of 1500 bytes. But a significant amount of response are larger than the MTU, and thus require fragmentation to reach their destination.

Because the requests are short lived, the TCP window size between the clients and memcache never increases significantly. The initial window returned by memcache is typically initialized around 5800 bytes, and very rarely grows larger than 9500 bytes.

The RTT measured between the memcache nodes and its clients is around 0.140 milliseconds (140 microseconds). These nodes being connected in gigabits ethernet, the Bandwidth Delay Product is:

BDP = ( 1*10^9 ) * ( 142 * 10^-9) = 142 bytes

Therefore, the maximum amount of data in transit at any given time between memcache and one of its clients will never be more than 142 bytes.

142 Bytes is so small that neither the TCP window or the Congestion windows will be hit on either side of the connection. So the only thing we can improve seems to be the size of the packets, by enabling 9,000 Bytes Jumbo frames.

ip link set mtu 9000 dev eth1

Another aspect to consider is to increase the default txqueuelen from 1000 to something more suited to gigabits links, like 1,000,000.

ip link set eth1 txqueuelen 1000000

see this for a description of txqueuelen: http://wiki.linuxwall.info/doku.php/en:ressources:dossiers:networking:traffic_control#pfifo_fast