2014. június 10., kedd

How does iptables hashlimit module work?

Hashlimit is an iptables module that allows one to define rules that in effect will limit traffic speed (bytes / time unit) or frequency (connections / time unit) per target or origin ports / IPs. The inner workings of this module and / or how to make it work correctly remains a mystery for many.

Hashlimit is also close friends with the limit module, only much more powerful, capable of expressing rate limiting per source IP (for example) in a single rule.

A hashlimit rule doesn't work like the most people I met imagine. The common false picture of iptables hashlimit is that it throttles connections by "slowing down" things somehow. Why would anyone think that? It's just an iptables rule after all: match, or no match.

But hashlimit doesn't work like that at all. Instead of applying friction to the wheels slowing the wehicle down, it just blows the whole wheel up. Instead of "slowing things down", hashlimit drops packets (in the most common use cases).

Well... a haslimit rule actually either mathches a packet or doesn't, just like any other iptables rule. If a hashlimit rule matches a packet, it means that the packet is below (--hashlimit-upto) or above (--hashlimit-above) a certain rate (bytes / timeframe or frequency / timeframe).

You can, of course, create a rule that -j DROP packets that are --hashlimit-above 10/sec effectively prohibiting traffic faster than 10 packets per second.

This means it isn't inherently connection-oriented either. Yes, it can be used together with connection tracking, but it's not mandatory. TCP connection rate can be controlled with hashlimit by simply matching on the SYN flag.

Ok, so hashlimit can only DROP, or ACCEPT a packet, or do the usual iptably things like directing it to a separate chain where it gets logged and then dropped etc. But...

How can this even work?

I mean how can this smoothly limit traffic, not causing connection resets and other weird end-user-deterrent errors?

There seem to be a contradiction between the theory and practice. If hashlimit rules simply just drop packets then why the end user only feel maybe some slowdown? Why there are no error messages saying that the connection was terminated? (Or at least why such thing never happens if the connection limit is higher than a few / min?)

The reason for this is because almost everybody uses TCP. On the web, folks use TCP exclusively. And those who don't, probably do the same thing as TCP:

The TCP protocol sends acknowledgements, so the sending party can keep track of what have arrived and what was probably-lost on the network, and can re-send data if needed.

The typical use-case for hashlimit is to drop the first TCP segment (the one with the SYN flag set) if there were too many in the near past, effectively limiting connection rate. The sending party keeps retransmitting connection requests for a while. The exact timeout and the number of retries are operating system (and configuration) dependent.

An unfortunate consequence of this is if you need to tune your hashlimit tight, you will get dropped packets all the time.

I saw system administrators carefully analysing the syslog, and trying to tune hashlimit rules so no packet drop log entries appear when traffic is "normal",  but still tgrying to be able to hold off simple DOS attacks (like a stuck F5 key).

Needless to say, such efforts are a waste of time. No matter how hard you try, there will be packet drops with hashlimit, and it's OK. TCP can handle that.

Of course if you have to tune limits that tight, you also gonna have a bad time with NAT users too. If your application can't handle at least a few hundred req/s per thread, then in general you gonna have a bad time, and hashlimit won't do much for you. I had an old client once, whose application started coughing at five requests per second. Hashlimit wasn't quite the solution they needed. Such an application will crash an burn as soon as it is introduced to the general public.

Experimenting with hashlimit

Let's take a look at the documentation: http://ipset.netfilter.org/iptables-extensions.man.html#lbAW and set up a hashlimit rule.

For example, we can create a rule that will DROP packets that are NEW, coming to the http server (at port 80), and are coming too fast. In this case too fast means 20 in every second. We will enforce this limit by source IP address (hence the --hashlimit-mode srcip). The hash table name will be "http", so we can use this very hash table in other rules too if we need.

iptables -I INPUT -m hashlimit -m tcp -p tcp --dport 80 --hashlimit-above 20/sec --hashlimit-mode srcip --hashlimit-name http -m state --state NEW -j DROP
Verify that the rule is active by issuing iptables -L.

Chain INPUT (policy ACCEPT)
target     prot opt source               destination         
DROP       tcp  --  anywhere             anywhere             limit: above 20/sec burst 5 mode srcip tcp dpt:http state NEW

Chain FORWARD (policy ACCEPT)
target     prot opt source               destination         

Chain OUTPUT (policy ACCEPT)
target     prot opt source               destination         

(Yes, it is. Also, please note that this will limit outbound traffic too, since SYN,ACK response packets are considered NEW.)


As you can see, there is a "burst 5" note in the output above. This is the default "burst" value for every hashlimit rule. This has things to do with how the whole limiting stuff is implemented.

When a packet comes in and hits the hashlimit rule, and it also matches all the other criteria for that rule, the hashlimit engine kicks in, and tries to find an entry in a hash table for that packet. In our case, it will hash the source IP address, and store a counter at that hash value, so our rule will maintain a counter for every source IP address.

That counter decreased upon each hit. If the arrival of a packet tries to push the counter below zero, it means that we have just hit the limit. In our case it means that the packet must be dropped. If there isn't any counter stored for the current hash, one will be created and initialized to 1, so the fresh packet will be granted access, and the counter will be set to zero immediately.

The counter is also incremented in certain intervals, and the interval depends on the limit. In our case, the limit is 20/sec, so the counters are incremented by one in every 1/20th of a second, until it reaches the burst value. A counter can never be greater than the burst value.

Actually this description simplifies things a bit, but not too much. The truth is somewhat uglier than that. See the Linux kernel itself for the gory details: http://lxr.free-electrons.com/source/net/netfilter/xt_hashlimit.c#L382

Testing with ab

Let's  run ab (Apache Benchmark) to see what we've achieved. You need to have an apache server running locally. The default installation and the default "It works!" page will do just fine.

ab -c 1 -n 10 http://localhost/

This is ApacheBench, Version 2.3 <$Revision: 1430300 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/

Benchmarking localhost (be patient).....done

Server Software:        Apache/2.4.6
Server Hostname:        localhost
Server Port:            80

Document Path:          /
Document Length:        11 bytes

Concurrency Level:      1
Time taken for tests:   9.001 seconds
Complete requests:      10
Failed requests:        0
Write errors:           0
Total transferred:      2550 bytes
HTML transferred:       110 bytes
Requests per second:    1.11 [#/sec] (mean)
Time per request:       900.064 [ms] (mean)
Time per request:       900.064 [ms] (mean, across all concurrent requests)
Transfer rate:          0.28 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        0  899 315.9    999     999
Processing:     1    1   0.2      1       1
Waiting:        0    1   0.2      0       1
Total:          1  900 315.8   1000    1000
ERROR: The median and mean for the waiting time are more than twice the standard
       deviation apart. These results are NOT reliable.

Percentage of the requests served within a certain time (ms)
  50%   1000
  66%   1000
  75%   1000
  80%   1000
  90%   1000
  95%   1000
  98%   1000
  99%   1000
 100%   1000 (longest request)

But WHY??? 1.11 requests / second is totally not 20 requests / second. What isn't working properly?

TCP. And it's working just fine.

One can try and attack the local web server on multiple threads, hoping that the effect of the TCP retransmission timeout will disappear once there are many parallel connections. Let's try this one:

ab -c 10 -n 10 http://localhost/

So all ten requests are made at once. Still, we see only 5 requests per second. How can that be?

This is because ab will try to make requests on all threads at once, and there's no chance for the counter to be incremented by even one. As soon as you issue the command, almost all SYN packet will be dropped, and the threads will wait for their chance to retransmit. When they get their chance, the counter will be at the the burst value, and when the threads send their retransmissions, <burst> number of threads will be successful.

One way to circumvent this is to either reduce the retransmission timeout, or wait a little between each requests. Unfortunately, retransmission behaviour currently can't be set per-socket under Linux. There is a patch for that however, so this feature might make it's way into the mainline code later.

Also please note that if we turn "keepalive" on, then nothing will hold ab back:

ab -c 1 -n 10 -k http://localhost/

I got a nice 4500 reuqests / second with this. Fortunately DOSers rarely use keepalive, and it can also be turned off at the server side. Of course one should carefully consider (and measure) the effect of keepalive settings, it can have a quite heavy impact.

Testing with curl

The  following piece of bash script can be used to experiment with hashlimit rules, and their effect on the traffic:

while true; do curl http://localhost/ &> /dev/null; echo -n "#"; sleep 0.05; done

The 0.05 seconds sleep will correspond to 20 reqs/s (or a little less, because curl needs some time to run too). If you try to decrease this, eventually you will see that the flow of # characters is being interrupted for a short period of time every now and then.

If you sleep to little, the second SYN packet will be dropped, and curl will have to wait to retransmit, limiting the connection rate down to 1/<retransmission timeout> reqs / sec in case of a single thread. This causes the above script to print one # character in every second.


TCP is a very sturdy protocol and can be configured to fit many environment. You can use it between two processes on the same core, or you can put a device to an orbit around planet Mars and expect a connection to stay alive despite the 6 - 42 minutes "ping" and probably heavy packet loss.

Even the desktop-default TCP stack can deal with the packet loss caused by hashlimit rules, and this is the way hashlimit rules supposed to do their work, so tuning for no packet drops is an unnecessary and futile effort.

Despite the fact both TCP and hashlimit rules operate along quite simple principles, their interaction can produce fairly complex behaviour. The behaviour of a system with very simple rules can be arbitrarily complex. Just think about how simple rules define the Mandelbrot set.

Because of this, you should always be aware of how the tools you use actually work. No metaphors or graphic images-in-mind will be able to help you pinpoint the source of a complex (mis)behaviour like an ab test producing 1-5 reqs/s under a hashlimit 20/sec rule. You need to know the system works exactly, nothing else will work.