Running a website with a large quantity of useful content is a double-edged sword. You get a lot of hits, your search rank goes up, and people like your site. That's wonderful, but for every ten legitimate users, there's at least one LUSER who decides that he/she wants the whole site. This is bad enough under normal circumstances, but it's even worse when you're on a cohosted link and/or a shared server co-op, because it ruins the experience for the other clients AND the other servers.
Such was the situation today, when I was recently made aware that a section of my website was generating the vast majority of network traffic on a shared machine on a shared network. I scanned quickly for runaway/zombie procs and saw nothing, except an unusually large number of httpd processes. I checked my access log with tail -f, and, scrolling down the screen faster than I could read them, were log entries like
[a-luser-ip] - - [07/Feb/2005:12:15:15 -0500] "HEAD / HTTP/1.1" 301 - "http://www.mysite.com/" "SiteSucker/1.6.4"and
[another-luser-ip] - - [07/Feb/2005:12:47:50 -0500] "HEAD / HTTP/1.0" 200 - "http://www.mysite.com/" "Wget/1.7"
...in other words, some users were running programs designed specifically for the purpose of retrieving all the content of a website (as in the case of "SiteSucker" -- how subtle), or they were running other things in ways to achieve the same result (as in the case of wget, a very useful little tool which is frequently abused for this sort of thing.)
Well, of course I didn't want to have to shut down my site that so many people find useful, but this abuse couldn't continue. I added a couple of the offending IPs to my firewall ban list, but I needed a more long-term, comprehensive solution. So I started to google around for bandwidth-throttling measures that are specific to web traffic, and I came across bw_mod by Ivan "Bruce" Barrera. It, along with a couple of built-in directives in Apache, look to be the easiest solution to my problem, and here's how I implemented them.
System information: RedHat Linux 9 for x86, Apache 2.0.40 from stock RPMs, 10MBit network link shared by approximately 20 other linux servers. PHP, SSL, and realm authentication all enabled.
Part I: There's very little I need to add about bandwidth mod (bw_mod), since this author's readme explains everything you need to know in a very easy-to-follow, non-technical way. I'll simply condense it and leave out the "if this doesn't work, try that" sections which he kindly included.
Untar it with
This will create a directory called "bw_mod-0.5".
NOTE: that your binary may be called "apxs2", and/or may have a different path.
You should see output like:
It started off with the new LoadModule line inside the worker.c block, like this (new line shown in red):
After some troubleshooting, I deleted it from there, and placed it a few lines up, at the end of the long LoadModule section.
[source] may be a full host, part of a domain, an ip address, a network mask, or "all".
Order is relevant. First entries have precedence.
[source] -- see above
The second parameter indicates the minimun speed each client will have.
The FIRST client will have a top speed of 100kb. If more clients come, it will be split accordingly but everyone will have at least 50kb (even if you have 50 clients).
Everyone has 50kb as top speed.
[type] is the filename extension, or * for all. E.g. you can use .tgz to match only tar-compressed files, .avi to match video files.
[maxconnections] is the maximum number of (simultaneous?) connections allowed from the source. Any connection over the max will get a 503 Service Temporarily Unavailable.
There is a catch: You NEED to have a BandWidth limit for the same origin. It doesnt need to be a low limit, you can use an unlimited setting. The reason has to do with the way the program manages memory. If you dont put a BandWidth using the same origin, MaxConnections will be ignored.
Don't forget to restart apache (service httpd restart) after applying any of these changes.
Part II: Apache has a built-in directive called BrowserMatch, and its sister BrowserMatchNoCase. These allow you to block clients based on their UserAgent string, and they will work without mod_rewrite. The UserAgent string can be spoofed, but this will take care of the vast majority of your would-be bandwidth hogs.
You can create other environments for things like known email-harvesting bots, known-evil web spiders, etc. and do more creative things based on which type of malicious visitors they are. I am content to just refuse connections from all of them.
Don't forget to restart apache (service httpd restart) after applying any of these changes.
Looking for web spiders and site suckers: Here's a shortcut for identifying all user agents in your logs.
Translation: Run the logfile through a filter, treating double-quotes as field separators, pick out the 6th field (which is the UserAgent field), sort the resulting list, toss out all duplicates, and don't show me anything containing "Mozilla". NOTE that many programs include "Mozilla" in their useragent strings; also, some download accelerators operate as plugins to the regular browser and then append their useragent specifics to the browser's string. So, if you want to be more thorough than this, leave off everything after "uniq" and you'll see it all -- including stuff like
This will take awhile for large logs. Leave off everything after "uniq" to see all of the UserAgents; I filter out Mozilla since it is what appears for most normal browsers, including IE.
Also, this assumes you are using "combined" log format. If you don't get the results you expect (see below), then try replacing the "$6" with another number -- your log format may order the fields differently.
Sample results:
When you see something odd or suspicious (like "LeechGet", gee I wonder what that does), Google around for it. If the name is too general, add "user agent" or "spider" or "search engine" to your query.
References:
Note that there is much more to bw_mod than this. In my particular case, the files that users (and lusers) are after, all happen to be PDFs, and one of bw_mod's features is to limit based on filename. Here are a few more options and examples, culled from the readme enclosed with the program:
rpm -Uvh /home/mbates/rh9rpms/httpd-devel-2.0.40-21.i386.rpm
tar -xvzf bw_mod-0.5rc1.tgz
cd bw_mod-0.5
/usr/sbin/apxs -i -a -c bw_mod-0.5rc1.c
/usr/lib/httpd/build/libtool --silent --mode=compile gcc -prefer-pic -O2 -g -pipe -march=i386 -mcpu=i686 -I/usr/kerberos/include -DAP_HAVE_DESIGNATED_INITIALIZER -DLINUX=2 -D_REENTRANT -D_XOPEN_SOURCE=500 -D_BSD_SOURCE -D_SVID_SOURCE -D_GNU_SOURCE -pthread -DNO_DBM_REWRITEMAP -I/usr/include/httpd -c -o bw_mod-0.5rc1.lo bw_mod-0.5rc1.c && touch bw_mod-0.5rc1.slo
/usr/lib/httpd/build/libtool --silent --mode=link gcc -o bw_mod-0.5rc1.la -rpath /usr/lib/httpd/modules -module -avoid-version bw_mod-0.5rc1.lo
/usr/lib/httpd/build/instdso.sh SH_LIBTOOL='/usr/lib/httpd/build/libtool' bw_mod-0.5rc1.la /usr/lib/httpd/modules
/usr/lib/httpd/build/libtool --mode=install cp bw_mod-0.5rc1.la /usr/lib/httpd/modules/
cp .libs/bw_mod-0.5rc1.so /usr/lib/httpd/modules/bw_mod-0.5rc1.so
cp .libs/bw_mod-0.5rc1.lai /usr/lib/httpd/modules/bw_mod-0.5rc1.la
cp .libs/bw_mod-0.5rc1.a /usr/lib/httpd/modules/bw_mod-0.5rc1.a
ranlib /usr/lib/httpd/modules/bw_mod-0.5rc1.a
chmod 644 /usr/lib/httpd/modules/bw_mod-0.5rc1.a
PATH="$PATH:/sbin" ldconfig -n /usr/lib/httpd/modules
----------------------------------------------------------------------
Libraries have been installed in:
/usr/lib/httpd/modules
If you ever happen to want to link against installed libraries
in a given directory, LIBDIR, you must either use libtool, and
specify the full pathname of the library, or use the `-LLIBDIR'
flag during linking and do at least one of the following:
- add LIBDIR to the `LD_LIBRARY_PATH' environment variable
during execution
- add LIBDIR to the `LD_RUN_PATH' environment variable
during linking
- use the `-Wl,--rpath -Wl,LIBDIR' linker flag
- have your system administrator add LIBDIR to `/etc/ld.so.conf'
See any operating system documentation about shared libraries for
more information, such as the ld(1) and ld.so(8) manual pages.
----------------------------------------------------------------------
chmod 755 /usr/lib/httpd/modules/bw_mod-0.5rc1.so
[activating module `bw' in /etc/httpd/conf/httpd.conf]
-bash2-2.05b#
<IfModule worker.c>
LoadModule bw_module /usr/lib/httpd/modules/bw_mod-0.5rc1.so
LoadModule cgid_module modules/mod_cgid.so
</IfModule>
...
LoadModule proxy_http_module modules/mod_proxy_http.so
LoadModule proxy_connect_module modules/mod_proxy_connect.so
##### NEW added Feb 7 2005 to limit bandwidth
#
# First load the module
LoadModule bw_module /usr/lib/httpd/modules/bw_mod-0.5rc1.so
#
# Now enable it
BandWidthModule On
#
# Now set the default, which is no limit --
# we will tweak it later.
BandWidth all 0
#
# PDFs larger than 1MB go at 10k/sec max
LargeFileLimit .pdf 1000 10000
#
# No more than 40 connections
MaxConnection all 40
#
##### end bandwidth limit section
<IfModule prefork.c>
...
LoadModule cgi_module modules/mod_cgi.so
</IfModule>
...
If speed is 0, there is no limit.
Examples:
BandWidth localhost 10240 # localhost can download at 102.4k/sec
BandWidth 192.168.218.5 0 # This host has no download limit
Examples:
BandWidth all 102400
MinBandWidth all 50000
BandWidth all 50000
MinBandWidth all -1
[minimum size] is the minimum size (in kbytes) of the file to be matched. That way you can match huge video files that hog your bandwidth.
Examples:
LargeFileLimit .avi 500 10240
This limits .avi files over (or equal to) 500kb to 10kbytes/s.
Examples:
BandWidth all 0
MaxConnection all 20
or
BandWidth all 0
BandWidth 192.168.0.0/24 10240
MaxConnection all 20
MaxConnection 192.168.0.0/24 5
BrowserMatchNoCase ^NameOfBadProgram1 nameofenv
BrowserMatchNoCase ^NameOfBadProgram2 nameofenv
BrowserMatchNoCase ^NameOfBadProgram3 nameofenv
Use the same "nameofenv" value for all of the agents you want to block. I added this section to some preexisting BrowserMatch directives that had to do with forcing HTTP responses to certain browser versions. Here's what mine looks like right now, I will be adding to it as my logs reveal new twits:
# NEW Feb 7 2005 anti-bandwidth-sucker measures
BrowserMatchNoCase ^wget suckers
BrowserMatchNoCase ^SiteSucker suckers
BrowserMatchNoCase ^iGetter suckers
BrowserMatchNoCase ^larbin suckers
BrowserMatchNoCase ^LeechGet suckers
BrowserMatchNoCase ^RealDownload suckers
BrowserMatchNoCase ^Teleport suckers
BrowserMatchNoCase ^Webwhacker suckers
BrowserMatchNoCase ^WebDevil suckers
BrowserMatchNoCase ^Webzip suckers
BrowserMatchNoCase ^Attache suckers
BrowserMatchNoCase ^SiteSnagger suckers
BrowserMatchNoCase ^WX_mail suckers
BrowserMatchNoCase ^EmailCollector suckers
BrowserMatchNoCase ^WhoWhere suckers
BrowserMatchNoCase ^Roverbot suckers
BrowserMatchNoCase ^ActiveAgent suckers
BrowserMatchNoCase ^EmailSiphon suckers
deny from env=suckers
You could also add this to individual directories' .htaccess files, but I have not tried this yet. I simply added it at the end of my main Directory block:
<Directory /home/www.whoopis.com/html>
Options Indexes FollowSymLinks
AllowOverride AuthConfig
deny from env=suckers
</Directory>
cat access.log | awk -F "\"" {'print $6'} | sort | uniq | grep -v Mozilla
Mozilla/4.0 (compatible; MSIE 6.0; Windows 98; CDSource=v13b.08; FunWebProducts)
Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; Alexa Toolbar; mxie; Maxthon)
Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; Sgrunt|V104|615|S-132489664; SV1; InterFREE Kit; .NET CLR 1.1.4322)
which may or may not be legit, but are definitely unusual.
bash2-2.05b# cat /var/log/httpd/access.log | awk -F "\"" {'print $6'} | sort | uniq | grep -v Mozilla
Advanced Browser (http://www.avantbrowser.com)
Avant Browser (http://www.avantbrowser.com)
CFNetwork/1.1
CoralWebPrx/0.1.12 (See http://coralcdn.org/)
DA 5.3
ELinks/0.11.CVS (textmode; Linux 2.6.10 i686; 142x68-3)
FDM 1.x
Googlebot-Image/1.0
Googlebot/2.1 (+http://www.google.com/bot.html)
HAM version 6.0.87.204
Holmes/1.0
Html Link Validator (www.lithopssoft.com)
IRLbot/1.0 (+http://irl.cs.tamu.edu/crawler)
Iltrovatore-Setaccio/1.2 (It-bot; http://www.iltrovatore.it/bot.html; info@iltrovatore.it)
Java1.3.1
LWP::Simple/5.64
LeechGet 2004 (www.leechget.net)
Links (2.1pre15; Linux 2.6.7-hardened-r16 i686; 80x40)
Lynx/2.8.4rel.1 libwww-FM/2.14
Mediapartners-Google/2.1
Monica/1.4
Opera/7.50 (X11; Linux i386; U) [en]
RS
RealDownload/4.0.0.40
SIE-M55/10 UP.Browser/6.1.0.5.c.6 (GUI) MMP/1.0 (Google WAP Proxy/1.0)
SafariBookmarkChecker/1.26 (+http://www.coriolis.ch/)
SiteBar/3.2.6
SiteSucker/1.6.4
Space Bison/0.02 [fu] (Win67; X; SK)
SurveyBot/2.3 (Whois Source)
Wget/1.8.2
appie 1.1 (www.walhello.com)
contype
curl/7.10.2 (powerpc-apple-darwin7.0) libcurl/7.10.2 OpenSSL/0.9.7b zlib/1.1.4
findlinks/0.87 (+http://wortschatz.uni-leipzig.de/findlinks/)
gamekitbot/1.0 (+http://www.uchoose.de/crawler/gamekitbot/)
iCab/2.9.8 (Macintosh; U; 68K)
iGetter/2 (Macintosh; U; PPC Mac OS X; en)
ia_archiver
larbin_2.6.3 (larbin2.6.3@unspecified.mail)
msnbot/1.0 (+http://search.msn.com/msnbot.htm)
pipeLiner/0.7 (PipeLine Spider; http://www.pipeline-search.com/webmaster.html; webmaster@pipeline-search.com)
psbot/0.1 (+http://www.picsearch.com/bot.html)
sherlock/1.0
sohu-search
updated/0.1beta (updated.com; http://www.updated.com; crawler@updated.om)