Hacking Webalizer

Few weeks ago I found a way to install Webalizer on DreamHost. Webalizer is a little bit of an obsolete utility (comparing to Google Analytics Webalizer stats look rude), but still has some advantages. One of the biggest is that it is still used by many hosting solutions, so you don’t have to change all of your site(blog) pages just to make sure Google will track your stats.

Webalizer is *nix based command line utility which generate html stats pages by parsing log files. Although it is able to parse many types of log files, in practice it is mostly used for parsing Apache Web logs. Usually it is set to run by cron job, by setting it to run soon after log rotate.

Like any other *nix command line utility, you can pass dozens of arguments to webalizer (full argument list is available in manual which can be accessed with man webalizer). However, most of options can be set in webalizer.conf file, which is usually located in /etc/webalizer.conf.

Webalizer will probably work even if you don’t change anything in webalizer.conf file, but generated statistics will not be quite accurate. For example, big number of visits/hits on your site is actually generated by bots (msn, google, yahoo …) which is not something you would like to count in your stats. Also, by default referrers includes your site pages, which means in “Top 10 Referrers” only pages from your site will be listed. So, lets hack webalizer.conf in order to make stats more usefull to us :)

First off, lets turn off your own site from referrers, by adding it to HideReferrer. For example:

HideReferrer dinke.net/

Now, lets ignore spiders, bots etc:

# with option IgnoreSite we ignore well known bots sites
IgnoreSite msnbot.msn.com

# with IgnoreAgent we ignore based on useragent
IgnoreAgent msnbot
IgnoreAgent Googlebot
IgnoreAgent lmspider
IgnoreAgent Yahoo
IgnoreAgent ZyBorg
IgnoreAgent Jeeves/Teoma

Bear in mind that during setting above options for let say “www.yourmama.com”, all strings like “your”, “*mama.com” and “www.your*” will match.

In order to hide visits to particular part of your site, you should use IgnoreURL option. For example, on my own Word Press blog I don’t want to count any access to /wp-admin/ part so I added something like:

# ignore visits to particular part of the site
IgnoreURL /wp-admin/*

And last but not least, user agent list. In order to group your browsers by user agent (MSIE, Mozilla) you could use following code:

GroupAgent MSIE Internet Exploder
HideAgent MSIE
GroupAgent Mozilla Mozilla Based
HideAgent Mozilla

GroupAgent will group the same useragents, so you would have something like “MSIE – 156 hits” if you put MSIE on list. If you don’t add HideAgent option, all agents with string “MSIE” in it will be listed. I really don’t know why, but nothing except MSIE and Mozilla didn’t work for me (like opera, gecko etc) so this option turned out to be useless.

And this is it. With only few “tweaks” your stats became much more accurate, but with one drawback. Now when you turned off bots etc. number of visits/hits might be much lower than before :)

Comments

  1. December 16th, 2008 | 5:55 pm

    Just some words about WebAlizer…

    I don’t like it very much.
    It’s rather good and does’t cost anything.
    BUT:
    1) the stats it gives is incorrect very often
    2) some logs of WA are ref spamed, what may cause even DDOS of the site

    But its not difficult to tune WA that way so it will work correctly!

  2. July 17th, 2011 | 5:25 pm

    […] can set Webalizer adult to ignore certain spiders or robots, though maybe this isn’t value doing as we can’t discharge them all so because bother?. Like […]

Leave a reply