Follow 1.51 (c) Copyright 1997 Mark Nottingham Follow processes Common and Combined Logfile Formats to reveal usage patterns of groups of pages. It does this by grouping accesses by their requesting host and then associating them according to referer and time. The most recent version of Follow can be found at http://www.pobox.com/~mnot/follow/ Comments, bug reports and queries should go to mnot@pobox.com -------------------------------------------------------------------------- Configuration Before you start, edit follow and put your Base URL (i.e., the bare address of your Web site) in the configs at the start of the script. Make sure you DON'T end the URL with a '/'. -------------------------------------------------------------------------- Use Run Follow without any arguments for usage details. The basic usage is $ ./follow [infile] where [infile] is the name of a Common or Combined Logfile Format file. Follow is very dependent on Common/Combined Format, and will not operate correctly if the input deviates from this. See below for more information on these logfile formats. It's a good idea to limit the size of the file which you analyse, due to the increase in processing time for large files. This can be done with the -t option. Output should be piped to a page (' | more') or shunted to a file (' > file'). It can also be run from a cgi script; see the -h option. Options -h will generate the output in HTML. This option takes no arguments, and without it output will be formatted text. -i will limit the listing to seeions prefixed by the specified IP address. For instance, './follow -i 131.172. infile | more' will list sessions from the 131.172.*.* address range. Note that your logfiles must identify hosts by their IP address for this to work. -l limits analysis time window. Argument is in the format hh:mm, where hh is the number of hours and mm is the number of minutes ago to start analysis. -m discards all sessions with less than the specified number of accesses. For instance, './follow -m 5 infile > outfile' will only list sessions that have five or more accesses. This option defaults to two. -n will limit the listing to sessions ending with the specified hostname suffix. For instance, './follow -n .com.au infile | more' will only list sessions coming from hosts in the *.com.au domain. Note that your logs must identify hosts by name for this to work. -p will limit the listing to a directory prefix. For instance, './follow -d /~mark infile > outfile' will only list sessions in the /~mark directory. -s sorts output by time of the first entry in each session, in reverse order. Without this switch, Follow defaults to sorting by name. -t breaks a session into two if there is more than this number of minutes between accesses; times out a session from a specific host. Defaults to 45 minutes. Set to 0 to lump all accesses together. -u restricts output to a single authorised (identd or .htaccess-style) user. This will ONLY work if both you and your users' browsers are using identd-style authentication(rare), OR you have implemented .htaccess- style user authentication. -------------------------------------------------------------------------- Interpreting Follow Output Each block of output starts with the requesting hostname and browser type (if this information is included in your logs). If identd or user authentication information is available, it will be used in place of the hostname. The following lines are a progression of pages that the user has accessed, first listing the page requested, then the referer (if it was different than the previous page), followed by the time requested. If the referer is offsite, you can follow its link to find out where the user came from. You can use this output to discover how long people are staying on your pages, where they go 'back', and when they have to reload a page. This data allows you to analyse how people navigate your site. Note that Follow will only track non-image 'hits'; currently, it ignores .jpg, .gif and .xbm files. It also ignores hits from hosts with "proxy" "squid" or "cache" in their name, because proxies serve several hosts, and may skew your output. (If an authenticated session comes from a proxy, it will be included). --------------------------------------------------------------------------- Automating Follow follow.cgi is included with follow as a simple wrapper script that lets you control follow output from the Web. To use it, copy follow.cgi and www-lib.pl to your cgi-bin directory, and edit the configuration portion of follow.cgi to suit your site. When you access the URL of the script, it will present you with a form that allows you to access the command-line options from the Web. You can also use cron to automate Follow, if you wish; talk to your sysadmin. -------------------------------------------------------------------------- About the Combined Logfile Format Most modern Web servers keep logs in the Common Logfile Format. To get the most out of Follow, you need to use the Combined format, which is similar to the Common format, with the Referer and Agent logs tacked onto the end. Most Web servers are capable of doing this. For Apache, enter this line into httpd.conf; LogFormat "%h %l %u %t \"%r\" %>s %b %{referer}i \"%{user-agent}i\"" Other servers should be similar; consult your documentation. -------------------------------------------------------------------------- To Do The next version of follow (2.0) will use more complex structures (most likely a hash of arrays of arrays) to allow extraction of more data (such as time spent on individual pages and popular use pattern recognition). If you have any suggestions, please send them to me at . Thanks!