Follow 1.51
(c) Copyright 1997 Mark Nottingham <mnot@pobox.com>

Follow processes Common and Combined Logfile Formats to reveal usage
patterns of groups of pages. It does this by grouping accesses by their 
requesting host and then associating them according to referer and time.

The most recent version of Follow can be found at 
http://www.pobox.com/~mnot/follow/

Comments, bug reports and queries should go to mnot@pobox.com


--------------------------------------------------------------------------
Configuration

Before you start, edit follow and put your Base URL (i.e., the bare address
of your Web site) in the configs at the start of the script. Make sure you
DON'T end the URL with a '/'.


--------------------------------------------------------------------------
Use

Run Follow without any arguments for usage details. The basic usage is

$ ./follow [infile]

where [infile] is the name of a Common or Combined Logfile Format file.
Follow is very dependent on Common/Combined Format, and will not operate
correctly if the input deviates from this. See below for more information
on these logfile formats.

It's a good idea to limit the size of the file which you analyse, due to
the increase in processing time for large files. This can be done with the
-t option.

Output should be piped to a page (' | more') or shunted to a file (' > file').
It can also be run from a cgi script; see the -h option.


Options

-h   will generate the output in HTML. This option takes no arguments, and 
     without it output will be formatted text.

-i   will limit the listing to seeions prefixed by the specified IP address.
     For instance, './follow -i 131.172. infile | more' will list sessions
     from the 131.172.*.* address range. Note that your logfiles must
     identify hosts by their IP address for this to work.

-l   limits analysis time window. Argument is in the format hh:mm, where hh is 
     the number of hours and mm is the number of minutes ago to start analysis.

-m   discards all sessions with less than the specified number of 
     accesses. For instance, './follow -m 5 infile > outfile' will only list
     sessions that have five or more accesses. This option defaults to two.

-n   will limit the listing to sessions ending with the specified hostname
     suffix. For instance, './follow -n .com.au infile | more' will only list 
     sessions coming from hosts in the *.com.au domain. Note that your logs
     must identify hosts by name for this to work.

-p   will limit the listing to a directory prefix. For instance, 
     './follow -d /~mark infile > outfile' will only list sessions in the
     /~mark directory.

-s   sorts output by time of the first entry in each session, in reverse 
     order. Without this switch, Follow defaults to sorting by name.

-t   breaks a session into two if there is more than this number of minutes
     between accesses; times out a session from a specific host. Defaults to
     45 minutes. Set to 0 to lump all accesses together.

-u   restricts output to a single authorised (identd or .htaccess-style) user.
     This will ONLY work if both you and your users' browsers are using 
     identd-style authentication(rare), OR you have implemented .htaccess-
     style user authentication.


--------------------------------------------------------------------------
Interpreting Follow Output

Each block of output starts with the requesting hostname and browser type (if
this information is included in your logs). If identd or user
authentication information is available, it will be used in place of
the hostname. 

The following lines are a progression of pages that the user has accessed, 
first listing the page requested, then the referer (if it was different than 
the previous page), followed by the time requested. If the referer is offsite,
you can follow its link to find out where the user came from.

You can use this output to discover how long people are staying on your pages,
where they go 'back', and when they have to reload a page. This data allows 
you to analyse how people navigate your site.

Note that Follow will only track non-image 'hits'; currently, it ignores
.jpg, .gif and .xbm files. It also ignores hits from hosts with
"proxy" "squid" or "cache" in their name, because proxies serve several
hosts, and may skew your output. (If an authenticated session comes
from a proxy, it will be included).


---------------------------------------------------------------------------
Automating Follow

follow.cgi is included with follow as a simple wrapper script that lets
you control follow output from the Web. To use it, copy follow.cgi and
www-lib.pl to your cgi-bin directory, and edit the configuration portion
of follow.cgi to suit your site.

When you access the URL of the script, it will present you with a form
that allows you to access the command-line options from the Web.

You can also use cron to automate Follow, if you wish; talk to your sysadmin.


--------------------------------------------------------------------------
About the Combined Logfile Format

Most modern Web servers keep logs in the Common Logfile Format. To get the
most out of Follow, you need to use the Combined format, which is similar
to the Common format, with the Referer and Agent logs tacked onto the end.

Most Web servers are capable of doing this. For Apache, enter this line
into httpd.conf;

LogFormat "%h %l %u %t \"%r\" %>s %b %{referer}i \"%{user-agent}i\""

Other servers should be similar; consult your documentation.


--------------------------------------------------------------------------
To Do

The next version of follow (2.0) will use more complex structures (most 
likely a hash of arrays of arrays) to allow extraction of more data (such
as time spent on individual pages and popular use pattern recognition). If
you have any suggestions, please send them to me at <mnot@pobox.com>.
Thanks!