[This is a slightly modified copy of http://iamwww.unibe.ch/~scg/Src/Doc/htgrep.html installed locally, please DO not copy/refer to this document, but check the REAL source - DKS 9/97]

HTGREP FAQ List

Last modified: Mon Oct 21 13:17:45 MET DST 1996


Htgrep is a Perl utility written by Oscar Nierstrasz to facilitate the construction of simple search engines for the WWW as CGI scripts. This document attempts to answer a lot of the frequently asked questions that people have about using and installing htgrep.

The actual software (scripts and libraries) are in the SCG Software Archive. You can check which are the current versions of the required packages.


A gzipped, tarred distribution package is now available (including this HTML file). The distribution contains the following files.


You can now register any search engines you implement using htgrep so that you may be notified by e-mail when updates are available. (When you register, you may also optionally have your search engine added to a public list of htgrep search engines.


Version 2.2 provides some needed sanity checks. It also supports (since version 2.0) boolean keyword queries, multi-file searching, optional case sensitivity, pseudo-URL expansion, footer files, configurable record separators, and a simpler, more maintainable structure. Thanks to Paul Sutton and Brian Exelbierd for the contributed routines and modifications.


This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program (as the file COPYING in the main directory of the distribution); if not, write to the Free Software Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.


Contents

General
What is htgrep?
How does it work?
Installation
How do I install htgrep?
Htgrep doesn't work. How can I debug it?
Does htgrep work with Perl 5?
Will htgrep run on a PC/Mac/...?
Search Files
What format should the search file conform to?
How can I query plain text files?
How can I query refer bibliography files?
My queries always return the whole file. What's wrong?
How can I make a forms interface to my search engine?
How can I prevent users from searching arbitrary files?
When I query a file, all URLs returned are relative to the CGI script instead of to the searched file. What's wrong?
How can I prevent users from downloading the whole search file?
Querying
How can I search a line at a time?
How do I do a case-sensitive search?
How can I make an AND query?
How can I search across newlines?
How do I match word boundaries?
How do I query for lines starting with "A"?
How can I search a set of files?
Can I use a pattern to specify the files to search?
Wrapper Scripts
How can I hardwire htgrep parameters?
What parameters can I set?
How can I post-process query results?
How can I search just part of a file?
How can I restrict searches to parts of records (not URLs)?
Maintenance
I found a bug. Should I e-mail you?
I have modified htgrep. Would you like to make my changes part of the standard distribution?
Can I include source code from htgrep in my new product/book?
Varia
How did htgrep get developed?
How does htgrep perform?

General


What is htgrep?

Htgrep is a CGI script written in Perl that allows you to query any document accessible to your HTTP server on a paragraph-by-paragraph basis. Htgrep is a generic front-end to htgrep.pl, a Perl package that makes it relatively easy to implement simple search engines for the WWW. Htgrep allows you to pass all parameters to the search package as part of the URL, so a forms interface can be used to set the parameters. Usually, however, you will want to write your own CGI script by adapting htgrep so that most of the parameters are hard-wired.

Some of the built-in features of htgrep are:

Back to Contents


How does it work?

Htgrep takes a user-supplied query, converts it, if necessary, to a Perl regular expression, and applies it to a list of files of records. By default, records are expected to be separated by blank lines. Records that match the query are wrapped up by htgrep to form a valid HTML file. The text to be displayed before and after the query results can be separately specified as optional header and footer files. If the records are HTML list items, plain ascii text, or refer bibliography entries, the query results can be automatically processed by htgrep to produce valid HTML. It is also possible to define your own post-processing filter in case the file to search is in some other format.

Back to Contents


Installation


How do I install htgrep?

As a bare minimum, you will need to:
  1. Install the following perl packages in some directory that is accessible to your HTTP server:
  2. Configure htgrep.pl:
  3. Install the htgrep.cgi wrapper script in a directory of CGI scripts for your server. Note that this is a separate file from the htgrep.pl package of search routines! (With some servers, you can install it anywhere, calling it "htgrep.cgi", to inform the server that it is an executable script.)
  4. Configure htgrep (or htgrep.cgi) by setting $PERLLIB_INC to the directory containing the installed perl packages. Also make sure that the first line points to the location of Perl on your HTTP server's machine. (Either Perl 4 or Perl 5 will work.)
  5. Install form.html somewhere on your server at a public URL
  6. Configure form.html so that the form's ACTION points to the URL of the htgrep script (usually "/cgi-bin/htgrep"). You can also set the default value of the parameter indicating the file to be searched (path relative to your server's home).
Now htgrep should work: open the URL of the form, set the file to be searched, provide a query string, and select the submit button. A page containing all records that match the search string should be returned.

You can also query a file by directly composing the URL. Suppose that http://site/path is the URL of an HTML document available at your site, and that you have installed htgrep in the server's cgi-bin directory. Then you can query this document by simply using the URL:

	http://site/cgi-bin/htgrep/file=path
For example, you can query this document with the URL:

	http://iamwww.unibe.ch/cgi-bin/htgrep/file=~scg/Src/Doc/htgrep.html
(Setting the list of pseudo URLs is optional if you do not expect to search files in the pseudo locations. Similarly, setting @USER_HOMES is not necessary if you only plan to search files in a single location: just set $www_home to the directory containing the search files. You won't need to install either the main htgrep wrapper script or form.html if you are developing your own wrapper script or forms interface, but it may be easier to start by getting these to work first.)

Back to Contents


Htgrep doesn't work. How can I debug it?

Check that: If everything seems like it should work, but doesn't - i.e., you always get an uninformative error back from the HTTP server rather than the result of a query - then your best bet is to try running htgrep from the shell via a wrapper script, rather than through the CGI interface, so that Perl can give you more informative error messages. Be sure to run the script on the same machine that the HTTP server runs!

The wrapper script should set all the relevant parameters that would ordinarily be passed in the URL. You can adapt the following example:

	#! /bin/sh
	#
	# cgi           --- run a cgi script manually
	#
	# the name of the script -- "htgrep", or whatever:
	SCRIPT_NAME=/cgi-bin/htgrep

	# any tags to supply in the URL before the query:
	PATH_INFO="file=~scg/Src/Doc/htgrep.html"

	# the query itself (hardwire, or take as arguments):
	QUERY_STRING="$*"

	export SCRIPT_NAME PATH_INFO QUERY_STRING

	# the full path name of the script:
	exec /home/www/httpd-1.3/cgi-bin/htgrep
You can also set the tag $htgrep{'debug'} = "yes"; within htgrep.cgi. This will cause htgrep to print out the script it generates to evaluate the query.

If this still doesn't help, get a Perl guru to help you. Please avoid sending e-mail to me unless you are really, really stuck. In that case, please describe your problem very precisely, give a URL with the location of all the bits and pieces that you have managed to install so far and a URL that illustrates the precise problem, and maybe I will try to find time to look at it.

Back to Contents


Does htgrep work with Perl 5?

A couple of minor incompatibilities have been patched. Htgrep now appears to work with either Perl 4 or 5. No special features of Perl 5 are used, however.

Back to Contents


Will htgrep run on a PC/Mac/...?

Htgrep works under SunOS with NCSA HTTPD. It probably works with other Unixes and HTTP servers. I have no idea what it takes to make it run on other platforms. If you port it, please let me know.

Back to Contents


Search Files


What format should the search file conform to?

The file to be searched should consist of a sequence of records separated by blank lines. Htgrep expects a blank line to be the record separator (i.e., "\n\n", or two ASCII newline characters: 012 012). Normally each record will consist of a valid HTML paragraph.

If the file contents are list items rather than standalone HTML blocks, htgrep can be instructed to bracket the results of the search with <DL> and </DL>, <OL> and </OL> or <UL> and </UL>. The tag style=style must be included in the call to htgrep, where style is one of pre, dl, ol or ul. For example, we may query a list of titles of HTML documents and cause the resulting entries to be numbered as follows:

	http://iamwww.unibe.ch/cgi-bin/htgrep/file=~scg/iam-index&style=ol
Htgrep can also deal with plain text files and refer bibliography files. Other formats can be accommodated if you are willing to write a filter to post-process the query results.

Back to Contents


How can I query plain text files?

The tag style=pre can be used if the source document is a plain text file. This will cause special characters to be escaped and each paragraph to be surrounded by <PRE> and </PRE>.

The tag grab=yes will additionally cause htgrep to search for URLs and ftp pointers and convert them into hypertext links.

The following example queries a refer bibliography file, highlighting any URLs that are found:

	http://iamwww.unibe.ch/cgi-bin/htgrep/file=~scg/Bib/main.bib&style=pre&grab=yes
Note that refer files can also be automatically formatted by htgrep rather than being treated as plain text.

Back to Contents


How can I query refer bibliography files?

Htgrep can also be used to query a database of refer(1) style bibliography entries. Use the tag refer=yes.

See, for example, the OO Bibliography Database.

The tag abstract=yes is used internally by htgrep and is automatically generated when a bibliography entry contains an abstract (%X field). A link to a new call to htgrep is then generated, which will cause the abstract for a given entry to be displayed. Links to ftpable papers are also generated, if the refer entry contains a line of the form:

	%% ftp:site:file
or
	%% URL
For example, see the list of OO papers available by ftp in the same OO bibliography database.

If the tag ftpstyle=dir is used, the link will be to the containing directory rather than to the file itself (to facilitate exploration).

Back to Contents


My queries always return the whole file. What's wrong?

Records must be separated by a blank line. The <p> paragraph separator is not recognized. A blank line means two ascii new lines characters (octal 12). The line feed character (015 or ^M) is not the same thing. If the record separator is missing, the entire file is considered as a single record.

If for some reason it is not convenient for you to use a blank line as a record separator, you can change its definition by setting $htgrep'separator from your wrapper script (or configuring it inside htgrep.pl). For example, you might set:

	$htgrep'separator = "<p>";
Back to Contents


How can I make a forms interface to my search engine?

You can adapt the generic form interface to htgrep (please familiarize yourself first with how to write HTML forms). You may wish to set certain tags within the form without giving users the possibility to modify them from within the form by using the TYPE="hidden" qualifier. For example, you might specify:

	<FORM ACTION="/cgi-bin/htgrep.cgi">
	<INPUT TYPE=hidden NAME="file" VALUE="/~scg/Bib/main.bib">
	<INPUT TYPE=hidden NAME="refer" VALUE="yes">
	<INPUT NAME="isindex" SIZE=30> <B>Search String</b>
	<INPUT TYPE="submit" VALUE="Submit">
	<INPUT TYPE="reset" VALUE="Reset">
	</FORM>
In addition, you will probably want to replace the generic header that is generated by htgrep when it accesses the searched document. By default, htgrep produces only a minimal title and introduction to a searchable document. You can produce your own cover page as follows:

Suppose that the file to search is called base.html (or base.bib, etc.). If a header file base.hdr exists, htgrep will print that instead of the default header. In addition, if base.qry exists, it will be used whenever a non-empty query is given. Normally base.hdr will be a cover page with introductory information, whereas base.qry will only contain the title and main headline. Note that it is often convenient to have base.hdr simply be a link to form interface (i.e., form.html) so that an empty query will reproduce the full form. (Most of the examples given here work this way.)

Brian Exelbierd has provided extensions to htgrep that allow it also to recognize footer files in the same way. If you define the files base.hdr_footer and base.qry_footer, they will be automatically appended to the query output when queries are respectively absent or present.

The header pages can also be specified directly in the URL with the tags hdr=file and qry=file. The footer pages can be specified with the tags hdr_footer=file and qry_footer=file.

Back to Contents


How can I prevent users from searching arbitrary files?

If you do not want users to search arbitrary files, do not install htgrep.cgi as is, but make a separate wrapper script for each file you want to be searchable. Note that htgrep will in any case refuse to search files in directories containing a .htaccess file, unless this is explicitly overridden.

You can hardwire the file to be searched in a CGI wrapper script. Htgrep normally picks up all its parameters from the URL (with the routine &htgrep'ParseTags). These can either be ignored or overriden. Parameters are stored in the associative array %htgrep'tags. Just set the 'file' tag to the file to be searched as follows:

	# Set/Override the file to search:
	$htgrep'tags{'file'} = "~oscar/Wdb/food.w";
before the call to &htgrep'DoSearch (but after the calls to &htgrep'ParseTags).

Back to Contents


When I query a file, all URLs returned are relative to the CGI script instead of to the searched file. What's wrong?

This is correct behaviour. Relative URLs are always expanded by the client (i.e., your browser) relative to the current document's URL. This is the URL that you are using to access htgrep. Your client has no knowledge of the URL of the document being searched. You can correct this in a number of ways:

Back to Contents


How can I prevent users from downloading the whole search file?

Normally a maximum of 250 records will be retrieved. This can be controlled with the tag max=number. You should hardwire this if you do not want users to override the default maximum. (I.e., set $htgrep'tags{'max'}.)

By default, htgrep will only allow documents to be searched that exist underneath the HTTP server's home directory. This means that, in theory, if a document can be searched, it can also be retrieved in full (if the user can discover its URL!).

Now, you can restrict access to a directory with certain HTTP servers by putting a .htaccess file in that directory that specifies which client machines are allowed to access the files in that directory. Htgrep checks if such a file exists in the directory of the file to be searched, and if it does, it simply refuses to search the file. (Parsing .htaccess files goes beyond the simple functionality htgrep was intended to provide.)

What you really want is to deny access to the searched file through a direct URL, but permit access via htgrep. You can do this by setting:

	$htgrep'safeopen = 0;
before &htgrep'DoSearch is called. This overrides any .htaccess file that may be present, and allows htgrep to query the file. Since the server will refuse direct access to the file, this means it can only be queried through the htgrep interface.

In addition to limiting the number of records returned by htgrep, you may also want to check the authorization of the client machine yourself. The internet number is available in the variable $ENV{'REMOTE_ADDR'}.

Back to Contents


Querying


How can I search a line at a time?

Set the tag linemode=yes in the URL. Alternatively set
	$htgrep'tags{'linemode'} = "yes";
in the CGI script. (This is equivalent to setting $htgrep'separator = "\n".)

Back to Contents


How do I do a case-sensitive search?

Set the tag case=yes in the URL, or set:
	$htgrep'tags{'case'} = "yes";
in the wrapper script.

Back to Contents


How can I make an AND query?

Thanks to Paul Sutton, htgrep now supports boolean keyword searches. If a list of keywords is given as a query, only records will be returned that match every one of those keywords. Complex boolean queries using both AND and OR are also supported. For more details, see the htgrep query overview.

Back to Contents


How can I search across newlines?

The support for boolean keyword searches should address this most of the time (i.e., you can ask for records that contain two or more keywords, possibly on separate lines).

If you really want to match records with two words adjacent, and possibly separated by a newline, then use a Perl regular expression instead. For example, suppose you want to search for records containing (in order) the words "black cat" separated by whitespace. Then, instead of the query black cat or black and cat, you could ask for black\s+cat. (\s will match blanks, tabs or newlines.)

Back to Contents


How do I match word boundaries?

Word boundaries are auomatically matched when you use keyword searches. The query "black cat" will not match a record containing the words "blackened cattle". To override this feature, either use the wildcard character ("*") - i.e., "black* cat*" will succeed here - or use a Perl regular expression instead.

Back to Contents


How do I query for lines starting with "A"?

Normally you would use the regex "^A", but this will match records beginning with "A". (You must force htgrep to interpret this as a regex by setting the tag boolean=no.) What you really want is "\nA" (though this will not match records whose first line begins with "A"!). Sigh.

Back to Contents


How can I search a set of files?

Well, you probably should be using WAIS, but if you want to use htgrep, here's how:

You can set the file tag to be a comma-separated list of files rather than a single file name.:

http://iamwww.unibe.ch/cgi-bin/htgrep?file=~scg/Bib/ecoop.refer,~scg/Bib/oopsla.refer&refer=yes
Alternatively, you may set the variable @files to the list of files to search, for example:
	@htgrep'files = (
		"~scg/Bib/ecoop.refer",
		"~scg/Bib/oopsla.refer"
	);
Note that any header and footer files to be used will be relative to the first file to search.

If you really want to have htgrep automatically generate the list of files to search by applying some criteria to a starting directory, then you should write your own wrapper script in Perl to generate that list and set @files before calling &htgrep'DoSearch.

Back to Contents


Can I use a pattern to specify the files to search?

No. This is actually a little harder than it sounds, because the patterns you would want to specify would be relative to URLs, not to the actual file system. In a wrapper script you might use a pattern to get the list of files in the search directory, but then you would have to prepend each item in the list with the relative pathname for the directory. Getting a general routine to work is a bit more complicated. Let me know if you write one.

Back to Contents


Wrapper Scripts


How can I hardwire htgrep parameters?

The complete code for htgrep follows:
	#! /usr/local/bin/perl
	# Where htgrep.pl resides:
	$PERLLIB_INC = "/home/scg/local/perl/lib";
	unshift(@INC,$PERLLIB_INC);
	require("htgrep.pl");

	# Set parameters encoded in URL:
	&htgrep'ParseTags($ENV{'PATH_INFO'});
	# Set parameters encoded in query:
	&htgrep'ParseTags($ENV{'QUERY_STRING'});

	# Perform the search:
	&htgrep'DoSearch;
As mentioned above, parameters are stored in the associative array %htgrep'tags. A tag can be set or overriden explicitly as follows:
	# Set/Override the file to search:
	$htgrep'tags{'file'} = "~oscar/Wdb/food.w";
Of course, this must be done before the call to &htgrep'DoSearch. If you do not want users to be able to override the value of the tag in the URL or the query string, you must set the tag after the calls to &htgrep'ParseTags.

Here is a complete example of a wrapper script:

	#! /usr/local/bin/perl
	#
	# hotlist.cgi   --- search Oscar's hotlist
	#
	# Author: Oscar Nierstrasz 19/8/94

	unshift(@INC,"/home/oscar/public_html/cgi-bin/Lib");
	require("htgrep.pl");

	# Pick up parameters and query string:
	&htgrep'ParseTags($ENV{'PATH_INFO'});
	&htgrep'ParseTags($ENV{'QUERY_STRING'});

	# Set the file to search and other options:
	$htgrep'tags{'file'} = "~oscar/WWW/hl.html";
	$htgrep'tags{'style'} = "ol";

	# Do the search!
	&htgrep'DoSearch;
Back to Contents


What parameters can I set?

These parameters can either be set in the URL (or, equivalently in the entry form), or they can be hard-wired in the wrapper script by setting $htgrep'tags{'tagname'} = value:

	file        -- the list of files to search (comma-separated)
	isindex     -- the query string
	case        -- [no/yes] be case-sensitive 
	boolean     -- [auto/no/yes] use Boolean keywords vs. regexes 

	style       -- [none/pre/ol/ul/dl] format of records
	max         -- max # records to return (default 250)
	grab        -- [no/yes] convert URLs to hypertext (in plain text)
	linemode    -- [no/yes] each record is a single line

	refer       -- [no/yes] format
	abstract    -- [no/yes] show abstract
	ftpstyle    -- [file/dir] make link to ftp file or dir (for refer)

	hdr         -- header file (to preceed output)
	qry         -- query file (alternative header for non-empty query)
	hdr_footer  -- footer file (to follow output) 
	qry_footer  -- footer file (alternative footer for non-empty query) 

	filter      -- name of routine to filter query results
	version     -- [no/yes] only print current version of package 
	debug       -- [no/yes] show the Perl code to evaluate the query 
Back to Contents


How can I post-process query results?

You can write your own callback routine to be applied by htgrep to the query results. Just set $htgrep'tags{'filter'} to the name of your routine. Note that the filter must be within the scope of the htgrep package, so it should be declared as htgrep'filtername, not just filtername.

You can also introduce your own tag names if you want to dynamically set the filter to use according to the value of a tag.

One application of this is the Object-Oriented Information Sources catalogue. Records are stored as sets of tagged entries (similar to the format used for refer). Retrieved records are dynamically formatted as HTML, thus guaranteeing a uniform appearance. A forms interface is also used to allow users to submit new entries. Internally, entries are stored as follows:

	K scg composition open systems concurrency
	C research-home-page
	U http://iamwww.unibe.ch/~scg/
	T Software Composition Group
	D The Software Composition Group conducts research
	  into the use of object technology and related approaches for the
	  development of flexible, open software system.
	U http://iamwww.unibe.ch/~oscar
	W Oscar Nierstrasz
	I University of Berne, CH
which generates:
Software Composition Group
The Software Composition Group conducts research into the use of object technology and related approaches for the development of flexible, open software system.
Who: Oscar Nierstrasz
Where: University of Berne, CH
Here is the complete text of ooinfo.cgi:
	#! /usr/local/bin/perl
	#
	# ooinfo	--- Object-Oriented Information Sources
	#
	# Author: Oscar Nierstrasz 19/8/94
	# Re-written 26.6.95 to use tag-file format
	#
	# This script is a front-end to htgrep.pl
	#
	# This script and friends can be found at:
	# http://iamwww.unibe.ch/~scg/Src/

	unshift(@INC,"/home/scg/local/perl/lib");
	require("htgrep.pl");

	# What is the name of this script?
	$htgrep'tags{'htgrep'} = $ENV{'SCRIPT_NAME'};

	# Pick up parameters and query string:
	&htgrep'settags($ENV{'PATH_INFO'});
	&htgrep'settags($ENV{'QUERY_STRING'});

	# Set the file to search and other options:
	$htgrep'tags{'file'} = "~scg/OOinfo/OOINFO";

	$htgrep'tags{'filter'} = "ooinfo";

	# Do the search!
	&htgrep'doit;

	# Filter for ooinfo records:
	sub htgrep'ooinfo {
		&accent'html;
		# Delete keywords:
		s/^K.*/<hr>/;
		s/\nC.*//g;
		# Format URLs:
		s/\nU (.*)\n(\w) (.*)/\n$2 <a href=$1><b>$3<\/b><\/a>/g;
		# Format titles:
		s/\nT (.*)/\n<b>$1<\/b>/;
		# Format descriptions:
		s/\nD (.*)/\n<br><i>$1<\/i>/;
		s/\n\s+(.*)/\n<i>$1<\/i>/g;	# patch continuation lines
		s/<\/i>\n<i>/\n/g;
		# Contact persons:
		# group together multiple entries
		while(s/\nW (.*)\nW (.*)/\nW $1, $2/) { ; }
		s/\nW (.*)/\n<br><b>Who:<\/b> $1/g;
		# Institution:
		s/\nI (.*)/\n<br><b>Where:<\/b> $1/g;
		# See also:
		while(s/\nS (.*)\nS (.*)/\nS $1, $2/) { ; }
		s/\nS (.*)/\n<br><b>See also:<\/b> $1/g;
		# Delete comments:
		s/\n#.*//g;
		s/^#.*//;
	}

	__END__
Cryptic perhaps, but that's Perl for you!

A more advanced application is Oscar's Who's Who, an interface to an address database. Each field of an address record starts with a single-letter tag indicating the kind of data stored. The wrapper script formats records in different ways depending on what information is desired. Furthermore, only privileged clients are allowed to access the complete address information. The filters and the authorization routines are part of a separate package

The same package is used as an interface to a restaurant database.

Back to Contents


How can I search just part of a file?

You can: Back to Contents


How can I restrict searches to parts of records (not URLs)?

Probably the best way to solve this is to use a tag-file format for the searchable file, and generate HTML on the fly with a filter. You can then transform queries so that they will only match certain lines or certain fields. For example, to match only text in the title field of the OO Info catalogue, the query "text" can be transformed in the CGI wrapper script to "\nT.*text".

Back to Contents


Maintenance


I found a bug. Should I e-mail you?

If you're really sure it's a bug. If you have a bug fix, so much the better. The less e-mail I get about htgrep, the happier I am. If you fix a bug, then I'll get less e-mail!

Please provide URLs that illustrate the bug. If you have a bug fix, please make sure it's relative to the latest version of htgrep and its related packages!

Back to Contents


I have modified htgrep. Would you like to make my changes part of the standard distribution?

Sure, as long as the changes are robust, and they don't impact performance. If you want to make a separate page describing your version, I'll be happy to include a link.

Again, please provide URLs that illustrate your modifications, and make sure that they are relative to the latest version of htgrep and its related packages!

Back to Contents


Can I include source code from htgrep in my new product/book?

You can do anything you want with htgrep as long as you follow the GNU copyright rules (described in detail in the GNU General Public License). If you're in doubt, ask.

Back to Contents


Varia


How did htgrep get developed?

I set up the WWW server at the CUI (University of Geneva, Switzerland) in the summer of 1993. I quickly realized that search engines were desperately needed to complement the navigational features of the Web, but that these were hard to integrate in the pre-CGI version of CERN's HTTP server. As a result, we switched our server to Plexus, an HTTP server written in Perl. Plexus was designed to be easy to extend. The first version of htgrep was called parscan, and was implemented as a library add-on to Plexus.

The first application of parscan was CUI's W3Catalog, one of the first attempts to develop a searchable catalog of WWW resources. W3Catalog was built by nightly mirroring and reformatting selected documents from the CERN Virtual Library and other lists of WWW Resources so that they could easily be searched.

When the htbin interface was introduced in the Fall of 1993 to the NCSA httpd server, it was immediately clear that parscan should be re-written so that it could be used as a script with any compliant server. Shortly thereafter the Common Gateway Interface introduced and rapidly adopted as a standard way to share scripts across different servers. The conversion of htgrep to CGI was finally made in May 1994.

With the explosion of the Web, interest in CGI tools has grown, and it appears that htgrep has been picked up by a large number of sites as a quick way to get a search engine up and running. This FAQ list is a response to the large number of e-mail messages that have been received, and version 2.0 attempts to address some of the more serious shortcomings of the package. (Thanks to Paul Sutton and Brian Exelbierd for the contributed routines and modifications.) Version 2.0 has been completely re-organized and largely re-written to be more maintainable and easier to read. With luck, I should get fewer requests for help!

Back to Contents


How does htgrep perform?

Remarkably well, thanks to the efficiency of Perl's text searching. Unfortunately htgrep is not really suited for heavyweight applications. For each query:
  1. the http server forks a copy of the Perl compiler
  2. htgrep and the dependent packages are loaded and compiled
  3. the entire search file must be read and searched
If the server is popular, a large number of instances of htgrep can be running at any one time, thus slowing the server to a crawl. Since the entire search file is read for each query, this can pound your disk. (The file really should be cached.) If the search file is accessed through the local network, this can pound your network.

For modest applications, htgrep can be a reasonable choice for getting a search engine up and running quickly. For popular or heavyweight applications, the search engine should be encapsulated in a dedicated server (or integrated with the http server itself).

Back to Contents


Back to Software Archive

oscar@iam.unibe.ch