The actual software (scripts and libraries) are in the SCG Software Archive. You can check which are the current versions of the required packages.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with this program (as the file COPYING in the main directory of the distribution); if not, write to the Free Software Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
Some of the built-in features of htgrep are:
Back to Contents
@USER_HOME = ("/user", "/home");
, then
~oscar
will be expanded if either
/user/oscar
or /home/oscar
exists]
You can also query a file by directly composing the URL.
Suppose that http://
site/
path
is the URL of an HTML document available at your site,
and that you have installed htgrep in the server's cgi-bin directory.
Then you can query this document by simply using the URL:
http://site/cgi-bin/htgrep/file=pathFor example, you can query this document with the URL:
http://iamwww.unibe.ch/cgi-bin/htgrep/file=~scg/Src/Doc/htgrep.html(Setting the list of pseudo URLs is optional if you do not expect to search files in the pseudo locations. Similarly, setting @USER_HOMES is not necessary if you only plan to search files in a single location: just set $www_home to the directory containing the search files. You won't need to install either the main htgrep wrapper script or form.html if you are developing your own wrapper script or forms interface, but it may be easier to start by getting these to work first.)
Back to Contents
The wrapper script should set all the relevant parameters that would ordinarily be passed in the URL. You can adapt the following example:
#! /bin/sh # # cgi --- run a cgi script manually # # the name of the script -- "htgrep", or whatever: SCRIPT_NAME=/cgi-bin/htgrep # any tags to supply in the URL before the query: PATH_INFO="file=~scg/Src/Doc/htgrep.html" # the query itself (hardwire, or take as arguments): QUERY_STRING="$*" export SCRIPT_NAME PATH_INFO QUERY_STRING # the full path name of the script: exec /home/www/httpd-1.3/cgi-bin/htgrepYou can also set the tag
$htgrep{'debug'} = "yes";
within htgrep.cgi.
This will cause htgrep to print out the script it generates
to evaluate the query.
If this still doesn't help, get a Perl guru to help you. Please avoid sending e-mail to me unless you are really, really stuck. In that case, please describe your problem very precisely, give a URL with the location of all the bits and pieces that you have managed to install so far and a URL that illustrates the precise problem, and maybe I will try to find time to look at it.
Back to Contents
Back to Contents
Back to Contents
If the file contents are list items rather than standalone HTML blocks,
htgrep can be instructed to bracket
the results of the search with
<DL>
and </DL>
,
<OL>
and </OL>
or
<UL>
and </UL>
.
The tag style=
style must be included
in the call to htgrep, where style is one of
pre
, dl
, ol
or ul
.
For example, we may query a list of titles of HTML documents
and cause the resulting entries to be numbered as follows:
http://iamwww.unibe.ch/cgi-bin/htgrep/file=~scg/iam-index&style=olHtgrep can also deal with plain text files and refer bibliography files. Other formats can be accommodated if you are willing to write a filter to post-process the query results.
Back to Contents
style=pre
can be used if the source document is a
plain text file. This will cause special characters to be escaped and
each paragraph to be surrounded by <PRE>
and
</PRE>
.
The tag grab=yes
will additionally cause htgrep to search
for URLs and ftp pointers and convert them into hypertext links.
The following example queries a refer bibliography file, highlighting any URLs that are found:
http://iamwww.unibe.ch/cgi-bin/htgrep/file=~scg/Bib/main.bib&style=pre&grab=yesNote that refer files can also be automatically formatted by htgrep rather than being treated as plain text.
Back to Contents
refer=yes
.
See, for example, the OO Bibliography Database.
The tag abstract=yes
is used internally by htgrep and is
automatically generated when a bibliography entry contains an abstract
(%X field).
A link to a new call to htgrep is then generated, which will cause
the abstract for a given entry to be displayed.
Links to ftpable papers are also generated, if the refer entry
contains a line of the form:
%% ftp:site:fileor
%% URLFor example, see the list of OO papers available by ftp in the same OO bibliography database.
If the tag ftpstyle=dir
is used, the link will be to the
containing directory rather than to the file itself (to facilitate
exploration).
Back to Contents
<p>
paragraph separator is not recognized. A blank line means two ascii
new lines characters (octal 12). The line feed character (015 or ^M)
is not the same thing. If the record separator is missing,
the entire file is considered as a single record.
If for some reason
it is not convenient for you to use a blank line as a record separator,
you can change its definition by setting
$htgrep'separator
from your wrapper script (or configuring it inside htgrep.pl).
For example, you might set:
$htgrep'separator = "<p>";Back to Contents
TYPE="hidden"
qualifier.
For example, you might specify:
<FORM ACTION="/cgi-bin/htgrep.cgi"> <INPUT TYPE=hidden NAME="file" VALUE="/~scg/Bib/main.bib"> <INPUT TYPE=hidden NAME="refer" VALUE="yes"> <INPUT NAME="isindex" SIZE=30> <B>Search String</b> <INPUT TYPE="submit" VALUE="Submit"> <INPUT TYPE="reset" VALUE="Reset"> </FORM>In addition, you will probably want to replace the generic header that is generated by htgrep when it accesses the searched document. By default, htgrep produces only a minimal title and introduction to a searchable document. You can produce your own cover page as follows:
Suppose that the file to search is called base.html
(or base.bib
, etc.). If a header file
base.hdr
exists, htgrep will print that instead
of the default header.
In addition, if base.qry
exists, it will be used
whenever a non-empty query is given. Normally
base.hdr
will be a cover page with introductory
information, whereas base.qry
will only contain the
title and main headline. Note that it is often convenient to have
base.hdr
simply be a link to form interface (i.e.,
form.html) so that an empty query will reproduce the full form. (Most
of the examples given here work this way.)
Brian Exelbierd
has provided extensions to htgrep that allow it also to
recognize footer files in the same way.
If you define the files base
.hdr_footer
and
base.qry_footer
, they will be automatically
appended to the query output when queries are respectively
absent or present.
The header pages can also be specified directly in the URL with the tags
hdr=
file and qry=
file.
The footer pages can be specified with the tags
hdr_footer=
file and qry_footer=
file.
Back to Contents
.htaccess
file,
unless this is explicitly overridden.
You can hardwire the file to be searched in a
CGI wrapper script.
Htgrep normally picks up all its parameters from the URL
(with the routine &htgrep'ParseTags
).
These can either be ignored or overriden. Parameters are
stored in the associative array %htgrep'tags
.
Just set the 'file' tag to the file to be searched as follows:
# Set/Override the file to search: $htgrep'tags{'file'} = "~oscar/Wdb/food.w";before the call to
&htgrep'DoSearch
(but after the calls to &htgrep'ParseTags
).
Back to Contents
<BASE HREF=URL>
in the front page to your search engine.
(Thanks to Dale Bewley for pointing this out.)
max=
number.
You should hardwire this if you do not want users to override
the default maximum. (I.e., set $htgrep'tags{'max'}
.)
By default, htgrep will only allow documents to be searched that exist underneath the HTTP server's home directory. This means that, in theory, if a document can be searched, it can also be retrieved in full (if the user can discover its URL!).
Now, you can restrict access to a directory with certain HTTP servers
by putting a .htaccess
file in that directory that
specifies which client machines are allowed to access the files
in that directory. Htgrep checks if such a file exists in the
directory of the file to be searched, and if it does, it simply
refuses to search the file. (Parsing .htaccess
files
goes beyond the simple functionality htgrep was intended to provide.)
What you really want is to deny access to the searched file
through a direct URL, but permit access via htgrep.
You can do this by setting:
$htgrep'safeopen = 0;before
&htgrep'DoSearch
is called. This overrides
any .htaccess
file that may be present, and allows
htgrep to query the file.
Since the server will refuse direct access to the file, this means it can
only be queried through the htgrep interface.
In addition to limiting the number of records returned by htgrep, you may also
want to check the authorization of the client machine yourself.
The internet number is available in the variable
$ENV{'REMOTE_ADDR'}
.
Back to Contents
linemode=yes
in the URL.
Alternatively set
$htgrep'tags{'linemode'} = "yes";in the CGI script. (This is equivalent to setting
$htgrep'separator = "\n"
.)
Back to Contents
case=yes
in the URL, or set:
$htgrep'tags{'case'} = "yes";in the wrapper script.
Back to Contents
Back to Contents
If you really want to match records with two words adjacent,
and possibly separated by a newline, then use a
Perl regular
expression instead. For example,
suppose you want to search for records containing
(in order) the words "black cat" separated by whitespace.
Then, instead of the query black cat
or black and cat
, you could ask for
black\s+cat
. (\s will match blanks, tabs or newlines.)
Back to Contents
Back to Contents
boolean=no
.)
What you really want is "\nA" (though this will not
match records whose first line begins with "A"!).
Sigh.
Back to Contents
You can set the
file
tag to be a comma-separated list
of files rather than a single file name.:
http://iamwww.unibe.ch/cgi-bin/htgrep?file=~scg/Bib/ecoop.refer,~scg/Bib/oopsla.refer&refer=yesAlternatively, you may set the variable @files to the list of files to search, for example:
@htgrep'files = ( "~scg/Bib/ecoop.refer", "~scg/Bib/oopsla.refer" );Note that any header and footer files to be used will be relative to the first file to search.
If you really want to have htgrep automatically generate the list of files to search by applying some criteria to a starting directory, then you should write your own wrapper script in Perl to generate that list and set @files before calling &htgrep'DoSearch.
Back to Contents
Back to Contents
#! /usr/local/bin/perl # Where htgrep.pl resides: $PERLLIB_INC = "/home/scg/local/perl/lib"; unshift(@INC,$PERLLIB_INC); require("htgrep.pl"); # Set parameters encoded in URL: &htgrep'ParseTags($ENV{'PATH_INFO'}); # Set parameters encoded in query: &htgrep'ParseTags($ENV{'QUERY_STRING'}); # Perform the search: &htgrep'DoSearch;As mentioned above, parameters are stored in the associative array
%htgrep'tags
.
A tag can be set or overriden explicitly as follows:
# Set/Override the file to search: $htgrep'tags{'file'} = "~oscar/Wdb/food.w";Of course, this must be done before the call to
&htgrep'DoSearch
.
If you do not want users to be able to override the
value of the tag in the URL or the query string, you must set the tag
after the calls to &htgrep'ParseTags
.
Here is a complete example of a wrapper script:
#! /usr/local/bin/perl # # hotlist.cgi --- search Oscar's hotlist # # Author: Oscar Nierstrasz 19/8/94 unshift(@INC,"/home/oscar/public_html/cgi-bin/Lib"); require("htgrep.pl"); # Pick up parameters and query string: &htgrep'ParseTags($ENV{'PATH_INFO'}); &htgrep'ParseTags($ENV{'QUERY_STRING'}); # Set the file to search and other options: $htgrep'tags{'file'} = "~oscar/WWW/hl.html"; $htgrep'tags{'style'} = "ol"; # Do the search! &htgrep'DoSearch;Back to Contents
$htgrep'tags{'tagname'} = value
:
file -- the list of files to search (comma-separated) isindex -- the query string case -- [no/yes] be case-sensitiveBack to Contentsboolean -- [auto/no/yes] use Boolean keywords vs. regexes
style -- [none/pre/ol/ul/dl] format of records max -- max # records to return (default 250) grab -- [no/yes] convert URLs to hypertext (in plain text) linemode -- [no/yes] each record is a single line refer -- [no/yes] format abstract -- [no/yes] show abstract ftpstyle -- [file/dir] make link to ftp file or dir (for refer) hdr -- header file (to preceed output) qry -- query file (alternative header for non-empty query) hdr_footer -- footer file (to follow output)
qry_footer -- footer file (alternative footer for non-empty query)
filter -- name of routine to filter query results version -- [no/yes] only print current version of package
debug -- [no/yes] show the Perl code to evaluate the query
![]()
$htgrep'tags{'filter'}
to the name
of your routine. Note that the filter must be within the scope
of the htgrep package, so it should be declared as
htgrep'filtername
, not just
filtername
. You can also introduce your own tag names if you want to dynamically set the filter to use according to the value of a tag.
One application of this is the Object-Oriented Information Sources catalogue. Records are stored as sets of tagged entries (similar to the format used for refer). Retrieved records are dynamically formatted as HTML, thus guaranteeing a uniform appearance. A forms interface is also used to allow users to submit new entries. Internally, entries are stored as follows:
K scg composition open systems concurrency C research-home-page U http://iamwww.unibe.ch/~scg/ T Software Composition Group D The Software Composition Group conducts research into the use of object technology and related approaches for the development of flexible, open software system. U http://iamwww.unibe.ch/~oscar W Oscar Nierstrasz I University of Berne, CHwhich generates:
#! /usr/local/bin/perl # # ooinfo --- Object-Oriented Information Sources # # Author: Oscar Nierstrasz 19/8/94 # Re-written 26.6.95 to use tag-file format # # This script is a front-end to htgrep.pl # # This script and friends can be found at: # http://iamwww.unibe.ch/~scg/Src/ unshift(@INC,"/home/scg/local/perl/lib"); require("htgrep.pl"); # What is the name of this script? $htgrep'tags{'htgrep'} = $ENV{'SCRIPT_NAME'}; # Pick up parameters and query string: &htgrep'settags($ENV{'PATH_INFO'}); &htgrep'settags($ENV{'QUERY_STRING'}); # Set the file to search and other options: $htgrep'tags{'file'} = "~scg/OOinfo/OOINFO"; $htgrep'tags{'filter'} = "ooinfo"; # Do the search! &htgrep'doit; # Filter for ooinfo records: sub htgrep'ooinfo { &accent'html; # Delete keywords: s/^K.*/<hr>/; s/\nC.*//g; # Format URLs: s/\nU (.*)\n(\w) (.*)/\n$2 <a href=$1><b>$3<\/b><\/a>/g; # Format titles: s/\nT (.*)/\n<b>$1<\/b>/; # Format descriptions: s/\nD (.*)/\n<br><i>$1<\/i>/; s/\n\s+(.*)/\n<i>$1<\/i>/g; # patch continuation lines s/<\/i>\n<i>/\n/g; # Contact persons: # group together multiple entries while(s/\nW (.*)\nW (.*)/\nW $1, $2/) { ; } s/\nW (.*)/\n<br><b>Who:<\/b> $1/g; # Institution: s/\nI (.*)/\n<br><b>Where:<\/b> $1/g; # See also: while(s/\nS (.*)\nS (.*)/\nS $1, $2/) { ; } s/\nS (.*)/\n<br><b>See also:<\/b> $1/g; # Delete comments: s/\n#.*//g; s/^#.*//; } __END__Cryptic perhaps, but that's Perl for you!
A more advanced application is Oscar's Who's Who, an interface to an address database. Each field of an address record starts with a single-letter tag indicating the kind of data stored. The wrapper script formats records in different ways depending on what information is desired. Furthermore, only privileged clients are allowed to access the complete address information. The filters and the authorization routines are part of a separate package
The same package is used as an interface to a restaurant database.
Back to Contents
Back to Contents
Please provide URLs that illustrate the bug. If you have a bug fix, please make sure it's relative to the latest version of htgrep and its related packages!
Back to Contents
Again, please provide URLs that illustrate your modifications, and make sure that they are relative to the latest version of htgrep and its related packages!
Back to Contents
Back to Contents
The first application of parscan was CUI's W3Catalog, one of the first attempts to develop a searchable catalog of WWW resources. W3Catalog was built by nightly mirroring and reformatting selected documents from the CERN Virtual Library and other lists of WWW Resources so that they could easily be searched.
When the htbin interface was introduced in the Fall of 1993 to the NCSA httpd server, it was immediately clear that parscan should be re-written so that it could be used as a script with any compliant server. Shortly thereafter the Common Gateway Interface introduced and rapidly adopted as a standard way to share scripts across different servers. The conversion of htgrep to CGI was finally made in May 1994.
With the explosion of the Web, interest in CGI tools has grown, and it appears that htgrep has been picked up by a large number of sites as a quick way to get a search engine up and running. This FAQ list is a response to the large number of e-mail messages that have been received, and version 2.0 attempts to address some of the more serious shortcomings of the package. (Thanks to Paul Sutton and Brian Exelbierd for the contributed routines and modifications.) Version 2.0 has been completely re-organized and largely re-written to be more maintainable and easier to read. With luck, I should get fewer requests for help!
Back to Contents
For modest applications, htgrep can be a reasonable choice for getting a search engine up and running quickly. For popular or heavyweight applications, the search engine should be encapsulated in a dedicated server (or integrated with the http server itself).
Back to Contents
oscar@iam.unibe.ch