Xtract: a `grep'-like tool for XML documents


Other pages available:


Intro

Xtract is a command-line tool for searching XML and HTML documents. Just as `grep' returns lines which match your regular expression, so Xtract returns all those sub-trees from XML or HTML documents which match a query pattern. The query expression language is simple but powerful, and is based loosely on XQL, the recently proposed XML Query Language. An introduction to the Xtract query pattern language, together with the full Xtract grammar is in this tutorial.

Usage

Tool usage is as for grep:

      Xtract pattern [file [file ...]] 
The pattern expression must be present, and you almost certainly need to use quotes to prevent your shell from interpreting it. If no files are given, standard input is assumed. You can mention standard input explicitly amongst the file arguments by using a single dash (-). Filenames ending in .html or .htm are parsed as HTML - all others are assumed to be XML regardless of name. Extracted document sub-trees always go to standard output. Currently, there are no annotations saying which parts of the output came from which input files.

Copyright and Licence

Xtract is "Open Source" (see http://www.opensource.org/ ). Development of Xtract was funded by Canon Research Centre Europe Ltd. , Xtract re-uses some components written by other members of the Haskell language community. Hence, various modules are copyright to different people and organisations (see the copyright notices). You are licensed to use, distribute, or modify this software under the terms of the Gnu GPL. We give you no warranty, express or implied, about this software: use it at your own risk.

Obtaining Xtract

Sources and binaries for the Xtract tool are freely available. You currently need a Haskell compiler to build Xtract from source. The website http://www.haskell.org/ will point you to several free compilers and interpreters. We provide executable binaries for certain machines: check the download page for details. In a future release, we may provide machine-generated C sources. The installation procedure is described on the download page.

Version notes

Xtract is currently BETA QUALITY software. Version number is 0.5. Please read our list of caveats for known bugs and unimplemented features. Bug reports, suggestions, and other comments are most welcome. Write to Malcolm.Wallace@cs.york.ac.uk


The official Xtract website is at http://www.cs.york.ac.uk/fp/Xtract/