Filtering documents with SWISH::Filter
--------------------------------------

Swish-e knows only how to parse HTML, XML, and text files. 
Other file types may be indexed with the help of filters.

SWISH::Filter is a Perl module designed to make converting
documents from one type of content to another type of content
easy.  It's uses a plug-in type of system where new filters
can be added with little effort.

SWISH::Filter (and associated plug-in filter modules) do not
normally do the actual filtering.  This system provides only
an interface to the programs that do the filtering.

For example, the Swish-e distribution includes a filter plug-in
called SWISH::Filters::Pdf2HTML.  For this filter to work you must
install the xpdf package that includes the pdftotext and pdfinfo
programs.  SWISH::Filters::Pdf2HTML only provides a unified interface
to this programs.

The included program F<spider.pl> will use SWISH::Filter by default.
This means that installing the programs that do the filter is all that
is needed to start filtering documents.  For example, installing the
xpdf package will enable indexing of PDF file when spidering.

The filter modules are in the $libexecdir/perl directory.  Running swish-e
-h will list the setting for $libexecdir, but is typically
/usr/local/lib/swish-e if swish-e was built from source, or /usr/lib/swish-e
if installed as a package.  On Window $libexecdir will be set at 
installation time.

Note that $libexecdir/perl is not normally part of Perl's @INC array. So to
read documenation on a specific filter you will need to either specify the
full path to the filter or set PERL5LIB.  For example:

    export PERL5LIB=/usr/local/lib/swish-e/perl
    perldoc SWISH::Filter

Documentation for SWISH::Filter can also be found in the html directory and 
at http://swish-e.org.

Swish-e has another filter system. The FileFilter directive that can be used
to filter documents through an external program while indexing. That system
requires a separate filter setup for each type of document. See the
SWISH-CONFIG page for information on that type of filtering.


Testing SWISH::Filter
---------------------

The program swish-filter-test in installed by default (in the same location as
the swish-e binary).  This program can be used to test SWISH::Filter.  For example,
run the command:

    $ swish-filter-test foo.pdf foo.txt

    Document foo.pdf was  filtered.
       Document:     foo.pdf
       Content-Type: text/html  (initial was application/pdf)
       Parser type:  HTML*

    Document foo.txt was not filtered.
       Document:     foo.txt
       Content-Type: text/plain  (initial was text/plain)
       Parser type:  TXT*

Run the command

   $ swish-filter-test -man

for documentation.


Current filters distributed with Swish-e:
-----------------------------------------

All of these filters require installation of helper programs and/or Perl modules.
See the individual module's documentation for dependencies.

    SWISH::Filters::Doc2txt     - converts MS Word documents to text
    SWISH::Filters::Pdf2HTML    - converts PDF files to HTML with info tags as metanames
    SWISH::Filters::ID3toHTML   - extracts out ID3 (v1 and v2) tags from MP3 files
    SWISH::Filters::XLtoHTML    - converts MS Excel to HTML

Filters that depend on Perl modules that are not installed will not load.
Setting the environment variable FILTER_DEBUG may report helpful errors when using
filters.

See perldoc SWISH::Filter for instructions on creating filters.

