Programming

Software | Secret Software | Writing

Index Your World With Wherewas

I don't know if you've ever had anyone tell you that if you kept using computers, you'd stop using your memory. Well, I have, and after a good twenty years playing around with all shapes and size of computing device, I've realised that they're right.

My computer remembers names, dates, addresses, phone numbers and everything else for me, so I don't have to. Perhaps it's ultimately self-defeating, but my reaction to realising my memory is getting worse is to write even more software to remember different stuff for me.

There are, of course, a number of tools already available for helping to plug the memory gap - for instance, there's the Emacs Rememberance Agent, which squirrels away information and suggests related topics when you type into your Emacs buffer. Nat Friedman has written a very interesting piece of software called Dashboard, which receives "cluepackets" from co-operating applications such as Evolution or GAIM, and associates them. When I get a mail from a friend, Dashboard presents me with a picture of them, their address and phone number, and the last few instant messages from them.

But both Dashboard and the Rememberance Agent use what's called passive querying - while you're working away on other things, they suggest information that might be useful to you. I'm more interested in the other side of this, active querying. Generally, when I need to remember something, there's a specific query I have and want to ask of an application - what's John's address? What time and where am I meeting Sally on Friday?

A while back I wrote a series of tools for asking this sort of question about email, but recently I've found myself wondering where I read something. Was it on some web page? Did it come through one of the RSS feeds that I read? Or was it just a file on my local system? To answer these sorts of question, I wrote a personal desktop indexing and search tool called wherewas, and I'd like to share with you how I did it.

The Plucene Search Engine

While we were working on data mining email, my company and I completed a Perl port of the Lucene search engine, called Plucene. Plucene is a toolkit which makes it really easy to build up indexes of documents, and then build applications which search these indexes. We're going to be indexing various files, web pages, emails and so on, with Plucene, and then later creating a search tool.

Plucene has one major advantage over most search libraries, which we'll exploit for wherewas: you can construct a document out of multiple fields, and use those fields in your search term. For instance, if you have a multilingual web site, you can specify the language of each document, and then restrict your search to pages in a particular language. We'll use this to index as much metadata about resources as we can, to give the user more "handles" on his search - if a user can say that they're looking for a document written by so-and-so in English which was originally a PDF file, they stand a better chance at narrowing down the search results than just knowing a few words in the document. So to Plucene, a web page might look like this:

    id: http://blog.simon-cozens.org/
    indexed: 2004-09-22
    modified: 2004-09-20
    type: html
    language: en
    charset: utf-8
    text: I'm sure I recently...

And we could search for this particular page by saying, for instance recently +type:html.

Our job is to get any kind of file into this format of fields and values.

Let's look at the simplest case - indexing a plain text file on disk. We start by creating a Plucene::Index::Writer object to do the indexing for us, and a Plucene::Document object to represent what we're indexing:

    use Plucene::Index::Writer;
    use Plucene::Analysis::SimpleAnalyzer;
    use Plucene::Document;

    my $index = Plucene::Index::Writer->new(
        "/var/plucene/wherewas",
        Plucene::Analysis::SimpleAnalyzer->new(),
        1);

    my $doc = Plucene::Document->new();

Plucene::Index::Writer's constructor takes three arguments - first, the directory for storing the index; second, an analyser object which will split up documents into tokens to be indexed; third, a flag to say whether or not we're creating a new index. The simple analyzer, Plucene::Analysis::SimpleAnalyzer is good enough for most search engine purposes; other analyzers might be more appropriate if, for instance, you're handling a lot of East Asian text which doesn't have obvious word breaks.

Now we need to read in the file and flesh out the Document object. We'll use the File::Slurp module to get the read_file routine, which reads a whole file into a scalar variable. We'll also use Plucene::Document::Field to create the fields of our document. There are several different types of field - we're going to use a Keyword field for a URL, which will enable us to locate the file again given our search result, and a Text field to store the contents of the file.

    use File::Slurp;
    my $file = read_file("book.txt");
    $doc->add(Plucene::Document::Field->Keyword(id => 
        "file://home/simon/book.txt"));
    $doc->add(Plucene::Document::Field->Text(data => $file));

Finally, we index the document by passing it to the indexer:

    $index->add($doc);

This is how Plucene natively indexes files. We're going to abstract out the process of indexing documents so that wherewas can do the same thing for web pages, plain text files, PDF files, emails and so on. The document indexing part of wherewas became Plucene::SearchEngine::Index. We'll store the document as a plain Perl hash, and convert it to a Plucene::Document file later.

More metadata, and different front-ends

We mentioned getting additional metadata to help narrow down search results. From ordinary files, the natural source for this metadata is the filesystem and the information we get from stat. For web pages, we can use the headers from the web server.

Similarly, for plain text files we can merely extract the text, but for other types of file, we can extract some of the structure of the file as well. This means we need a design for wherewas that's going to allow us to switch in new file types and data sources. Here's the design that I came up with:

    # wherewas.png

Here we have a resource, somewhere or other, that we either get to using the URL frontend or the File frontend. These frontends gather metadata from the HTTP response headers or a filesystem stat respectively; then the resource is passed onto a backend for more specific analysis based on the type of the data. The two frontends provide an examine class method that can be used on the appropriate type of resource:

    my @docs = Plucene::SearchEngine::Index::File->examine("book.pdf")
    my @articles = Plucene::SearchEngine::Index::URL->examine
        ("http://blog.simon-cozens.org/?format=xml");

Note that we extract multiple documents, not just one - for instance, in the case of the blog site, we'll return a document object for each individual entry on the page, so that searches can return more finely-honed results. These objects that we return are not Plucene::Document objects yet - they're an intermediate form that we'll use to make it easier to store data, a data structure blessed into the class of the appropriate backend, such as Plucene::SearchEngine::Index::PDF, and will convert to Plucene::Documents later.

We'll deal with the File handler first. This will inherit a bunch of helper methods from the Base class common to both frontends and backends. It needs to look at a filename, get an appropriate backend handler for it, create an initial document, and then start filling it with metadata. The backend handler can then examine the file, and if necessary, clone the document returning multiple sub-documents.

How, though, do we know what sort of a file we've got? We could just look at the filename and see if we can tell from the extension, but that's not always ideal: some files, such as Unix mailboxes, don't necessarily have an extension, and in other cases the extension can be misleading. Instead, we'll use a module called File::MMagic to sniff the first few bytes of the file, and determine its MIME type. If that doesn't provide a useful handler, we can fall back on the extension.

    package Plucene::SearchEngine::Index::File;
    use base 'Plucene::SearchEngine::Index::Base';
    use File::MMagic;
    my $magic = File::MMagic->new;

    sub examine {
        my ($class, $filename, $encoding) = @_;
        return unless -r $filename;
        my $mime = $magic->checktype_filename($filename);

Now we will use a helper method, handler_for, which takes the best guess at the relevant handler for the given MIME type and filename extension, returning the name of a backend class:

        my $document = $class->handler_for($filename, $mime)->new();

And now we can add the data from the filename. Instead of building a hash directly, we'll use another helper method, add_data, which allows us to represent the Plucene document in a more Perl-friendly way.

        use File::Basename;
        use File::Spec::Functions qw(rel2abs);
        use File::stat;
        use Time::Piece;

        $document->add_data("filename", "Text", basename($filename));
        $document->add_data("mimetype", "Text", $mime);
        $document->add_data("id", "Keyword", "file://".rel2abs($filename));
        $document->add_data("modified", "Date",
            Time::Piece->new(stat($filename)->mtime));
        $document->add_data("owner", Text, scalar getpwuid(stat(_)->uid));
        
        if ($encoding) { $document->add_data("encoding", "Text", $encoding) }

As you can see, add_data takes three parameters: the name of the field; the type of the field, as described in the section on Plucene above; and the data to be indexed. By using add_data and an intermediate representation of the Plucene document, we can add multiple bits of data to the same field, instead of having to specify what goes in the field once and for all. This will become useful when, for instance, we're parsing HTML documents and want to add data to the relevant field (header, body, and so on) as we parse each chunk.

Now we've added all the metadata we can sensibly extract from the filesystem, we can pass the file to the backend, so that that can examine the file itself and index that:

        return $document->gather_data_from_file($filename);

This may return one or more objects, which we'll put in our Plucene index.

The Indexer and Query Tool

Once we've created a Plucene::SearchEngine::Index::URL which does the same sort of thing but using the metadata extracted from the HTTP response headers, the next stage is to turn the returned documents into Plucene::Document objects and place them into the Plucene index. The Plucene::SearchEngine::Index module will abstract this process away, making our command-line indexer look like the following:

    use constant WW => "/var/lib/plucene/wherewas-index";
    use Plucene::SearchEngine::Index;
    my $indexer = Plucene::SearchEngine::Index->new( dir => WW );

    my @docs;
    my $what = shift;
    if ($what =~ m{:/}) {
        @docs = Plucene::SearchEngine::Index::URL->examine($what)
    } else {
        @docs = Plucene::SearchEngine::Index::File->examine($what)
    }

    for (@docs) {
        $indexer->index($doc->document);
    }

If something looks like a URL, we use the URL frontend; otherwise we use the File frontend. The document method, provided by Plucene::SearchEngine::Index::Base, turns our intermediate representation into a real Plucene::Document object, which is then simply indexed. Plucene::SearchEngine::Index is responsible for providing a nice wrapper around Plucene::Index::Writer and sensible defaults for the analyser.

Plucene::SearchEngine::Index::Base also adds a few fields of its own to the record to be indexed: it adds the indexed field as a date when the indexing was done, and the type field, which is a user-friendly version of the name of the backend. So, for instance, we can retrieve things which went through Plucene::SearchEngine::Index::PDF by specifying type:pdf.

Now we can pass filenames and URLs to this, and their contents will be squirreled away for later retrieval:

    % ww-index http://news.bbc.co.uk/
    % ww-index book.pdf
    % ...

Of course, the best index in the world is no good if we can't retrieve data from it. The retrieval client is called ww, and uses the Plucene::SearchEngine::Query module to do its heavy lifting:

    #!/usr/bin/perl
    use constant WW => "/Users/simon/tmp/wherewas-index";

    use Plucene::SearchEngine::Query;
    my $indexer = Plucene::SearchEngine::Query->new( dir => WW );
    my $what = shift;
    my @docs = $indexer->search($what);
    for (sort {$b->[1] <=> $a->[1]} @docs) {
        my ($doc, $score) = @$_;
        my $id = $doc->get("id")->string;
        $id =~ s/file:\/\///;
        printf "Found at $id, score %i%%\n", $score*100;
    }

@docs here is an array of arrays: the first element in each set is a Plucene::Document object, and the second is the score assigned by Plucene. Once we have sorted the result set according to score, we retrieve the ID from the document as a string. This leads to results like the following:

    % ww "pie"
    Found at http://www.justatheory.com/computers/conferences/oscon2004/notes.html 
     in http://planet.perl.org/rss10.xml, score 37%
    Found at /Users/simon/p6s.html, score 30%

    % ww "searching"
    Found at http://blog.simon-cozens.org/6786.html 
     in http://blog.simon-cozens.org/blosxom.cgi/xml, score 59%
    Found at /Users/simon/0406tpj.pdf, score 7%

Some Sample Backends

So we've seen much of how the frontend modules and the end-user code work; how about the analysis modules which deal with the actual text?

Before we look at these, we need to peek at some of the details we've skipped over in our explanation so far. How exactly does handler_for in our base class decide which backend to use?

The trick is that each backend will register the extensions and MIME types it expects to handle in some hashes, and handler_for will look up the extension and MIME type in these hashes. To avoid dealing with global variables directly, there's a register_handler method in Plucene::SearchEngine::Base which hides the hashes as lexicals:

    {
        my %mime_handlers;
        my %extension_handlers;

        sub register_handlers {
            my ($class, @specs) = @_;
            for my $spec (@specs) {
                if ($spec =~ m{/}) { # That is, looks like a MIME type
                    $mime_handlers{$spec} = $class;
                } else {
                    $extension_handlers{$spec} = $class;
                }
            }               
        }

        sub handler_for {
            ...
        }
    }

This means that our backend modules can register both MIME types and filename extensions that they support:

    package Plucene::SearchEngine::Index::PDF;
    use base 'Plucene::SearchEngine::Index::Base';

    __PACKAGE__->register_handlers(".pdf", "application/pdf");

    package Plucene::SearchEngine::Index::HTML;
    use base 'Plucene::SearchEngine::Index::Base';

    __PACKAGE__->register_handlers(".htm", ".html", "text/html");

and the appropriate information will be stored in the lexical variables inside Plucene::SearchEngine::Index::Base.

For the frontends to query those hashes, we have the handler_for method as follows:

    sub handler_for {
        my ($self, $filename, $mime) = @_;
        if (exists $mime_handlers{$mime}) { return $mime_handlers{$mime} }
        for my $spec (keys %extension_handlers) {
            if ($filename =~ /$spec$/) { return $extension_handlers{$spec} }
        }
        return DEFAULT_HANDLER;
    }

This first tries a direct lookup on the MIME table, then uses each of the filename extensions registered as a regular expression to match against the filename given. If neither of these produce any results, then the default handler (the text file handler) is returned.

Another question you may have is where all these backend modules come from, how they get loaded up so that they can register themselves. Well, there are two ways to do this. Here's the one I don't recommend:

    package Plucene::SearchEngine::Index;
    use Plucene::SearchEngine::Index::HTML;
    use Plucene::SearchEngine::Index::PDF;
    use Plucene::SearchEngine::Index::Text;
    use Plucene::SearchEngine::Index::Email;
    use Plucene::SearchEngine::Index::Image;
    # ...

First, it's boring, and second, it means that we have to manually load up new modules when we install them. Instead, we'd rather have Perl look for all the Plucene::SearchEngine::Index::* modules that it can find, and load them up automatically. Thankfully, that's what the handy CPAN module Module::Pluggable was designed to do: to let extensions load up plug-in modules without having to know or care which ones are installed. Our module loader code looks like this:

    use Module::Pluggable (require => 1, 
        search_path => [ "Plucene::SearchEngine::Index" ]);

    __PACKAGE__->plugins;

Module::Pluggable provides the plugins method, which we've configured to look in the given namespace and then require all the modules that it finds there. Now when third-party modules are installed, they too can be loaded up and will therefore automatically register themselves on require with the hashes in Plucene::SearchEngine::Index::Base.

Now let's look at how the backends work; we'll take the Image backend as our first example. This is a third-party module that's available from the CPAN separate from the rest of Plucene::SearchEngine; it's a little unusual since you don't often expect to search image files for text, but this module uses Image::Info to extract the size and any comments from an image file. If you have a photo library which has good JFIF titles and tags, you can use wherewas without modification to index and search your image collection as well as your reading matter.

As we saw, the entry point to the backend is gather_data_from_file. For Plucene::SearchEngine::Index::Image, it looks like this:

    sub gather_data_from_file {
        my ($self, $filename) = @_;
        my $info = image_info($filename);
        return if $info->{error};
        $self->add_data("size", "Text", scalar dim($info));
        $self->add_data("text", "UnStored", $info->{Comment});
        $self->add_data("subtype", "Text", $info->{file_ext});
        $self->add_data("created", "Date",
                Time::Piece->new(str2time($info->{LastModificationTime})));
    }

image_info, from the Image::Info module, does the heavy lifting. All our module needs to do is add the size, text, subtype and created date fields into the object to be indexed.

It also needs to register what types of file it can deal with:

    __PACKAGE__->register_handler(qw(
        image/bmp           .bmp
        image/gif           .gif
        image/jpeg          jpeg jpg jpe
        ...
    );

Now when Plucene::SearchEngine::Index is use'd, it will call Module::Pluggable::plugins; this will in turn require all the other Plucene::SearchEngine::Index::* modules, including our image handling module, and as Perl requires them, it runs the code in them, so that the register_handler method is called, placing the MIME types and extensions into Plucene::SearchEngine::Index::Base's hashes. Whenever an image file comes along, the indexer knows which backend module to dispatch it to.

If we want to search our index collection, we can search for

    % ww type:image 

Similarly, we can index HTML files, using the HTML::TreeBuilder Perl module to parse the HTML for us.

    sub gather_data_from_file {
        my ($self, $filename) = @_;
        my $tree = HTML::TreeBuilder->new;
        $tree->parse_file($filename);

What can we add from a HTML file? The first thing we want to extract is all the data in the META tags, since these can contain keywords or a summary description of the page:

        for($tree->look_down(_tag => "meta")) {
            next if $_->attr("http-equiv");
            next unless $_->attr("value");
            $self->add_data($_->attr("name"), "Text", $_->attr("value"));
        }

Next we could squirrel away all the links, so that we can search documents by what they link to:

        for (@{$tree->extract_links("a")}) {
            $self->add_data("link", "Text", $_->[0]);
        }

Finally, we store the rest of the document as ordinary text.

        $self->add_data("text", "UnStored", $tree->as_trimmed_text);
        return $self;
    }

Once we have the HTML backend in place, the PDF backend is trivial. We use the pdftotext utility that comes with xpdf, and has an option to put all metadata it finds in the PDF document into HTML META tags. This is ideal for feeding to the HTML backend:

    use File::Temp;
    sub gather_data_from_file {
        my ($self, $filename) = @_;
        my $html = tmpnam();
        system("pdftotext", "-htmlmeta", $filename, $html);
        return unless -e $html;
        $self->Plucene::SearchEngine::Index::HTML::gather_data_from_file($html);
        unlink $html;
        return $self;
    }

Finally, as one might be able to guess, the plain text backend simply sucks in the text and indexes it; we don't try to extract any metadata, since there usually isn't any.

At this point we have a personal desktop indexer which can index any HTML, text, PDF or image file, locally or remotely from the web, and we have a way to search for things that we remember from these files. This isn't bad, but it isn't quite clever enough.

Creating a web proxy

The original point of this indexer was to allow me to remember where I'd seen particular documents or bits of information, and most of the things I read on my computer are from the web. So ideally, I want wherewas to take note of what my browser is seeing. In order to make this work most easily, rather than messing about with browser histories, I decided to write a personal indexing proxy: as I view a web site, it first passes through to wherewas to index its data, then gets passed along to the browser. Again, this is something that's very easy to achieve in Perl.

The key is the HTTP::Proxy module, which allows us to quickly create web proxies. It does this by providing a set of filters: you choose the filters that you want to use for your proxy, or create your own, and push them onto a filter stack. As you make web requests through the proxy, the request goes out and is filtered through the request stack, and then the response comes back and is filtered through the response stack before being handed to the client. There's one filter, HTTP::Proxy::BodyFilter::save, which saves every file it sees to disk, so we'll use this as the basis of our indexing filter.

    package HTTP::Proxy::BodyFilter::index;
    use base 'HTTP::Proxy::BodyFilter::save';
    use Plucene::SearchEngine::Index;

First we'll add an option to the constructor to show where our index will be:

    sub init {
        my ($self, %options) = @_;
        if (exists $options{index}) {
            my $directory = delete $options{index};
            $self->{_indexer} = 
                Plucene::SearchEngine::Index->new( dir => $directory );
        }
        $self->SUPER::init(%options);
    }

Next we want to keep hold of the HTTP::Response object that gets passed to us at the beginning of the response, so that we can extract the headers from it:

    sub start {
        my ($self, $message) = @_;
        if ($message->isa("HTTP::Response")) {
            $self->{_response} = $message;
        }
        $self->SUPER::start($message);
    }

And at the end, we want to examine the headers and the file that we've produced, and index them, using the same sort of code as in Plucene::SearchEngine::Index::URL. (Indeed, in the future, the URL frontend may be rearranged so that applications like this can use its code through methods.)

    sub end {
        my $self = shift;
        $self->SUPER::end(); # Close the file.
        my $filename = $self->{_hpbf_save_filename};
        return unless $filename;
        my $response = $self->{_reponse};
        my $mime = $response->header("Content-Type");
        $mime =~ s/;\s+(.*)//;
        my $o = Plucene::SearchEngine::Base->handler_for($filename, $mime)
                    ->new();
        # Proceed gathering data as before, and then...
        my @docs = $o->gather_data_from_file($filename);
        # And then index them:
        $self->{_indexer}->index($_) for @docs;
        # Clean up the temporary file
        unlink $filename;
    }

And that's basically it for the filter; we can now write the actual proxy:

    use HTTP::Proxy;
    use HTTP::Proxy::BodyFilter::index;
    my $proxy = HTTP::Proxy->new(port => 3128);
    $proxy->push_filter( 
        response => HTTP::Proxy::BodyFilter::index->new(
            index => "/var/plucene/wherewas"
        )
    );
    $proxy->run;

With this running, and my computer configured to use a proxy on port 3128, everything the browser sees - and everything other web-accessing tools like my RSS aggregators download - will be quietly but surely indexed so that I can find them later. My goal of being able to recall everything I've read is one step closer to being achieved.

Future directions

wherewas is currently what I'd call "prototypeware" - it demonstrates an interesting idea, and it was interesting to code, but it's not quite ready for prime time.

There are several things that need to be done to make it more usable. The first is a revamp of the search interface. Since most of the time the results will come as web sites, it would be interesting to write a web-based query interface with direct links to search reuslts. In this way, you could imagine wherewas becoming essentially a personalised web search engine.

Another area that wherewas needs improvement is in developing more indexers for different types of document. Some of them, such as a Microsoft Word document indexer, would just be similar to the PDF indexer, a thin wrapper around existing conversion tools. Others will be more involved.

Because wherewas can index multiple documents from a single file, you can do some really interesting things: for instance, I can currently use wherewas to find data which was inside a PDF attachment to an email, itself inside a mailbox file. There's no reason why we can't index tarballs and zip archive files, and recursively index their contents.

Finally, Plucene itself needs to become somewhat faster for wherewas to take off; for a long time I had problems using the web proxy because I was downloading web pages faster than Plucene could index them, and the indexing process was really slowing down the browsing. Adding forking to the indexing part of the proxy helped the user experience, but flooded the machine with things to index. This is an ongoing need for Plucene's development, so I imagine that will be addressed in time.

But for the meantime, I hope I've demonstrated an interesting idea - personalised desktop and web indexing - and some interesting technologies - the Plucene search engine, the HTTP::Proxy proxy construction kit, and a few others. And now that I don't have to remember what I've been reading... what was it that I'm supposed to do next today, again?

Latest articles

Development activity

This page was last checked for correctness on 2004-12-13. Contact Simon.