Programming

Software | Secret Software | Writing

Hash Crash Course

When I teach about hashes, I do what most Perl tutors and tutorials do: I introduce the hash as a "dictionary", a mapping between one thing and another. The classic example, for instance, is to have a set of English words mapped to French words:

    %french = (
        apple  => "pomme",
        pear   => "poire",
        orange => "Leon Brocard"
    );

But the more I look at my code - and more often, the more I look at how to tidy up other people's code - I realise that this is perhaps the least common use of a hash. Much more often, I'm using hashes in particular idioms which have very little in common with this concept of a mapping. Let's look at some of the ways that hashes are actually uesd in Perl code...

Counting

Many of the uses of hashes can be categorised as "answering questions about lists". When you have an array or list of values and you need to ask about its properties, you will often find yourself using a hash. We'll start simply, by counting the number of particular elements in a list. Here's how you'd do this naively:

    my $count = 0;
    for (@list) {
        $count++ if $_ eq "apple";
    }

And maybe you'd smarten this up with the use of the grep function:

    $count = grep $_ eq "apple", @list;

But when you come to need the number of pears in the list, then you have to do the same again:

    $count_apples = grep $_ eq "apple", @list;
    $count_pears  = grep $_ eq "pear",  @list;

Now we have two passes over the list, and the situation isn't going to get any prettier from here. What we want is basically a histogram of the data, and we can get that with a hash:

    my %histogram;
    $histogram{$_}++ for @list;

Each individual item is associated with its count, and we've only been through the list once.

In a recent case, I was looking at a list of tags associated with various photographs. To lay out the data for display, it was useful to know how many different tags there are in the list, and we can get that simply from the number of keys in the histogram:

    $unique = keys %histogram;

And then we can delete the duplicates and end up with a list of the unique tags:

    @unique = keys %histogram;

Fianlly we can show the five most popular tags from the list, like so:

    @popular = 
        (sort { $histogram{$b} <=> $histogram{$b} } @unique)[0..4];

This sorts the list of unique tags based on their popularity in the original list, and pulls out the top five.

Uniqueness

Another idiom to get the unique elements from a list is a variation on the "counting things" idiom; this time we don't care how many of each item there is, just that there's at least one. This allows us to say:

    for (@list) { $unique{$_} = 1 }
    @unique = keys %unique;

which reduces to:

    %unique = map { $_ => 1 } @list;
    @unique = keys %unique;

You can take this a stage further with a slightly non-standard idiom to do the whole thing in one operation:

    @unique = keys %{{ map { $_ => 1 } @list }};

keys needs to be given a hash, so this funny-looking line creates an anonymous hash ({ map ... @list }) and then turns it into a real hash (%{ $hash_ref }) to feed it to keys.

The advantage of this trick is that you can use a variation of it to work with lists which include objects as well as ordinary scalars. Here's an example of the problem:

    my @tags;
    push @tags, Memories::Tag->retrieve_random for 1..10;

Now, this should get 10 random Memories::Tag objects. But some of them might be the same, so we want to see the unique ones:

    %unique = map { $_ => 1 } @tags;
    @unique = keys %unique;

Unfortunately, if we then try to do anything with the tags, it all goes horribly wrong:

    print $unique[0]->name;

    # Can't locate object method "test" via package "Memories::Tag(0x1801380)"

What's happened is that hash keys can only be strings; once we put the object into a hash as the key, then it's not an object any more. Perl turns it into a string. There's no (easy) way to get the strings back into an object. What we need is some kind of data structure which would help us map these broken strings into the original objects. Thankfully, we have one in the form of the hash itself! So the code becomes:

    %unique = map { $_ => $_ } @tags;
    @unique = values %unique;

First we map the object - which is going to be smashed into a string - with another copy of the object - which is going to retain its object-ness. Then instead of getting the strings out of the hash, which are useless to us, we get the objects.

Of course, this doesn't quite work, because even if retrieve_random retrieves the same tag name twice, it will return two different objects with the same data. (Unless our underlying database abstraction layer is using caching, which we'll look at in a moment.)

What we need is to make the list unique based on a property of the tag, such as its name. This time we have:

    %unique = map { $_->name => $_ } @tags;
    @unique = values %unique;

OK. Now we have a list of objects which we've retrieved randomly, but which have unique names. The only problem is that we retrieved ten of them to start with, and then we whittled the list down to get rid of duplicates. What if we actually wanted ten?

The solution is to keep track of ones that we've already seen, so that we don't create any duplicates in the first place. The question "Have I seen this before?" is best answered using another hash-based idiom, like so:

    my @tags;
    my %seen;
    while (@tags < 10) {
        my $candidate = Memories::Tag->retrieve_random;
        next if $seen{$candidate->name}++;
        push @tags, $candidate;
    }

Once again we're collecting unique values by putting them into a hash. Let's suppose that the first candidate tag that comes along is japan. At this point the hash is empty, and so $seen{japan} is zero. Since it's zero, we don't go on to the next, but we still increment it. Now $seen{japan} stands at 1, and the tag goes into our list. If japan comes past again, $seen{japan} will be positive so we will call next, and it will not be placed into the list again. (Don't try this if you have less than ten distinct tags, of course!)

Caching

A special case of "have I seen this before?" comes when you want to create a cache: If I have seen this before, what was the answer I got last time? Here's how you do that.

We mentioned that the underlying database abstraction layer in our tags example might cache look-ups and return the same object if you requested the same tag multiple times:

    my %cache;

    sub retrieve {
        my ($self, $id) = @_;
        return $cache{$id} if exists $cache{$id};
        return $cache{$id} = $self->_hard_retrieve($id);
    }

This checks to see if we've seen the $id before; if we have, return the value from the hash. If not, work out what the value should be, then store it in the hash for next time, then return the whole lot.

Of course, for heavy-duty applications like a database abstraction layer, you need to do a little more, such as pruning the cache to make sure it doesn't get out of date or get so full of data that it eats all your memory. The Cache::Cache suite of modules from CPAN takes care of all of this for you, and the core Memoize module adds this kind of caching to a function without you specifically having to write the cache-getting and -setting code.

Searching

Finally, let's look at using hashes for searching.

If you've done any computer science, you'll probably know about a couple of common searching algorithms. (And even if you haven't, they're so common that you've probably reinvented them without knowing it.) One is linear search, where you start at the beginning of a list, and work your way towards the end, looking for the target item:

    my $index;
    for $index (0..@chambers) {
        last if $chambers[$index] == $bullet;
    }
    print "Found at index $index" if $index < @chambers;

Another is binary search, where you start in the middle of a sorted list, and work out whether you should be going higher or lower than the current element, like players in some demented version of Card Sharks. This is basically the way we look up names in a phone book - it's faster, but a bit more complicated to implement:

    my @names = qw(Able Baker Charlie Dog ...);
    my $index;

    my $target = "Roger";

    sub search {
        my ($lower, $upper) = (0, $#names);
        while ($lower <= $upper) {
            my $index = ($lower + $upper) / 2;
            if ($names[$index] lt $target) {
                $lower = $index + 1;
            } elsif ($names[$index] gt $target) {
                $upper = $index - 1;
            } else { return $index }
        }
        # Not found!
    }

This starts in the middle, at "Mike". "Go higher!" shouts the crowd, so it goes half-way between there and the end, to "Tare". Then it needs to go lower, so looks half-way between "Mike" and "Tare" and gets to "Peter". Now it needs to go higher again, between "Peter" and "Tare", and finds "Roger".

Four comparisons, not bad. Can we do it any better? How about... oh, none?

    my %search = map { $names[$_] => $_ } 0..$#names;
    print $names{"Roger"};

Now I'm cheating somewhat here because setting up the hash requires iterating over the whole array. But once you've done that, the searches are basically free.

Of course, if you need to look up an array index by its contents, then maybe you're doing something wrong in the first place, and should have used a hash to start with. For instance, we're reading in a configuration file like this:

    force-v3-sigs
    escape-from-lines
    lock-once
    load-extension rndlinux
    keyserver wwwkeys.eu.pgp.net
    keyserver the.earth.li

(Yes, that's from GnuPG.) Each line has a key and optionally a value, but you can have multiple values for each key. At the end of the day, we're going to want to look through and say "tell me all the keyservers" - but not just the keyservers, you want to read the whole config in and be able to say that about any key. That's a search problem, and a tricky one. Here's one solution:

    while (<>) {
        chomp;
        my ($key, $value) = split /\s+/, $_, 2;
        push @config, [$key, $value];
    }

Now you have to say:

    @keyservers = map { $_->[1] } grep { $_->[0] eq "keyserver" } @config;

This is actually a linear search in disguise. (grep does a linear search for you.)

With hashes, the issue is much simpler:

    while (<>) {
        chomp;
        my ($key, $value) = split /\s+/, $_, 2;
        push @{$config{$key}}, $value;
    }

This treats every key as a separate array reference, and pushes values into that array. Now to retrieve the list of keyservers, we just look up the array inside the hash:

    @keyservers = @{$config{"keyservers"}};

This process is very remniscient of the final pattern we're going to look at - using a hash as a portable symbol table.

Dispatch tables

Instead of having the configuration reader create a bunch of arrays - @keyservers, @load_extension and so on - we created a hash which held the arrays for us so that we can look them up indirectly but more efficiently. In effect, instead of using the Perl symbol table, we use a hash as a portable symbol table.

Let's suppose you have a script that does several related things: it manages your to-do list by adding, editing, listing and deleting to-do items:

    % todo add "Email Samuel about photos"
    Todo item 129 created
    % todo done 129
    Item 129 marked as done

You would expect the script to look like this:

    my $command = shift @ARGV;
    if    ($command eq "add")  { add(@ARGV)  }
    elsif ($command eq "list") { list(@ARGV) }
    elsif ($command eq "done") { done(@ARGV) }
    elsif ($command eq "edit") { edit(@ARGV) }
    ...
    else { die "Unknown command: $command" }

Well, that is quite tedious; we need to edit the program in several places every time we add a new command. We could use symbolic references - that is, tell Perl to call a function named $command:

    my $command = shift @ARGV;
    sub AUTOLOAD { die "Unknown command: $AUTOLOAD" }
    no strict 'refs';
    &{$command}(@ARGV);

But that's somewhat crazy: it allows the user to get at any subroutine in the main package, which we may not want, and in order to keep any error checking, we've had to assume that any undefined subroutine call comes from the command line.

The middle way is to copy the commands into a hash, mapped to a function reference:

    %commands = (
        add  => \&add,
        list => \&list,
        edit => \&edit,
        done => \&done,
    );
    my $command = shift @ARGV;
    if (!exists $commands{$command}) { die "Unknown command: $command" }
    $commands{$command}->(@ARGV);

This keeps strict happy, it's safe in the way it restricts what subroutines can be called, and it allows for error checking that doesn't mess everything else up. Mark Jason Dominus' Higher Order Perl shows how you can define commands at runtime if you use dispatch tables, something you can't do if you hard-code your dispatch.

Conclusion

We've looked at some of the most common hash-based patterns: using hashes for counting, uniqueness, searching, and dispatch - rather a lot more than just mapping from one thing to another. Of course, that is what a hash does at one level, but the uses of such a data structure are a lot more diverse than just that.

And that's how you improve your Perl programming - you take elements of the language which ostensibly do one thing, and you find that they're great for more complicated uses as well. Maybe after these ideas you'll be able to find a few more hash idioms of your own!

Latest articles

Development activity

This page was last checked for correctness on 2006-11-02. Contact Simon.