Software | Secret Software | Writing
Little Languages in Perl
I've been working on and off on a little project to help trap and classify spam in all its guises: not just junk mail, but junk web site comments, web sites, and so on. It's sort of a rewrite of the popular SpamAssassin project but expanded to work with other kinds of material than just email. I've called it SpamMonkey, and you can get it from the CPAN.
But I do want it to work with email too, and one powerful anti-spam tool for dealing with email is the relay black list. Originally used to keep tabs on mail servers which allowed anyone to send mail through them, there are now lists of all varieties of "naughty" hosts and netblocks. To use a relay black list, you look at where your mail came from by looking at the "Received" lines of the header, and look at all the hosts it came through on its journey to your inbox, and see if any of them are on the naughty list, which is published using the DNS system.
The only problem is that mail transport agents provide their information about where mail came from in any number of formats. There is a standard specified in RFC 2822, but it is a custom more honoured in the breach than in the practice. SpamAssassin has a great parser which deals with most of the common formats and makes a good stab at the rest, but we hardly want to be pulling in SpamAssassin in our SpamAssassin-replacement.
So I began looking at the relevant subroutine in SA for parsing Received headers.
It's called Mail::SpamAssassin:Message::Metadata::Received::parse_received_line,
and it's around 900 lines of Perl.
It's also almost entirely made up of regular expressions,
like this:
if (/^from .*?(?:\]\)|\)\]) .*?\(.*?authenticated.*?\).*? by/) {
$auth = 'Sendmail';
} elsif (/\) by .+ \(\d{1,2}\.\d\.\d{3}(?:\.\d{1,3})?\) \(authenticated as .+\) id /) {
$auth = 'CriticalPath';
} elsif (/ by .*? with (ESMTPA|ESMTPSA|LMTPA|LMTPSA|ASMTP|HTTP)\;? /i) {
$auth = $1;
}
I didn't particularly want to just take the code and steal it wholesale; I wanted to try to make it a bit tidier in the process. To do this, I chose a solution off the top of my head...
Ad-hoc languages
I started rewriting these regular expressions as statements in a new little language I made up for the purpose. How do you design a language like this? It's very simple - you just start writing code in it. The representation that you naturally come up with when you write code is, by definition, the most natural representation of that code for you. Since I had a bunch of regular expressions to match some text, and some actions to do if they matched, I started writing code like this:
/^\(/ IGNORE "gateway noise";
/\sid\s+<?([^\s<>;]{3,})/ SET id = $1;
/ by .*? with (ESMTPA|ESMTPSA|LMTPA|LMTPSA|ASMTP|HTTP)\;? /i SET auth = $1;
So there's the general form: regular expression test, followed by one or more actions. Notice that what I've done is to remove all the common code in the SpamAssassin case: if (...) { ... } and concentrate on only the parts which change from rule to rule.
The actions can be IGNORE with a reason, if this line isn't interesting - for instance, if it details a mail transfer inside a firewall - SET to assign a variable to a value, DONE if we've already got enough information out of the line, and UNPARSABLE if we don't understand the line at all.
And off I went merrily translating the code from SpamAssassin into my new language, until I can across something new. For efficiency, SpamAssassin does a general check (say, if the header contains "from") and only runs some specific checks if the general check succeeds. The Perl is like this:
if (/^from /) {
if (/ \(SquirrelMail authenticated user /) {
dbg("received-header: ignored SquirrelMail injection: $_");
return;
}
...
}
So suddenly we need another control structure: if the general condition succeeds, then test the specific conditions. For some reason I was in a bit of a Perl 6 mood when I was ad-hoc'ing my little language, and came out with:
GIVEN /^from/ {
/ \(SquirrelMail authenticated user / IGNORE "SquirrelMail injection";
...
}
I also needed a GIVEN NOT to cover Perl's unless control structure.
Two more little additions - the ability to say set var ||= $value and the usual Perl syntax var=~// to test against a variable instead of $_ - and our little language is done. What was 900 lines of code is now 200 lines of data. Now what?
Parsing
First, we've got to teach our program how to understand the language we've just made up - we have to parse it. One great tool to have in your armoury for parsing languages is Parse::RecDescent; however, our little language is so simple that we can parse it with a few regular expressions. For anything more complex, Parse::RecDescent would be coming out of the toolbox at this point. But what we have is an optional regular expression - with an optional referent var=~ - and then some commands. We'll worry about GIVEN in a moment. Here's how we get started:
sub parse_rules {
my $tree = [];
while (<DATA>) {
chomp; s/^\s+//;
s/^#.*//; next unless /\S/;
We're going to store the rules in an array, one element per rule. We start by tidying up the input line and getting rid of comments and blank lines. Now we'll check for the referent, and we can start defining our rule:
my $referent;
$referent = $1 if s/^(\S+)=~//;
my $current = {};
Now we have a bit of a problem; we want to parse a regular expression, and they're horrific: you have to look for matching beginning and end delimiters, you have to check whether or not the end delimiter appears with a backslash before it as in /^from Unknown\/Local \(/; you have to check for a possible s before a substitution regular expression and possible modifiers afterwards.
Thankfully, there's already a solution for this very task: the Text::Balanced module supplies a subroutine called extract_quotelike, which look after all of these concerns. We give it our input string and, if there's a regular expression at the beginning of the string, it will remove it and return it to us.
if (my $re= extract_quotelike($_)) {
$current->{regexp} = $re;
$current->{referent} = $referent if $referent;
}
Now if we've specified a referent var=~ then a regular expression is no longer optional, so we have another clause to check for this:
elsif ($referent) {
die "Syntax error on line $.: Referent with no regexp!\n";
}
And now everything else in the line is meant to be commands, so while there's text left, we pull commands out of the line:
my @actions;
while ($_) {
s/^\s+// and next;
if (s/^SET (\w+)\s*((?:\|\|)?=)\s*(.*?);//) {
push @actions, { action => "SET",
variable => $1, value => $3, operator => $2 };
next;
}
The first thing we check for is SET a=<whatever>. Remembering that we also allow SET a ||= <whatever>, we pull out the operator as well.
Once we've got this action out of the way, we call next to go around the loop again. But maybe it's not SET, maybe it's IGNORE:
elsif (s/^IGNORE\s*//) {
my $reason = extract_quotelike($_);
die "No semicolon after reason? on line $.\n" unless s/^\s*;//;
push @actions, { action => "IGNORE", reason => $reason};
next;
}
With IGNORE, we have the same problem as regular expressions; we might have a reason that looks like this: "This is a \"dangerous\" gateway". The very same solution works, however, as extract_quotelike can pull out quoted strings as well as regular expressions. (Hence its name.)
The rest of the parser is fairly straightforward:
} elsif (s/^(DONE|UNPARSABLE)\s*;//) {
push @actions, { action => $1 };
next;
}
die "Can't parse action '$_' at line $.\n";
}
if (@actions) { $current->{actions} = \@actions }
push @$tree, $current;
}
return $tree;
}
If we see DONE or UNPARSABLE, we create an action of that name. Otherwise, we don't know what the action means, and we complain. Then we put the actions into the current rule and add it to our list of rules.
This covers everything apart from the GIVEN block.
A Better Strategy
Let's just pause for a second, though, and look at these lines again:
} elsif (s/^(DONE|UNPARSABLE)\s*;//) {
push @actions, { action => $1 };
next;
}
Let's suppose we want to add a new action TAG, which associates some metadata with this Received line. We'd have to go back and write another elsif clause and another regular expression, until our whole parsing subroutine is built up of a confusing list of if statements and regular expressions - if you remember, though, this is exactly the problem we had with SpamAssassin, the problem we were trying to get away from!
Mark-Jason Dominus, in his superb book "Higher Order Perl" creates a configuration parser a little like our parser, but which is table-driven. Just like we're doing with our little language, Mark's suggestion is to turn code into data, like so:
my %commands = (
SET => \&parse_set,
IGNORE => \&parse_ignore,
DONE => \&named_action,
UNPARSABLE => \&named_action,
);
Our loop over the actions then looks like this:
while ($_) {
s/^\s+// and next;
if (/(\w+)/) {
my $command = $1;
if (exists $commands{$command}) {
$action = $commands{$command}->(); # Modifies $_;
push @actions, $action;
next;
} else {
die "Unknown command $command at line $.\n";
}
} else {
die "Syntax error: no command? at line $.\n"
}
}
Each action subroutine knows how to parse its "bit" of the input, and removes everything pertaining to its action, ready for the next time. The real advantage of this, apart from the main parsing loop being cleaner, is that it makes it very easy to add new commands. Mark's config parser even has a DEFINE command which modifies the parsing table at run-time so you can create your own macros and commands from within the little language. But we have deliberately kept our language small and not very expandable, so there isn't much need for this in our case.
GIVEN blocks
Now we have the guts of a parser, adding support for GIVEN is actually pretty easy; all we need to remember is that GIVEN blocks contain code that we need to parse. First, we check for a GIVEN and see if it has an inversion or a referent:
if (s/^GIVEN\s+//) {
my $inverse = s/^NOT\s+//;
my ($referent) = s/^(\S+)=~//;
Then we extract a regular expression, which this time is not optional:
my $re;
unless ($re= extract_quotelike($_)) {
die "Syntax error on line $.: given has no expression\n";
}
And we want an open brace:
die "Syntax error on line $.: improper given\n" unless /\s*{\s*$/;
Now, all that follows is a set of rules forming a subtree inside this GIVEN block. So, we just parse the rest of the rules recursively:
my $subtree = parse_rules();
push @$tree, {given => $re,
($referent ? (referent => $referent) : ()), subtree => $subtree, inverse => $inverse };
next;
}
Of course, there's one problem with this - we need to know when to stop. That's easy, though - we stop when we see a close brace at the beginning of a line and return what we have parsed so far:
if (/^}\s*$/) {
return $tree;
}
We can make this a bit nicer by keeping track of whether or not we're in a given block:
sub parse_rules {
my $in_given = shift;
Now when we make the recursive call, we can say:
my $subtree = parse_rules(1);
And when we check for the close brace, we can tell whether or not we're actually expecting to see a close brace at this point:
if (/^}\s*$/) {
return $tree if $in_given;
die "Syntax error on line $.: superfluous close bracket\n";
}
And that's it - what was 200 lines of rules is now a Perl data structure. Now what?
Compiled or interpreted?
Well, there are two things we could do at this point. We could use the parsed version as a set of instructions to be interpreted, a little like what the Perl interpreter does internally. We'd end up with a Received line parser like this:
my $rules = parse_rules();
sub parse_received {
my $rules = shift;
local $_ = shift;
for my $rule (@$rules) {
if ($rule->{given}) {
my $passes = ($rule->{referent} || $_) =~ $rule->{given};
$passes = !$passes if $rule->{inverse};
parse_received($rule->{subtree}, $_);
next;
}
elsif ($rule->{regexp}) {
next unless ($rule->{referent} || $_) =~ $rule->{regexp};
}
for (@{$rule->{actions}}) {
# Do the actions
}
}
}
As you can see, this uses the data structure directly to decide what to do with each rule. There are several problems with interpreting, though; chief among them, it's slow. There is, of course, another way - we could compile the rules.
Compilation
Compiling sounds scary, but actually all compiling means is turning the data structure into something we can execute. We're in a Perl program, executing Perl code, so we could turn our data structure back into Perl code. The Perl code will look suspiciously similar to the code in SpamAssassin that we were trying to get away from, but we'll never need to touch it because it will be generated from rules that are expressed a bit more sanely.
Let's again start with a skeleton, then add in the support for the different commands, before finally adding support for GIVEN. Here's the outline:
sub unparse_rules {
my $tree = shift;
my $output;
for (@$tree) {
if ($_->{regexp}) {
$output .= 'if (';
$output .= '$r->{'.$_->{referent}."}=~" if $_->{referent};
$output .= $_->{regexp}.") {\n";
} else { $output .= "do { \n"; }
# Deal with actions here
$output .= "};\n";
}
return $output;
}
We're using $r as the variable which will hold all the information we parse out of the Received header. $output is the Perl code we're producing. Our code is going to look like this if there's a regular expression:
if (/blah blah blah/) {
# actions
}
And, to make it nice and consistent, if there isn't a regular expression, we'll just say:
do {
# actions
}
The code to handle the actions is pretty simple:
for (@{$_->{actions}||[]}) {
if ($_->{action} eq "DONE") {
$output .= 'return tidy_up($r)';
} elsif ($_->{action} eq "UNPARSABLE" or $_->{action} eq "IGNORE") {
$output .= "return ";
if ($_->{reason}) { $output .= "{reason => $_->{reason} }" };
} elsif ($_->{action} eq "SET") {
$output .= '$r->{'.$_->{variable}."} ".
$_->{operator} . " ".$_->{value}
} else { die "Couldn't unparse action!\n" }
$output .= ";\n";
}
And then we process the given in a similar way to before - we unparse the regular expression just like normal, and then we call ourselves recursively:
if ($_->{given}) {
$output .= $_->{inverse} ? "unless " : "if ";
$output .= "(";
$output .= '$r->{'.$_->{referent}."}=~" if $_->{referent};
$output .= $_->{given}.") {\n";
$output .= unparse_rules($_->{subtree});
$output .= "}\n";
next;
}
(The actual code in Email::Received keeps track of the recusion level to space out the code neatly, but we haven't shown that here.)
So now we have a routine which reads the parsed rules and returns a hunk of Perl code which, when run, turns the Received header stored in $_ into a hash reference $r. Now what?
Topping, tailing and installing
First we need to add a little more code at the top of our generated routine to take parameters and set up $r:
sub generate_parse_received {
my $code = shift;
$code = q|
sub {
local $_ = shift;
s/\s+/ /gs;
my $r = {};
|. $code. q|
Plus some more at the bottom to return $r:
return tidy_up($r);
}
|;
}
Now what I had originally planned to do was to run this:
generate_parse_received(unparse_rules(parse_rules()));
and store the output into a file so that it could be run. But, well, we have Perl, we can compile the code on the fly:
package Email::Received;
use base 'Exporter';
our @EXPORT = qw(parse_received);
*parse_received = eval generate_parse_received(
unparse_rules(parse_rules()));
We've compiled the code written in our mini-language into a Perl subroutine, which we've dynamically installed into the symbol table when our module loads. Now when later in the code someone calls Email::Received::parse_received, they're calling the code we've generated.
Debugging
OK, so what? We've done away with 900 lines of horrible Perl by turning them into another language and then compiling that into Perl code that we never see. Is that really much of an improvement?
Well, yes. I took the tests for SpamAssassin's Received header parser and used them for Email::Received. Since this is the real world, they didn't all pass first time, but about half of them failed. How do we work out what's going wrong? In the original SpamAssassin code, debugging the forest of regular expressions and conditionals is a complete nightmare unless you're prepared to spend lots of quality time single-stepping through things with a debugger. I'm not.
But since we're generating the code on the fly, we can generate bits of "instrumentation", things which tell us what paths were taken. So, in our rule parser, I add a line like this every time it sees new input:
$current->{line} = "$.: $_";
And in the unparser, something like this, just before we do any actions:
$output .= 'push @{$r->{rules_fired}}, q{'.$_->{line}."};\n";
With these two lines, our test suite can now tell us the path that our data took through the maze of code:
if(!is_deeply(\@result, \@expected)) {
print "# Rules fired: \n"
for my $i (0..$#result) {
for (@{$result[$i]->{rules_fired}}) {
print "# [$i] $_\n";
}
}
}
For instance:
# Input was:
# Received: from hotmail.com (something.com [65.54.245.95])
# at just after 10pm by Daffy.timing.com on a Friday (CrazyMTA)
# (envelope-from <foo@example.com>) with TFTP
# Rules fired:
# [0] 11: /(?:return-path:? |envelope-(?:sender|from)[ =])(\S+)\b/i
# SET envfrom = $1;
# [0] 103: /^from (\S+) \((\S+) \[(${IP_ADDRESS})\]\).*? by (\S+) /
# SET helo = $1; SET ip = $3; SET by = $4; DONE;
Here we're told how we got the output, and what line of the rules caused it to happen. Which is an awful lot easier than padding through with a debugger manually.
Conclusion
Little languages - also known as domain specific languages - are a great way to make programming simpler by removing all the repetitive parts of the program and concentrating just on what changes. By creating your little language on an ad-hoc basis, you tend to settle on the most natural representation of the rules you're trying to set out.
This has been the foundation of the Unix operating system, which has a plethora of little languages for text manipulation (awk, sed), for calculation (bc, dc), for typesetting and graphics, (pic, troff, eqn), and of course, shell scripting. Perl has been the natural successor of this style of programming, and so it's only fitting that we turn to Perl in order to parse and compile the little languages we come up with.
As well as making our code neater, it makes the code easier to update and to test. Perl 6 promises to make it even easier to interpret and compile your own little languages as part of a Perl program, but I hope you'll agree it's easy enough, and beneficial enough, to do right now in Perl 5!