28 February 2016

From Regular Expressions to Grammars, Pt. 4

If you're new to Regular Expressions (at least as they're used in Perl 6), then I'd suggest starting with the 1st part of this series. Those of you with a solid grasp of regular expressions may want to skip ahead to last week's posting. Now, on with the show!

In Last Week's Episode

We were starting to develop a compiler in Perl 6 that would take a JavaScript expression like
 var a = 3; console.log( "Hey, did you know a = " + a + "?" );  
and turn it into Perl 6 code that compilers like Rakudo Perl can run. Before we get started it's probably a good idea to figure out what that code will look like. If you already know Perl 5, then code like this should look familiar to you.

my $a = 3;
say "Hey, did you know a = " ~ $a ~ "?";
We'll need to make sure that our regular expressions have captured the essence of the JavaScript. If you remember from last time, we captured our text with this set of regular expressions:
my rule Number { \d+ };
my rule Variable { \w+ };
my rule String { '"' <-[ " ]>+ '"' };
my rule Assignment-Expression { var <Variable> '=' <Number> };
my rule Function-Call { console '.' log '(' <String> '+' <Variable> '+' <String> ')' };

say 'var a = 3; console.log( "Hey, did you know a = " + a + "?" );' ~~
rule { <Assignment-Expression> ';' <Function-Call> ';' }
If you put this into a Perl 6 source file and run it, the output might look a little strange at first:
「var a = 3; console.log( "Hey, did you know a = " + a + "?" );」
 Assignment-Expression => 「var a = 3」
    Variable => 「a 」
    Number => 「3」
 Function-Call => 「console.log( "Hey, did you know a = " + a + "?" )」
    String => 「"Hey, did you know a = " 」
    Variable => 「a 」
    String => 「"?" 」
If you'll ignore the 「」 marks for the moment, you can see that the matches are indented, almost like a file explorer window, with 'Assignment-Expression' being a directory, and 'Variable' and 'Number' being files inside that directory. That's not too far from the truth, actually. When I see this sort of structure, I find that it helps to visualize it like so, with just a bit of added syntax -

$/ => 「var a = 3; console.log( "Hey, did you know a = " + a + "?" );」
 <Assignment-Expression> => 「var a = 3」
    <Variable> => 「a 」
    <Number> => 「3」
 <Function-Call> => 「console.log( "Hey, did you know a = " + a + "?" )」
    <String> => 「"Hey, did you know a = " 」
    <Variable> => 「a 」
    <String> => 「"?" 」
This makes it almost too easy to figure out how to print out text, and points out a tiny problem in our regular expression. Let's print out the number we've assigned a to, just to start out with. The first line tells us the root of the directory, or match, tree is $/.  If you add 'say $/;' to the end of your test file and rerun it, you'll see the entire expression printed out twice. That must mean that $/ is the entire match.

Going down one layer is just as easy as adding what's on the left side of the => arrow. Change the previous 'say' statement to 'say $/<Assignment-Expression>;', and look at how the output changes. It should now look like this:
「var a = 3」
  Variable => 「a 」
  Number => 「3」
Let's put our (invisible) markers back in, so we can see where to go...
$/<Assignment-Expression> => 「var a = 3」
  <Variable> => 「a 」
  <Number> => 「3」
We can now see that our target, the number 3, is just one layer further down. Again, we can add what's on the left-hand side of the expression, so let's do just that.

say $/<Assignment-Expression><Number>;
  「3」
And we have almost exactly what we want. The 「」 are in the way, so let's "cast" the value here back to a number. I've put "cast" in scare quotes because it's not quite what C/C++ programmers think of as "casting". What we want to do is roughly the equivalent of 'sscanf(str,"%d",&num)", but in Perl 6, the operation is much simpler.

say +$/<Assignment-Expression><Number>;
  3
Without getting into too much detail, $/ is an object that has an implicit number, string and boolean value hiding inside of it. Adding '+' to the front reveals the hidden number inside the $/ object.

From JavaScript to Perl

We're not too far off from being able to generate Perl 6 code from our JavaScript. Let's use what we've learned above with our first statement, the assignment.
say 'my $' ~ $/<Assignment-Expression><Variable> ~ ' = ' ~
      $/<Assignment-Expression><Number> ~ ';';

my $a = 3;
We've just used 7 lines of Perl 6 to turn code in one language into another language. And most of the Perl 6 code is reusable, because strings, numbers and JavaScript/C/Java-style variable names are common across most languages out there.

Last time, we learned how to create matches using regular expressions. This time we've learned how we can use what we've matched, and how to find what we want inside a say statement. The invisible matching markers are useful enough that I might actually write a module that puts them back into match expressions, it shouldn't be hard.

There is one problem with that scheme, and if we look at the <Function-Call> matches, it's pretty easy to see the problem.

$/<Function-Call> => 「console.log( "Hey, did you know a = " + a + "?" )」
  <String> => 「"Hey, did you know a = " 」
  <Variable> => 「a 」
  <String> => 「"?" 」
When we write "say $/<Function-Call><String>;", which <String> will we get? Before you run this, try to guess. Is it the first one, because Perl 6 won't replace a match object once it's been created? Is it the last one, because the last one "overwrites" the first one? Does the compiler simply "get confused" and prints nothing? Try it and see!

It actually returns both matches in a list, so you can reference either one. Our invisible markers now get to look like
$/<Function-Call> => 「console.log( "Hey, did you know a = " + a + "?" )」
  <String>[0] => 「"Hey, did you know a = " 」
  <Variable> => 「a 」
  <String>[1] => 「"?" 」
So if we want to print out the first string, we can write "say $/<Function-Call><String>[0];" and get back  「"Hey, did you know a = " 」 complete with the funky Japanese quotation marks. Thankfully there's a shortcut to getting rid of those, just like there was with the number 3.

say ~$/<Function-Call><String>[0];
  "Hey, did you know a = "

The '~" operator "stringifies" the match, just like '+' "numifies" the match that gets returned. So, you can probably write the final line yourself...

say 'say ' ~ $/<Function-Call><String>[0] ~ ' ~ '
  ' $' ~ $/<Function-Call><Variable> ~ ' ~ '
  $<Function-Call><String>[1] ~ ';';

say "Hey, did you know a = " ~ $a ~ "?";
And we've compiled our two lines of JavaScript into Perl 6.

Refactoring

What we've got works, but there's quite a bit of repetition. Here's what we've got so far.
my rule Variable { \w+ };
my rule String { '"' <-[ " ]>+ '"' };
my rule Assignment-Expression { var <Variable> '=' <Number> };
my rule Function-Call { console '.' log '(' <String> '+' <Variable> '+' <String> ')' };

'var a = 3; console.log( "Hey, did you know a = " + a + "?" );' ~~
rule { <Assignment-Expression> ';' <Function-Call> ';' }

say 'my $' ~ $/<Assignment-Expression><Variable> ~
       ' = ' ~ $/<Assignment-Expression><Number> ~
       ';';

say 'say ' ~ $/<Function-Call><String>[0] ~
       ' ~ $' ~ $/<Function-Call><Variable> ~
      ' ~ ' ~ $/<Function-Call><String>[1] ~
      ';';

The rules look pretty good, the repetitions of <String> and <Variable> are pretty much unavoidable, but look at the 'say' statements. You'll see that <Assignment-Expression> and <Function-Call> repeat themselves several times. One way to get rid of this repetition is to create a temporary variable, but that could get ugly.

my $assignment-expression = $/<Assignment-Expression>;
say 'my $' ~ $assignment-expression<Variable> ~ ' = ' ~
    $assignment-expression<Number> ~ ';'
Instead, let's take advantage of Perl 6's subroutine signatures, and reuse the $/ variable name so we can reuse the code we wrote above, and just drop out the <Assignment-Expression> part. I'll name the subroutine after the rule, just to keep things straight. (You'll see why later.)
sub assignment-expression( $/ ) {
    'my $' ~ $/<Variable> ~ ' = ' ~ $/<Number> ~ ';'
}

say assignment-expression( $/<Assignment-Expression> ); 
Let's do the same for <Function-Call> as well, creating a function with the same name and $/ subroutine signature. It now fits neatly on one line, and only repeats the <String> bit because it has to.
sub function-call( $/ ) {
    'say ' ~ $/<String>[0] ~ ' ~ ' ~ $/<Variable> ~ ' ~ ' ~ $/<String>[1] ~ ';'
}

say function-call( $/<Function-Call> ); 

Objectification

I've made quite a few choices along the way to get us to this point in the narrative. Here's where we are after the last bout of refactoring:

my rule Number { \d+ };
my rule Variable { \w+ };
my rule String { '"' <-[ " ]>+ '"' };
my rule Assignment-Expression { var <Variable> '=' <Number> };
my rule Function-Call { console '.' log '(' <String> '+' <Variable> '+' <String> ')' };

'var a = 3; console.log( "Hey, did you know a = " + a + "?" );' ~~
rule { <Assignment-Expression> ';' <Function-Call> ';' }

sub assignment-expression( $/ ) {
    'my $' ~ $/<Variable> ~ ' = ' ~ $/<Number> ~ ';'
}

sub function-call( $/ ) {
    'say ' ~ $/<String>[0] ~ ' ~ $' ~ $/<Variable> ~ ' ~ ' ~ $/<String>[1] ~ ';';
} say assignment-expression( $/<Assignment-Expression> ); say function-call( $/<Function-Call> );
Here's where this all pays off.  Let's pack up the last two 'say' calls first. We haven't given the top-level rule a name, so let's just call it ... well, 'top' for now.

sub top( $/ ) { assignment-expression( $/ ) ~ function-Call( $/ ) }

Pack up your Troubles 


We haven't done much with the rules sitting at the top of the file for a while, so let's work with those. In Perl 6, and for that matter programming in general, it's a good idea to package up your code for reuse. While Perl 6 lets us package up code with the 'class' keyword, the rules we have really aren't "code" in any sense. While they can be used in code, and we do use them, they don't really make any decisions on their own.

So we shouldn't use the 'class' keyword to package them up. Instead, there's another convenient type meant for packaging up a bunch of regular expressions and rules, called a 'grammar'. It looks just like the syntax for declaring a 'rule', and that's actually by design.


grammar JavaScript {
  rule Number { \d+ };
  rule Variable { \w+ };
  rule String { '"' <-[ " ]>+ '"' };
  rule Assignment-Expression { var <Variable> '=' <Number> };
  rule Function-Call { console '.' log '(' <String> '+' <Variable> '+' <String> ')' };

  rule TOP { <Assignment-Expression> ';' <Function-Call> ';' };
}
You'll note that we gave our top-level rule a name as well, and just called it 'TOP' for the time being. If you're playing along at home, you've probably made the change and are wondering how the "'var a = 3;...' ~~ rule { ... }" thing plays out, because trying things like "'var a = 3;...' ~~ JavaScript;" won't quite work.

Grammars are just like classes, in that they're really just clumps of potential code. They can't be made to do work on their own, they have to be converted from potential to .. well, kinetic code. We can do that just like you do with any class.
my $javaScript = JavaScript.new;
And now we have a variable that we can work with. Now, let's put it to work. All grammar classes come with a built-in 'parse()' method that we can use to get at the regular expressions inside it. Let's modify our match statement to take advantage of that -

$javaScript.parse(
    'var a = 3; console.log( "Hey, did you know a = " + a + "?" );');
And our code should work again.

Taking Action

Now that we've bundled up all of our matching stuff into one tidy little class, it'd be nice if we could do the same for those subroutines. Let's try that here, and put our subroutines into their own namespace, just like we did with the rules. We'll have to change from 'sub' to 'method', and our 'top' method will have to use 'self.' to call the other methods.

class Actions {
    method assignment-expression( $/ ) {
      'my $' ~ $/<Variable> ~ ' = ' ~ $/<Number> ~ ';'
    }

    method function-call( $/ ) {
      'say ' ~ $/<String>[0] ~ ' ~ $' ~ $/<Variable> ~ ' ~ ' ~ $/<String>[1] ~ ';';
    }

    method top( $/ ) {
        self.assignment-expression( $/<Assignment-Expression> ) ~
        self.function-call( $/<Function-Call> )
    }
}
And just like before, we can create the Actions object in one line
 my $actions = Actions.new;
And call the top almost like we did before.
say $actions.top( $/ );
We've changed things around quite a bit, so here's a look at where we stand.
grammar JavaScript {
  rule Number { \d+ };
  rule Variable { \w+ };
  rule String { '"' <-[ " ]>+ '"' };
  rule Assignment-Expression { var <Variable> '=' <Number> };
  rule Function-Call { console '.' log '(' <String> '+' <Variable> '+' <String> ')' };
  rule TOP { <Assignment-Expression> ';' <Function-Call> ';' }
}
my $j = JavaScript.new;

$j.parse('var a = 3; console.log( "Hey, did you know a = " + a + "?" );');

class Actions {
    method assignment-expression( $/ ) {
      'my $' ~ $/<Variable> ~ ' = ' ~ $/<Number> ~ ';'
    }

    method function-call( $/ ) {
      'say ' ~ $/<String>[0] ~ ' ~ $' ~ $/<Variable> ~ ' ~ ' ~ $/<String>[1] ~ ';';
    }

    method top( $/ ) {
      self.assignment-expression( $/<Assignment-Expression> ) ~
      self.function-call( $/<Function-Call> )
    }
}

my $actions = Actions.new;
say $actions.top($/);

Don't worry, we're almost there. Now that we have a separate class for the actions, let's rename the methods to exactly match the grammar rules, so we don't forget what they are.
class Actions {
    method Assignment-Expression( $/ ) {
      'my $' ~ $/<Variable> ~ ' = ' ~ $/<Number> ~ ';'
    }

    method Function-Call( $/ ) {
      'say ' ~ $/<String>[0] ~ ' ~ $' ~ $/<Variable> ~ ' ~ ' ~ $/<String>[1] ~ ';';
    }

    method TOP( $/ ) {
      self.Assignment-Expression( $/<Assignment-Expression> ) ~
      self.Function-Call( $/<Function-Call> )
    }
}
Furthermore, there's one last bit of magic that we can take advantage of. We're going to combine the $javascript and $actions objects like so.
say $javascript.parse('....', :actions($actions) );
The ':actions(...)' is just a fancy way of declaring an optional argument to the 'parse()' method. We're telling the regular expression engine that any time a rule like <Function-Call> or <TOP> matches, we'd like it to call the corresponding method in our class.

This almost works as-is, but if you run the code with these modifications, you'll see the parser returns the original match object, with those Japanese quote marks. So it seems like we're back at square one. Not quite.

Go ahead and add a temporary "say 'Hello!';" to one of the methods, just to confirm that they're getting called. This is important proof that the regex engine is working and properly parsing what it's going over. You can even use some of the tricks we learned above and write "say $/<Variable>;" to see if the match is getting run as you thought it should. Go ahead and play around, come back here when you're done.

Mixed Signals

What's happening is the methods are getting called, but their output is being lost. Let's capture the output and use the final (ha!) feature of the grammar, the Abstract Syntax Tree. Now, this might dredge up notions of sitting in classrooms watching boxes and lines being drawn on the chalkboard, but it's not really that bad. We've already seen one, in fact the output from say() is an AST.

Let's look at the other syntax tree, the one we're building in the background. Add '.ast' to the end of the "$javascript.parse(...).ast;" call, like that. This will show us the syntax tree we're building on our own. Or will it?

If you do this, you'll see it prints (Any), which generally is the equivalent of "failed match", but we know from previous testing that the match hasn't failed. So what's going on here? While our methods are getting run, and they return output, Perl 6 doesn't know what to do with the output, or where it fits in the AST it's been asked to build.

The key is a little thing called "make". Add this where we used to put 'say', at the start of the methods.
class Actions {
    method Assignment-Expression( $/ ) {
      make 'my $' ~ $/<Variable> ~ ' = ' ~ $/<Number> ~ ';'
    }

    method Function-Call( $/ ) {
      make 'say ' ~ $/<String>[0] ~ ' ~ $' ~ $/<Variable> ~ ' ~ ' ~ $/<String>[1] ~ ';'
    }

    method TOP( $/ ) {
      make $/<Assignment-Expression>.ast ~ $/<Function-Call>.ast
    }
}
Also, because Perl 6 is calling the methods for us, we don't need to call self.Function-Call(...) on our own, all we need to do is look at the syntax tree that Function-Call(...) returns to us. And there we have it. A complete, albeit tiny compiler. In case you've gotten lost with the editing, here's the final result.
grammar JavaScript {
  rule Number { \d+ };
  rule Variable { \w+ };
  rule String { '"' <-[ " ]>+ '"' };
  rule Assignment-Expression { var <Variable> '=' <Number> };
  rule Function-Call { console '.' log '(' <String> '+' <Variable> '+' <String> ')' };
  rule TOP { <Assignment-Expression> ';' <Function-Call> ';' }
}

class Actions {
  method Assignment-Expression( $/ ) {
    make 'my $' ~ $/<Variable> ~ ' = ' ~ $/<Number> ~ ';' }

  method Function-Call( $/ ) {
    make 'say ' ~ $/<String>[0] ~
     ' ~ $' ~ $/<Variable> ~ ' ~ ' ~ $/<String>[1] ~ ';'; }

  method TOP( $/ ) {
    make $/<Assignment-Expression>.ast ~ $/<Function-Call>.ast }
}

my $j = JavaScript.new;
my $a = Actions.new;
say $j.parse(
   'var a = 3; console.log( "Hey, did you know a = " + a + "?" );',
   :actions($a)).ast;

Where Do We Go From Here 

One simple but neat change you can do is expand the Assignment-Expression to accept both numbers and strings. We talked last time about alternatives in the rules, so this hint should be enough to get you started:
rule Assignment-Expression { var <Variable> '=' ( <Number> | <String> ) }
You'll have to modify the Assignment-Expression method a little bit to make this work. Or you could get crafty and realize that ( <Number> | <String> ) could be turned into its own little generic "Term" rule, "rule Term { <Number> | <String> }", add an action "method Term( $/ ) { make $/<Number> or $/<String> }" and only change one thing in Assignment-Expression.

Time and again when helping people out online, I've had to say that Perl 5 regular expressions aren't quite the tool they need, whether they're trying to find a bit of HTML in a document, rewriting an RTF file or pulling out a title from a LaTeX doc. I've had to say "Use HTML::Parser", or "Check out the RTF modules on CPAN" or "Try Parser::MCG" to tackle these thorny questions.

Perl 6 regular expressions can handle all of those tasks and much more. Plus, the techniques I've mentioned in this tutorial aren't specific to JavaScript. You can use these same techniques to parse any language that can be broken down into tokens. It may take some creative use of higher-level rules, but it can be done.

Your methods don't have to return Perl 6 text when they parse JavaScript. They could just as easily count up the number of function calls, flag lines of code that can cause problems or do inline optimization. Perl 6 is written in Perl 6, so you could even use these techniques to compile from Perl 6 to JavaScript.

Thank you, gentle reader, for making it this far with me. I hope you've learned something along the way, or at least been entertained by what vistas Perl 6 opens up. Next month I'll probably briefly return to Perl 5 and more real-world debugging.

No comments:

Post a Comment