Completely Useless Fun Project: Building The Parser
It’s been awhile but I’m ready to get back into it. I was traveling pretty extensively for thanksgiving and didn’t have that much time in front of my computer. I hope everyone had a great thanksgiving!
Anyway, we already spoke about the history, design and implementing a lexer.. This post is going to focus on building the parser.
Lexing, as we gathered a few weeks ago, is the lexical analysis part of the compiler. This is the process of deciding what each stream of characters mean. Are these characters some number? Are they a reserved keyword? Are the a variable name? The lexer will determine this.
The next part of the compiler, is syntactic analysis. This is what the parser does, analyze the syntax of your language. This is different than analyzing the semantics of your language, however, and that is what we will talk about next time.
Parsing a language
In order to parse a language, you must first take into account grammar. Grammar is the set of rules that make up valid statements (often referred to as the alphabet) in a language. This is how a compiler knows that
1 + 1 is valid, while knowing
1 + + 1 is not. It is important to know that a grammar does not know the meaning of a string, only the form it may take. The job of determining the meaning of the string comes from semantic parsing, which we will cover next week.
There are many different types of grammars, defined by noted linguist Noam Chomsky. These include regular, context-free, context-sensitive, and recursively enumerable. We actually spoke a little bit about regular grammars. Regular grammars are what powers Regular Expressions!
You may be wondering why we can’t use regular expression as our parser for building the entire programming language. The answer is not that simple. For simple languages a regular grammar may be good enough. But for more complex languages, regular grammars are hard to work with. This is because Regular grammars are either right-left or left-right (aka linear) whereas a grammar like context-free (which is what we will use to build Arbor’s parser) can basically be any combination of symbols.
Context Free Grammars
Context Free Grammars are the backbone of parsing a language. A Context Free Grammar is a set of production rules that describe all possible strings in a given language. These production rules are simple replacements and take the form
B -> ß. This rule means that when you encounter a
B replace it with a
You can make a whole bunch of rules for a Context-Free-Grammar (CFG). The beauty of it is that these can literally go on for ever. Take the following Grammar for example:
A -> aA A -> a B -> b B -> bB S -> AB
This languages could take any string with all
as first and all
bs last. This takes
abbbbbbbbbbbbbbbbbbbbbbbbbb as well as
aaaaaaaaaaaaabbbbbbbbbb. When the parser encounters an
a it replaces it with an
A, so now a string like
aaAbbb. Next it sees
aA and replaces that with
A so the string becomes
aAbbb. This is performed one more time so that it becomes
Abbb. Then, we do the same thing to the
AB. Next we recognize that
AB can be reduced to
S and so
S is our starting symbol and so we are done.
Using this, we can define a simple grammar that can perform any arithmetic operation:
S -> math math -> math '+' math math -> math '-' math math -> math '*' math math -> math '/' math math -> NUMBER
So here we define
NUMBER as any number, integer, or float. It is in all caps to show that it is not a grammar rule (non-terminal symbol), but a terminal symbol. A Non-Terminal Symbol is a symbol that can be reduced into a form that is a group of terminal symbols. Terminal symbols are the building blocks of the language. These are the lexemes you defined in the lexer. In other words, non-terminal symbols can be reduced to other symbols, either terminal or otherwise, while terminal symbols can not.
In our grammar,
NUMBER is the end of the road. We can’t reduce it to anything else.
1. It won’t become
Now is this grammar valid? A simple test says yes:
1 + 2 * 3 - 8 + 66 / 3 * 4 is valid as well as
1 + 2. Let’s see why, using
1 + 2 * 3 - 8 + 66 / 3 * 4 as an example.
If we run
1 + 2 * 3 - 8 + 66 / 3 * 4 through our lexer, the string becomes
NUMBER + NUMBER * NUMBER - NUMBER + NUMBER / NUMBER * NUMBER. One of our rules says that
math can become a
NUMBER so there we go. The parser turns the string into
math + math * math - math + math / math * math. Then the arithmetic rules come into play: we can reduce
math + math to
math + math * math - math + math / math * math becomes
math * math - math + math / math * math. We can do that for every rule in that string:
math * math - math + math / math * math ->
math - math + math / math * math ->
math + math / math * math ->
math / math * math ->
math * math ->
math. And we are done. That was a valid string.
Of course none of this really helps because it looks like you are actually losing data, which is sort of true. But the tools that we will use actually makes it easy to maintain the structure of the program, which I will talk about next week when talking about the Abstract Syntax Tree.
Types of CFG Parsers
There are are few different types of CFG parsers; General Parsers, Top Down parser, and Bottom up Parsers.
General parsers can parse any GFG, but they are incredibly inefficient for production use, so I will not say much more about them.
Top down parsers start at the start symbol (
S in my math example above) and moves towards the leaves (or terminal symbols)
Bottom Up Parsing, is what I did in the example above. It starts at the leaves (terminal symbol;
NUMBER above) and moves towards the root. It is exactly what I did above: I started with
NUMBER and reduced it to
math and kept going until it hit
Top Down and Bottom Up parsers are what we use to build language parsers, they are not general parsers so you have to define a grammar for them and that is why we use tools like Bison or Yacc to generate these for us. Furthermore, we can actually break them down into different types of parsing: LL, LR, SLR and LALR. These are a bit complicated to explain here, but here is a link to the wikipedia page. If you are curious, the parser we are going to use is a LALR bottom up parsing. This is what powers YACC.
Finally, we will get the meat and potatoes of describing a grammar in python for Arbor.
You should already have PLY install and defined the lexicon of the language. If you haven’t go back and read my post on the lexer. If you want to skip ahead and see what the “final” parser looks like, go to my github repo for arbor
The first thing we have to do when defining our parsing rules is to import yacc and the lexer tokens into our parsing file. We do this so that the parse understands our tokens. So far your parser file should look like this:
import ply.yacc as yacc from src.lexer import tokens
Next I define a start symbol. Admittedly, I did this because I forgot what the default start symbol is in Ply. That is really as easy as just doing
start = 's'.
At this point your code should look like this:
import ply.yacc as yacc from src.lexer import tokens start = 's'
Now we can get started on our parser rules. Ply actually makes this incredibly easy.
To define a rule, all we have to do is define a function that starts with
p_ and takes in a single parameter
p. Then we define a docstring (similar to the token functions in the lexer) that defines the rules. For example, here is what a function that defines one of our math rules from the earlier example would look like:
def p_mathAdd(p): '''math : math '+' math''' pass
This is exactly equal to
math -> math '+' math.
Yacc will take all of the functions that are prefixed with
p_ and generate your LALR tables for you. This makes the parsing so easy on you.
Now to design your language, first you have to keep in mind the structure. I like to break down my rules are generally two types:
expressions. I define an
expression as a statement that returns a value. So an
expression is a type of statement. From there I define my expressions. Most things in arbors are actually expressions from this. I think the only things that are pure statements are my
if/else constructs. Everything else is an expression.
Now of course my program is really just a collection of the different types of statements. So the first few rules I define are:
def p_start(p): '''s : statements''' pass def p_statements(p): '''statements : statements statement''' pass def p_statement(p): '''statement : expression SEMICOLON''' pass
As we recall, I define
s as my starting symbol. So
s can be transformed into
statements. Then statements is really just a combination of
statement is an
expression with a
SEMICOLON is a terminal symbol that I defined in me lexer that is, well, a semicolon.
Then I defined expression as a majority of my language. An expression are things a variable name, mathematic expression, function definition, function call, etc. Here is how I define my mathematical operators:
def p_bin_op(p): ‘'’expression : expression PLUS expression | expression SUB expression | expression MULTI expression | expression DIV expression’’’ pass
There are two ways you can combine two rules that reduce the same rule. One way, which may be easier to construct your AST, is to define them as separate functions. The above function could also be written as:
def p_add(p): '''expression : expression PLUS expression''' pass def p_minus(p): '''expression : expression MINUS expression''' pass ...
Or you can combine them in a single function, separated by a pipe (
|). It really comes down to how you like to do things and your personal preferences. I often use both methods, depending on what I am trying to achieve.
Accessing Values on the Parser
You may have noticed that you pass in a single parameter p into the parser functions. This
p object is the parser object. Using this object you can access the values of the expression that the parser is currently on.
You access this kind of like a list, where
p is the value you want the left hand expression to have (the part before the colon). If you have this rule:
expression : NAME,
p refers to
expression. After that,
p[n] refers to everything after the colon in order. In this rule:
expression : NAME PLUS NAME,
p would refer to the first
p refers to the
p refers to the second
NAME. You would assign what you want
expression to equal. For example, in the above rule, you may want
expression to equal to the sum of the two values. That would be done like this:
p = p + p. Notice you don’t return! Instead you just assign p to equal what you think it should return.
Note on errors:
You can handle parser errors by defining a
p_error function or by using the
error symbol. The parse object will have line number, but not column number, so you will have to figure that out. I wrote a function to do that that looks like:
def find_column(input, token): last_cr = input.rfind('\n',0,token.lexpos) if last_cr < 0: last_cr = 0 column = (token.lexpos - last_cr) return column
If p is none, that means you hit the end of the file unexpectedly.
When you are done defining your rules, you need to instantiate the parser. This is very similar to how the Lexer is instantiated:
parser = yacc.yacc(debug = True). You don’t need
debug=True although it definitely helps for debugging. After that, the parser object has a method called parse. You pass in the string you want to parse into this method and the parser will get to work. This method returns the value that was assigned to
p in the entry node. For me this is where I do
s : statements, but it could be wherever your start symbol is.
The wrap up
That’s all there is really to parsing. Parsing is a major part of building a compiler, and the concepts employed in computer language parsing can be used anywhere!
Lexing and Parsing really are the easy part of building a compiler. The fun parts come next: the AST and converting Arbor to LLVM bytecode. That is what I will talk about in the next post!
By the way, I would like to point out that I am a few weeks ahead in building the compiler for Arbor. I am a bit sloppy with my code right now for the Arbor compiler and all of my stuff is in the
parser branch (even though it should be broken down better). If you want to look at the latest and greatest, look at that branch!
As always the link to the Arbor project is here: https://github.com/radding/Arbor