I was discussing a few days ago about evaluating expressions and I decided to explain how you can build an evaluator. I will do this in a series of posts, getting one step more in each post. I will use C++, but the approaches are the same regardless the language.
Let’s consider this expression: 1+2*3. The value of this expression is 7. But how do you evaluate it in a language like C++ if you get it as a string? First of all this is a so called “infix” notation. There are also prefix and postfix notation. The terms infix, prefix and postfix refer to the position of the operator related to the operands:
- Prefix: operator operand1 operand2 (ex: + 1 2)
- Infix: operand1 operator operand2 (ex: 1 + 2)
- Postfix: operand1 operand2 operator (ex: 1 2 +)
The human understandable notation is infix. But it turns out that trying to parse a string with infix expression, from left to right and evaluate it is not possible. Because you cannot now what in advance and operators have different precedence; and there are parentheses too.
To solve the problem you’d have to build a helper structure representing the infix expression. There are two possibilities:
- Reverse Polish Notation (RPN) implies transforming the infix expression in a postfix expression and then evaluating it from left to right. 1 + 2*3 is transformed into 1 2 3 * +. You go from left to right until you find an operator, evaluate the expression and then replace it in the stack.
Abstract Syntax Tree(AST) is an abstract representation of an expression, with inner nodes representing operators and leafs representing numbers.
The RPN is harder to build and evaluate in my opinion, so I will focus on the approach with the AST.
We build an AST while parsing the expression. First, we’ll have to define the grammar for the expression. Otherwise we wouldn’t know what to parse.
EXP -> EXP + EXP | EXP - EXP | EXP * EXP | EXP / EXP | - EXP | (EXP) | number
First, this grammar is recursive, as you can see, but another important problem is that it does not represent the precedence of the operators. For this reasons, a better grammar is this:
EXP -> EXP + TERM | EXP - TERM | TERM TERM -> TERM * FACTOR | TERM / FACTOR | FACTOR FACTOR -> ( EXP ) | - EXP | number
These rules written above are called productions. The symbols used are:
- EXP, TERM, FACTOR are called non-terminal symbols
- +, -, /, *, (, ) number are called terminal symbols
- EXT is the start symbol
While the grammar has the correct operator precedence, it’s still recursive, or more precisely, left-recursive. You can see that EXP goes into EXP then operator + then TERM. You never reach to match operator + because you have start again and again with a new expression. There are techniques for eliminating this recursion and the result is:
EXP -> TERM EXP1 EXP1 -> + TERM EXP1 | - TERM EXP1 | epsilon TERM -> FACTOR TERM1 TERM1 -> * FACTOR TERM1 | / FACTOR TERM1 | epsilon FACTOR -> ( EXP ) | - EXP | number
‘epsilon’ here means ‘nothing’.
With the theory (well, this is just the tip of the iceberg, but should be a good start for you) in place we’ll have to do three things:
The first two steps will be done the same time, but I’ll take them one at a time and explain it in details.
Before you continue with the implementation details, I suggest you read more about both RPN and AST and grammars.
Here are several references: