I was discussing a few days ago about evaluating expressions and I decided to explain how you can build an evaluator. I will do this in a series of posts, getting one step more in each post. I will use C++, but the approaches are the same regardless the language.
Let’s consider this expression: 1+2*3. The value of this expression is 7. But how do you evaluate it in a language like C++ if you get it as a string? First of all this is a so called “infix” notation. There are also prefix and postfix notation. The terms infix, prefix and postfix refer to the position of the operator related to the operands:
- Prefix: operator operand1 operand2 (ex: + 1 2)
- Infix: operand1 operator operand2 (ex: 1 + 2)
- Postfix: operand1 operand2 operator (ex: 1 2 +)
The human understandable notation is infix. But it turns out that trying to parse a string with infix expression, from left to right and evaluate it is not possible. Because you cannot now what in advance and operators have different precedence; and there are parentheses too.
To solve the problem you’d have to build a helper structure representing the infix expression. There are two possibilities:
- Reverse Polish Notation (RPN) implies transforming the infix expression in a postfix expression and then evaluating it from left to right. 1 + 2*3 is transformed into 1 2 3 * +. You go from left to right until you find an operator, evaluate the expression and then replace it in the stack.
-
Abstract Syntax Tree
(AST) is an abstract representation of an expression, with inner nodes representing operators and leafs representing numbers.
The RPN is harder to build and evaluate in my opinion, so I will focus on the approach with the AST.
We build an AST while parsing the expression. First, we’ll have to define the grammar for the expression. Otherwise we wouldn’t know what to parse.
EXP -> EXP + EXP | EXP - EXP | EXP * EXP | EXP / EXP | - EXP | (EXP) | number
First, this grammar is recursive, as you can see, but another important problem is that it does not represent the precedence of the operators. For this reasons, a better grammar is this:
EXP -> EXP + TERM | EXP - TERM | TERM TERM -> TERM * FACTOR | TERM / FACTOR | FACTOR FACTOR -> ( EXP ) | - EXP | number
These rules written above are called productions. The symbols used are:
- EXP, TERM, FACTOR are called non-terminal symbols
- +, -, /, *, (, ) number are called terminal symbols
- EXT is the start symbol
While the grammar has the correct operator precedence, it’s still recursive, or more precisely, left-recursive. You can see that EXP goes into EXP then operator + then TERM. You never reach to match operator + because you have start again and again with a new expression. There are techniques for eliminating this recursion and the result is:
EXP -> TERM EXP1 EXP1 -> + TERM EXP1 | - TERM EXP1 | epsilon TERM -> FACTOR TERM1 TERM1 -> * FACTOR TERM1 | / FACTOR TERM1 | epsilon FACTOR -> ( EXP ) | - EXP | number
‘epsilon’ here means ‘nothing’.
With the theory (well, this is just the tip of the iceberg, but should be a good start for you) in place we’ll have to do three things:
The first two steps will be done the same time, but I’ll take them one at a time and explain it in details.
Before you continue with the implementation details, I suggest you read more about both RPN and AST and grammars.
Here are several references:
I’m wondering, what if I want it to parse and evaluate multiple values, each expression is separated by ‘;’.
So far I’ve reached to conclusion that it should be on term1. For some reason it parses and the tree is correct but it still doesn’t parse the second expression.
I’m not sure I understand to what two expression separated by a semicolon evaluate to. Let’s say you have ‘1+2*5;2*2;6+4’. What does that evaluate to?
Why don’t you post the grammar you are considering and I’ll try to see how you can evaluate it?
Hi, this is a great example. Can you give me any tips on enhancing the code to support operators, e.g. =, and the AND/OR clauses.
>Can you give me any tips on enhancing the code to support operators, e.g. =, and the AND/OR clauses.
Probably the starting point is to outline corresponding grammar for this purpose – I tried to put something down on this issue – and obtained something like :
Initial version:
LOGICAL_EXPR : LOGICAL_EXPR AND LOGICAL_EXPR
| LOGICAL_EXPR OR LOGICAL_EXPR
| NOT LOGICAL_EXPR
| ( LOGICAL_EXPR )
| COMPARE_EXP comparator COMPARE_EXP
COMPARE_EXP : COMPARE_EXPR + COMPARE_EXPR
| COMPARE_EXPR – COMPARE_EXPR
| COMPARE_EXPR * COMPARE_EXPR
| COMPARE_EXPR / COMPARE_EXPR
| – COMPARE_EXPR
| ( COMPARE_EXPR )
| function( COMPARE_EXPR )
| number
Final version ( hope that’s the good one … 🙂 ) :
LOGICAL_EXPR : LOGICAL_OP AND LOGICAL_OP
| LOGICAL_OP OR LOGICAL_OP
| LOGICAL_OP
LOGICAL_OP : ( LOGICAL_EXPR )
| NOT LOGICAL_EXPR
| COMPARE_EXP comparator COMPARE_EXP
COMPARE_EXPR : TERM MATH_EXPR
MATH_EXPR : + TERM MATH_EXPR
| – TERM MATH_EXPR
| nothing
TERM : FACTOR TERM1
TERM1 : * FACTOR TERM1 |
| / FACTOR TERM1
| nothing
FACTOR : (COMPARE_OP)
| function( COMPARE_OP )
| – COMPARE_OP
| number
Even the topic is quite old I’d be curious about some comments/observation on this issue … 🙂
Best regards
This is a great tutorial!! thanks for sharing this knowledge.
Brilliant Explanation!!
I am trying to construct AST tree for expressions like “Myfunc(a) + 4 * 5” where a is user-defined variable and Myfunc is also user defined function.
How to do this with the Parser class you have defined?
There is one issue: “FACTOR -> ( EXP ) | – EXP | number” should be “FACTOR -> ( EXP ) | – FACTOR | number”
Thank you Mr. Marius. I have some questions according to AST. I want to create a small software that can receive a source code as an input. This tool will have two module. In the beginning, I want to detect the loops in the source code. For example, if I have
for(i=0;i