# Evaluating Expressions – Part 3: Building the AST

In my previous post we’ve parsed an exception verifying whether it’s correct or not syntactically. But we still have to evaluate it. To be able to do that we’ll have to build an abstract syntax tree. This can be done by modifying the previous code and inserting semantic action. That means we do something more when we match productions.

An abstract syntax tree is a binary tree. The inner nodes will represent operators and leafs will be numerical values.

Here is how a node in the AST will look:

It is defined like this:

```enum ASTNodeType
{
Undefined,
OperatorPlus,
OperatorMinus,
OperatorMul,
OperatorDiv,
UnaryMinus,
NumberValue
};

class ASTNode
{
public:
ASTNodeType Type;
double      Value;
ASTNode*    Left;
ASTNode*    Right;

ASTNode()
{
Type = Undefined;
Value = 0;
Left = NULL;
Right = NULL;
}

~ASTNode()
{
delete Left;
delete Right;
}
};
```

For the expression 1+2*3, the AST will be:

We’ll build this tree by inserting semantic actions and adding nodes according to the following rules:

You’ll probably notice that based on these rules the AST shown above will be modified a little bit, with some additional nodes for operators + and *, having on the left a leaf node with the neutral element for the operation (zero for + and 1 for *), and on the right a node corresponding to a TERM or FACTOR. This won’t affect the evaluation.

The Parser class will change so that the functions corresponding to the non-terminal symbols EXP, EXP1, TERM, TERM1 and FACTOR will return an ASTNode* instead of void. That is the node created as a semantic action.

```class Parser
{
Token m_crtToken;
const char* m_Text;
size_t m_Index;

private:

ASTNode* Expression()
{
ASTNode* tnode = Term();
ASTNode* e1node = Expression1();

return CreateNode(OperatorPlus, tnode, e1node);
}

ASTNode* Expression1()
{
ASTNode* tnode;
ASTNode* e1node;

switch(m_crtToken.Type)
{
case Plus:
GetNextToken();
tnode = Term();
e1node = Expression1();

return CreateNode(OperatorPlus, e1node, tnode);

case Minus:
GetNextToken();
tnode = Term();
e1node = Expression1();

return CreateNode(OperatorMinus, e1node, tnode);
}

return CreateNodeNumber(0);
}

ASTNode* Term()
{
ASTNode* fnode = Factor();
ASTNode* t1node = Term1();

return CreateNode(OperatorMul, fnode, t1node);
}

ASTNode* Term1()
{
ASTNode* fnode;
ASTNode* t1node;

switch(m_crtToken.Type)
{
case Mul:
GetNextToken();
fnode = Factor();
t1node = Term1();
return CreateNode(OperatorMul, t1node, fnode);

case Div:
GetNextToken();
fnode = Factor();
t1node = Term1();
return CreateNode(OperatorDiv, t1node, fnode);
}

return CreateNodeNumber(1);
}

ASTNode* Factor()
{
ASTNode* node;
switch(m_crtToken.Type)
{
case OpenParenthesis:
GetNextToken();
node = Expression();
Match(')');
return node;

case Minus:
GetNextToken();
node = Factor();
return CreateUnaryNode(node);

case Number:
{
double value = m_crtToken.Value;
GetNextToken();
return CreateNodeNumber(value);
}

default:
{
std::stringstream sstr;
sstr << "Unexpected token '" << m_crtToken.Symbol << "' at position " << m_Index;
throw ParserException(sstr.str(), m_Index);
}
}
}

ASTNode* CreateNode(ASTNodeType type, ASTNode* left, ASTNode* right)
{
ASTNode* node = new ASTNode;
node->Type = type;
node->Left = left;
node->Right = right;

return node;
}

ASTNode* CreateUnaryNode(ASTNode* left)
{
ASTNode* node = new ASTNode;
node->Type = UnaryMinus;
node->Left = left;
node->Right = NULL;

return node;
}

ASTNode* CreateNodeNumber(double value)
{
ASTNode* node = new ASTNode;
node->Type = NumberValue;
node->Value = value;

return node;
}

void Match(char expected)
{
if(m_Text[m_Index-1] == expected)
GetNextToken();
else
{
std::stringstream sstr;
sstr << "Expected token '" << expected << "' at position " << m_Index;
throw ParserException(sstr.str(), m_Index);
}
}

void SkipWhitespaces()
{
while(isspace(m_Text[m_Index])) m_Index++;
}

void GetNextToken()
{
SkipWhitespaces();

m_crtToken.Value = 0;
m_crtToken.Symbol = 0;

if(m_Text[m_Index] == 0)
{
m_crtToken.Type = EndOfText;
return;
}

if(isdigit(m_Text[m_Index]))
{
m_crtToken.Type = Number;
m_crtToken.Value = GetNumber();
return;
}

m_crtToken.Type = Error;

switch(m_Text[m_Index])
{
case '+': m_crtToken.Type = Plus; break;
case '-': m_crtToken.Type = Minus; break;
case '*': m_crtToken.Type = Mul; break;
case '/': m_crtToken.Type = Div; break;
case '(': m_crtToken.Type = OpenParenthesis; break;
case ')': m_crtToken.Type = ClosedParenthesis; break;
}

if(m_crtToken.Type != Error)
{
m_crtToken.Symbol = m_Text[m_Index];
m_Index++;
}
else
{
std::stringstream sstr;
sstr << "Unexpected token '" << m_Text[m_Index] << "' at position " << m_Index;
throw ParserException(sstr.str(), m_Index);
}
}

double GetNumber()
{
SkipWhitespaces();

int index = m_Index;
while(isdigit(m_Text[m_Index])) m_Index++;
if(m_Text[m_Index] == '.') m_Index++;
while(isdigit(m_Text[m_Index])) m_Index++;

if(m_Index - index == 0)

char buffer[32] = {0};
memcpy(buffer, &m_Text[index], m_Index - index);

return atof(buffer);
}

public:
ASTNode* Parse(const char* text)
{
m_Text = text;
m_Index = 0;
GetNextToken();

return Expression();
}
};
```

Now the Parse() method will return the created abstract syntax tree. We will see how to evaluate the expression by traversing this tree in the next post.

### 2 Replies to “Evaluating Expressions – Part 3: Building the AST”

1. Brandon Morris says:

THANK YOU SO MUCH FOR SHARING THIS!!!

I know this is somewhat of an older post, and so I apologize for “necro’ing” a post from 2009 but I think the code is still very much relevant and useful. The only question I have though, is how does one go about eliminating the “extra” node creation? In a larger picture my problem is this: I’m writing a compiler and each iteration of my exp engine has been different, everything from a shunting-yard algorithm to a brute-force approach of a PQ tree; I’ve never really been happy with the thing until working with this code. It’s PERFECT it does everything I need to; it’s clean, it’s elegant, and with each OpNode being in a separate class (The way I have it) I can get very specific in very specific situations, if I want, with the intermediate code output. The only problem is it adds extra nodes.

The compiler I’m writing is essentially a C++ (With some stuff not being implemented, like templates). Anyhow, with these extra nodes it kinda reeks havoc on dealing with operator overloads on “dealing” with end-user class types during the compiler’s compile-time of the end-user code. If you draw it out on paper, you’ll see what I mean (I’d attach my handwritten drawn out tree if I could), but essentially if you take the statement:
x = a(3, 8 + b(), 4);

you wind up with one of the trees in the the deepest level being:
.*
/ \
b 1

(Ignore the ‘.’; I figured the parser of the site’s blog software wouldn’t line it up right with a new line starting with a whitespace.)

This is problematic as b was never specified by the end-user to multiply by 1, and if b has user-defined operators (Specifically a mult operator) the mult node will incorectly spit out intermediate code rep as if it were as it has no way of really knowing whether it was a legitimate end-user-typed sub-expr, or an “artificial” one. So my question is, how do I eliminate the extra factors and OpNodes that aren’t really specified in the original expression that this code creates? Where would I take a surgical scalpel and how would I apply it?

I have some ideas; one place I can see is the implicit assumption that at the end of each Expr1() and Term1() call, it creates a nodenumber. The other place I can see is at the end of each Expr() and each Term() it implicitly assumes to create an OpAdd and an OpMult respectively. These seem like the best places to slice out but I’m not sure how to square the code-flow once I’ve done that as I think without those assumptions the lexer is somewhat behind the code flow or in front of it or something. Any thoughts? Thanks in advance, and thanks again for sharing the code with such an excellently detailed article!

This site uses Akismet to reduce spam. Learn how your comment data is processed.