In my previous post I have provided some background theory for evaluating expressions with abstract syntax trees. As I was mentioning, the first step towards this goal is to parse the expression, make sure it is correct syntactically. This is what I’ll show you in this post.
Having the grammar defined, we’ll make one function for each non-terminal symbol (EXP, EXP1, TERM, TERM1, FACTOR).
Simply put the code will look like this:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 |
void Expression() { Term(); Expression1(); } void Expression1() { switch(current_token) { case '+': GetNextToken(); Term(); Expression1(); break; case '-': GetNextToken(); Term(); Expression1(); break; } } |
However, I want to make it a little bit more organized, so the first thing to do will be defining a Token structure that will indicate the type of last extracted token and if the case its value (for numbers). A token is basically a symbol extracted (one at a time) from the input text. The possible tokens will be the arithmetical operators (‘+’, ‘-‘, ‘/’, ‘*’), the parentheses (‘(‘ and ‘)’), numbers and the end of the text.
Here is how I defined the token type and the token:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 |
enum TokenType { Error, Plus, Minus, Mul, Div, EndOfText, OpenParenthesis, ClosedParenthesis, Number }; struct Token { TokenType Type; double Value; char Symbol; Token():Type(Error), Value(0), Symbol(0) {} }; |
To be able to do the parsing, we’ll need some helper functions:
- SkipWhitespaces(), skips all whitespaces between two tokens:
1234void SkipWhitespaces(){while(isspace(m_Text[m_Index])) m_Index++;} - GetNextToken(), extracts the next token from the text; if an illegal token appears it throws an exception
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748void GetNextToken(){// ignore white spacesSkipWhitespaces();m_crtToken.Value = 0;m_crtToken.Symbol = 0;// test for the end of textif(m_Text[m_Index] == 0){m_crtToken.Type = EndOfText;return;}// if the current character is a digit read a numberif(isdigit(m_Text[m_Index])){m_crtToken.Type = Number;m_crtToken.Value = GetNumber();return;}m_crtToken.Type = Error;// check if the current character is an operator or parenthesesswitch(m_Text[m_Index]){case '+': m_crtToken.Type = Plus; break;case '-': m_crtToken.Type = Minus; break;case '*': m_crtToken.Type = Mul; break;case '/': m_crtToken.Type = Div; break;case '(': m_crtToken.Type = OpenParenthesis; break;case ')': m_crtToken.Type = ClosedParenthesis; break;}if(m_crtToken.Type != Error){m_crtToken.Symbol = m_Text[m_Index];m_Index++;}else{std::stringstream sstr;sstr << "Unexpected token '" << m_Text[m_Index] << "' at position " << m_Index;throw ParserException(sstr.str(), m_Index);}} - GetNumber() extracts a number from the input text from the current position; the purpose of this tutorial is didactical, so this function is quite simple: it reads integers and doubles with ‘.’ As the decimal point; it doesn’t read numbers in a format like 123.3E+2.
1234567891011121314151617double GetNumber(){SkipWhitespaces();int index = m_Index;while(isdigit(m_Text[m_Index])) m_Index++;if(m_Text[m_Index] == '.') m_Index++;while(isdigit(m_Text[m_Index])) m_Index++;if(m_Index - index == 0)throw ParserException("Number expected but not found!", m_Index);char buffer[32] = {0};memcpy(buffer, &m_Text[index], m_Index - index);return atof(buffer);}
With these defined, we can build the parser for the specified grammar.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 |
class Parser { Token m_crtToken; const char* m_Text; size_t m_Index; private: void Expression() { Term(); Expression1(); } void Expression1() { switch(m_crtToken.Type) { case Plus: GetNextToken(); Term(); Expression1(); break; case Minus: GetNextToken(); Term(); Expression1(); break; } } void Term() { Factor(); Term1(); } void Term1() { switch(m_crtToken.Type) { case Mul: GetNextToken(); Factor(); Term1(); break; case Div: GetNextToken(); Factor(); Term1(); break; } } void Factor() { switch(m_crtToken.Type) { case OpenParenthesis: GetNextToken(); Expression(); Match(')'); break; case Minus: GetNextToken(); Factor(); break; case Number: GetNextToken(); break; default: { std::stringstream sstr; sstr << "Unexpected token '" << m_crtToken.Symbol << "' at position " << m_Index; throw ParserException(sstr.str(), m_Index); } } } void Match(char expected) { if(m_Text[m_Index-1] == expected) GetNextToken(); else { std::stringstream sstr; sstr << "Expected token '" << expected << "' at position " << m_Index; throw ParserException(sstr.str(), m_Index); } } void SkipWhitespaces() { while(isspace(m_Text[m_Index])) m_Index++; } void GetNextToken() { // ignore white spaces SkipWhitespaces(); m_crtToken.Value = 0; m_crtToken.Symbol = 0; // test for the end of text if(m_Text[m_Index] == 0) { m_crtToken.Type = EndOfText; return; } // if the current character is a digit read a number if(isdigit(m_Text[m_Index])) { m_crtToken.Type = Number; m_crtToken.Value = GetNumber(); return; } m_crtToken.Type = Error; // check if the current character is an operator or parentheses switch(m_Text[m_Index]) { case '+': m_crtToken.Type = Plus; break; case '-': m_crtToken.Type = Minus; break; case '*': m_crtToken.Type = Mul; break; case '/': m_crtToken.Type = Div; break; case '(': m_crtToken.Type = OpenParenthesis; break; case ')': m_crtToken.Type = ClosedParenthesis; break; } if(m_crtToken.Type != Error) { m_crtToken.Symbol = m_Text[m_Index]; m_Index++; } else { std::stringstream sstr; sstr << "Unexpected token '" << m_Text[m_Index] << "' at position " << m_Index; throw ParserException(sstr.str(), m_Index); } } double GetNumber() { SkipWhitespaces(); int index = m_Index; while(isdigit(m_Text[m_Index])) m_Index++; if(m_Text[m_Index] == '.') m_Index++; while(isdigit(m_Text[m_Index])) m_Index++; if(m_Index - index == 0) throw ParserException("Number expected but not found!", m_Index); char buffer[32] = {0}; memcpy(buffer, &m_Text[index], m_Index - index); return atof(buffer); } public: void Parse(const char* text) { m_Text = text; m_Index = 0; GetNextToken(); Expression(); } }; |
The exception class is defined like this:
1 2 3 4 5 6 7 8 9 10 11 |
class ParserException : public std::exception { int m_Pos; public: ParserException(const std::string& message, int pos): std::exception(message.c_str()), m_Pos(pos) { } }; |
As you can see, the code for the grammar production is quite simple and straight forward. Now, let’s put it to the test.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 |
void Test(const char* text) { Parser parser; try { parser.Parse(text); std::cout << """ << text << ""t OK" << std::endl; } catch(ParserException& ex) { std::cout << """ << text << ""t " << ex.what() << std::endl; } } int main() { Test("1+2+3+4"); Test("1*2*3*4"); Test("1-2-3-4"); Test("1/2/3/4"); Test("1*2+3*4"); Test("1+2*3+4"); Test("(1+2)*(3+4)"); Test("1+(2*3)*(4+5)"); Test("1+(2*3)/4+5"); Test("5/(4+3)/2"); Test("1 + 2.5"); Test("125"); Test("-1"); Test("-1+(-2)"); Test("-1+(-2.0)"); Test(" 1*2,5"); Test(" 1*2.5e2"); Test("M1 + 2.5"); Test("1 + 2&5"); Test("1 * 2.5.6"); Test("1 ** 2.5"); Test("*1 / 2.5"); return 0; } |
The output for this testing program is:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 |
"1+2+3+4" OK "1*2*3*4" OK "1-2-3-4" OK "1/2/3/4" OK "1*2+3*4" OK "1+2*3+4" OK "(1+2)*(3+4)" OK "1+(2*3)*(4+5)" OK "1+(2*3)/4+5" OK "5/(4+3)/2" OK "1 + 2.5" OK "125" OK "-1" OK "-1+(-2)" OK "-1+(-2.0)" OK " 1*2,5" Unexpected token ',' at position 6 " 1*2.5e2" Unexpected token 'e' at position 8 "M1 + 2.5" Unexpected token 'M' at position 0 "1 + 2&5" Unexpected token '&' at position 5 "1 * 2.5.6" Unexpected token '.' at position 7 "1 ** 2.5" Unexpected token '*' at position 4 "*1 / 2.5" Unexpected token '*' at position 1 |
Which is exactly what we expected: it validates correct expressions and throws an exception when the exception is incorrect.
In the next post I’ll show how to modify this code to build an abstract syntax tree.
1 Reply to “Evaluating Expressions – Part 2: Parse the Expression”