-
Simple Top-Down Parsing in Python (effbot.org)
In the early seventies, Vaughan Pratt published an elegant improvement to recursive-descent in his paper Top-down Operator Precedence. Pratt's algorithm associates semantics with tokens instead of grammar rules, and uses a simple 'binding power' mechanism to handle precedence levels. /.../ In this article, I'll briefly explain how the algorithm works, discuss different ways to implement interpreters and translators with it in Python, and finally use it to implement a parser for Python's expression syntax.
It seems that a closing paren is indistinguishable from "(end)": test("1)") yields (literal 1). Similarly for brackets, braces, and commas. I'm not sure how to improve error detection in the parser.
BTW, using ideas from this article, I've managed to write a regexp parser with postfix operators that wrap their arguments instead of modifying them.
Excellent article - thank you very much! That was "the last straw" that got me going ...
One thing that struck me: Both you and Crockford avoid introducing "," (comma) as an infix operator, to parse all kinds of lists as expressions, which would be very natural to me. Is there a general problem with doing so, or is it just a deliberate design decision?
I probably found the answer myself, it's overgeneration, right?! You lose control over substructures, like the ":" pairs in dicts, or the initializer part in C-style for-loops, if you allow to recurse in generic expression parsing.
Interesting article. BTW, the line `raise SyntaxError("unknown operator)' is lacking a closing double-quote.
Very nice article. However, I'm having trouble implementing postfix operators... is it possible with this approach?
More specifically, postfix operators with high precedence. For instance, say I want to write a parser of regular expressions. "ab*c" should be equivalent to "a(b*)c", not "(ab)*c"... the "b" somehow has to telepathically know that a "*" is coming up, and to combine with that, not with the "a" before it.
I stumbled upon that myself recently, but in that case, it was sufficient to treat the postfix operators as "modifiers" -- just let the operator's "led" method modify the left node, and then return that node instead of itself. More extensive tree modifications are also possible, but is more work.
Excellent suggestion, thanks. In my case the solution was to have the postfix "led" make a copy of its "left" argument, then modify the original to have a new "id", signifying the postfix operation, with the copy as its "first" argument. Works beautifully.
Hello,
any chance we will see c?ElementTree for parsing?
As in an easy-to-use library for top-down parsing, or something else?
As in easy-to-use (and fast) library for top-down parsing, YES!
Some errors :
At the start of the article, inside tokenize function, you have to read :
if number:
yield integer_literal_token(int(number))
and inside class end_token:
lbp = 0 # and not ldp = 0
Should be fixed now (I hope). Thanks! /F
Now I feel like inventing my own language ;)
btw, thanks for the PDF!
I've made a pdf from your article : http://kib2.free.fr/pdfs/SimpleTopDownParsing.pdf
Thanks Fredrick, very nice article.