I’m hoping that people can point me to practical introductions to PEG parsing. Article length (i.e., 25-50 pages) would be ideal, but book suggestions are welcome too. This may be obvious, but by “practical” I mean that I’m interested in material with a lot of examples and usable tips rather than theoretical accounts. A good example of the kind of thing I’m interested in is Roberto Ierusalimschy’s Mastering LPeg (PDF link).
Brief background on why: I’m a self-taught, amateur programmer, and I’m interested in using LPeg (especially within neovim) for text manipulation and scanning. I don’t have the kind of math or CS background to deal with the theory, but I find LPeg appealing as an alternative to vim regexes or Lua patterns.
Hey! I know a little about this. There’s a programming language called Janet that is very heavily inspired by Lua, and it has a PEG library built-in that was very heavily inspired by LPeg. It uses a different notation, but the ideas all translate, and it uses a lot of the same names/operators, just with a prefix flair.
I wrote a book with a chapter on PEGs that covers, well, kinda just the basics, but has an example of writing a bare-bones HTML parser.
I did Advent of Code last year in Janet because almost every problem starts out with a little parsing puzzle, and I thought it would be a good way to practice writing a bunch of diverse PEGs. Recently I’ve been recording myself solving some of them again, mostly to highlight the PEG-parsing parts. Pretty low-effort videos, but it might be interesting to watch the beginnings of them to see a few examples of PEGs. None of them are very complicated – AoC inputs are pretty easy to parse – but there’s some variety.
I’ve contributed a little bit to Janet’s PEG engine, and I think you might be delightfully surprised by how simple the execution model is – you’re basically just moving a cursor through a buffer and keeping track of a couple stacks. It kinda, I dunno, kinda demystified it a lot and made it easier to both write and debug PEGs.
Thanks so much. I will take a look at your book—it also gives me a good excuse to learn at least a little about Julia.
Guido van Rossum wrote a series of blog posts about PEG before using it for the CPython parser.
Thanks: if parsing Python isn’t practical, then I don’t know what is.
For anyone who wants the posts but doesn’t feel like fighting with Medium:
One outside the box option is to play live with a parser, the pest parser for rust has a live page that lets you write grammars and see what they parse out. There’s a drop down to the right of the test input box that lets you select which rule is being ‘looked for’. I find reading about parsers is a lot like reading about running or baseball or chess, it can be very helpful, but the real learning comes ‘on the field’
https://pest.rs/?g=N4Ig5gTghgtjURALhFANgBwBZQAQF5dhcByKEgOgpIC8TcAfUgQUuoC16BfAHQDsAJgEswQgC4EipAAxsSATm79%2BQgQFM%2BEwsQAU6bHibDRYgJQBqXFxAAaEEL4YArmOSoARgHYAzCC5A#editor
I’ll second Pest. It’s really a great library. I’ve used a number of times and had a pleasant experience every time!
Why do you like pest more than nom? (I’ve used nom but not pest).
To me the biggest issue with nom was the versions jumping. Using it on purpose once left the code un-update-able because one of the used methods disappeared in a later version. I think the jumping in major versions has also made the tertiary docs (ie blog posts, comments, etc) less useful because they may not refer to the latest version. So it’s not that nom is bad, it’s that nom is a bunch of parsers under one name. https://crates.io/crates/nom/versions
I’ve only used pest for a personal project and it was pretty straightforward and fun. The docs make sense, the code just works, and it saved me a lot of pain doing more manual work.
For serious serious stuff (read working at a major security vendor) I used
untrusted
https://github.com/briansmith/untrusted which is the best for parsing untrusted binary input.Agreed. I use Lua’s REPL with LPeg to experiment as I read, but I will try out the pest web interface as well.
Pest has the slight advantage of being less characters to define a grammar, but it’s not exactly the original PEG characters, I think | instead of / which is fine by me. Given that you are targeting neovim + Lua (w/ ~25 years of vi/vim experience I still find neovim to be … a bit of a hassle to get compiled and running) your method of using lpeg in the lua repl is probably ideal. I’d only try the pest fiddle if you wanted to quickly figure out why something wasn’t working.
Thanks for the follow-up. I’ll keep the fiddle in mind for debugging.
Re building neovim, I’ve had good lucky building nightly for a few years now with this (relatively minimal) script. Every now and again, I need to kickstart ninja by running
make distclean
in neovim’s source directory, but otherwise this works well for me on both macOS and Debian. (If you try it, remove thenbundleup
section. That invokes Paq with a local script to update plugins.) Then again, after 25 years, you may be perfectly happy with just vim. (I switched primarily because I like Lua a lot and Vimscript not at all.)Not sure if helpful or not but I wrote this article on using a PEG parser library to implement a hash-like (dictionary like) https://www.schneems.com/2018/04/30/how-to-implement-a-ruby-hash-like-grammar-in-parslet/
Thanks: I will take a look.
For LPEG specifically, I think Roberto’s new tutorial that you linked is the best, but here are a few older resources:
I don’t have an introduction, but I do have a repo of many LPEG based parsers that you can look through.
Agreed about Roberto’s tutorial being very good. Thanks for the other tips. Those names remind me of the Lua mailing list. (Speaking of which: long time, no talk. I hope you are good.)
Yes! I had not noticed it was you when I replied :)
Please, don’t. Use context free grammars (CFG) instead. I wrote a long answer to you. But decided to publish it as a blog post: https://lobste.rs/s/nybhsl/this_is_why_you_should_never_use_parser . (The post also lists situations, when PEG may be a good thing.)
jsparse is a simple implementation that’s useful to learn about PEGs. Two posts about it:
https://bluishcoder.co.nz/2007/10/31/javascript-packrat-parser.html https://bluishcoder.co.nz/2007/10/03/javascript-parser-combinators.html
Thanks for the tips. It will be fun to look the same basic concepts in libraries written in different languages.
Do you want to learn how to use (write a grammar using) a PEG parser generator, or do you want to learn how PEG parsing works? Those are different things :) I used a PEG tool to create a parser for a SQL dialect, but I only vaguely know the theory behind PEG.
Mostly I learned by reading the tool’s docs, then examining several existing grammar files, then a bunch of trial and error…
There are actually (at least) two answers to how PEG parsing works!
PEGs were invented by Bryan Ford who also invented packrat parsing, so the first PEG parsers were packrat parsers. But packrat parsers are (as their name suggests) memory-hungry.
LPeg works more like a backtracking regex engine, so it’s more frugal with memory but can’t always parse in linear time. However, the recognizer described by a PEG has less nondeterminism than a regex, so a PEG parser is less likely to get into an exponential backtracking search than a regex engine. Another way of thinking about it is that the kind of recognizer described by a PEG is very close to a recursive descent parser, so as the author of a grammar you have a lot of control over how much backtracking can occur, more than in most other grammar formalisms.
I want to learn how to write grammars (and scanners) using a PEG generator like LPeg.