I've recently become slightly obsessed about finding ways (new or otherwise) to make parsing text really really simple. I'm concerned there are wide gaps in the range of currently parsing tools, all of which are filled by pain.
It's also a nice distraction from the C++ language proposal I was working on which is stalled while I dig through more research. It turns out someone has already done something very similar to what I was thinking! So there will be a bit of a delay while I bottom that out properly.
Parsed the pain.
Parsing with regular expressions covers a decent amount of simple low-hanging fruit. I happen to be a big fan of regex but it definitely doesn't handle parsing 'structured documents' very well. Here 'structure' means some non-trivial pattern: perhaps matching braces, nested data or maybe a recursive structure.
This is by design. Regular expressions are, or were originally, a way of describing an expression in a 'regular grammar'. Its expressive power is actually very limited and text doesn't need to be that complex before it exceeds the expressiveness of a regular expression. This regex email address parser is just about readable, but kind of pushing the limits:
However, XML, HTML, pretty much all source code, every file format I've ever written - basically all the documents I care about - are not regular grammars.
The pain in context
The next step up from a regular grammar in the Chomsky hierarchy is a 'context free' grammar. Parsing a context-free grammar frequently involves writing a lexer and parser combination to do the work. The lexer breaks the character stream into 'tokens' and the parser translates the token stream into a more meaningful tree structure. Parsers in particular tend to be complex and lengthy pieces of code so you'd more often than not find yourself using a parser generator such as yacc, bison, or antlr to actually generate the code for you from a separate description of the grammar. This is all before you actually get to doing something useful with the tree the parser outputs.
Either way you cut it, this is a significant step up in pain from a regular expression. Your problem has suddenly jumped from a condensed one-liner to full-on procedurally-generated code. If the task you have in mind is just a bit more complex than a regex can handle your pain increases disproportional with this increase in complexity.
Sadly, even context-free grammars don't cut much in practice. There's a fair gap between the expressiveness of context-free grammar and the real world of nasty ambiguous context-sensitive languages. I'm thinking mainly of the context-sensitivity of C++ where the task of writing a parser is full of painful implementation details. Not to mention that there is a further major leap to get close to parsing the world of natural languages, such as English.
There are no shortage of parsing tasks in the "slightly more complex than a regex" category. Context-free grammars actually contain several sub-categories that are more restrictive but simpler to parse, such as LL and LR. So it's not really much of a surprise to discover that a typical 'regex' isn't actually a 'regular grammar expression' any more.
Perl's implementation of regex supports recursion, back references, and finite look-ahead which allow it handle some - maybe all - context-free documents. I recently re-read the Perl regex tutorial to remind myself of it, and had some fun scraping the web for tescos voucher codes. I think the expansion beyond supporting just regular grammars is very helpful, but I don't think it's really bridging the gap to context-free parsing in a particularly manageable and re-usable way.
So, if Perl's extended regex doesn't cut it, what are the alternatives? Well, here's a couple of thoughts.
I thought this was quite a nice find: sgrep ("structured grep"). It's similar to, but a separate from the familiar grep. There are binaries for most platforms on-line as well as being found in Cygwin, Ubuntu and probably most other Linux distros. At least in theory, it extends regular grammar pattern matching to support structure through the use of nested matching pairs and boolean operators.
Here's how you might scrap a html document for the content of all 'bold' tags:
$ cat my_webpage.html | sgrep '"" .. ""'
The .. infix operator matches the text region with the specified start and end text strings. It also support boolean operators like this:
$ cat my_webpage.html | sgrep '"".. ("" or "")'
If you dig through the manual you'll come across macros and other cool operators such as 'containing', 'join', 'outer' and so on. It seems easy to pick up and you can compose more complex expressions with macros.
I would go on about it for longer but sadly it's current implementation has a fairly major flaw - it has no support for regex! This feels like a bit of simultaneous forwards and backwards step. I'm not actually sure whether it's a fundamental flaw in the approach they've taken or whether the functionality is simple missing from the implementation. It's a bit of shame because I think it looks really promising, and if you are interested I'd recommend you take a moment to read a short article on their approach. I found it an interesting read and have since hit upon a handful of mini-parsing problems that I found sgrep very helpful with.
This was a recent discovery, and it now surprises me I hadn't come across it before. I think I didn't know of it because it's rather tightly bound to the realm of 'functional' languages, which isn't something I've spent that much time looking at until now. That's all changing though, as I think I'm becoming a convert.
It occured to me that a parser might be easier to write in a functional language: Parsing a grammar is kind of like doing algebra, and algebraic manipulation is the kind of thing functional languages are particularly good at. Googling these ideas turned up both Parser Combinators, an interesting parsing technique, and Haskell, a pure functional language where a 'parser' is really a part of the language itself.
Parse combinators are a simple concept: You write a set of micro-parsers (my name for them) that do very basic parsing duties. Each is just a single function that given a text string, returns a list of possible interpretations. Each interpretation is a pair of the interpreted object and the remaining text string. In Haskell, you'd write the type of all parsers (a bit like a template in C++) like this:
type Parser a = String -> [(a,String)]
For an unambiguous input string the parser will produce a list with just one item, ambiguous inputs will produce a list with more than one item, and an invalid input produces an empty list. An example micro-parser might just match a particular keyword at the start of the string.
Since all your parsers are of the same type, it's now simple to compose them together into more complex parsers. This is modular programming at its most explicit.
It's quite surprisingly how tiny and general the code to compose these parsers can be. You can reduce them to one-liners. Here's a few examples, again in Haskell:
-- Here, m and n are always Parser types.
-- p is a predicate, and b is a general function.
(m # n) cs = [((a,b),cs'') | (a,cs') <- m cs, (b,cs'') <- n cs']
(m ! n) cs = (m cs) ++ (n cs)
(m ? p) cs = [(a,cs') | (a,cs') <- m cs, p a]
(m >-> b) cs = [(b a, cs') | (a,cs') <- m cs]
(m -# n) cs = [(a,cs'') | (_,cs') <- m cs, (a,cs'') <- n cs']
(m #- n) cs = [(a,cs'') | (a,cs') <- m cs, (_,cs'') <- n cs']
I've taken these examples from "Parsing with Haskell", which is an excellent short paper and well worth a read.
Learning Haskell has been something of a revelation. I had glanced at Objective CAML and Lisp before, but I'm actually really quite shocked at how cool Haskell is and that it took me so long to find it.