<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Friendly patterns and algorithms &#187; regex</title>
	<atom:link href="http://www.palgorithm.co.uk/tag/regex/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.palgorithm.co.uk</link>
	<description>Discussion of algorithms for games, graphics and general engineering</description>
	<lastBuildDate>Mon, 31 May 2010 11:21:30 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.9.2</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>On parsing, regex, haskell and some other cool things</title>
		<link>http://www.palgorithm.co.uk/2009/03/on-parsing-regex-haskell-and-some-other-cool-things/</link>
		<comments>http://www.palgorithm.co.uk/2009/03/on-parsing-regex-haskell-and-some-other-cool-things/#comments</comments>
		<pubDate>Sat, 21 Mar 2009 21:05:06 +0000</pubDate>
		<dc:creator>Sam Martin</dc:creator>
				<category><![CDATA[thoughts]]></category>
		<category><![CDATA[haskell]]></category>
		<category><![CDATA[parser combinators]]></category>
		<category><![CDATA[parsing]]></category>
		<category><![CDATA[regex]]></category>
		<category><![CDATA[sgrep]]></category>

		<guid isPermaLink="false">http://www.palgorithm.co.uk/?p=142</guid>
		<description><![CDATA[I&#8217;ve recently become slightly obsessed about finding ways (new or otherwise) to make parsing text really really simple. I&#8217;m concerned there are wide gaps in the range of currently parsing tools, all of which are filled by pain.
It&#8217;s also a nice distraction from the C++ language proposal I was working on which is stalled while [...]]]></description>
			<content:encoded><![CDATA[<p>I&#8217;ve recently become slightly obsessed about finding ways (new or otherwise) to make <a href="http://en.wikipedia.org/wiki/Parsing">parsing</a> text really really simple. I&#8217;m concerned there are wide gaps in the range of currently parsing tools, all of which are filled by pain.</p>
<p>It&#8217;s also a nice distraction from the C++ language proposal I was working on which is stalled while I dig through more research. It turns out someone has already done something very similar to what I was thinking! So there will be a bit of a delay while I bottom that out properly.</p>
<p><strong>Parsed the pain.</strong><br />
Parsing with regular expressions covers a decent amount of simple low-hanging fruit. I happen to be a big fan of regex but it definitely doesn&#8217;t handle parsing &#8217;structured documents&#8217; very well. Here &#8217;structure&#8217; means some non-trivial pattern: perhaps matching braces, nested data or maybe a recursive structure. </p>
<p>This is by design. Regular expressions are, or were originally, a way of describing an expression in a <a href="http://en.wikipedia.org/wiki/Regular_grammar">&#8216;regular grammar&#8217;</a>. Its expressive power is actually very limited and text doesn&#8217;t need to be that complex before it exceeds the expressiveness of a regular expression. This regex email address parser is just about readable, but kind of pushing the limits:</p>
<p><code><span class="stdin">\b[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}\b</span></code></p>
<p>However, XML, HTML, pretty much all source code, every file format I&#8217;ve ever written &#8211; basically all the documents I care about &#8211; are not regular grammars. </p>
<p><strong>The pain in context</strong><br />
The next step up from a regular grammar in the <a href="http://en.wikipedia.org/wiki/Formal_grammar#The_Chomsky_hierarchy">Chomsky hierarchy</a> is a &#8216;<a href="http://en.wikipedia.org/wiki/Context-free_grammar">context free</a>&#8216; grammar. Parsing a context-free grammar frequently involves writing a lexer and parser combination to do the work. The lexer breaks the character stream into &#8216;tokens&#8217; and the parser translates the token stream into a more meaningful <a href="http://en.wikipedia.org/wiki/Abstract_syntax_tree">tree structure</a>. Parsers in particular tend to be complex and lengthy pieces of code so you&#8217;d more often than not find yourself using a <a href="http://en.wikipedia.org/wiki/Comparison_of_parser_generators">parser generator</a> such as <a href="http://dinosaur.compilertools.net/">yacc</a>, <a href="http://www.gnu.org/software/bison/">bison</a>, or <a href="http://www.antlr.org/">antlr</a> to actually generate the code for you from a separate description of the grammar. This is all before you actually get to doing something useful with the tree the parser outputs.</p>
<p>Either way you cut it, this is a significant step up in pain from a regular expression. Your problem has suddenly jumped from a condensed one-liner to full-on procedurally-generated code. If the task you have in mind is just a bit more complex than a regex can handle your pain increases disproportional with this increase in complexity.</p>
<p>Sadly, even context-free grammars don&#8217;t cut much in practice. There&#8217;s a fair gap between the expressiveness of context-free grammar and the real world of nasty ambiguous context-sensitive languages. I&#8217;m thinking mainly of the context-sensitivity of C++ where the task of writing a parser is full of painful implementation details. Not to mention that there is a further major leap to get close to parsing the world of natural languages, such as English.</p>
<p><strong>Pain relief</strong><br />
There are no shortage of parsing tasks in the &#8220;slightly more complex than a regex&#8221; category. Context-free grammars actually contain several sub-categories that are more restrictive but simpler to parse, such as <a href="http://en.wikipedia.org/wiki/LL_parser">LL</a> and <a href="http://en.wikipedia.org/wiki/LR_parser">LR</a>. So it&#8217;s not really much of a surprise to discover that a typical &#8216;regex&#8217; isn&#8217;t actually a &#8216;regular grammar expression&#8217; any more. </p>
<p>Perl&#8217;s implementation of regex supports recursion, back references, and finite look-ahead which allow it handle some &#8211; <a href="http://www.perlmonks.org/?node_id=308283">maybe all</a> &#8211; context-free documents. I recently re-read the Perl <a href="http://perldoc.perl.org/perlretut.html">regex tutorial</a> to remind myself of it, and had some fun scraping the web for <a href="http://www.google.co.uk/search?q=tescos+voucher+codes&#038;ie=utf-8&#038;oe=utf-8&#038;aq=t&#038;rls=com.ubuntu:en-GB:unofficial&#038;client=firefox-a">tescos voucher codes</a>. I think the expansion beyond supporting just regular grammars is very helpful, but I don&#8217;t think it&#8217;s really bridging the gap to context-free parsing in a particularly manageable and re-usable way. </p>
<p>So, if Perl&#8217;s extended regex doesn&#8217;t cut it, what are the alternatives? Well, here&#8217;s a couple of thoughts.</p>
<p><strong>Structured grep</strong><br />
I thought this was quite a nice find: <a href="http://www.cs.helsinki.fi/u/jjaakkol/sgrep.html">sgrep</a> (&#8220;structured grep&#8221;). It&#8217;s similar to, but a separate from the familiar <a href="http://en.wikipedia.org/wiki/Grep">grep</a>. There are binaries for most platforms on-line as well as being found in <a href="http://www.cygwin.com/">Cygwin</a>, <a href="http://www.ubuntu.com/">Ubuntu</a> and probably most other Linux distros. At least in theory, it extends regular grammar pattern matching to support structure through the use of nested matching pairs and boolean operators. </p>
<p>Here&#8217;s how you might scrap a html document for the content of all &#8216;bold&#8217; tags:</p>
<p><code><span class="prompt">$ </span><span class="stdin">cat my_webpage.html | sgrep '"<b>" .. "</b>"'</span></code></p>
<p>The .. infix operator matches the text region with the specified start and end text strings. It also support boolean operators like this:</p>
<p><code><span class="prompt">$ </span><span class="stdin">cat my_webpage.html | sgrep '"<a>".. ("</a>" or "</b>")'</span></code></p>
<p>If you dig through the <a href="http://www.cs.helsinki.fi/u/jjaakkol/sgrepman.html">manual</a> you&#8217;ll come across macros and other cool operators such as &#8216;containing&#8217;, &#8216;join&#8217;, &#8216;outer&#8217; and so on. It seems easy to pick up and you can compose more complex expressions with macros. </p>
<p>I would go on about it for longer but sadly it&#8217;s current implementation has a fairly major flaw &#8211; it has no support for regex! This feels like a bit of simultaneous forwards and backwards step. I&#8217;m not actually sure whether it&#8217;s a fundamental flaw in the approach they&#8217;ve taken or whether the functionality is simple missing from the implementation. It&#8217;s a bit of shame because I think it looks really promising, and if you are interested I&#8217;d recommend you take a moment to read a <a href="http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.52.3355">short article</a> on their approach. I found it an interesting read and have since hit upon a handful of mini-parsing problems that I found sgrep very helpful with.</p>
<p><strong>Parser combinators</strong><br />
This was a recent discovery, and it now surprises me I hadn&#8217;t come across it before. I think I didn&#8217;t know of it because it&#8217;s rather tightly bound to the realm of &#8216;functional&#8217; languages, which isn&#8217;t something I&#8217;ve spent that much time looking at until now. That&#8217;s all changing though, as I think I&#8217;m becoming a convert.</p>
<p>It occured to me that a parser might be easier to write in a functional language: Parsing a grammar is kind of like doing algebra, and algebraic manipulation is the kind of thing functional languages are particularly good at. Googling these ideas turned up both <a href="http://en.wikipedia.org/wiki/Parser_Combinator">Parser Combinators</a>, an interesting parsing technique, and <a href="http://www.haskell.org/">Haskell</a>, a pure functional language where a &#8216;parser&#8217; is really a part of the language itself.</p>
<p>Parse combinators are a simple concept: You write a set of micro-parsers (my name for them) that do very basic parsing duties. Each is just a single function that given a text string, returns a list of possible interpretations. Each interpretation is a pair of the interpreted object and the remaining text string. In Haskell, you&#8217;d write the type of all parsers (a bit like a template in C++) like this:</p>
<p><code><span class="stdin">type Parser a = String -> [(a,String)]</span></code></p>
<p>For an unambiguous input string the parser will produce a list with just one item, ambiguous inputs will produce a list with more than one item, and an invalid input produces an empty list. An example micro-parser might just match a particular keyword at the start of the string.</p>
<p>Since all your parsers are of the same type, it&#8217;s now simple to compose them together into more complex parsers. This is modular programming at its most explicit.</p>
<p>It&#8217;s quite surprisingly how tiny and general the code to compose these parsers can be. You can reduce them to one-liners. Here&#8217;s a few examples, again in Haskell:</p>
<p><code><br />
<span class="prompt">-- Here, m and n are always Parser types.</span><br />
<span class="prompt">-- p is a predicate, and b is a general function.</span></p>
<p><span class="prompt">-- parse-and-then-parse</span><br />
<span class="stdin">(m # n) cs = [((a,b),cs'') | (a,cs') <- m cs, (b,cs'') <- n cs']</span></p>
<p><span class="prompt">-- parse-or-parse</span><br />
<span class="stdin">(m ! n) cs = (m cs) ++ (n cs)</span></p>
<p><span class="prompt">-- parse-if-result</span><br />
<span class="stdin">(m ? p) cs = [(a,cs') | (a,cs') <- m cs, p a]</span></p>
<p><span class="prompt">-- parse-and-transform-result</span><br />
<span class="stdin">(m >-> b) cs = [(b a, cs') | (a,cs') <- m cs]</span></p>
<p><span class="prompt">-- parse-and-ignore-left-result</span><br />
<span class="stdin">(m -# n) cs = [(a,cs'') | (_,cs') <- m cs, (a,cs'') <- n cs']</span></p>
<p><span class="prompt">-- parse-and-ignore-right-result</span><br />
<span class="stdin">(m #- n) cs = [(a,cs'') | (a,cs') <- m cs, (_,cs'') <- n cs']</span><br />
</code></p>
<p>I&#8217;ve taken these examples from &#8220;<a href="http://www.cs.lu.se/EDA120/assignment4/parser.pdf">Parsing with Haskell</a>&#8220;, which is an excellent short paper and well worth a read.</p>
<p>Learning Haskell has been something of a revelation. I had glanced at Objective CAML and Lisp before, but I&#8217;m actually really quite shocked at how cool Haskell is and that it took me so long to find it.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.palgorithm.co.uk/2009/03/on-parsing-regex-haskell-and-some-other-cool-things/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>
