Regular Expressions do have limitations, but have you considered the following? NET framework is unique when it comes to regular expressions in that it supports Balancing Group Definitions. NET Regular Expression Engine As described above properly balanced constructs cannot be described by a regular expression. NET regular expression engine provides a few constructs that allow balanced constructs to be recognized. NET regular expression to emulate a restricted PDA by essentially allowing simple versions of the stack operations: push, pop and empty.
The simple operations are pretty much equivalent to increment, decrement and compare to zero respectively. NET regular expression engine to recognize a subset of the context-free languages, in particular the ones that only require a simple counter. NET regular expressions to recognize individual properly balanced constructs.
And if you are comfortable with coding regexes, way faster to code than coding xpaths. Using an HTML minifier to remove all whitespace in your document that the browser doesn't render?
And almost certainly less fragile to changes in what you are scraping. An XML parser won't care, and neither will a well-written XPath statement.
Since a Type 2 grammar is fundamentally more complex than a Type 3 grammar (see the Chomsky hierarchy), you can't possibly make this work.
Some regex engines (such as Perl's) are Turing complete.