Entry
Regular expressions & parsing with quotes
Jul 5th, 2000 09:59
Nathan Wallace, unknown unknown, Hans Nowak, Snippet 66, Tim Peters
"""
Packages: text.regular_expressions
"""
"""
> I am trying to write code to parse Forth-ish strings. It should support
> quotes, too.
Hans, that's hardly an implementable definition <0.6 wink>.
> The following string
>
> 2 3 dup "hello world" . blah foo 44
>
> should parse to
>
> ['2', '3', 'dup', '"hello world"', '.', 'blah', 'foo', '44']
> ...
> 1) This parser also takes strings like
>
> "hello world"x y z
>
> ['"hello world"', 'x', 'y', 'z']
>
> which I don't want. "hello world"x should be an error. Q: How can I trap
> this?
>
> 2) I would like to include double quotes in strings, ala C, using
> \", thus
> allowing things like "\Hi!\" he said". Q: Can this be done using regular
> expressions?
See the attached. Given
print parse(r'2 3 dup "hello world" . blah foo 44')
print parse(r'spam "\Hi!\" he said" eggs')
print parse(r'"hello world"x y z')
it prints
['2', '3', 'dup', '"hello world"', '.', 'blah', 'foo', '44']
['spam', '"\\Hi!\\" he said"', 'eggs']
Traceback (innermost last):
File "misc2/spam.py", line 33, in ?
print parse(r'"hello world"x y z')
File "misc2/spam.py", line 25, in parse
raise ValueError("can't parse string at index " +
ValueError: can't parse string at index ('"hello world"x y z', 0)
You're pushing regular expressions beyond what they're good at, though.
Study Friedl's book ("Mastering Regular Expressions", O'Reilly) if you're
determined to persist <wink>.
if-parsing-were-easy-regexps-wouldn't-suck-at-it-ly y'rs - tim
"""
import re
findtoken = re.compile(r"""
\s* # skip leading ws
(
" # open quote
[^"\\]* # normal characters
(?: \\. # chew up backslash and following char
[^"\\]* # normal characters
)* # repeat as needed to chew up the backslashes
" # close quote
| # or ...
[^"\s]+ # buncha non-quote/ws
)
(?: \s+ | $) # followed by ws or end-of-line
""", re.VERBOSE).match
def parse(str):
result = []
n = len(str)
i = 0
while i < n:
m = findtoken(str, i)
if not m:
raise ValueError("can't parse string at index " +
`(str, i)`)
result.append(m.group(1))
i = m.end()
return result
if __name__ == "__main__":
import sys
stuff = [
r'2 3 dup "hello world" . blah foo 44',
r'spam "\Hi!\" he said" eggs',
r'"hello world"x y z',
]
for str in stuff:
try:
print parse(str)
except:
print sys.exc_info()