faqts : Computers : Programming : Languages : Python : Snippets : Regular Expressions

+ Search
Add Entry AlertManage Folder Edit Entry Add page to http://del.icio.us/
Did You Find This Entry Useful?

14 of 67 people (21%) answered Yes
Recently 3 of 3 people (100%) answered Yes

Entry

Regular expressions & parsing with quotes

Jul 5th, 2000 09:59
Nathan Wallace, unknown unknown, Hans Nowak, Snippet 66, Tim Peters


"""
Packages: text.regular_expressions
"""

"""
> I am trying to write code to parse Forth-ish strings. It should support
> quotes, too.

Hans, that's hardly an implementable definition <0.6 wink>.

> The following string
>
> 2 3 dup "hello world" . blah foo 44
>
> should parse to
>
> ['2', '3', 'dup', '"hello world"', '.', 'blah', 'foo', '44']

> ...
> 1) This parser also takes strings like
>
> "hello world"x y z
>
> ['"hello world"', 'x', 'y', 'z']
>
> which I don't want. "hello world"x should be an error. Q: How can I trap
> this?
>
> 2) I would like to include double quotes in strings, ala C, using
> \", thus
> allowing things like "\Hi!\" he said". Q: Can this be done using regular
> expressions?

See the attached.  Given

print parse(r'2 3 dup "hello world" . blah foo 44')
print parse(r'spam "\Hi!\" he said" eggs')
print parse(r'"hello world"x y z')

it prints

['2', '3', 'dup', '"hello world"', '.', 'blah', 'foo', '44']
['spam', '"\\Hi!\\" he said"', 'eggs']
Traceback (innermost last):
  File "misc2/spam.py", line 33, in ?
    print parse(r'"hello world"x y z')
  File "misc2/spam.py", line 25, in parse
    raise ValueError("can't parse string at index " +
ValueError: can't parse string at index ('"hello world"x y z', 0)

You're pushing regular expressions beyond what they're good at, though.
Study Friedl's book ("Mastering Regular Expressions", O'Reilly) if you're
determined to persist <wink>.

if-parsing-were-easy-regexps-wouldn't-suck-at-it-ly y'rs  - tim
"""

import re

findtoken = re.compile(r"""
    \s*         # skip leading ws
    (
        "           # open quote
        [^"\\]*     # normal characters
        (?: \\.         # chew up backslash and following char
            [^"\\]*     # normal characters
        )*          # repeat as needed to chew up the backslashes
        "           # close quote
    |           # or ...
        [^"\s]+     # buncha non-quote/ws
    )
    (?: \s+ | $) # followed by ws or end-of-line
""", re.VERBOSE).match

def parse(str):
    result = []
    n = len(str)
    i = 0
    while i < n:
        m = findtoken(str, i)
        if not m:
            raise ValueError("can't parse string at index " +
                             `(str, i)`)
        result.append(m.group(1))
        i = m.end()
    return result


if __name__ == "__main__":

    import sys

    stuff = [
        r'2 3 dup "hello world" . blah foo 44',
        r'spam "\Hi!\" he said" eggs',
        r'"hello world"x y z',
    ]

    for str in stuff:
        try:
            print parse(str)
        except:
            print sys.exc_info()