faqts : Computers : Programming : Languages : Python : Snippets : Regular Expressions

+ Search
Add Entry AlertManage Folder Edit Entry Add page to http://del.icio.us/
Did You Find This Entry Useful?

1 of 2 people (50%) answered Yes
Recently 0 of 1 people (0%) answered Yes

Entry

Translating regsub.splitx() to re.split()

Jul 5th, 2000 09:59
Nathan Wallace, unknown unknown, Hans Nowak, Snippet 22, Tim Peters


"""
Packages: text.regular_expressions
"""

"""
> I have been using
>
>     regsub.splitx('AbbbC', r'\([a-z]\)\1*')
>
> to split something like 'AbbbC' into ['A', 'bbb', 'C'].
>
> How do I translate it into re.split()?

With great difficulty <wink>.  Interesting problem!

>  I tried
>
>     re.split(r'([a-z])+', 'AbbbC')      -> ['A', 'b', 'C']
>     re.split(r'(([a-z])+)', 'AbbbC')    -> ['A', 'bbb', 'b', 'C']
>     re.split(r'([a-z])\1*', 'AbbbC')    -> ['A', 'b', 'C']

Those are all functioning as documented.  Could be the closest
straightforward way you can get is a working <wink> variant of the first
two:
"""

import re

print re.split(r'([a-z]+)', 'AbbbC')      #  ['A', 'bbb', 'C']

"""
>     re.split(r'(([a-z])\1*)', 'AbbbC')  -> it hangs

Note that \1 is *inside* group #1:  this is an illegal regexp.  The regexp
compiler should have caught that & complained instead of hanging.  It's
not unique to split; the simpler

    print re.match(r'((a)\1*)', 'aaa')

also hangs, and the simpler-still

    print re.match(r'(a\1*)', 'aaa')

at least yields an internal runtime error:

    pcre.error: ('Regex execution error', -2)

Rewriting as a legal regexp doesn't solve the original problem, though:

    re.split(r'(([a-z])\2*)', 'AbbbC')   -> ['A', 'bbb', 'b', 'C']

What you would *like* to do now is suppress the output from group #2, by
making it a non-capturing group, e.g.:

    r'((?:[a-z])\2*)'

But making it non-capturing also prevents a subsequent back-reference from
getting at it, so that doesn't work either.

Three workarounds:

1) Post-process the results to weed out the group #2 output:

    results = re.split(r"(([a-z])\2*)", string)
    # remove every third result
    bad_indices = range(2, len(results), 3)
    bad_indices.reverse()
    for i in bad_indices:
        del results[i]

2) Get the source for RegexObject.split (from lib/re.py), and modify it to
your liking (it's just a Python loop that calls search repeatedly and
appends partial results to a list as it goes along).

3) Use brute force to avoid backreferences:

    # build (a+|b+|...|z+)
    import string
    pat = string.join(string.lowercase, "+|")
    pat = "(" + pat + "+)"
    print re.split(pat, 'AbbbC')

regularly y'rs  - tim
"""