Entry
Translating regsub.splitx() to re.split()
Jul 5th, 2000 09:59
Nathan Wallace, unknown unknown, Hans Nowak, Snippet 22, Tim Peters
"""
Packages: text.regular_expressions
"""
"""
> I have been using
>
> regsub.splitx('AbbbC', r'\([a-z]\)\1*')
>
> to split something like 'AbbbC' into ['A', 'bbb', 'C'].
>
> How do I translate it into re.split()?
With great difficulty <wink>. Interesting problem!
> I tried
>
> re.split(r'([a-z])+', 'AbbbC') -> ['A', 'b', 'C']
> re.split(r'(([a-z])+)', 'AbbbC') -> ['A', 'bbb', 'b', 'C']
> re.split(r'([a-z])\1*', 'AbbbC') -> ['A', 'b', 'C']
Those are all functioning as documented. Could be the closest
straightforward way you can get is a working <wink> variant of the first
two:
"""
import re
print re.split(r'([a-z]+)', 'AbbbC') # ['A', 'bbb', 'C']
"""
> re.split(r'(([a-z])\1*)', 'AbbbC') -> it hangs
Note that \1 is *inside* group #1: this is an illegal regexp. The regexp
compiler should have caught that & complained instead of hanging. It's
not unique to split; the simpler
print re.match(r'((a)\1*)', 'aaa')
also hangs, and the simpler-still
print re.match(r'(a\1*)', 'aaa')
at least yields an internal runtime error:
pcre.error: ('Regex execution error', -2)
Rewriting as a legal regexp doesn't solve the original problem, though:
re.split(r'(([a-z])\2*)', 'AbbbC') -> ['A', 'bbb', 'b', 'C']
What you would *like* to do now is suppress the output from group #2, by
making it a non-capturing group, e.g.:
r'((?:[a-z])\2*)'
But making it non-capturing also prevents a subsequent back-reference from
getting at it, so that doesn't work either.
Three workarounds:
1) Post-process the results to weed out the group #2 output:
results = re.split(r"(([a-z])\2*)", string)
# remove every third result
bad_indices = range(2, len(results), 3)
bad_indices.reverse()
for i in bad_indices:
del results[i]
2) Get the source for RegexObject.split (from lib/re.py), and modify it to
your liking (it's just a Python loop that calls search repeatedly and
appends partial results to a list as it goes along).
3) Use brute force to avoid backreferences:
# build (a+|b+|...|z+)
import string
pat = string.join(string.lowercase, "+|")
pat = "(" + pat + "+)"
print re.split(pat, 'AbbbC')
regularly y'rs - tim
"""