faqts : Computers : Programming : Languages : Python : Snippets : Regular Expressions

+ Search
Add Entry AlertManage Folder Edit Entry Add page to http://del.icio.us/
Did You Find This Entry Useful?

6 of 7 people (86%) answered Yes
Recently 4 of 5 people (80%) answered Yes

Entry

Perl regex code to Python

Jul 5th, 2000 09:59
Nathan Wallace, unknown unknown, Hans Nowak, Snippet 18, Mike Fletcher


"""
Packages: text.regular_expressions
"""

"""
> I have been learning Perl in order to write some CGI scripts and
> various
> parsing scripts. I was wondering how the equivalent code would look in
> Python?
> 
> Given an HTML file with content data deliminated via HTML comments.
> For
> example:
> 
> <HTML>
> <BODY>
> <!--version-->6.4<!--/version-->
> <B><!--product-->OpenGL<!--/product--></B>some more stuff...
> <!--description-->line1
> line2
> line3
> line4<!--/description-->
> </BODY>
> </HTML>
> 
> In Perl, I use the following regex to parse the content from between
> the HTML
> comments:
> 
> sub kbExtractContent ($text, "description") {
>    @_[0] =~ /<!--@_[1]-->(.+)<!--\/@_[1]-->/s;
>    return $1;
> }
> 
> where $text contains the entire contents of an HTML file and
> "description" is
> the comment pattern that I am looking for.  The regex in the above
> case will
> look for any text between <!--description--> and <!--/description-->.
> The
> regex will span multiple lines via the 's' option.  The content text
> is
> returned via the $1.
> 
> How would you do the equivalent in Python?
"""

import re
TAGPATTERN = '<!--%s-->(.*?)<!--/%s-->'
#TAGPATTERN = '<%s.*?>(.*?)</%s.*?>'

def findtag( instuff, tagname, tagpattern=TAGPATTERN ):
	'''
	Finds the contents of a tag matching
	TAGPATTERN which must have two string
	substitution "slots" into which to
	place copies of tagname.
	'''
	reg = re.compile( tagpattern %(tagname, tagname),
				re.IGNORECASE |re.DOTALL )
	result = reg.search( instuff )
	# the result is either a match object or the None object
	if result: # is a match object
		return result.group( 1)
	else: # is the None object
		return None # different result from empty content

"""
Note: I've used a non-greedy search on the contents (which is normally
what you want unless you're allowing nested comments of the same GI).
DOTALL (or S) flag multi-line matching for the . character, IGNORECASE
is just another "wouldn't you want this".

As you will notice, far more verbose, but I find it easier to understand
at a glance than the Perl which, even though I know what it does still
doesn't quite resolve for me as to how it's being done.
"""