Home     My Faqts     Contributors     About     Help    

faqts : Computers : Programming : Languages : Python

FAQTs repaired & updated!
Thanks for your patience...
Entry Add Entry Alert - Edit this Entry

Did You Find This Entry Useful?

45 of 60 people (75%) answered Yes
Recently 9 of 10 people (90%) answered Yes

Where can I best learn how to parse out both HTML and Javascript tags to extract text from a page?

Oct 30th, 2000 23:49

Magnus Lyckå, Matthew Schinckel, Paul Allopenna
Python Documentation


If you want to (quickly) strip all HTML tags from a string of data, try 
using the 
re module:
import re
file = open(filename,'r')
data = file.read()
file.close()
text = re.sub('<!--.*?-->', '', data) #Remove comments first, or '>' in
                                      #comments will be interpreted as
                                      #end of (comment) tag.
text = re.sub('<.*?>', '', text)
This will also strip any javascript, but only if the page has been made 
'properly' 
- that is, the javascript is within HTML comments.
If you want to know how it works, read the 're' chapter in the library 
reference, 
as it discusses the usefulness of 'non-greedy' regular expressions.



© 1999-2004 Synop Pty Ltd