Home     My Faqts     Contributors     About     Help    

faqts : Computers : Programming : Languages : Python

FAQTs repaired & updated!
Thanks for your patience...
Entry Add Entry Alert - Edit this Entry

Did You Find This Entry Useful?

45 of 60 people (75%) answered Yes
Recently 9 of 10 people (90%) answered Yes

Where can I best learn how to parse out both HTML and Javascript tags to extract text from a page?

Oct 30th, 2000 23:49

Magnus Lyckå, Matthew Schinckel, Paul Allopenna
Python Documentation

If you want to (quickly) strip all HTML tags from a string of data, try 
using the 
re module:
import re
file = open(filename,'r')
data = file.read()
text = re.sub('<!--.*?-->', '', data) #Remove comments first, or '>' in
                                      #comments will be interpreted as
                                      #end of (comment) tag.
text = re.sub('<.*?>', '', text)
This will also strip any javascript, but only if the page has been made 
- that is, the javascript is within HTML comments.
If you want to know how it works, read the 're' chapter in the library 
as it discusses the usefulness of 'non-greedy' regular expressions.

© 1999-2004 Synop Pty Ltd