faqts : Computers : Programming : Languages : Python

+ Search
Add Entry AlertManage Folder Edit Entry Add page to http://del.icio.us/
Did You Find This Entry Useful?

56 of 73 people (77%) answered Yes
Recently 8 of 10 people (80%) answered Yes


Where can I best learn how to parse out both HTML and Javascript tags to extract text from a page?

Nov 30th, 2004 10:30
Chris Burkhardt, Joe Bloggs, Magnus Lyckå, Matthew Schinckel, Paul Allopenna,

If you want to (quickly) strip all HTML tags from a string of data, 
using the 
re module:
import re
file = open(filename,'r')
data = file.read()
text = re.sub('<!--.*?-->', '', data) #Remove comments first, or '>' in
                                      #comments will be interpreted as
                                      #end of (comment) tag.
text = re.sub('<.*?>', '', text)
This will also strip any javascript, but only if the page has been 
- that is, the javascript is within HTML comments.
If you want to know how it works, read the 're' chapter in the library 
as it discusses the usefulness of 'non-greedy' regular expressions.