faqts : Computers : Programming : Languages : Python

+ Search
Add Entry AlertManage Folder Edit Entry Add page to http://del.icio.us/
Did You Find This Entry Useful?

56 of 73 people (77%) answered Yes
Recently 8 of 10 people (80%) answered Yes

Entry

Where can I best learn how to parse out both HTML and Javascript tags to extract text from a page?

Nov 30th, 2004 10:30
Chris Burkhardt, Joe Bloggs, Magnus Lyckå, Matthew Schinckel, Paul Allopenna,


If you want to (quickly) strip all HTML tags from a string of data, 
try 
using the 
re module:
import re
file = open(filename,'r')
data = file.read()
file.close()
text = re.sub('<!--.*?-->', '', data) #Remove comments first, or '>' in
                                      #comments will be interpreted as
                                      #end of (comment) tag.
text = re.sub('<.*?>', '', text)
This will also strip any javascript, but only if the page has been 
made 
'properly' 
- that is, the javascript is within HTML comments.
If you want to know how it works, read the 're' chapter in the library 
reference, 
as it discusses the usefulness of 'non-greedy' regular expressions.