faqts : Computers : Programming : Languages : Python : Snippets : Web Programming / Manipulating HTML files

+ Search
Add Entry AlertManage Folder Edit Entry Add page to http://del.icio.us/
Did You Find This Entry Useful?

6 of 9 people (67%) answered Yes
Recently 4 of 6 people (67%) answered Yes

Entry

Extracting fields from HTML table

Jul 5th, 2000 09:58
Nathan Wallace, unknown unknown, Hans Nowak, Snippet 5, Skip Montanaro


"""
Packages: text.html
"""

"""
> Any tips on how to extract fields from html tables? Rendering the
> table with lynx sort of works, but field delimiters are lost.

Try converting it to a canonical form first, then split it using the <th> or
<td> tags.  I have a simple script at home that massages a table into a
one-row-per-row form.  Let me know if you'd like it as a starting point.  For
example,  once you get it looking something like

<table>
<tr><th>Header</tr><td>data 1</td><td>data 2</td></tr>
<tr><th>Header</tr><td>data 1</td><td>data 2</td></tr>
...
</table>

You should be able to search for lines containing "<tr>" and split them using
something like

    fields = re.split("(?:<[^>]+>\s*)+", line)

The fields list should look like

    ['', 'Header', 'data 1', 'data 2', '']

from which you can easily extract the interesting tidbits.
"""