Entry
Extracting fields from HTML table
Jul 5th, 2000 09:58
Nathan Wallace, unknown unknown, Hans Nowak, Snippet 5, Skip Montanaro
"""
Packages: text.html
"""
"""
> Any tips on how to extract fields from html tables? Rendering the
> table with lynx sort of works, but field delimiters are lost.
Try converting it to a canonical form first, then split it using the <th> or
<td> tags. I have a simple script at home that massages a table into a
one-row-per-row form. Let me know if you'd like it as a starting point. For
example, once you get it looking something like
<table>
<tr><th>Header</tr><td>data 1</td><td>data 2</td></tr>
<tr><th>Header</tr><td>data 1</td><td>data 2</td></tr>
...
</table>
You should be able to search for lines containing "<tr>" and split them using
something like
fields = re.split("(?:<[^>]+>\s*)+", line)
The fields list should look like
['', 'Header', 'data 1', 'data 2', '']
from which you can easily extract the interesting tidbits.
"""