faqts : Computers : Programming : Languages : Python : Common Problems

+ Search
Add Entry AlertManage Folder Edit Entry Add page to http://del.icio.us/
Did You Find This Entry Useful?

13 of 14 people (93%) answered Yes
Recently 8 of 9 people (89%) answered Yes

Entry

How can I extract just the links in a web-page or html-document?

May 12th, 2000 07:10
unknown unknown, Fredrik Lundh


In addition to the question:

It would be nice if relative links like'<a 
href="/source/test_pyt.tgz">Logo</a> in a page at 
http://www.test.site.org became
"http://www.test.site.org/source/test_pyt.tgz" in the end too ...

Solution:

from the eff-bot archives:

#
# extract anchors from an HTML document
#
# fredrik lundh, may 1999
#
# fredrik@pythonware.com
# http://www.pythonware.com
#

import htmllib
import formatter
import string
import urllib, urlparse

class myParser(htmllib.HTMLParser):

    def __init__(self, base):
        htmllib.HTMLParser.__init__(self, formatter.NullFormatter())
        self.anchors = []
        self.base = base

    def anchor_bgn(self, href, name, type):
        self.save_bgn()
        if self.base:
            self.anchor = urlparse.urljoin(self.base, href)
        else:
            self.anchor = href

    def anchor_end(self):
        text = string.strip(self.save_end())
        if self.anchor and text:
            self.anchors.append((self.anchor, text))

if __name__ == '__main__':

    URL = "http://www.pythonware.com"

    f = urllib.urlopen(URL)

    p = myParser(URL)
    p.feed(f.read())
    p.close()

    print "anchors =", p.anchors
    print "title =", p.title

</F>

<!-- (the eff-bot guide to) the standard python library:
http://www.pythonware.com/people/fredrik/librarybook.htm
-->