Entry
How do i retrieve all links from a html page?
Jun 10th, 2001 02:42
Stefan Fischerländer, Michiel ten Hagen,
The HTML source has to be in $html. You can achieve this with something
like:
$lines = file("http://www.website.com");
$html = join("",$lines);
// remove all line breaks
$html = str_replace("\n","",$html);
// and put in a new line break behind every anchor tag
$html = str_replace("</a>","</a>\n",$html);
// split the string into single lines
$lines = split("\n",$html);
// $lines now is an array of lines and each line ends with an anchor tag
// for every anchor tag, we now have an entry in $lines
for($i=0;$i<count($lines);$i++)
{
// delete everything in front of the anchor tag
$lines[$i] = eregi_replace(".*<a ","<a ",$lines[$i]);
// now every line just consists of something like <a ...>...</a>
// we extract the link within the href attribut ...
eregi("href=[\"']{0,1}([^\"'> ]*)",$lines[$i],$regs);
// and put it into the $lines array
$lines[$i] = $regs[1];
}
Now all the links (URls) are in the $lines array.