faqts : Computers : Programming : Languages : PHP : Common Problems : Regular Expressions

+ Search
Add Entry AlertManage Folder Edit Entry Add page to http://del.icio.us/
Did You Find This Entry Useful?

18 of 22 people (82%) answered Yes
Recently 7 of 10 people (70%) answered Yes

Entry

How do i retrieve all links from a html page?

Jun 10th, 2001 02:42
Stefan Fischerländer, Michiel ten Hagen,


The HTML source has to be in $html. You can achieve this with something 
like:
$lines = file("http://www.website.com");
$html = join("",$lines);


// remove all line breaks
$html = str_replace("\n","",$html);
// and put in a new line break behind every anchor tag
$html = str_replace("</a>","</a>\n",$html);
// split the string into single lines
$lines = split("\n",$html);
	
// $lines now is an array of lines and each line ends with an anchor tag
// for every anchor tag, we now have an entry in $lines
for($i=0;$i<count($lines);$i++)
{
	// delete everything in front of the anchor tag
	$lines[$i] = eregi_replace(".*<a ","<a ",$lines[$i]);
	// now every line just consists of something like <a ...>...</a>
	// we extract the link within the href attribut ...
	eregi("href=[\"']{0,1}([^\"'> ]*)",$lines[$i],$regs);
	// and put it into the $lines array
	$lines[$i] = $regs[1];
}

Now all the links (URls) are in the $lines array.