Entry
TSE: Search: Regular expression: HTML: URL: Which possible regular expression to extract anchor <A>?
Jul 15th, 2006 13:44
Knud van Eeden,
----------------------------------------------------------------------
--- Knud van Eeden --- 18 September 2004 - 02:12 am ------------------
TSE: Search: Regular expression: HTML: URL: Which possible regular
expression to extract anchor <A>?
===
You want to find occurrences of hyperlinks like
<A HREF="http://www.semware.com">this is a test</A>
===
A HTML anchor <A> could have the following format
anchor =
+-------<---------+
| |
->-(>)->-(A)->-+->-[attribute]->-+->-(>)->[text]->(</A>)->-
attribute =
+------<------+ +------<------+
| | | |
->-+->-(space)->-+->-[name]->-+->-(space)->-+->-(=)->-+
| | |
+------>------+ |
|
|
+----------------------------------------------------+
|
| +------<------+
| | |
+->-[value]->-+->-(space)->-+->-
| |
+------>------+
name =
->-+->-[HREF]->--------+
| |
+->-[attribute 2]->-+-->--
| |
+->-[attribute 3]->-+
... ...
| |
+->-[attribute 3]->-+
value =
+->-[text (no quotes)]->-+
| |
->-+->-(")->-[text]->-(")->-+->-
| |
+->-(')->-[text]->-(')->-+
===
One of the simplest way would be to search for all occurrences of HREF,
but that will usually find too much, as also non hyperlinks are
going to be included.
--- cut here: begin --------------------------------------------------
PROC Main()
LFind( "HREF", "iwgv" )
END
<F12> Main()
--- cut here: end ----------------------------------------------------
---
A more involved regular expression you could use in TSE is
searching for '<A ' followed by something, followed by "HREF", followed
by zero or more spaces, followed by "=", possibly followed by a
single or double quote (the search for this quote you could leave
out of the regular expression, as it is not really relevant for the
result).
That regular expression will find most valid hyperlinks (though
possibly
also some not valid hyperlinks) in an HTML page.
===
--- cut here: begin --------------------------------------------------
PROC Main()
LFind( '<[aA] .@[hH][rR][eE][fF] @= @{{\"}|{' + "\'" + '}}?\c', "xv" )
END
<F12> Main()
--- cut here: end ----------------------------------------------------
===
Internet: see also:
---
TSE: Search/Replace: Regular expression: Link: Can you give overview
links regular expressions?
http://www.faqts.com/knowledge_base/view.phtml/aid/31433/fid/865
----------------------------------------------------------------------