faqts : Computers : Programming : Languages : Tse : Search

+ Search
Add Entry AlertManage Folder Edit Entry Add page to http://del.icio.us/
Did You Find This Entry Useful?

0 of 2 people (0%) answered Yes
Recently 0 of 2 people (0%) answered Yes

Entry

TSE: Search: Regular expression: HTML: URL: Which possible regular expression to extract anchor <A>?

Jul 15th, 2006 13:44
Knud van Eeden,


----------------------------------------------------------------------
--- Knud van Eeden --- 18 September 2004 - 02:12 am ------------------

TSE: Search: Regular expression: HTML: URL: Which possible regular 
expression to extract anchor <A>?

===

You want to find occurrences of hyperlinks like

 <A HREF="http://www.semware.com">this is a test</A>

===

A HTML anchor <A> could have the following format


anchor =

               +-------<---------+
               |                 |
->-(>)->-(A)->-+->-[attribute]->-+->-(>)->[text]->(</A>)->-


attribute =

   +------<------+            +------<------+
   |             |            |             |
->-+->-(space)->-+->-[name]->-+->-(space)->-+->-(=)->-+
                              |             |         |
                              +------>------+         |
                                                      |
                                                      |
 +----------------------------------------------------+
 |
 |             +------<------+
 |             |             |
 +->-[value]->-+->-(space)->-+->-
               |             |
               +------>------+


name =

->-+->-[HREF]->--------+
   |                   |
   +->-[attribute 2]->-+-->--
   |                   |
   +->-[attribute 3]->-+
   ...               ...
   |                   |
   +->-[attribute 3]->-+


value =

     +->-[text (no quotes)]->-+
     |                        |
  ->-+->-(")->-[text]->-(")->-+->-
     |                        |
     +->-(')->-[text]->-(')->-+

===

One of the simplest way would be to search for all occurrences of HREF,
but that will usually find too much, as also non hyperlinks are
going to be included.

--- cut here: begin --------------------------------------------------

PROC Main()
 LFind( "HREF", "iwgv" )
END

<F12> Main()

--- cut here: end ----------------------------------------------------

---

A more involved regular expression you could use in TSE is
searching for '<A ' followed by something, followed by "HREF", followed
by zero or more spaces, followed by "=", possibly followed by a
single or double quote (the search for this quote you could leave
out of the regular expression, as it is not really relevant for the
result).
That regular expression will find most valid hyperlinks (though 
possibly
also some not valid hyperlinks) in an HTML page.

===

--- cut here: begin --------------------------------------------------

PROC Main()
 LFind( '<[aA] .@[hH][rR][eE][fF] @= @{{\"}|{' + "\'" + '}}?\c', "xv" )
END

<F12> Main()

--- cut here: end ----------------------------------------------------

===

Internet: see also:

---

TSE: Search/Replace: Regular expression: Link: Can you give overview 
links regular expressions?
http://www.faqts.com/knowledge_base/view.phtml/aid/31433/fid/865

----------------------------------------------------------------------