faqts : Computers : Programming : Languages : JavaScript

+ Search
Add Entry AlertManage Folder Edit Entry Add page to http://del.icio.us/
Did You Find This Entry Useful?

36 of 44 people (82%) answered Yes
Recently 8 of 10 people (80%) answered Yes

Entry

How do I remove HTML comments?
How do I match HTML comments?

Apr 7th, 2008 22:59
ha mo, Mark Szlazak, anita wigginton,


An HTML comment declaration consists of <! followed by zero or more 
comments followed by >. Each comment starts with -- and includes 
all text up to and including the next occurrence of --. In a comment 
declaration, white space is allowed after each comment, but not before 
the first comment.
This means that the following are all legal HTML comments:
	1a. <!-- Hello -->
	1b. <!-- 
		 Hello!
	         The tag-pair <B>...</B> bolds any text inside.
	    -->       
	2a. <!-- Hello -- -- Goodbye -- >
	2b. <!-- Hello --
	      -- Goodbye -- >
	3.  <!---->
	4.  <!------ Hello -->
	5.  <!------> Hello -->
	6.  <!>
Note that a comment tag with just -- characters should always have a 
multiple of four - characters to be legal. However, not all HTML 
parsers follow this rule and non-compliant sequences of - like <!-----> 
maybe allowed. These sequences are often used by people as seperators 
in their source code.
In Javascript 1.5 the following regular expression will match most 
HTML comments:
	regX = /<!(?:--.*?--\s*)?>/g;
However, it fails on comments that span multiple lines. The dot 
metacharacter in .*? matches anything except a newline character and 
there is no modifier that turns dot into a metacharacter to match any 
character. This causes the expression to fail on multiline comments 
like 1b.
One way to overcome this is by replacing . by alternative groupings 
like (?:.|\n), (?:[^-]|-[^-]) or (?:[^-]|-(?!-)), but the last 
two groupings won't catch the illegal <!-------> type comments mentioned
previously. Furthermore, alternative groupings are inefficient when 
compared to character classes. Unfortunately the character class [.\n] 
won't work since the dot metacharacter is not found within classes. 
Instead, use [\s\S], [\d\D] or [\w\W] to match any character. Also, a 
whitespace specification is added to the expression so no empty lines 
are left behind when comments are removed.
	regX = /<!(?:--[\s\S]*?--\s*)?>\s*/g;
The following function is passed an HTML string and returns a string 
with all HTML comments removed.
	function removeHTMLComments(html) {
		return html.replace(/<!(?:--[\s\S]*?--\s*)?>\s*/g,'');
	}
A problem with this is that it will incorrectly remove code inside <!--
 --> that are found inside <script...></script> or <style...></style> 
tags. Recall the comment tricks used to hide scripts and styles from 
some old browsers. 
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<HEAD>
<TITLE>HTML Comment Example</TITLE>
<!-- Id: html-sgml.sgm,v 1.5 1995/05/26 21:29:50 connolly Exp  -->
<STYLE type="text/css"><!--
	STYLE BLOCK SHOULD REMAIN!
--></STYLE>
<SCRIPT LANGUAGE="JavaScript">
<!-- hide this stuff from other browsers
	SCRIPT BLOCK (1) THAT SHOULD NOT BE REMOVED!
// end the hiding comment -->
</SCRIPT>
</HEAD>
<BODY>
<!-- another --
  -- comment -->
<P>Not a <I>comment</I>, just regular old data characters.</P>
<SCRIPT LANGUAGE="JavaScript">
<!-- hide this stuff from other browsers
	SCRIPT BLOCK (2) THAT SHOULD NOT BE REMOVED!
// end the hiding comment -->
</SCRIPT>
<!>
</BODY>
</HTML>
To avoid this problem the following alternation grouping is used to 
also match <SCRIPT...> </SCRIPT> and <STYLE...> </STYLE> sections.
	regX = /<(?:!(?:--[\s\S]*?--\s*)?(>)\s*|
(?:script|style|SCRIPT|STYLE)[\s\S]*?<\/(?:script|style|SCRIPT|STYLE)
>)/g;
Also, a function instead of a string is used as the replacement in the 
replace() method. For more on functions as replacements see:
http://www.faqts.com/knowledge_base/view.phtml/aid/15940
The first argument to this function is the string that matched the 
pattern. The second argument is the string that matched the capturing 
parenthesized subexpression. This parenthesized subexpression is used 
as a flag to indicate a match NOT within a SCRIPT or STYLE section. 
Comments of HTML sections are replaced by an empty string but SCRIPT 
and STYLE sections are replaced by copies of themselves.
Since legal identifier names in JavaScript can have a dollar sign ($) 
as the first character, one could assign the functions arguments to 
identifier names in correspondence to those used in replacement strings.
	function removeHTMLComments(html) {
		return html.replace(regX, function(m,$1) {
				 	     return $1? '':m;
				  	  });
	}
http://www.businessian.com
http://www.computerstan.com
http://www.financestan.com
http://www.healthstan.com
http://www.internetstan.com
http://www.moneyenews.com
http://www.technologystan.com
http://www.zobab.com
http://www.healthinhealth.com