How can I use perl at the command line to do a Global Search & Replace within a directory? I want to remove <font WHATEVER> ... </font> but not ...
Mar 24th, 2009 21:18
chat alarab, Anthony Boyd, Per M Knutsen, mike gifford,
Per M Knutsen wrote on February 12, 2001:
Simple search-and-replace is easy with Perl. The one-liner I use is:
perl -pi -e 's/search/replace/g' filename
where /search/ becomes substituted by /replace/. To substitute the
pattern in all files within a directory, simply replace the filename
with a the wildcard *, like this:
perl -pi -e 's/search/replace/g' *
The s/ modifier means substitue, /g means global matching (i.e.
substitute ALL instances of /search/ in files indicated).
For a more challenging search-and-replace you will need to learn how to
use regular expressions. For example, if you want to replace the
pattern <FONT ...>Something</font> with Something, you could alter the
search-and-replace string like this:
perl -pi -e 's/<FONT.*>(.*)<\/FONT>/$1/gi' filename
For example, the following:
<FONT size=5 Color="blue">Something else</font>
Note that the third line is deleted altogether. The i/ modifier makes
the search case-insensitive. To understand how this substitution works,
you will need to know a bit about Perl's regular expression syntax. For
the keen, I highly recommend Jeffrey Friedl's excellent book Mastering
Regular Expressions (O'Reilly). The most important thing to note here
is that the $1 variable refers back to what was matched within the
paranthesis in the search string. You can use this feature to refer
back to several sub-patterns in your search pattern, each embraced by a
separate pair of parantheses. Use $1, $2 etc to do this.
Anthony Boyd wrote on February 10, 2003:
Please note that you will LOSE DATA if you try the Perl one-liner above.
First, it won't match font tags that occur over multiple lines. So if
the open tag is on line 1, and the close tag is on line 2, you have no
match. Thus, after running that Perl one-liner, you might still have
font tags in your HTML. Second, the pattern ".*" matches almost
EVERYTHING. The dot means "any character" and the star means "as much
as you can." So this line of HTML:
<font size=2>Hi there. <b>Hey! I'm bold!</b> Plain again.</font>
Will get hacked down to this:
Why did we lose so much of the text? Because <FONT.*> means "match a
less-than symbol, followed by the letters FONT, then match as much as
you possibly can until you find the last possible greater-than symbol."
So <FONT.*> matches all the way to the closing bold tag. Ugh. You
don't want to lose text, and you do want it to find FONT tags that span
multiple lines. So you need to stop using ".*" and add a parameter
(-0777) which will make Perl look at all lines at once. Like this:
perl -0777 -pi -e 's/<\/?FONT[^>]*>//gi' filename
That means, "find a less-than sign (<) followed (optionally) by a
closing slash character, followed by the letters FONT, followed by
anything that is NOT a greater-than sign, followed by the greater-than
sign." In other words, only the opening & closing <FONT> tags. I
believe that will perfectly strip out 99.99% of the font tags in existence.