Entry
TSE: Search/Replace: Regular expression: How to replace number + dot + text in begin line? [parser]
Apr 10th, 2005 12:30
Knud van Eeden,
----------------------------------------------------------------------
--- Knud van Eeden --- 10 April 2005 - 09:25 pm ----------------------
TSE: Search/Replace: Regular expression: How to replace number + dot +
text in begin line? [parser]
---
I want to replace text like
1. cbcbcbcbcbcb
3. fgfgfgfgfgfg
15. kkkrororkk
the numbers stand at the beginning of a line and I want to delete the
numbers, the dot and the space, but not what follows after it.
---
---
Possible solution:
1. Make a backup of your file first
2. From the menu you could use the regular expression:
^{[0-9]#\.[ ]#}{[a-zA-Z]#}{.@}$
Replace this with the second and third tag content
\2\3
And use option
x
-----------------------------------------------
3. as a macro, you could try
--- cut here: begin --------------------------------------------------
PROC Main()
LReplace( "^{[0-9]#\.[ ]#}{[a-zA-Z]#}{.@}$", "\2\3", "x" )
END
--- cut here: end ----------------------------------------------------
---
---
If you run this on the following text:
--- cut here: begin --------------------------------------------------
1. cbcbcbcbcbcb "Some other text1"
3. fgfgfgfgfgfg "Some other text2"
15. kkkrororkk "Some other text3"
--- cut here: end ----------------------------------------------------
you get as a result:
--- cut here: begin --------------------------------------------------
cbcbcbcbcbcb "Some other text1"
fgfgfgfgfgfg "Some other text2"
kkkrororkk "Some other text3"
--- cut here: end ----------------------------------------------------
---
---
Note:
This regular expression reads as, in this order:
^ = start from begin of line
[0-9]# = read digits, as much as you can
\. = until you find a dot
[ ]# = read 1 or more spaces, as much as you can
[a-zA-Z]# = read alphabetic characters, upper case or lower case,
as much as you can
.@ = read 0 or more any characters, as much as you can
$ = until end of line
and the {}{}{} characters mean the first, second and third tag,
which are called \1, \2 and \3
You remove here everything in the first tag
(by replacing the whole with only the second and third tag \2\3)
---
---
Note:
Syntax form:
---
--- cut here: begin --------------------------------------------------
-->---------------------->--(begin of line)-->----------------------+
|
v
|
+---------------------------<---------------------------------------+
|
v +--------<--------+
| | |
+--------------------->--+-->--[digit]-->--+------------------------+
|
v
|
+---------------------------<---------------------------------------+
|
v
|
+----------------------------->--[.]-->-----------------------------+
|
v
|
+---------------------------<---------------------------------------+
|
v
| +------<------+
| | |
+----------------------->--+-->--[ ]-->--+-->-----------------------+
|
v
|
+---------------------------<---------------------------------------+
|
v
| +---------------------<----------------------+
| | |
+-------->--+-->--+-->--[upper case character]-->--+-->--+-->-------+
| | |
+-->--[lower case character]-->--+ v
|
|
+---------------------------<---------------------------------------+
|
v
| +-----------<-------------+
| | |
+------------------>--+-->--[any character]-->--+-->----------------+
| | |
+----------->-------------+ |
v
|
+---------------------------<---------------------------------------+
|
v
|
|
+------------------------>--(end of line)-->---
--- cut here: end ----------------------------------------------------
---
---
Note:
A possible pseudo code for a parser program:
--- cut here: begin --------------------------------------------------
// ------------------------------------------
get next 'character'
//
// ------------------------------------------
//
// ^
//
if not ( 'character' equals 'begin of line' )
return
endif
//
get next 'character'
//
// ------------------------------------------
//
// [0-9]#
//
if not ( 'character' equals 'digit' )
return
endif
//
repeat
get next 'character'
until not ( 'character' equals 'digit' )
//
// ------------------------------------------
//
// \.
//
if not ( 'character' equals 'dot' )
return
endif
//
get next 'character'
//
// ------------------------------------------
//
// [ ]#
//
if not ( 'character' equals 'space' )
return
endif
//
repeat
get next 'character'
until not ( 'character' equals 'space' )
//
// ------------------------------------------
//
// [a-zA-Z]#
//
if not ( ( 'character' equals ( 'upper case character' ) or
( 'character' equals ( 'lower case character' ) )
return
endif
//
repeat
case:
when ( 'character' equals ( 'lower case character' )
get next 'character'
when ( 'character' equals ( 'upper case character' )
get next 'character'
endcase
until not ( ( 'character' equals 'lower case character' ) or
( 'character' equals 'upper case character' ) )
//
// ------------------------------------------
//
// .@
//
while not ( 'character' equals 'end of line' )
get next 'character'
endwhile
//
// ------------------------------------------
//
// $
//
// ------------------------------------------
--- cut here: end ----------------------------------------------------
---
---
A possible TSE lexical analyzer + parser program:
---
Automatization of the creation of this program.
The idea is that each regular expression part could
be seen as a building block.
So you build your program by concatenating this
standard blocks
---
(e.g.
'repeat' block,
'while' block,
'case' or 'or' block,
'and' block,
'if' block,
...)
one after the other.
---
In this particular example you have
(by splitting the given regular expression
in its atomic parts)
^
|
v
|
[0-9]#
|
v
|
\.
|
v
|
[ ]#
|
v
|
[a-zA-Z]#
|
v
|
.@
|
v
|
$
---
or thus
'if' block
followed by a
'repeat' block
followed by an
'if' block
followed by a
'repeat' block
followed by a
'repeat' block with a 'case' block in it
followed by a
'while' block
followed by an
'if' block.
---
Thus the structure of the blocks concatenated looks like:
[if]
|
v
|
[repeat]
|
v
|
[if]
|
v
|
[repeat]
|
v
|
[repeat]
|
v
|
[while]
|
v
|
[if]
---
If you reach the end of the blocks while parsing,
your line is parsed OK, otherwise you jump back
to your starting position (using e.g. a PopPosition()).
So you should write for each of the elementary parsing actions
(looking at the regular expression) a separate block,
which you store in a library.
When you have to create a program which is the equivalent
of a regular expression (e.g. if you want to parse
over several lines, which a regular expression sometimes
not can can do), you can then easily create this,
just by putting the appropriate building blocks
after each other.
---
--- cut here: begin --------------------------------------------------
PROC Main()
STRING s[255] = ""
//
PushPosition() // store starting position to possibly jump back to
//
// ------------------------------------------
//
// ^
//
if not ( CurrPos() == 1 )
Warn( "Please put the cursor at begin of line" )
PopPosition() // search failed, so go back to starting position
return()
endif
Warn( "begin of line" )
//
s = GetText( CurrPos(), 1 )
//
// ------------------------------------------
//
// [0-9]#
//
if not ( s IN '0'..'9' )
PopPosition() // search failed, so go back to starting position
return()
endif
//
repeat
//
Warn( "digit" )
//
Right()
s = GetText( CurrPos(), 1 )
//
until not ( s IN '0'..'9' )
//
// ------------------------------------------
//
// \.
//
if not ( s IN '.' )
PopPosition() // search failed, so go back to starting position
return()
endif
//
Warn( "dot" )
//
Right()
s = GetText( CurrPos(), 1 )
//
// ------------------------------------------
//
// [ ]#
//
if not ( s IN ' ' )
PopPosition() // search failed, so go back to starting position
return()
endif
//
repeat
//
Warn( "space" )
//
Right()
s = GetText( CurrPos(), 1 )
//
until not ( s IN ' ' )
//
// ------------------------------------------
//
// [a-zA-Z]#
//
if not ( ( s IN 'a'..'z' ) or ( s IN 'A'..'Z' ) )
PopPosition() // search failed, so go back to starting position
return()
endif
//
repeat
//
case s
when 'a'..'z'
//
Warn( "lower case character" )
//
Right()
s = GetText( CurrPos(), 1 )
//
when 'A'..'Z'
//
Warn( "upper case character" )
//
Right()
s = GetText( CurrPos(), 1 )
//
endcase
until not ( ( s IN 'a'..'z' ) or ( s IN 'A'..'Z' ) )
//
// ------------------------------------------
//
// .@
//
while not ( CurrPos() > CurrLineLen() )
//
Warn( "any character" )
//
Right()
s = GetText( CurrPos(), 1 )
//
endwhile
//
// ------------------------------------------
//
// $
//
Warn( "end of line" )
// ------------------------------------------
KillPosition()
Warn( "parsed line is OK" )
END
<F12> Main()
--- cut here: end ----------------------------------------------------
If you run this program on the line
--- cut here: begin --------------------------------------------------
1. cbcbcbcbcbcb "Some other text1"
--- cut here: end ----------------------------------------------------
it will output
--- cut here: begin --------------------------------------------------
digit
dot
space
lower case character
lower case character
lower case character
lower case character
lower case character
lower case character
lower case character
lower case character
lower case character
lower case character
lower case character
lower case character
any character
any character
any character
any character
any character
any character
any character
any character
any character
any character
any character
any character
any character
any character
any character
any character
any character
any character
any character
end of line
parsed line is OK
--- cut here: end ----------------------------------------------------
---
---
Internet: see also:
---
TSE: Search/Replace: Regular expression: Link: Can you give overview
links regular expressions?
http://www.faqts.com/knowledge_base/view.phtml/aid/31433/fid/865
----------------------------------------------------------------------