Entry
TSE: Text: Wordwrap: Hyphen: Remove: How to dehyphenate wordwrapped text? ['-' / join]
Dec 10th, 2005 16:32
Knud van Eeden,
----------------------------------------------------------------------
--- Knud van Eeden --- 19 May 2005 - 04:58 pm ------------------------
TSE: Text: Wordwrap: Hyphen: Remove: How to dehyphenate wordwrapped
text? ['-' / join]
---
You may want to remove the wordwrap '-' at the end of the
lines.
===
A quick manual method is:
Steps: Overview:
1. -Search for the hyphen at the end of the line
\-$
2. -Use the regular expression search option
x
3. -Delete this hyphen character, by pressing key <delete>
-
4. -Join the next line by pressing key <delete> again
5. -Repeat this e.g. using a keyboard macro
---
To remove the hyphen in your wordwrapped text,
you could search for a hyphen '-' at the end of a line.
===
If the next line starts with characters, then you remove that
hyphen, and join the current line with the next line.
===
e.g.
before: text containing hyphens '-' at the end of the lines.
This is typical text which you get after scanning a magazine
then e.g. using OCR to get the text on that page.
---
--- cut here: begin --------------------------------------------------
This is an example of de-
hyphenation.
This might be use-
ful, in OCR for ex-
ample.
--- cut here: end ----------------------------------------------------
---
after: text containing hyphens '-' at the end of the line are now
removed, and the lines joined.
---
--- cut here: begin --------------------------------------------------
This is an example of dehyphenation.
This might be useful, in OCR for example.
--- cut here: end ----------------------------------------------------
===
Getting the hyphen
To analyze the text, in order to be able to remove the correct
hyphen, you might use the following about the text in a block:
---
Backus Naur Form: in words:
1. A block is one or more lines.
1. A line is a wordwrapline or a not wordwrapline
1. A wordwrapline is possibly terminated by a hyphen
1. If the next line starts with a word
(that is starting with alphanumeric characters,
followed by again by zero or more alphanumeric characters
or digits,
followed by or
the end of line,
delimiter like
space,
dot,
comma,
semicolon,
colon,
question mark,
exclamation mark,
single quote,
double quote
then it is assumed it was a wordwrap.
So you remove the hyphen '-' at the end of the line,
and you join the lines.
===
Backus Naur Form
---
block =
+--------<-------+
| |
-->--+-->--[line]-->--+-->--
---
line =
-->--+-->--[wordwrap line]-->------+
| |
| |
+-->--[non wordwrap line]-->--+-->--
---
wordwrapline with hyphen =
->-[some characters]->-+
|
|
+---------------------+
|
|
| +->-[a..z]->-+
| | |
+->--| +->--(-)-->-[end of line]->-[next line]->-
| |
+->-[A..z]->-+
---
nextline =
+->--(space)--->-+
| |
+---------<--------+ +->--[!)------->-+
| +->-[a..z]->-+ | | |
+->-[a..z]->-+ | | | | +->--(.)------->-+
| | | | | | | |
->-| +-+->-+->-[A..z]->-+-+->-+->--(,)------->-+->-
| | | | | |
+->-[A..z]->-+ | | +->--(?)------->-+
+->-[0..9]->-+ | |
+->--(;)------->-+
| |
+->--(:)------->-+
| |
+->--(')------->-+
| |
+->--(")------->-+
| |
+->-[end line]->-+
---
---
Internet: see also:
---
TSE: File: Search: How to search/replace over multiple lines?
[dehyphenation / regular expression]
http://www.faqts.com/knowledge_base/view.phtml/aid/36302/fid/865
----------------------------------------------------------------------