Format text with regard to punctuation
How can I format text in a natural language taking punctuation into account? The built-in gq
command of Vim, or command line tools, such as fmt or par break lines without regard to punctuation. Let me give you an example,
fmt -w 40
gives not what I want:
we had everything before us, we had
nothing before us, we were all going
direct to Heaven, we were all going
direct the other way
smart_formatter -w 40
would give:
we had everything before us,
we had nothing before us,
we were all going direct to Heaven,
we were all going direct 开发者_高级运维the other way
Of course, there are cases when no punctuation mark is found within the given text width, then it can fallback to the standard text formatting behavior.
The reason why I want this is to get a meaningful diff
of text where I can spot which sentence or subsentence changed.
Here is a not very elegant, but working method I finally came up with. Suppose, a line break at a punctuation mark is worth 6 characters. It means, I'll accept a result which is more ragged but contains more lines ending in a punctuation mark if the "raggedness" is less than 6 characters long. For example, this is OK ("raggedness" is 3 characters).
Wait!
He said.
This is not OK ("raggedness" is more than 6 characters)
Wait!
He said to them.
The method is to add 6 dummy characters after each punctuation mark, format the text, then remove the dummy characters.
Here is the code for this
sed -e 's/\([.?!,]\)/\1 _ _ _/g' | fmt -w 34 | sed -e 's/ _//g' -e 's/_ //g'
I used _
(space + underscore) as a pair of dummy characters, supposing they're not contained in the text. The result looks quite good,
we had everything before us,
we had nothing before us,
we were all going direct to
Heaven, we were all going
direct the other way
精彩评论