开发者

Format text with regard to punctuation

How can I format text in a natural language taking punctuation into account? The built-in gq command of Vim, or command line tools, such as fmt or par break lines without regard to punctuation. Let me give you an example,

fmt -w 40 gives not what I want:

we had everything before us, we had
nothing before us, we were all going
direct to Heaven, we were all going
direct the other way

smart_formatter -w 40 would give:

we had everything before us,
we had nothing before us,
we were all going direct to Heaven,
we were all going direct 开发者_高级运维the other way

Of course, there are cases when no punctuation mark is found within the given text width, then it can fallback to the standard text formatting behavior.

The reason why I want this is to get a meaningful diff of text where I can spot which sentence or subsentence changed.


Here is a not very elegant, but working method I finally came up with. Suppose, a line break at a punctuation mark is worth 6 characters. It means, I'll accept a result which is more ragged but contains more lines ending in a punctuation mark if the "raggedness" is less than 6 characters long. For example, this is OK ("raggedness" is 3 characters).

Wait!
He said.

This is not OK ("raggedness" is more than 6 characters)

Wait!
He said to them.

The method is to add 6 dummy characters after each punctuation mark, format the text, then remove the dummy characters.

Here is the code for this

sed -e 's/\([.?!,]\)/\1 _ _ _/g' | fmt -w 34 | sed -e 's/ _//g' -e 's/_ //g'

I used _ (space + underscore) as a pair of dummy characters, supposing they're not contained in the text. The result looks quite good,

we had everything before us,
we had nothing before us,
we were all going direct to
Heaven, we were all going
direct the other way
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜