开发者

Word wrap algorithms for Japanese

In a recent web application I built, I was pleasantly surprised when one of our users decided to use it to create something entirely in Japanese. However, the text was wrapped strangely and awkwardly. Apparently browsers don't cope with wrapping Japanese text very well, probably because it contains few spaces, as each character forms a whole word. However, that's not really a safe assumption to make as some words are constructed of several characters, and it is not safe to break some character groups into different lines.

Googling around hasn't really helped me understand the problem any better. It seems to me like one would need a dictionary of unbreakable patterns, and assume that everywhere else is safe to break. But I fear I don't know enough about Japanese to really know all the words, which I understand from some of my searching, are quite complicated.

How would you approach this problem? Are there any libraries or algorithms you are aware of that already exist that deal with this in a satisfactory wa开发者_JS百科y?


Japanese word wrap rules are called kinsoku shori and are surprisingly simple. They're actually mostly concerned with punctuation characters and do not try to keep words unbroken at all.

I just checked with a Japanese novel and indeed, both words in the syllabic kana script and those consisting of multiple Chinese ideograms are wrapped mid-word with impunity.


Below listed projects are useful to resolve Japanese wordwrap (or wordbreak from another point of view).

  • budou (Python): https://github.com/google/budou
  • mikan (JS): https://github.com/trkbt10/mikan.js
  • mikan.sharp (C#): https://github.com/YoungjaeKim/mikan.sharp

mikan has regex-based approach while budou uses natural language processing.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜