Word wrap algorithms for Japanese

2022-12-16 16:10 问答作者：

In a recent web application I built, I was pleasantly surprised when one of our users decided to use it to create something entirely in Japanese. However, the text was wrapped strangely and awkwardly. Apparently browsers don't cope with wrapping Japanese text very well, probably because it contains few spaces, as each character forms a whole word. However, that's not really a safe assumption to make as some words are constructed of several characters, and it is not safe to break some character groups into different lines.

Googling around hasn't really helped me understand the problem any better. It seems to me like one would need a dictionary of unbreakable patterns, and assume that everywhere else is safe to break. But I fear I don't know enough about Japanese to really know all the words, which I understand from some of my searching, are quite complicated.

How would you approach this problem? Are there any libraries or algorithms you are aware of that already exist that deal with this in a satisfactory wa开发者_JS百科y?

Japanese word wrap rules are called kinsoku shori and are surprisingly simple. They're actually mostly concerned with punctuation characters and do not try to keep words unbroken at all.

I just checked with a Japanese novel and indeed, both words in the syllabic kana script and those consisting of multiple Chinese ideograms are wrapped mid-word with impunity.

Below listed projects are useful to resolve Japanese wordwrap (or wordbreak from another point of view).

budou (Python): https://github.com/google/budou
mikan (JS): https://github.com/trkbt10/mikan.js
mikan.sharp (C#): https://github.com/YoungjaeKim/mikan.sharp

mikan has regex-based approach while budou uses natural language processing.

继续阅读：algorithm cjk internationalization unicode word-wrap

Word wrap algorithms for Japanese

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？