Is there a tool for splitting german compound words in java?

2023-03-30 23:06 问答作者：

I am sucessfully splitting Sentences into words with a StringTokenizer.

Is there a tool wh开发者_如何学运维ich is able to split compound words like Projektüberwachung into their parts Projekt and überwachung or even some longer ones?

The reason for splitting the compound words is that i want to do a text-extraction. I want to convert phrases like these Projektplanung und -überwachung into the two parts Projektplanung and Projektüberwachung. And splitting the compound word is my first step.

JWordSplitter

Randomly saw this on synaptic this morning. Here is the description from the site:

"jWordSplitter is a small Java library that splits compound words into their parts. This is especially useful for languages like German where an infinite number of new words can be formed by just appending nouns ("Donaudampfschifffahrtskapitän")."

Usage is as simple as this:

String word = "Donaudampfschifffahrtskapitän";
AbstractWordSplitter splitter = new GermanWordSplitter();
Collection<String> splittedWords = splitter.splitWord(word);

Unfortunately, there is no pre-built library in the download section, but it is easy to build. Here is a short description how to do this in three simple steps.

Checkout the sources via SVN:

svn co https://jwordsplitter.svn.sourceforge.net/svnroot/jwordsplitter/trunk jwordsplitter
Open the Maven Project e.g. in Netbeans
Build the library which includes the dictionary (jwordsplitter-3.2.jar, 300kB)

I have always had a great dislike for the type of hyphenation in your example: Projektplanung und -überwachung. :-( So even though I agree with JB Nizet, that without a list or dictionary of simple non-compound nouns there is no way to know, maybe there is a way to make an intelligent guess, in German at least. Let's reunite Projekt and -überwachung!

You could create a list of consonant-clusters, and note where these clusters divide. e.g. ktpl in the first word of the pair would divide so: kt-pl. Geschwindigkeitsbegrenzung has tsb which divides ts-b. I haven't thought it all the way through--and additional meta-data may be necessary.

The algorithm would find the most "centrally located" consonant-cluster in the word. E.g. it would ignore 'schw' and 'nd' and 'gr' and 'nz' and look to 'tsb' in Geschwindigkeitsbegrenzung.

Lucene has a Token Filter that can decompose compound words. Perhaps this could suit your needs ?

继续阅读：string text-processing tokenize

Is there a tool for splitting german compound words in java?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？