Is there a tool for splitting german compound words in java?
I am sucessfully splitting Sentences into words with a StringTokenizer
.
Is there a tool wh开发者_如何学运维ich is able to split compound words like Projektüberwachung
into their parts Projekt
and überwachung
or even some longer ones?
The reason for splitting the compound words is that i want to do a text-extraction. I want to convert phrases like these Projektplanung und -überwachung
into the two parts Projektplanung
and Projektüberwachung
. And splitting the compound word is my first step.
JWordSplitter
Randomly saw this on synaptic this morning. Here is the description from the site:
"jWordSplitter is a small Java library that splits compound words into their parts. This is especially useful for languages like German where an infinite number of new words can be formed by just appending nouns ("Donaudampfschifffahrtskapitän")."
Usage is as simple as this:
String word = "Donaudampfschifffahrtskapitän";
AbstractWordSplitter splitter = new GermanWordSplitter();
Collection<String> splittedWords = splitter.splitWord(word);
Unfortunately, there is no pre-built library in the download section, but it is easy to build. Here is a short description how to do this in three simple steps.
Checkout the sources via SVN:
svn co https://jwordsplitter.svn.sourceforge.net/svnroot/jwordsplitter/trunk jwordsplitter
Open the Maven Project e.g. in Netbeans
Build the library which includes the dictionary (jwordsplitter-3.2.jar, 300kB)
I have always had a great dislike for the type of hyphenation in your example: Projektplanung und -überwachung
. :-( So even though I agree with JB Nizet, that without a list or dictionary of simple non-compound nouns there is no way to know, maybe there is a way to make an intelligent guess, in German at least. Let's reunite Projekt and -überwachung!
You could create a list of consonant-clusters, and note where these clusters divide. e.g. ktpl in the first word of the pair would divide so: kt-pl. Geschwindigkeitsbegrenzung has tsb
which divides ts-b. I haven't thought it all the way through--and additional meta-data may be necessary.
The algorithm would find the most "centrally located" consonant-cluster in the word. E.g. it would ignore 'schw' and 'nd' and 'gr' and 'nz' and look to 'tsb' in Geschwindigkeitsbegrenzung.
Lucene has a Token Filter that can decompose compound words. Perhaps this could suit your needs ?
精彩评论