Determine the language of a unicode string in Java [duplicate]
If I have a string in java how can I determine which language it belongs to? Does Unicode specification allow us to do it?
There is no metadata in an Unicode string that specifies what language the string is in, if the string is even a word or phrase.
Based on the characters contained in the string, you may be able to guess what language is being used. For example, Unicode range 30A0–30FF represents Japanese Katakana characters. So if most of your string consists of characters within that range, you could make an educated guess that it's Japanese. This is not at all reliable, though. For instance, what if it's just random Katakana characters?
For reliable language detection, I would abandon all thought of using Unicode as a basis for language detection and focus on language recognition algorithms.
精彩评论