How do I tell what language is a plain-text file written in? [closed]
Suppose we have a text file with the content: "Je suis un beau homme ..."
another with: "I am a brave man"
the third with a text in German: "Guten morgen. Wie geht's ?"
How do we write a function that would tell us: with such a probability the text in the first file is in English, in the second we have French etc?
Links to books / out-of-the-box solutions are welcome. I write in Java, but I can learn Python if needed.
My comments
- There's one small comment I need to add. The text may contain phrases in different languages, as part of whole or as a result of a mistake. In classic litterature we have a lot of examples, because the aristocracy members were multilingual. So the probability better describes the situation, as most par开发者_C百科ts of the text are in one language, while others may be written in another.
- Google API - Internet Connection. I would prefer not to use remote functions/services, as I need to do it myself or use a downloadable library. I'd like to make a research on that topic.
There is a package called JLangDetect which seems to do exactly what you want:
langof("un texte en français") = fr : OK
langof("a text in english") = en : OK
langof("un texto en español") = es : OK
langof("un texte un peu plus long en français") = fr : OK
langof("a text a little longer in english") = en : OK
langof("a little longer text in english") = en : OK
langof("un texto un poco mas largo en español") = es : OK
langof("J'aime les bisounours !") = fr : OK
langof("Bienvenue à Montmartre !") = fr : OK
langof("Welcome to London !") = en : OK
// ...
Edit: as Kevin pointed out, there is similar functionality in the Nutch project provided by the package org.apache.nutch.analysis.lang.
Language detection by Google: http://code.google.com/apis/ajaxlanguage/documentation/#Detect
For larger corpi of texts you usually use the distribution of letters, digraphs and even trigraphs and compare with known distributions for languages you want to detect.
However, a single sentence is very likely too short to yield any useful statistical measures. You may have more luck with matching individual words with a dictionary, then.
NGramJ seems to be a bit more up-to-date:
http://ngramj.sourceforge.net/
It also has both character-oriented and byte-oriented profiles, so it should be able to identify the character set too.
For documents in multiple languages you need to identify the character set (ICU4J has a CharsetDetector that can do this), then split the text on something resonable like multiple line breaks, or paragraphs if the text is marked up.
Try Nutch's Language Identifier. It is trained with n-gram profiles of languages and profile of available languages is matched with input text. Interesting thing is you can add more languages, if you need.
Look up Markov chains.
Basically you will need statistically significant samples of the languages you want to recognize. When you get a new file, see what the frequencies of specific syllables or phonemes are, and compare the the pre-calculated sample. Pick the closest one.
Although a more complicated solution than you are looking for, you could use Vowpal Wabbit and train it with sentences from different languages.
In theory you could get back a language for every sentence in your documents.
http://hunch.net/~vw/
(Don't be fooled by the "online" in the project's subtitle - that's just mathspeak for learns without having to have whole learning material in memory)
If you are interested in the mechanism by which language detection can be performed, I refer you to the following article (python based) that uses a (very) naive method but is a good introduction to this problem in particular and machine learning (just a big word) in general.
For java implementations, JLangDetect and Nutch as suggested by the other posters are pretty good. Also take a look at Lingpipe, JTCL and NGramJ.
For the problem where you have multiple languages in the same page, you can use a sentence boundary detector to chop a page into sentences and then attempt to identify the language of each sentence. Assuming that a sentence contains only one (primary) language, you should still get good results with any of the above implementations.
Note: A sentence boundary detector (SBD) is theoretically language specific (chicken-egg problem since you need one for the other). But for latin-script based languages (English, French, German, etc.) that primarily use periods (apart from exclamations etc.) for sentence delimiting, you will get acceptable results even if you use an SBD designed for English. I wrote a rules-based English SBD that has worked really well for French text. For implementations, take a look at OpenNLP.
An alternative option to using the SBD is to use a sliding window of say 10 tokens (whitespace delimited) to create a pseudo-sentence (PS) and try and identify the border where the language changes. This has the disadvantage that if your entire document has n tokens, you will perform approximately n-10 classification operations on strings of length 10 tokens each. In the other approach, if the average sentence has 10 tokens, you would have performed approximately n/10 classification operations. If n = 1000 words in a document, you are comparing 990 operations versus 100 operations: an order of magnitude difference.
If you have short phrases (under 20 characters), accuracy of language detection is poor in my experience. Particularly in the case of proper nouns as well as nouns that are same across languages like "chocolate". E.g. Is "New York" an English word or a French word if it appears in a French sentence?
Do you have connection to the internet if you do then Google Language API would be perfect for you.
// This example request includes an optional API key which you will need to
// remove or replace with your own key.
// Read more about why it's useful to have an API key.
// The request also includes the userip parameter which provides the end
// user's IP address. Doing so will help distinguish this legitimate
// server-side traffic from traffic which doesn't come from an end-user.
URL url = new URL(
"http://ajax.googleapis.com/ajax/services/search/web?v=1.0&"
+ "q=Paris%20Hilton&key=INSERT-YOUR-KEY&userip=USERS-IP-ADDRESS");
URLConnection connection = url.openConnection();
connection.addRequestProperty("Referer", /* Enter the URL of your site here */);
String line;
StringBuilder builder = new StringBuilder();
BufferedReader reader = new BufferedReader(new InputStreamReader(connection.getInputStream()));
while((line = reader.readLine()) != null) {
builder.append(line);
}
JSONObject json = new JSONObject(builder.toString());
// now have some fun with the results...
If you don't there are other methods.
bigram models perform well, are simple to write, simple to train, and require only a small amount of text for detection. The nutch language identifier is a java implementation we found and used with a thin wrapper.
We had problems with a bigram model for mixed CJK and English text (i.e. a tweet is mostly Japanese, but has a single english word). This is obvious in retrospect from looking at the math (Japanese has many more characters, so the probabilities of any given pair are low). I think you could solve this with some more complicated log-linear comparison, but I cheated and used a simple filter based on character sets that are unique to certain languages (i.e. if it only contains unified Han, then it's Chinese, if it contains some Japanese kana and unified Han, then it's Japanese).
精彩评论