NLTK and language detection
How do I detect what language a text is written in using NLTK?
The examples I've seen use nltk.detect开发者_Go百科
, but when I've installed it on my mac, I cannot find this package.
Have you come across the following code snippet?
english_vocab = set(w.lower() for w in nltk.corpus.words.words())
text_vocab = set(w.lower() for w in text if w.lower().isalpha())
unusual = text_vocab.difference(english_vocab)
from http://groups.google.com/group/nltk-users/browse_thread/thread/a5f52af2cbc4cfeb?pli=1&safe=active
Or the following demo file?
https://web.archive.org/web/20120202055535/http://code.google.com/p/nltk/source/browse/trunk/nltk_contrib/nltk_contrib/misc/langid.py
This library is not from NLTK either but certainly helps.
$ sudo pip install langdetect
Supported Python versions 2.6, 2.7, 3.x.
>>> from langdetect import detect
>>> detect("War doesn't show who's right, just who's left.")
'en'
>>> detect("Ein, zwei, drei, vier")
'de'
https://pypi.python.org/pypi/langdetect?
P.S.: Don't expect this to work correctly always:
>>> detect("today is a good day")
'so'
>>> detect("today is a good day.")
'so'
>>> detect("la vita e bella!")
'it'
>>> detect("khoobi? khoshi?")
'so'
>>> detect("wow")
'pl'
>>> detect("what a day")
'en'
>>> detect("yay!")
'so'
Although this is not in the NLTK, I have had great results with another Python-based library :
https://github.com/saffsd/langid.py
This is very simple to import and includes a large number of languages in its model.
Super late but, you could use textcat
classifier in nltk
, here. This paper discusses the algorithm.
It returns a country code in ISO 639-3, so I would use pycountry
to get the full name.
For example, load the libraries
import nltk
import pycountry
from nltk.stem import SnowballStemmer
Now let's look at two phrases, and guess
their language:
phrase_one = "good morning"
phrase_two = "goeie more"
tc = nltk.classify.textcat.TextCat()
guess_one = tc.guess_language(phrase_one)
guess_two = tc.guess_language(phrase_two)
guess_one_name = pycountry.languages.get(alpha_3=guess_one).name
guess_two_name = pycountry.languages.get(alpha_3=guess_two).name
print(guess_one_name)
print(guess_two_name)
English
Afrikaans
You could then pass them into other nltk
functions, for example:
stemmer = SnowballStemmer(guess_one_name.lower())
s1 = "walking"
print(stemmer.stem(s1))
walk
Disclaimer obviously this will not always work, especially for sparse data
Extreme example
guess_example = tc.guess_language("hello")
print(pycountry.languages.get(alpha_3=guess_example).name)
Konkani (individual language)
精彩评论