Detecting language of email body
I need to implement an automated email reply system.
Here for the system i need to check the incoming emails and reply the email in the same language in which the email was received.
How can i do such a thing , please suggest some ideas?开发者_Go百科 Thanks in advance.
Appending one more query:
In the email headers there is one more header of the kind:
Content-Type: text/plain; charset=ISO-8859-1
How good it can prove in determining the language of the email body?
e.g (all headers taken out from gmail):
for Chinese subject and body
Content-Type: text/plain; charset=GB2312
for Korean subject and body
Content-Type: text/plain; charset=EUC-KR
for french/italian subject and body
Content-Type: text/html; charset=ISO-8859-1
Also is there any list somebody can direct me that have mappings defined for language to charset?
Thanks in advance
Google translate can guess the language of a sample text. Have a look at the API, it could be a solution for your problem (if you're connected to the internet anyway and don't care, sending fragments of mails to google servers...).
For offline evaluation I found the Java Text Categorizing Library.
This answer primarily for those who don't trust online services and cannot use GPL/LGPL software for various reasons. If those aren't problems, Andreas_D's answer is probably better.
It's an interesting problem. Here's how I'd approach it.
For every language you want to support, pick the twenty most common words in that language, that are unique to that language (such as and
, the
and because
and so forth for English). In other words, don't use blancmange or soufflé to identify French, since you may well get a message from a German chef.
Then just score your languages against the email to see which language has the highest occurrence of those words.
But I wouldn't use that to exclusively decide the language. Rather I'd use it to select the order in which the messages appeared. If an email was predominantly German but stood even a little chance of being French, I'd put the message out like this:
- German bit.
- French bit.
- English bit (see below).
Each "bit" would also contain a section at the start along the lines of "We have detected your most likely language as BLAH but, if this is not the case, scroll down for other likely languages".
And always have the fallback of English just in case you're dead wrong. I know it's linguocentric but I'm pretty certain the vast majority of Internet users are forced to deal with English (or its strange and slightly warped cousin, American) every day.
Where did the email senders get the email address? If it was on a web page, TV commercial, print advertisement, etc. in their own language, then you could give each supported language its own email address.
精彩评论