How can I deal with accented letters, german letters and other characters?
My python script is working now, but I'm having a little trouble:
Here is the output:
from BeautifulSoup import BeautifulSoup
import urllib
langCode={
"arabic":"ar", "bulgarian":"bg", "chinese":"zh-CN",
"croatian":"hr", "czech":"cs", "danish":"da", "dutch":"nl",
"english":"en", "finnish":"fi", "french":"fr", "german":"de",
"greek":"el", "hindi":"hi", "italian":"it", "japanese":"ja",
"korean":"ko", "norwegian":"no", "polish":"pl", "portugese":"pt",
"romanian":"ro", "russian":"ru", "spanish":"es", "swedish":"sv" }
def setUserAgent(userAgent):
urllib.FancyURLopener.version = userAgent
pass
def translate(text, fromLang, toLang):
setUserAgent("Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008070400 SUSE/3.0.1-0.1 Firefox/3.0.1")
try:
postParameters = urllib.urlen开发者_JS百科code({"langpair":"%s|%s" %(langCode[fromLang.lower()],langCode[toLang.lower()]), "text":text,"ie":"UTF8", "oe":"UTF8"})
except KeyError, error:
print "Currently we do not support %s" %(error.args[0])
return
page = urllib.urlopen("http://translate.google.com/translate_t", postParameters)
content = page.read()
page.close()
htmlSource = BeautifulSoup(content)
translation = htmlSource.find('span', title=text )
return translation.renderContents()
print translate("Good morning to you friend!", "English", "German")
print translate("Good morning to you friend!", "English", "Italian")
print translate("Good morning to you friend!", "English", "Spanish")
Guten Morgen, du Freund!
Buongiorno a te amico!
Buenos dÃas a ti amigo!
How do I manage the letters that aren't basic english letters? How would you recommend I solve this? I was thinking a dictionary to replace certain chains with another character, but I'm sure Python has something like this already. Batteries included and whatnot. :P
Thanks.
Don't parse http://translate.google.com/translate_t
since Google provides an AJAX service for this purpose. The translatedText
in the json
data returned by ajax.googleapis.com
is already a unicode string.
import urllib2
import urllib
import sys
import json
LANG={
"arabic":"ar", "bulgarian":"bg", "chinese":"zh-CN",
"croatian":"hr", "czech":"cs", "danish":"da", "dutch":"nl",
"english":"en", "finnish":"fi", "french":"fr", "german":"de",
"greek":"el", "hindi":"hi", "italian":"it", "japanese":"ja",
"korean":"ko", "norwegian":"no", "polish":"pl", "portugese":"pt",
"romanian":"ro", "russian":"ru", "spanish":"es", "swedish":"sv" }
def translate(text,lang1,lang2):
base_url='http://ajax.googleapis.com/ajax/services/language/translate?'
langpair='%s|%s'%(LANG.get(lang1.lower(),lang1),
LANG.get(lang2.lower(),lang2))
params=urllib.urlencode( (('v',1.0),
('q',text.encode('utf-8')),
('langpair',langpair),) )
url=base_url+params
content=urllib2.urlopen(url).read()
try: trans_dict=json.loads(content)
except AttributeError:
try: trans_dict=json.load(content)
except AttributeError: trans_dict=json.read(content)
return trans_dict['responseData']['translatedText']
print translate("Good morning to you friend!", "English", "German")
print translate("Good morning to you friend!", "English", "Italian")
print translate("Good morning to you friend!", "English", "Spanish")
yields
Guten Morgen, du Freund!
Buongiorno a te amico!
Buenos días a ti amigo!
Parse the proper charset from the headers returned by urlopen()
and pass it as the fromEncoding
argument to the BeautifulSoup
constructor.
精彩评论