Encoding problem with C# utilising HttpWebRequest
I am getting character codes (' and &quote;) that are breaking my responses (showing 39; and uto;) when returning a string from an HttpWebRequest:
internal static void TranslateThis(Player player, string fromLang, string toLang, string text){
try
{
string translated = null;
HttpWebRequest hwr = (HttpWebRequest)HttpWebRequest.Create("http://translate.google.com/?langpair=" + fromLang + "|" + toLang + "&text=" + text.Replace(" ", "+") + "#");
HttpWebResponse res = (HttpWebResponse)hwr.GetResponse();
StreamReader sr = new StreamReader(res.GetResponseStream());
string html = sr.ReadToEnd();
int a = html.IndexOf("onmouseout=\"this.style.backgroundColor='#fff'\">") + 47;
int b = html.IndexOf("</span>",html.IndexOf("onmouseout=\"this.style.backgroundColor='#fff'\">") + 47);
translated = html.Substring(a, b - a);
if (translated.Length < (10 * text.Length)){
if (player == Player.Con开发者_开发百科sole)
{
player.ParseMessage(translated, true);
}
else
{
player.ParseMessage(translated, false);
}
} else {
player.Message("Usage: /translate [lang] [message]");
}
}
catch
{
player.Message("Usage: /translate [lang] [message]");
}
}
First of all make sure you get the correct encoding of the downloaded content. See this SO answer for code on how to do this.
Basically check both the http headers and the meta tags for the encoding, and re-encode the content if necessary. Then do a HttpUtility.HtmlDecode to get rid of any html coded characters. Now you are ready to start searching for whatever content you are trying to find.
I would also recommend using something like Html Agility Pack to parse the html instead of indexof.
It is hard to say what exactly does your ParseMessage
method expect, so this is just a guess:
The result you are getting from Google Translate is in HTML. Which means if you want a plain text output, you have to convert the HTML to text. You have successfully (for now, at least, until Google Translate changes their output page a tiny bit; your solution is not exactly fool- or future-proof) extracted the translation from the HTML page. But the translation is still encoded in HTML and you need to decode it. For that, you can use the WebUtility.HtmlDecode
method (assuming you are using .NET Framework 4): After the
translated = html.Substring(a, b - a);
line, add
translated = WebUtility.HtmlDecode(translated);
Discussions with another developer go me to try this before the last lot of comments. Here is what ended up working:
internal static void TranslateThis(Player player, string fromLang, string toLang, string text){
try
{
string translated = null;
text = Regex.Replace(text, @"[^\w\.\'\s@-]", "");
HttpWebRequest request = (HttpWebRequest)WebRequest.Create("http://translate.google.com/?langpair=" + fromLang + "|" + toLang + "&text=" + text.Replace(" ", "+") + "#");
request.MaximumAutomaticRedirections = 4;
request.MaximumResponseHeadersLength = 4;
request.Credentials = CredentialCache.DefaultCredentials;
HttpWebResponse response = (HttpWebResponse)request.GetResponse();
Stream receiveStream = response.GetResponseStream();
StreamReader readStream = new StreamReader(receiveStream, Encoding.UTF7);
String html = readStream.ReadToEnd() + "";
int a = html.IndexOf("onmouseout=\"this.style.backgroundColor='#fff'\">") + 47;
int b = html.IndexOf("</span>",html.IndexOf("onmouseout=\"this.style.backgroundColor='#fff'\">") + 47);
translated = html.Substring(a, b - a);
response.Close();
readStream.Close();
if (translated.Length < (10 * text.Length))
{
translated = translated.Replace("'", "'");
translated = Regex.Replace(translated, @"[^\w\.\'\s@-]", "");
if (player == Player.Console)
{
player.ParseMessage(translated, true);
}
else
{
player.ParseMessage(translated, false);
}
}
else
{
player.Message("Usage: /translate [lang] [message]");
}
}
catch(Exception ex)
{
player.Message("Error:" + ex.ToString());
}
}
精彩评论