C# and HtmlAgilityPack encoding problem
WebClient GodLikeClient = new WebClient();
HtmlAgilityPack.HtmlDocument GodLikeHTML = new HtmlAgilityPack.HtmlDocument();
GodLikeHTML.Load(GodLikeClient.OpenRead("www.alfa.lt");
So this code returns: "Skaitytojo klausimas psichologui: kas lemia homoseksualumÄ…? - Naujienų portalas Alfa.lt" instead of "Skaitytojo klausimas psichologui: kas lemia homoseksualumą? - Naujienų portalas Alfa.lt".
This webpage is encoded in 1257 (baltic), but textBox1.Text = GodLikeHTML.DocumentNode.OuterHtml;
returns the distorted text - baltic diacritics 开发者_开发问答are transformed into some weird several characters long strings :(
And yes, I've tried the HtmlAgilityPack forums. They do suck.
P.S. I'm no programmer, but I work on a community project and I really need to get this code working. Thanks ;}
Actually the page is encoded with UTF-8.
GodLikeHTML.Load(GodLikeClient.OpenRead("http://www.alfa.lt"), Encoding.UTF8);
will work.
Or you could use the code in my SO answer which detects encoding from http headers or meta tags, en re-encodes properly. (It also supports gzip to minimize your download).
With the download class your code would look like:
HttpDownloader downloader = new HttpDownloader("http://www.alfa.lt",null,null);
GodLikeHTML.LoadHtml(downloader.GetPage());
I had a similar encoding problems. I fixed it, in the most current version of HtmlAgilityPack, by adding the following to my WebClient initialization.
var htmlWeb = new HtmlWeb();
htmlWeb.OverrideEncoding = Encoding.UTF8;
var doc = htmlWeb.Load("www.alfa.lt");
UTF8 didn't work for me, but after setting the encoding like this, most pages i was trying to scrape worked just wel:
web.OverrideEncoding = Encoding.GetEncoding("ISO-8859-1");
Perhaps it might help someone.
HtmlAgilityPack.HtmlDocument doc = new HtmlDocument();
StreamReader reader = new StreamReader(WebRequest.Create(YourUrl).GetResponse().GetResponseStream(), Encoding.Default); //put your encoding
doc.Load(reader);
hope it helps :)
try to change that to GodLikeHTML.Load(GodLikeClient.OpenRead("www.alfa.lt"), Encoding.GetEncoding(1257));
if all of those post doesn't work, Just use this: WebUtility.HtmlDecode("Your html text");
This seemed to remove the need to know anything about encoding for me:
using System;
using HtmlAgilityPack;
using System.Net;
using System.IO;
class Program
{
static void Main(string[] args)
{
Console.Write("Enter the url to pull html documents from: ");
string url = Console.ReadLine();
HtmlDocument document = new HtmlDocument();
var request = WebRequest.Create(url);
var response = request.GetResponse();
using (var reader = new StreamReader(response.GetResponseStream()))
{
document.LoadHtml(reader.ReadToEnd());
}
}
}
This is my solution
HttpWebRequest request =(HttpWebRequest)WebRequest.Create("http://www.sina.com.cn");
HttpWebResponse response =(HttpWebResponse)request.GetResponse();
long len = response.ContentLength;
byte[] barr = new byte[len];
response.GetResponseStream().Read(barr, 0, (int)len);
response.Close();
string data = Encoding.UTF8.GetString(barr);
var encod = doc.DetectEncodingHtml(data);
string convstr = Encoding.Unicode.GetString(Encoding.Convert(encod, Encoding.Unicode, barr));
doc.LoadHtml(convstr);
Even simpler (WebClient
seems not to have any OverrideEncoding
feature):
using (WebClient webClient = new WebClient())
{
webClient.Encoding = Encoding.UTF8;
// do whatever you want...
}
(works for me in .NET Framework 4.8)
精彩评论