How to extract full url with HtmlAgilityPack - C#
Alright with the way below it is extracting only referring url like this
the extraction co开发者_如何学Gode :
foreach (HtmlNode link in hdDoc.DocumentNode.SelectNodes("//a[@href]"))
{
lsLinks.Add(link.Attributes["href"].Value.ToString());
}
The url code
<a href="Login.aspx">Login</a>
The extracted url
Login.aspx
But i want to get real link what browser parsed like
http://www.monstermmorpg.com/Login.aspx
I can do it with checking the url whether containing http and if not add the domain value but it may cause some problems at some occasions and i think not a very wise solution.
c# 4.0 , HtmlAgilityPack.1.4.0
Assuming you have the original url, you can combine the parsed url something like this:
// The address of the page you crawled
var baseUrl = new Uri("http://example.com/path/to-page/here.aspx");
// root relative
var url = new Uri(baseUrl, "/Login.aspx");
Console.WriteLine (url.AbsoluteUri); // prints 'http://example.com/Logon.aspx'
// relative
url = new Uri(baseUrl, "../foo.aspx?q=1");
Console.WriteLine (url.AbsoluteUri); // prints 'http://example.com/path/foo.aspx?q=1'
// absolute
url = new Uri(baseUrl, "http://stackoverflow.com/questions/7760286/");
Console.WriteLine (url.AbsoluteUri); // prints 'http://stackoverflow.com/questions/7760286/'
// other...
url = new Uri(baseUrl, "javascript:void(0)");
Console.WriteLine (url.AbsoluteUri); // prints 'javascript:void(0)'
Note the use of AbsoluteUri
and not relying on ToString()
because ToString
decodes the URL (to make it more "human-readable"), which is not typically what you want.
I can do it with checking the url whether containing http and if not add the domain value
That's what you should do. Html Agility Pack has nothing to help you with this:
var url = new Uri(
new Uri(baseUrl).GetLeftPart(UriPartial.Path),
link.Attributes["href"].Value)
);
精彩评论