开发者

Determine if a URL is in the header/footer of a web page given URL, page DOM, parent URL and other page URLs

Given a URL, the URL of the webpage that first URL is on, the DOM of the webpage, and a li开发者_如何学JAVAst of the rest of the URLs on the webpage how can I reliably determine if the URL is in the header/footer of the page or if it's in neither?

I'm using C#/.NET.

I know that no solution is perfect since webpages are not semantically expressed and also because some websites/pages specifically obfuscate their pages, but I would like to build some logic that would work for say 75% of webpages.

Also, are there other pieces of information that would be helpful to determine the location of the URL in the page?


I think the creative task here is to define "header" and "footer", as in "content less than x units away from the top", or "the last 200 characters on the page". Once you have accomplished this, you can parse the page based on those rules.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜