Extracting words surrounding anchor text
I need help in using java code to extract some words that are surrounding the anchor text of html content.In particular seven words before and after (both sides of) the anchor text(HTMl block-level elements and punctuati开发者_开发问答on marks included).
You can make use of regular expression for this.
You want to create new URL() from href and get only host part? Get a look at this class.
To find anchors and get href attribute I recommend Jericho library. I'm using it always when I need to parse a sample of HTML / XML and need no XML validation. It's fast and easy to use, and has a lot of examples on project's page.
精彩评论