Extracting everything but tags from a web page without a parser - using scanner and regex?

2023-01-14 22:29 问答作者：

Working on Android SDK, it's Java minus some things.

I have a solution that pulls out two regex patterns from web pages. The problems I'm having is that it's finding things inside HTML tags. I tried jTidy, but it was just too slow on the Android. Not sure why but my Scanner regex match solution whips it many times over.

currently, I grab the page source into a IntputStream

is = uconn.getInputStream();

and the match and extr开发者_运维技巧act like this:

Scanner scanner = new Scanner(in, "UTF-8");
String match = "";   
while (match != null) {   
    match = scanner.findWithinHorizon(extractPattern, 0);   
    if (match != null) {   
        String matchit = scanner.match().group(grp);

it works very nicely and is fast.

My regex pattern is already kinda crazy, actually two patterns in an or like this (p1|p2)

Any ideas on how I do that "but not inside HTML tags" or exclude HTML tags at the start? If I can exclude HTML tags from my source that will likely speed up my interface significantly as I have a few other things I need to do with the raw data.

One thing you can do is add a lookahead for the closing angle bracket:

(p1|p2)(?![^<>]*+>)

The idea is, after you find a match you scan forward a bit; if you find a closing bracket without first seeing an opening bracket, the match must have occurred inside a tag, so reject it. But be aware that even in well-formed HTML there are many things that can mess you up, like SGML comments, CDATA sections, or even angle brackets in attribute values.

Another approach would be to match the tags and ignore those matches:

((?:<[^<>]++>)++)(p1|p2)

Then you test whether it was group #1 that matched:

MatchResult match = scanner.match();
if (match.start(1) != -1) {
    // keep searching
}

But again, as a general solution this is way too fragile, for the reasons I cited above. You should only use one of these solutions (or any regex solution) if you're sure it's compatible with the particular pages you're working on.

Why don't you use javax.xml.parsers to parse HTML (ergo xml)

继续阅读：html-parsing java.util.scanner regex

Extracting everything but tags from a web page without a parser - using scanner and regex?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？