开发者

Extracting an "encompassing" string based on a term within the string

I have a java function to extract a string out of the HTML Page source开发者_C百科 for any website...The function basically accepts the site name, along with a term to search for. Now, this search term is always contained within javascript tags. What I need to do is to pull the entire javascript (within the tags) that contains the search term.

Here is an example -

<script type="text/javascript">
    //Roundtrip
    rtTop = Number(new Date());

    document.documentElement.className += ' jsenabled';
</script>

For the javascript snippet above, my search term would be "rtTop". Once found, I want my function to return the string containing everything within the script tags.

Any novel solution? Thanks.


You could use a regular expression along the lines of

String someHTML = //get your HTML from wherever
Pattern pattern = Pattern.compile("<script type=\"text/javascript\">(.*?rtTop.*?)</script>",Pattern.DOTALL);
Matcher myMatcher = pattern.matcher(someHTML);
myMatcher.find();
String result = myMatcher.group(1);


I wish I could just comment on JacobM's answer, but I think I need more stackCred.

You could use an HTML parser, that's usually the better solution. That said, for limited scopes, I often use regEx. It's a mean beast though. One change I would make to JacobM's pattern is to replace the attributes within the opening element with [^<]+

That will allow you to match even if the "type" isn't present or if it has some other oddities. I'd also wrap the .*? with parens to make using the values later a little easier.

* UPDATE * Borrowing from JacobM's answer. I'd change the pattern a little bit to handle multiple elements.

String someHTML = //get your HTML from wherever
String lKeyword = "rtTop";
String lRegexPattern = "(.*)(<script[^>]*>(((?!</).)*)"+lKeyword +"(((?!</).)*)</script>)(.*)";
Pattern pattern = Pattern.compile(lRegexPattern ,Pattern.DOTALL);
Matcher myMatcher = pattern.matcher(someHTML);
myMatcher.find();
String lPreKeyword = myMatcher.group(3);
String lPostKeyword = myMatcher.group(5);
String result = lPreKeyword + lKeyword + lPostKeyword;

An example of this pattern in action can be found here. Like I said, parsing HTML via regex can get real ugly real fast.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜