开发者

How do I find a HTML div contains specific text after a text prefix?

I have following string:

<div> text0 </div> prefix <div> text1 <strong>text2</strong> text3 </div> text4

and want to know wether it contains text3 inside divs that go after prefix:

prefix<div>...text3...</div>

but I don't know how ta make regex for that, since I can't use [^<]+ because div's can contain strong tag inside.

Please help

EDIT:开发者_JAVA技巧

  1. Div tags after prefix are guaranted to be not nested
  2. Language is C#
  3. Text4 is very long, so regex must not look after closing div

EDIT2: I don't want to use html parser, it can be easily (and MUCH faster) achieved with Regex. HTML there is simple: no attributes in tags; no nesting div's. And even some % of wrong answers are acceptable in my case.


If you turn off the "greedy" option, you should be able to just use something like prefix<div>.*text3.*</div>. (If the <div> is allowed to have attributes, use prefix<div[^>]*>.*text3.*</div> instead.)

Numerous improvements could be made to this in order to take account of unusual spacing, >s within quotes, </div> within quotes, etc.

Patterns like prefix<div>...<div></div>text3</div> would be more difficult. You might have to capture all of the occurrences of the div tag so that you could count how many div tags were open at a given time.

EDIT: Oops, turning off the greedy option won't always give the right result, even in examples other than the one above. Probably better just to capture all occurrences of the div tag and go from there. As noted above by Peter, HTML is not a regular language and so you can't use regular expressions to do everything you might want with it.


this is my new regex:

prefix<div>([^<]*<(?!/div>))*[^<]*text3([^<]*<(?!/div>))*[^<]*</div>

seems to work ok.


For C# + HtmlAgilityPack you can do something like:

InputString = Regex.Replace(InputString,"^(?:[^<]+?|<[^>]*>)*?prefix","");

HtmlDocument doc = new HtmlDocument();

doc.LoadHtml(InputString);

HtmlNode node = doc.DocumentNode.SelectSingleNode("//div[contains('text3')]");

The prefix removal is still not a good way of dealing with it. Ideally you'd do something like using HtmlAgilityPack to find where prefix occurs in the DOM, translate that to provide the position in the string, then do a substring(pos,len) (or equivalent) to look at only the relevant text (you can also avoid looking at text4 using similar method).
I'm afraid I can't translate all that into code right now; hopefully someone else can help there.


(original answer, before extra details provided)
Here is a JavaScript + jQuery solution:

var InputString = '<div>text0 </div> prefix <div>text1 <strong>text2</strong> text3 </div> text4';

InputString = InputString.replace(/^.*?prefix/,'');

var MatchingDivs = jQuery('div:contains(text3)','<div>'+InputString+'</div>')

console.log(MatchingDivs.get());

This makes use of jQuery's ability to accept a context as second argument (though it appears this needs to be wrapped in div tags to actually work).

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜