How do I find a HTML div contains specific text after a text prefix?

2023-01-10 17:49 问答作者：

I have following string:

<div> text0 </div> prefix <div> text1 <strong>text2</strong> text3 </div> text4

and want to know wether it contains text3 inside divs that go after prefix:

prefix<div>...text3...</div>

but I don't know how ta make regex for that, since I can't use [^<]+ because div's can contain strong tag inside.

Please help

EDIT:开发者_JAVA技巧

Div tags after prefix are guaranted to be not nested
Language is C#
Text4 is very long, so regex must not look after closing div

EDIT2: I don't want to use html parser, it can be easily (and MUCH faster) achieved with Regex. HTML there is simple: no attributes in tags; no nesting div's. And even some % of wrong answers are acceptable in my case.

If you turn off the "greedy" option, you should be able to just use something like prefix<div>.*text3.*</div>. (If the <div> is allowed to have attributes, use prefix<div[^>]*>.*text3.*</div> instead.)

Numerous improvements could be made to this in order to take account of unusual spacing, >s within quotes, </div> within quotes, etc.

Patterns like prefix<div>...<div></div>text3</div> would be more difficult. You might have to capture all of the occurrences of the div tag so that you could count how many div tags were open at a given time.

EDIT: Oops, turning off the greedy option won't always give the right result, even in examples other than the one above. Probably better just to capture all occurrences of the div tag and go from there. As noted above by Peter, HTML is not a regular language and so you can't use regular expressions to do everything you might want with it.

this is my new regex:

prefix<div>([^<]*<(?!/div>))*[^<]*text3([^<]*<(?!/div>))*[^<]*</div>

seems to work ok.

For C# + HtmlAgilityPack you can do something like:

InputString = Regex.Replace(InputString,"^(?:[^<]+?|<[^>]*>)*?prefix","");

HtmlDocument doc = new HtmlDocument();

doc.LoadHtml(InputString);

HtmlNode node = doc.DocumentNode.SelectSingleNode("//div[contains('text3')]");

The prefix removal is still not a good way of dealing with it. Ideally you'd do something like using HtmlAgilityPack to find where prefix occurs in the DOM, translate that to provide the position in the string, then do a substring(pos,len) (or equivalent) to look at only the relevant text (you can also avoid looking at text4 using similar method).
I'm afraid I can't translate all that into code right now; hopefully someone else can help there.

(original answer, before extra details provided)
Here is a JavaScript + jQuery solution:

var InputString = '<div>text0 </div> prefix <div>text1 <strong>text2</strong> text3 </div> text4';

InputString = InputString.replace(/^.*?prefix/,'');

var MatchingDivs = jQuery('div:contains(text3)','<div>'+InputString+'</div>')

console.log(MatchingDivs.get());

This makes use of jQuery's ability to accept a context as second argument (though it appears this needs to be wrapped in div tags to actually work).

继续阅读：html-agility-pack html-parsing regex

How do I find a HTML div contains specific text after a text prefix?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

王昌瑞《潜梦追凶》剧组庆生新锐演员未来可期？

Is it allowed to ask users to enter credit card details for own payment method?

Escaping "<" in Perl-generated XML

imessage会显示已读吗？

微信重新建群怎么建？