How do I find a HTML div contains specific text after a text prefix?
I have following string:
<div> text0 </div> prefix <div> text1 <strong>text2</strong> text3 </div> text4
and want to know wether it contains text3
inside divs that go after prefix:
prefix<div>...text3...</div>
but I don't know how ta make regex for that, since I can't use [^<]+
because div's can contain strong
tag inside.
Please help
EDIT:开发者_JAVA技巧
- Div tags after prefix are guaranted to be not nested
- Language is C#
- Text4 is very long, so regex must not look after closing div
EDIT2: I don't want to use html parser, it can be easily (and MUCH faster) achieved with Regex. HTML there is simple: no attributes in tags; no nesting div's. And even some % of wrong answers are acceptable in my case.
If you turn off the "greedy" option, you should be able to just use something like prefix<div>.*text3.*</div>
. (If the <div>
is allowed to have attributes, use prefix<div[^>]*>.*text3.*</div>
instead.)
Numerous improvements could be made to this in order to take account of unusual spacing, >
s within quotes, </div>
within quotes, etc.
Patterns like prefix<div>...<div></div>text3</div>
would be more difficult. You might have to capture all of the occurrences of the div
tag so that you could count how many div
tags were open at a given time.
EDIT: Oops, turning off the greedy option won't always give the right result, even in examples other than the one above. Probably better just to capture all occurrences of the div
tag and go from there. As noted above by Peter, HTML is not a regular language and so you can't use regular expressions to do everything you might want with it.
this is my new regex:
prefix<div>([^<]*<(?!/div>))*[^<]*text3([^<]*<(?!/div>))*[^<]*</div>
seems to work ok.
For C# + HtmlAgilityPack you can do something like:
InputString = Regex.Replace(InputString,"^(?:[^<]+?|<[^>]*>)*?prefix","");
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(InputString);
HtmlNode node = doc.DocumentNode.SelectSingleNode("//div[contains('text3')]");
The prefix removal is still not a good way of dealing with it. Ideally you'd do something like using HtmlAgilityPack to find where prefix
occurs in the DOM, translate that to provide the position in the string, then do a substring(pos,len) (or equivalent) to look at only the relevant text (you can also avoid looking at text4 using similar method).
I'm afraid I can't translate all that into code right now; hopefully someone else can help there.
(original answer, before extra details provided)
Here is a JavaScript + jQuery solution:
var InputString = '<div>text0 </div> prefix <div>text1 <strong>text2</strong> text3 </div> text4';
InputString = InputString.replace(/^.*?prefix/,'');
var MatchingDivs = jQuery('div:contains(text3)','<div>'+InputString+'</div>')
console.log(MatchingDivs.get());
This makes use of jQuery's ability to accept a context as second argument (though it appears this needs to be wrapped in div
tags to actually work).
精彩评论