RegExp get string inside string
Let presume we have something like this:
<div1>
<h1>text1</h1>
<h1>text2</h1>
</div1>
<div2>
<h1>text3</h1>
</div2>
Using RegExp we need to get text1
and text2
but not text3
.
How to do this?
Thanks in advance.
EDIT: This is just an example. The text I'm parsing could be just plain text. The main thing I want to accomplish is list all strings from a specific section of a document. I gave this HTML code for example as it perfectly resembles the thing I need to get.
(?siU)<h1>(.*)</h1>
would parse all three strings, but how to get only first two?
EDIT2: Here is another rather dumb example. :)
Section1
This is a "very" nice sentence.
It has "just" a few words.
Section2
This is "only" an example.
The End
I need quoted words from first but not from second section.
Yet again, 开发者_运维问答(?siU)"(.*)"
returns quoted words from whole text,
and I need only those between words Section1
and Section2
.
This is for the "Rainmeter" application, which apparently uses Perl regex syntax.
I'm sorry, but I can't explain it better. :)
For the general case of the two examples provided -- for use in Rainmeter regex -- you can use:
(?siU)<h1>(.*)</h1>(?=.+<div2>)
for the first sample and
(?siU)"(.*)"(?=.+Section2)
for the second.
Note that Rainmeter seems to escape things for you, but you might need to change "
to \"
, above.
These both use Positive Lookahead but beware: both solutions will fail in the case of nested tags/structures or if there are mutiple Section1's and Section2's. Regex is not the best tool for this kind of parsing.
But maybe this is good enough for your current needs?
Use a DOM library and getElementsByTagName('div')
and you'll get a nodeList back. You can reference the first item with ->item(0)
and then getElementsByTagName('h1')
using the div as a context node, grab the text with ->nodeValue
property.
精彩评论