开发者

Regexp - search for text which doesn't contain whole word

I have text similar like this:

<html><p>this is <b>the</b> text</p> and <p>this is another text</p></html>

and I need to get this text using regexp

this is 开发者_StackOverflow中文版<b>the</b> text

Problem is, when I use simple regexp like this (<html>.*</p>) I'm getting whole text until the last occurence of </p>

Can anyone help me?

thanks lennyd


You need a non-greedy match:

<html>.*?</p>

Also, you might want to consider using an HTML parser instead of regular expressions for this task.


By default regular expression quantifiers are greedy, i.e. you get the match of maximum length. You'll have to specify that you want an 'un-greedy' match using .*?


To capture the data in between para tags you may use regexp with positive look-ahead assertion /<p>(.*)(?=<\/p>)/, which is more greedy then .*? and works slower, but may be helpful for you. Also make sure that your HTML is valid, that means:

  1. All para tags are closed. HTML browsers close para tags, when they enter another block.
  2. Para tags are not nested :) Otherwise you have problems with any regex.


Silly question, still using pure regex, why not just strip any <..> inside paragraphs? THEN grab the phrases using something like [^<]
?

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜