开发者

regex needed to match anything within p tags

I need a regular expression to match anything that is with开发者_如何学运维in <p> tags so for example if I had some text:

<p>Hello world</p>

The regex would match the Hello world part


in javascript:

var str = "<p>Hello world</p>";
str.search(/<\s*p[^>]*>([^<]*)<\s*\/\s*p\s*>/)

in php:

$str = "<p>Hello world</p>";
preg_match_all("/<\s*p[^>]*>([^<]*)<\s*\/\s*p\s*>/", $str);

These will match something as complex as this

< p style=  "font-weight: bold;" >Hello world  <  /  p >


EDIT: Don't do it. Just don't.

See this question

If you insist, use <p>(.+?)</p> and the result will be in the first group. It is not perfect, but no regexp solution to HTML parsing problem will ever be.

E.g (in python)

>>> import re
>>> r = re.compile('<p>(.+?)</p>')
>>> r.findall("<p>fo o</p><p>ba adr</p>")
['fo o', 'ba adr']


It seems that the above proposed solutions will fail either:

  • to return text within <p>...</p> tags whenever it contains other tags like <a>, <em>, etc. or
  • to distinguish between <p> and <path> or
  • to include tags with attributes like <p class="content">

Consider using this regex:

<p(|\s+[^>]*)>(.*?)<\/p\s*>

Resulting text will be captured in group 2.


Obviously, this solution won't work properly whenever closing tag </p> will be for some reason enclosed in comment tags <p> ... <!-- ... </p> ... -->


You can use this in Python as a comprehensive solution:

import re
import bs4
import requests

page = requests.get(link)
page_content = bs4.BeautifulSoup(page.content,'html.parser')
result = page_content.find_all('p')


Regex:

<([a-z][a-z0-9]*)\b[^>]*>(.*?)</\1>

This will work for any pair of tags.

e.g <p class="foo">hello<br/></p>

The \1 makes sure that the opening tag matches the closing tag.

The content between the tags is captured in \2.


For anybody looking into this Regex or any other regex to match specific HTML tags, this Regex below will work as needed:

<\s*p[^>]*>(.*?)<\s*\/\s*p\s*>

This will match strings like the below strings as mentioned in xzyfer's answer:

<p>I would like <b>all</b> the text!</p> < p style=  "font-weight: bold;" >Hello world  <  /  p >

Link to the Regex on Regex101 here: https://regex101.com/r/kjpLII


If you would like to use the Regex for other HTML tags instead of just p tags you can change the p's in the Regex to whichever HTML tag you wish to match:

<\s*div[^>]*>(.*?)<\s*\/\s*div\s*>
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜