regex needed to match anything within p tags

2023-02-08 11:20 问答作者：

I need a regular expression to match anything that is with开发者_如何学运维in  tags so for example if I had some text:

<p>Hello world</p>

The regex would match the Hello world part

in javascript:

var str = "<p>Hello world</p>";
str.search(/<\s*p[^>]*>([^<]*)<\s*\/\s*p\s*>/)

in php:

$str = "<p>Hello world</p>";
preg_match_all("/<\s*p[^>]*>([^<]*)<\s*\/\s*p\s*>/", $str);

These will match something as complex as this

< p style=  "font-weight: bold;" >Hello world  <  /  p >

EDIT: Don't do it. Just don't.

See this question

If you insist, use (.+?) and the result will be in the first group. It is not perfect, but no regexp solution to HTML parsing problem will ever be.

E.g (in python)

>>> import re
>>> r = re.compile('<p>(.+?)</p>')
>>> r.findall("<p>fo o</p><p>ba adr</p>")
['fo o', 'ba adr']

It seems that the above proposed solutions will fail either:

to return text within ... tags whenever it contains other tags like <a>, , etc. or
to distinguish between  and <path> or
to include tags with attributes like

Consider using this regex:

<p(|\s+[^>]*)>(.*?)<\/p\s*>

Resulting text will be captured in group 2.

Obviously, this solution won't work properly whenever closing tag  will be for some reason enclosed in comment tags  ... 

You can use this in Python as a comprehensive solution:

import re
import bs4
import requests

page = requests.get(link)
page_content = bs4.BeautifulSoup(page.content,'html.parser')
result = page_content.find_all('p')

Regex:

<([a-z][a-z0-9]*)\b[^>]*>(.*?)</\1>

This will work for any pair of tags.

e.g hello 

The \1 makes sure that the opening tag matches the closing tag.

The content between the tags is captured in \2.

For anybody looking into this Regex or any other regex to match specific HTML tags, this Regex below will work as needed:

<\s*p[^>]*>(.*?)<\s*\/\s*p\s*>

This will match strings like the below strings as mentioned in xzyfer's answer:

<p>I would like <b>all</b> the text!</p> < p style=  "font-weight: bold;" >Hello world  <  /  p >

Link to the Regex on Regex101 here: https://regex101.com/r/kjpLII

If you would like to use the Regex for other HTML tags instead of just p tags you can change the p's in the Regex to whichever HTML tag you wish to match:

<\s*div[^>]*>(.*?)<\s*\/\s*div\s*>

继续阅读：regex

regex needed to match anything within p tags

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？