Help with regular expression to scrape website

2023-01-15 08:59 问答作者：

I need to write a regular expression for the following (NB. ignore carriage returns, I've added them for readability):

<strong>Contact details</strong>
<p><label>Office:</label>&nbsp;+44 (0)12 3456 7890<br />
<label>Direct:</label>&nbsp;+44 (0)12 3456 7890<br />
<label>Mobile:</label>&nbsp;+44 (0)1234 567890<br />
<label>E-mail:</label>&nbsp;<a href="mailto:you@me.com">you@me.com</a><br />

I am using

/([\+\d\(\)\s]+)/

Which matches the number blocks and I can use and offset of 0-2 to identify them. The problem is it is retur开发者_C百科ning white space as well which is screwing up my offsets. How do I say "it must contain at least one digit in the match"?

I did also try

/\<label\>Office:\<\/label\>&nbsp;([\+\d\(\)\s]+)\<br \/\>/

But that would return

+44 (0)12 3456 7890<br />
<label>Direct:</label>&nbsp;+44 (0)12 3456 7890<br />
<label>Mobile:</label>&nbsp;+44 (0)1234 567890<br />
<label>E-mail:</label>&nbsp;<a href="mailto:you@me.com">you@me.com</a>

Its not a good idea to parse HTML using regex, use a DOM bases parse instead.

Your regex does not work because its greedy, to make it non-greedy change

([\+\d\(\)\s]+)

([\+\d\(\)\s]+?)

Also +, ( and ) will be treated literally in a char class. So no need to escape them:

([+\d()\s]+?)

继续阅读：regex

Help with regular expression to scrape website

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？