开发者

Help with regular expression to scrape website

I need to write a regular expression for the following (NB. ignore carriage returns, I've added them for readability):

<strong>Contact details</strong>
<p><label>Office:</label>&nbsp;+44 (0)12 3456 7890<br />
<label>Direct:</label>&nbsp;+44 (0)12 3456 7890<br />
<label>Mobile:</label>&nbsp;+44 (0)1234 567890<br />
<label>E-mail:</label>&nbsp;<a href="mailto:you@me.com">you@me.com</a><br />

I am using

/([\+\d\(\)\s]+)/

Which matches the number blocks and I can use and offset of 0-2 to identify them. The problem is it is retur开发者_C百科ning white space as well which is screwing up my offsets. How do I say "it must contain at least one digit in the match"?

I did also try

/\<label\>Office:\<\/label\>&nbsp;([\+\d\(\)\s]+)\<br \/\>/

But that would return

+44 (0)12 3456 7890<br />
<label>Direct:</label>&nbsp;+44 (0)12 3456 7890<br />
<label>Mobile:</label>&nbsp;+44 (0)1234 567890<br />
<label>E-mail:</label>&nbsp;<a href="mailto:you@me.com">you@me.com</a>


Its not a good idea to parse HTML using regex, use a DOM bases parse instead.

Your regex does not work because its greedy, to make it non-greedy change

([\+\d\(\)\s]+)

to

([\+\d\(\)\s]+?)

Also +, ( and ) will be treated literally in a char class. So no need to escape them:

([+\d()\s]+?)
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜