Help with regular expression to scrape website
I need to write a regular expression for the following (NB. ignore carriage returns, I've added them for readability):
<strong>Contact details</strong>
<p><label>Office:</label> +44 (0)12 3456 7890<br />
<label>Direct:</label> +44 (0)12 3456 7890<br />
<label>Mobile:</label> +44 (0)1234 567890<br />
<label>E-mail:</label> <a href="mailto:you@me.com">you@me.com</a><br />
I am using
/([\+\d\(\)\s]+)/
Which matches the number blocks and I can use and offset of 0-2 to identify them. The problem is it is retur开发者_C百科ning white space as well which is screwing up my offsets. How do I say "it must contain at least one digit in the match"?
I did also try/\<label\>Office:\<\/label\> ([\+\d\(\)\s]+)\<br \/\>/
But that would return
+44 (0)12 3456 7890<br />
<label>Direct:</label> +44 (0)12 3456 7890<br />
<label>Mobile:</label> +44 (0)1234 567890<br />
<label>E-mail:</label> <a href="mailto:you@me.com">you@me.com</a>
Its not a good idea to parse HTML using regex, use a DOM bases parse instead.
Your regex does not work because its greedy, to make it non-greedy change
([\+\d\(\)\s]+)
to
([\+\d\(\)\s]+?)
Also +
, (
and )
will be treated literally in a char class. So no need to escape them:
([+\d()\s]+?)
精彩评论