How to write regex to pull first and last name from source HTML?
I've been pulling my hair out trying to come up with a regx that will pull the First and Last Name from the following HTML. My regex fu is not strong.
<span id="label_85110"><b>First Name</b></span>
<br/>
<span id="value_85110">AWeber- Email Parser</span>
<br/>
</p>
<p>
<span id="label_86004"><b>Last Name</b></span>
<br/>
<span id="value_86004">Submission</span>
<br/>
</p>
<p>
<span id="label_85111"><b>Email</b></span>
<br/>
<span id="value_85111">leslie@dakno.com</span>
<br/>
</p>
<p>
<span id="label_85540"><b>Phone</b></span>
开发者_开发技巧<br/>
<span id="value_85540">919-923-7017</span>
<br/>
</p>
@oliver1,
Please note that the keyword in Regular Expression is "Regular." Regular Expressions are used with Regular Languages.
Unfortunately, (X)HTML is not a Regular Language. Rather, it is a Context Free Language.
You cannot write a RegEx which can properly parse a Context Free Language. This is a mathematically proven reality; you cannot write a RegEx which can properly parse a Context Free Language.
The Solution: Use XPath
Instead you should use an XML parser; you are already using XHTML which means you could instead use XPath. (although you're missing an <p>
at the beginning of your code snippet)
How can any parser, RegEx or query identify the first names and last names? The best I see is "<span>
elements which come after a <br />
" which is pretty weak.
You can nonetheless write an XPath query to find "<span>
elements which come after a <br />
".
//br/following-sibling::span/text()
... but that also finds the values of Email
and Phone
, so you'll want only the first two results.
Alternately, you could instead use the id
attributes on the <span>
elements:
//span[@id='value_85110']/text()|//span[@id='value_86004']/text()
If You Can Modify The HTML
Ideally, my suggestion is to make your XHTML more semantic:
<label for="first-name-1">First Name</label>
<span id="first-name-1" class="first-name">Aweber- Email Parser</span>
<label for="last-name-1">Last Name</label>
<span id="last-name-1" class="last-name">Submission</span>
<label for="email-address-1">Email</label>
<span id="email-address-1" class="email-address">leslie@dakno.com</span>
<label for="phone-number-1">Phone</label>
<span id="phone-number-1" class="phone-number">919-923-7017</span>
Enhance it with CSS (instead of using <b>
and <br/>
all over the place)...
label {
font-weight:bolder;
display:block;
maring-top:5px;
}
span {
display:block;
maring-bottom:5px;
}
... and then use an XPath query like so:
//span[@class='first-name'] | //span[@class='last-name']
Disclaimer: This is just an answer to the problem, not an endorsement of using regex for this purpose.
<span[^>]*?><b>First Name(?:<[^>]+?>|\s)+([^<]*?)(?:<[^>]+?>|\s)+?Last Name(?:<[^>]+?>|\s)+([^<]*)[\S\s]+?Phone[\S\s]+?<\/p>
then just grab groups 1 and 2 for each match. tested this with firefox's javascript flavor of regex.
From a philosophical standpoint XPath is probably a more robust solution if you have an XPath-capable HTML parser or if you are sure that you are working with valid XML, which what you posted is not (missing a document root node and an opening <p> tag at the beginning).
Depends a little bit on the syntax your actual regex library or tool, but basically use something like this:
<span id="label_85110"><b>([^<]+)</b>
Then you can access the first match group via some API.
Extract the last name similar to that.
Btw, some may argue: 'regex are the wrong tool for extracting data from HTML !!elf!1!'
Well, that is up to the poster. He is asking for a regular expression. And we don't know the details. Perhaps for his restricted use case everything else is overkill. (e.g. one time analysis and it is guaranteed that input data always uses the posted skeleton etc.)
精彩评论