How to write regex to pull first and last name from source HTML?

2023-01-13 18:12 问答作者：

I've been pulling my hair out trying to come up with a regx that will pull the First and Last Name from the following HTML. My regex fu is not strong.

<span id="label_85110"><b>First Name</b></span>
<br/>
    <span id="value_85110">AWeber- Email Parser</span>
    <br/>
</p>
<p>
<span id="label_86004"><b>Last Name</b></span>
<br/>
    <span id="value_86004">Submission</span>
    <br/>
</p>
<p>
<span id="label_85111"><b>Email</b></span>
<br/>
    <span id="value_85111">leslie@dakno.com</span>
    <br/>
</p>
<p>
<span id="label_85540"><b>Phone</b></span>
开发者_开发技巧<br/>
    <span id="value_85540">919-923-7017</span>
    <br/>
</p>

@oliver1,

Please note that the keyword in Regular Expression is "Regular." Regular Expressions are used with Regular Languages.

Unfortunately, (X)HTML is not a Regular Language. Rather, it is a Context Free Language.

You cannot write a RegEx which can properly parse a Context Free Language. This is a mathematically proven reality; you cannot write a RegEx which can properly parse a Context Free Language.

The Solution: Use XPath

Instead you should use an XML parser; you are already using XHTML which means you could instead use XPath. (although you're missing an  at the beginning of your code snippet)

How can any parser, RegEx or query identify the first names and last names? The best I see is " elements which come after a  " which is pretty weak.

You can nonetheless write an XPath query to find " elements which come after a  ".

//br/following-sibling::span/text()

... but that also finds the values of Email and Phone, so you'll want only the first two results.

Alternately, you could instead use the id attributes on the  elements:

//span[@id='value_85110']/text()|//span[@id='value_86004']/text()

If You Can Modify The HTML

Ideally, my suggestion is to make your XHTML more semantic:

<label for="first-name-1">First Name</label>
<span id="first-name-1" class="first-name">Aweber- Email Parser</span>
<label for="last-name-1">Last Name</label>
<span id="last-name-1" class="last-name">Submission</span>
<label for="email-address-1">Email</label>
<span id="email-address-1" class="email-address">leslie@dakno.com</span>
<label for="phone-number-1">Phone</label>
<span id="phone-number-1" class="phone-number">919-923-7017</span>

Enhance it with CSS (instead of using  and   all over the place)...

label {
    font-weight:bolder;
    display:block;
    maring-top:5px;
}
span {
    display:block;
    maring-bottom:5px;
}

... and then use an XPath query like so:

//span[@class='first-name'] | //span[@class='last-name']

Disclaimer: This is just an answer to the problem, not an endorsement of using regex for this purpose.

<span[^>]*?><b>First Name(?:<[^>]+?>|\s)+([^<]*?)(?:<[^>]+?>|\s)+?Last Name(?:<[^>]+?>|\s)+([^<]*)[\S\s]+?Phone[\S\s]+?<\/p>

then just grab groups 1 and 2 for each match. tested this with firefox's javascript flavor of regex.

From a philosophical standpoint XPath is probably a more robust solution if you have an XPath-capable HTML parser or if you are sure that you are working with valid XML, which what you posted is not (missing a document root node and an opening tag at the beginning).

Depends a little bit on the syntax your actual regex library or tool, but basically use something like this:

<span id="label_85110"><b>([^<]+)</b>

Then you can access the first match group via some API.

Extract the last name similar to that.

Btw, some may argue: 'regex are the wrong tool for extracting data from HTML !!elf!1!'

Well, that is up to the poster. He is asking for a regular expression. And we don't know the details. Perhaps for his restricted use case everything else is overkill. (e.g. one time analysis and it is guaranteed that input data always uses the posted skeleton etc.)

继续阅读：regex

How to write regex to pull first and last name from source HTML?

The Solution: Use XPath

If You Can Modify The HTML

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

王昌瑞《潜梦追凶》剧组庆生新锐演员未来可期？

Is it allowed to ask users to enter credit card details for own payment method?

Escaping "<" in Perl-generated XML

imessage会显示已读吗？

微信重新建群怎么建？

The Solution: Use XPath

If You Can Modify The HTML

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

王昌瑞《潜梦追凶》剧组庆生 新锐演员未来可期？

Is it allowed to ask users to enter credit card details for own payment method?

Escaping "<" in Perl-generated XML

imessage会显示已读吗？

微信重新建群怎么建？

王昌瑞《潜梦追凶》剧组庆生新锐演员未来可期？