Not Another Parse-HTML-With-Regex Question

2022-12-17 12:57 问答作者：

I've read a few questions on here re parsing HTML with regex, and I understand that this is, on the whol开发者_运维技巧e, a terrible idea.

Having said this, I have a very specific problem that I think Regex might be the answer to. I've been fumbling around trying to work out the answer but I'm new (today) to Regex, and I was hoping some kind hearted person may be able to help me out.

I have an array of strings that always follow the format

STUFF HERE<a href="somewhere" title="something" target="_blank">name of thing</a>STUFF HERE

What I'm hoping to achieve is to be left with just the 'somewhere' and the 'name of thing, so that I can output just <a href="somewhere">name of thing</a>.

The array of strings comes from an RSS feed of links on my Facebook profile, if you happen to be interested.

Many, many thanks for any help.

Jack

I understand completely where you're coming from on the pragmatism scale.

However PHP does have a very nice/straightforward HTML parser, and it seems sufficiently simple to get it to work that I'd hesitate not to recommend it.

I don't know PHP, but you can use the following (extremely brittle) regex:

<a href="(.+?)" title=".+?" target="_blank">(.+?)</a>

This will capture the URL and the text of the link.

If you want to be somewhat more flexible, you could allow any attributes, like this:

<a .*?href="(.+?)".*?>(.+?)</a>

$str = 'STUFF HERE<a href="somewhere" title"something" target="_blank">name of thing</a>STUFF HERE';
$success = preg_match('/.*href=\"([^\"]+)\".*>([^<]+)<.*/i', $str, $matches);
if ($success) {
    echo $matches[1];
    echo $matches[2];
} else {
    echo "Parsing failed.";
}

The parenthetical clauses isolate portions of the match for the $matches array. If the pattern matches the string at all, then $matches[1] would contain your href and $matches[2] would contain your link text.

Inside the parenthesis, I'm defining the meat of those segments you're interested with exclusion characters. The first one is [^\"]+, which is one-or-more of any character except double quote. The latter is [^<]+, which is one or more of any character except less than. This ensures that, if the markup is consistently in the format you provided, then you have well-defined boundaries on either side of the portions you're interested in.

SLaks regex may has some problems with URLs with no attributes other than href, here is my take:

~<a.+?href="(.+?)".*?>(.+?)</a>~i

I've tested with my own Facebook feed and could load it with SimpleXML. Well, partly. The RSS feed cannot be loaded directly, but if you fetch the Feed with MagPie first, you can then load the description element with SimpleXml like this:

$xml = simplexml_load_string($description); // load description
$link = $xml->xpath('//a');                 // find all links inside
$href = (string) $link[0]['href'];          // get URL
$text = (string) $link[0];                  // and link text

As long as Facebook does not break the HTML inside the description, it is safe to use SimpleXml. If they break it, SimpleXml will complain.

继续阅读：parsing php regex

Not Another Parse-HTML-With-Regex Question

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？