开发者

Not Another Parse-HTML-With-Regex Question

I've read a few questions on here re parsing HTML with regex, and I understand that this is, on the whol开发者_运维技巧e, a terrible idea.

Having said this, I have a very specific problem that I think Regex might be the answer to. I've been fumbling around trying to work out the answer but I'm new (today) to Regex, and I was hoping some kind hearted person may be able to help me out.

I have an array of strings that always follow the format

STUFF HERE<a href="somewhere" title="something" target="_blank">name of thing</a>STUFF HERE

What I'm hoping to achieve is to be left with just the 'somewhere' and the 'name of thing, so that I can output just <a href="somewhere">name of thing</a>.

The array of strings comes from an RSS feed of links on my Facebook profile, if you happen to be interested.

Many, many thanks for any help.

Jack


I understand completely where you're coming from on the pragmatism scale.

However PHP does have a very nice/straightforward HTML parser, and it seems sufficiently simple to get it to work that I'd hesitate not to recommend it.


I don't know PHP, but you can use the following (extremely brittle) regex:

<a href="(.+?)" title=".+?" target="_blank">(.+?)</a>

This will capture the URL and the text of the link.

If you want to be somewhat more flexible, you could allow any attributes, like this:

<a .*?href="(.+?)".*?>(.+?)</a>


$str = 'STUFF HERE<a href="somewhere" title"something" target="_blank">name of thing</a>STUFF HERE';
$success = preg_match('/.*href=\"([^\"]+)\".*>([^<]+)<.*/i', $str, $matches);
if ($success) {
    echo $matches[1];
    echo $matches[2];
} else {
    echo "Parsing failed.";
}

The parenthetical clauses isolate portions of the match for the $matches array. If the pattern matches the string at all, then $matches[1] would contain your href and $matches[2] would contain your link text.

Inside the parenthesis, I'm defining the meat of those segments you're interested with exclusion characters. The first one is [^\"]+, which is one-or-more of any character except double quote. The latter is [^<]+, which is one or more of any character except less than. This ensures that, if the markup is consistently in the format you provided, then you have well-defined boundaries on either side of the portions you're interested in.


SLaks regex may has some problems with URLs with no attributes other than href, here is my take:

~<a.+?href="(.+?)".*?>(.+?)</a>~i


I've tested with my own Facebook feed and could load it with SimpleXML. Well, partly. The RSS feed cannot be loaded directly, but if you fetch the Feed with MagPie first, you can then load the description element with SimpleXml like this:

$xml = simplexml_load_string($description); // load description
$link = $xml->xpath('//a');                 // find all links inside
$href = (string) $link[0]['href'];          // get URL
$text = (string) $link[0];                  // and link text

As long as Facebook does not break the HTML inside the description, it is safe to use SimpleXml. If they break it, SimpleXml will complain.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜