开发者

Regular expression anchor text for a link

I am trying to pull the anchor text from a link that is formatted this way:

<h3><b>File</b> : <a href="/en/browse/file/variable_text">i_want_this</a></h3>

I want only the anchor text for the link : "i_want_this"

"variable_text" varies according to the filename so I need to ignore that.

I am using this regex:

<a href=\"\/en\/browse\/file\/variable_text\">(.*开发者_运维知识库?)<\/a>

This is matching of course the complete link.


PHP uses a pretty close version to PCRE (PERL Regex). If you want to know a lot about regex, visit perlretut.org. Also, look into Regex generators like exspresso.

For your use, know that regex is greedy. That means that when you specify that you want something, follwed by anything (any repetitions) followed by something, it will keep on going until that second something is reached.

to be more clear, what you want is this:

  1. <a href="
  2. any character, any number of times (regex = .* )
  3. ">
  4. any character, any number of times (regex = .* )
  5. </a>

beyond that, you want to capture the second group of "any character, any number of times". You can do that using what are called capture groups (capture anything inside of parenthesis as a group for reference later, also called back references).

I would also look into named subpatterns, too - with those, you can reference your choice with a human readable string rather than an array index. Syntax for those in PHP are (?P<name>pattern) where name is the name you want and pattern is the actual regex. I'll use that below.

So all that being said, here's the "lazy web" for your regex:

<?php
$str = '<h3><b>File</b> : <a href="/en/browse/file/variable_text">i_want_this</a></h3>';
$regex = '/(<a href\=".*">)(?P<target>.*)(<\/a>)/';
preg_match($regex, $str, $matches);

print $matches['target'];
?>

//This should output "i_want_this"

Oh, and one final thought. Depending on what you are doing exactly, you may want to look into SimpleXML instead of using regex for this. This would probably require that the tags that we see are just snippits of a larger whole as SimpleXML requires well-formed XML (or XHTML).


I'm sure someone will probably have a more elegant solution, but I think this will do what you want to done.

Where:

$subject = "<h3><b>File</b> : <a href=\"/en/browse/file/variable_text\">i_want_this</a></h3>";

Option 1:

$pattern1 = '/(<a href=")(.*)(">)(.*)(<\/a>)/i';
preg_match($pattern1, $subject, $matches1);
print($matches1[4]);

Option 2:

$pattern2 = '(<a href=")(.*)(">)(.*)(</a>)';
ereg($pattern2, $subject, $matches2);
print($matches2[4]);


Do not use regex to parse HTML. Use a DOM parser. Specify the language you're using, too.

Since it's in a captured group and since you claim it's matching, you should be able to reference it through $1 or \1 depending on the language.

$blah = preg_match( $pattern, $subject, $matches );
print_r($matches);


The thing to remember is that regex's return everything you searched for if it matches. You need to specify that only care about the part you've surrounded in parenthesis (the anchor text). I'm not sure what language you're using the regex in, but here's an example in Ruby:

string = '<a href="/en/browse/file/variable_text">i_want_this</a>'
data = string.match(/<a href=\"\/en\/browse\/file\/variable_text\">(.*?)<\/a>/)
puts data # => outputs '<a href="/en/browse/file/variable_text">i_want_this</a>'

If you specify what you want in parenthesis, you can reference it:

string = '<a href="/en/browse/file/variable_text">i_want_this</a>'
data = string.match(/<a href=\"\/en\/browse\/file\/variable_text\">(.*?)<\/a>/)[1]
puts data # => outputs 'i_want_this'

Perl will have you use $1 instead of [1] like this:

$string = '<a href="/en/browse/file/variable_text">i_want_this</a>';
$string =~ m/<a href=\"\/en\/browse\/file\/variable_text\">(.*?)<\/a>/;
$data = $1;
print $data . "\n";

Hope that helps.


I'm not 100% sure if I understand what you want. This will match the content between the anchor tags. The URL must start with /en/browse/file/, but may end with anything.

#<a href="/en/browse/file/.+?">(.*?)</a>#

I used # as a delimiter as it made it clearer. It'll also help if you put them in single quotes instead of double quotes so you don't have to escape anything at all.

If you want to limit to numbers instead, you can use:

#<a href="/en/browse/file/[0-9]+">(.*?)</a>#

If it should have just 5 numbers:

#<a href="/en/browse/file/[0-9]{5}">(.*?)</a>#

If it should have between 3 and 6 numbers:

#<a href="/en/browse/file/[0-9]{3,6}">(.*?)</a>#

If it should have more than 2 numbers:

#<a href="/en/browse/file/[0-9]{2,}">(.*?)</a>#


This should work:

<a href="[^"]*">([^<]*)

this says that take EVERYTHING you find until you meet "

[^"]*

same! take everything with you till you meet <

[^<]*

The paratese around [^<]*

([^<]*)

group it! so you can collect that data in PHP! If you look in the PHP manual om preg_match you will se many fine examples there!

Good luck!

And for your concrete example:

<a href="/en/browse/file/variable_text">([^<]*)

I use

[^<]* 

because in some examples...

.*? 

can be extremely slow! Shoudln't use that if you can use

[^<]*


You should use the tool Expresso for creating regular expression... Pretty handy.. http://www.ultrapico.com/Expresso.htm

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜