Perl regex using negative look? can't seem to figure out how to do this properly
I'm trying to get this to work with perl's regex but can't seem to figure it out.. I want to grab any url that has ".website." in it, except ones that are like this (with "en" preceding ".website."
$linkhtml = 'http://en.search.website.com/?q=beach&' ;
This is an example of a url that I would want to be returned by the regex, while the one above is rejected
$linkhtml = ' http://exsample.website.com/?q=beach&' ;
Here is my attempt at it.. any advice on what I'm doing wrong is appreciated
开发者_开发技巧 $re2='(?<!en)'; # Any number of characters
$re4='(.*)'; # Any number of characters
$re6='(\.)'; # Any Single Character 4
$re7='(website)'; # Word 2
$re8='(\.)'; # Any Single Character 5
$re9='(.*)'; # Any number of characters
$re=$re4.$re2.$re6.$re7.$re8.$re9;
if ($linkhtml =~ /$re/)
Negative lookbehind assertions don't work well if the content you are trying to match after the assertion is so general that it would match the assertion itself. Consider:
perl -wle'print "en.website" =~ qr/(?<!en\.)web/' # doesn't match
perl -wle'print "en.website" =~ qr/(?<!en\.)[a-z]/' # does match, because [a-z] is matching the 'en'
The best thing to do here is what David suggested: use two patterns to screen out the good and bad values:
my @matches = grep {
/$pattern1/ and not /$pattern2/
} @strings;
...where pattern1 matches all URLs, and pattern2 matches just the 'en' URLs.
I'd just do it in two steps: first use a generic regular expression to check for any URL (or rather, anything that looks like a URL). Then check each result that matches that against another regex that looks for en
occurring in the host before wordpress
, and discard anything that matches.
Here's the final solution, in case anyone comes across this in the future that is new to regex (as I am) and has a similar problem.. in my case I wrapped this is a "for loop" so it would go through an array but it just depends on the need.
first lets filter out the urls that have "en" as these aren't urls we want
$re1='(.*)'; # Any number of characters
$re2='(en)'; # Word 1
$re3='(.*)'; # Any number of characters
$re=$re1.$re2.$re3;
if ($linkhtml =~ /$re/)
{
#do nothing, as we don't want a link with "en" in it
}
else {
### find urls with ".website."
$re1='(.*)'; # Any number of characters
$re2='(\.)'; # period
$re3='(website)'; # Word 1
$re4='(\.)'; # period
$re5='(.*)'; # Any number of characters
$re=$re1.$re2.$re3.$re4.$re5;
if ($linkhtml =~ /$re/) {
#match to see if it is a link that has ".website." in it
## do something with the data as it matches, such as:
print "linkhtml
}
}
精彩评论