开发者

Turn set of urls in to a regex pattern (optional patterns)

Using an arbitrary set of urls (eg: http://api.longurl.org/v开发者_如何学Python2/services) what is the best way to turn this list into a regex?

Is this appropriate regex?

(((easyuri|eepurl|eweri)\.com)|((migre|mke|myloc)\.me)|etc...)'

Can you do multiple levels of optional patterns like that?


I see different ways to accomplish this.

  1. Use XPath and try to select a node given the current URL.
  2. Parse the xml into a dictionary and test your current URL if it exists as a key.
  3. Store the domains of the XML in a database, index the url field and query your current URL.
  4. If performance is not an issue: Match the current URL against the entire XML file as text.
  5. Perhaps there are more ideas.

Building a regex from the XML does not seem to me a good idea since all the other solutions appear to me far more easy to develop.


OP'S ANSWER:

Well it turns out that this does work:

/((?:easyuri|eepurl|eweri)\.com)|((?:migre|mke|myloc)\.me)/

Run against this:

easyuri.com eepurl.comer eweri.us migre.me mke.memo myloc.em

You get this:

    [0] => Array
    (
        [0] => easyuri.com
        [1] => eepurl.com
        [2] => migre.me
        [3] => mke.me
    )

But the easiest way would just be something like this:

/0rz\.tw|1link\.in|1url\.com|2\.gp|2big\.at|etc\.\.\./

Regex helps you complicate things more than is possible with other methods. ;P

Here's the PHP I eventually used to create the regex:

Assumes that you have cURL'd http://api.longurl.org/v2/services and converted the xml to an array called $urlShorteners like: $urlShorteners = array('0rz.tw', '1link.in', 'etc...');

foreach($urlShorteners as $url) {
    $urls[] = array_reverse(explode('.', $url));
}

foreach($urls as $url) {
    $tldKeys[array_shift($url)][] = $url;
}

foreach($tldKeys as $tld => $doms) {
    if($tld != '') {
         $subPattern = array();
         foreach($doms as $subDomain) {
             $subPattern[] = implode("\.", array_reverse($subDomain));
         }
         if (count($subPattern) > 1) $optionPattern[] = "((?:" . implode("|", $subPattern) . ")\." . $tld . ")";
         else $optionPattern[] = "(" . $subPattern[0] . "\." . $tld . ")";
    }
}
$regex = '/' . implode('|', $optionPattern) . '/';
echo $regex . "\n";
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜