开发者

What does this Regex string mean?

I'm trying to debug some PHP but I am not so hot on my regex, can someone please translate this for me? (if even it is regex)

public static function fetch($number)
    {
        $number = str_replace(" ", "", $number);
        $html = file_get_contents('http://w2.brreg.no/enhet/sok/detalj.jsp?orgnr=' . $number);
        preg_match_all('/\<td style="width.*\<b\>(.*)[: ]*\<\/b\>/msU', $html, $keys);
        preg_match_all('/\<\/b\>.*\<td.*\>(.*)\<\开发者_如何转开发/td\>/msU', $html, $values);

        if (!$keys[1])
        {
            return null;
        }

Kept the PHP snippet for context, if it helps :D Thanks :)


I'm only translating the first one, the second one is similar.

/                  # regex delimiter
\<td style="width  # match <td style="width  (unnecessary escaping of < !)
.*                 # match anything (as few characters as possible, see below)
\<b\>              # match <b> (again, unnecessary escaping!)
(.*)               # match anything (lazily) and capture it
[: ]*              # match any number of colons or spaces
\<\/b\>            # match </b>
/msU               # regex delimiter; multiline option (unnecessary), 
                   # dot-all option (dot matches newline) 
                   # and ungreedy option (quantifiers are lazy by default).

EDIT: U is not the Unicode option, but the ungreedy option. My mistake. The regex isn't that bad after all :)

I'd suggest using these regexes instead:

/<td style="width.*?<b>(.*?)[: ]*<\/b>/s
/<\/b>.*?<td.*?>(.*?)<\/td>/s


More or less, it returns the {extracted} part from <td style="width ..."><b>{extracted}: </b>


To help understand regular expressions I recommend downloading Expresso (for Windows) which is a free (but registration required) expression parser and testing tool.


I believe its trying to match the following structure:

<td width=.....><b>key:</b></td><td>value</td>

Its parsing the string twice, once for keys, which are taken from the first column, and a second time for values, which are taken from the second column.


I you want an advice, your regex may won't work as expected. In your case, it's better to use xpath.

See this snippet :

$str = "
<html>
    <body>
        <table>
        <tr>
            <td style='width:500px'><b>foo : </b> bar</td>
            <td style='width:200;vertical-align:'><b>baz :</b> qux</td>
        </tr>
        </table>
    </body>
</html>
";

$xml = simplexml_load_string($str);

$results = array();
foreach($xml->xpath('//td[@style][b]') as $row) {
    $value = trim(sprintf("%s", $row));
    $key = trim((string)$row->b, ' :');
    $results[$key] = $value;
}

var_dump($results);

Will prints

array(2) {
  ["foo"]=>
  string(3) "bar"
  ["baz"]=>
  string(3) "qux"
}
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜