开发者

PHP: regex search a pattern in a file and pick it up

I am really confused with regular expressions for PHP.

Anyway, I cant read the whole tutorial thing now because I have a bunch of files in html which I have to find links in there ASAP. I came up with the idea to automate it with a php code which it is the language I know.

开发者_运维技巧

so I think I can user this script :

$address = "file.txt"; 
$input = @file_get_contents($address) or die("Could not access file: $address");
$regexp = "??????????"; 
if(preg_match_all("/$regexp/siU", $input, $matches)) { 
    // $matches[2] = array of link addresses 
   // $matches[3] = array of link text - including HTML code 
} 

My problem is with $regexp

My required pattern is like this:

href="/content/r807215r37l86637/fulltext.pdf" title="Download PDF

I want to search and get the /content/r807215r37l86637/fulltext.pdf from above lines which I have many of them in the files.

any help?

==================

edit

title attributes are important for me and all of them which I want, are titled

title="Download PDF"


Once again regexp are bad for parsing html.

Save your sanity and use the built in DOM libraries.

$dom = new DOMDocument();
@$dom->loadHTML($html);
$x = new DOMXPath($dom);
    $data = array();
foreach($x->query("//a[@title='Download PDF']") as $node)
{
    $data[] = $node->getAttribute("href");
}

Edit Updated code based on ircmaxell comment.


That's easier with phpQuery or QueryPath:

foreach (qp($html)->find("a") as $a) { 
    if ($a->attr("title") == "PDF") {
        print $a->attr("href");
        print $a->innerHTML();
    }
}

With regexps it depends on some consistency of the source:

preg_match_all('#<a[^>]+href="([^>"]+)"[^>]+title="Download PDF"[^>]*>(.*?)</a>#sim', $input, $m);

Looking for a fixed title="..." attribute is doable, but more difficult as it depends on the position before the closing bracket.


try something like this. If it does not work, show some examples of links you want to parse.

<?php
$address = "file.txt"; 
$input = @file_get_contents($address) or die("Could not access file: $address");
$regexp = '#<a[^>]*href="([^"]*)"[^>]*title="Download PDF"#'; 

if(preg_match_all($regexp, $input, $matches, PREG_SET_ORDER)) { 
  foreach ($matches as $match) {
    printf("Url: %s<br/>", $match[1]);
  }
} 

edit: updated so it searches for Download "PDF entries" only


The best way is to use DomXPath to do the search in one step:

$dom = new DomDocument();
$dom->loadHTML($html);
$xpath = new DomXPath($dom);

$links = array();
foreach($xpath->query('//a[contains(@title, "Download PDF")]') as $node) {
    $links[] = $node->getAttribute("href");
}

Or even:

$links = array();
$query = '//a[contains(@title, "Download PDF")]/@href';
foreach($xpath->evaluate($query) as $attr) {
    $links[] = $attr->value;
}


href="([^]+)" will get you all the links of that form.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜