PHP: regex search a pattern in a file and pick it up
I am really confused with regular expressions for PHP.
Anyway, I cant read the whole tutorial thing now because I have a bunch of files in html which I have to find links in there ASAP. I came up with the idea to automate it with a php code which it is the language I know.
开发者_运维技巧so I think I can user this script :
$address = "file.txt";
$input = @file_get_contents($address) or die("Could not access file: $address");
$regexp = "??????????";
if(preg_match_all("/$regexp/siU", $input, $matches)) {
// $matches[2] = array of link addresses
// $matches[3] = array of link text - including HTML code
}
My problem is with $regexp
My required pattern is like this:
href="/content/r807215r37l86637/fulltext.pdf" title="Download PDF
I want to search and get the /content/r807215r37l86637/fulltext.pdf
from above lines which I have many of them in the files.
any help?
==================
edit
title attributes are important for me and all of them which I want, are titled
title="Download PDF"
Once again regexp are bad for parsing html.
Save your sanity and use the built in DOM libraries.
$dom = new DOMDocument();
@$dom->loadHTML($html);
$x = new DOMXPath($dom);
$data = array();
foreach($x->query("//a[@title='Download PDF']") as $node)
{
$data[] = $node->getAttribute("href");
}
Edit Updated code based on ircmaxell comment.
That's easier with phpQuery or QueryPath:
foreach (qp($html)->find("a") as $a) {
if ($a->attr("title") == "PDF") {
print $a->attr("href");
print $a->innerHTML();
}
}
With regexps it depends on some consistency of the source:
preg_match_all('#<a[^>]+href="([^>"]+)"[^>]+title="Download PDF"[^>]*>(.*?)</a>#sim', $input, $m);
Looking for a fixed title="..."
attribute is doable, but more difficult as it depends on the position before the closing bracket.
try something like this. If it does not work, show some examples of links you want to parse.
<?php
$address = "file.txt";
$input = @file_get_contents($address) or die("Could not access file: $address");
$regexp = '#<a[^>]*href="([^"]*)"[^>]*title="Download PDF"#';
if(preg_match_all($regexp, $input, $matches, PREG_SET_ORDER)) {
foreach ($matches as $match) {
printf("Url: %s<br/>", $match[1]);
}
}
edit: updated so it searches for Download "PDF entries" only
The best way is to use DomXPath
to do the search in one step:
$dom = new DomDocument();
$dom->loadHTML($html);
$xpath = new DomXPath($dom);
$links = array();
foreach($xpath->query('//a[contains(@title, "Download PDF")]') as $node) {
$links[] = $node->getAttribute("href");
}
Or even:
$links = array();
$query = '//a[contains(@title, "Download PDF")]/@href';
foreach($xpath->evaluate($query) as $attr) {
$links[] = $attr->value;
}
href="([^]+)"
will get you all the links of that form.
精彩评论