PHP: regex search a pattern in a file and pick it up

2023-02-10 04:35 问答作者：

I am really confused with regular expressions for PHP.

Anyway, I cant read the whole tutorial thing now because I have a bunch of files in html which I have to find links in there ASAP. I came up with the idea to automate it with a php code which it is the language I know.

开发者_运维技巧

so I think I can user this script :

$address = "file.txt"; 
$input = @file_get_contents($address) or die("Could not access file: $address");
$regexp = "??????????"; 
if(preg_match_all("/$regexp/siU", $input, $matches)) { 
    // $matches[2] = array of link addresses 
   // $matches[3] = array of link text - including HTML code 
}

My problem is with $regexp

My required pattern is like this:

href="/content/r807215r37l86637/fulltext.pdf" title="Download PDF

I want to search and get the /content/r807215r37l86637/fulltext.pdf from above lines which I have many of them in the files.

any help?

==================

edit

title attributes are important for me and all of them which I want, are titled

title="Download PDF"

Once again regexp are bad for parsing html.

Save your sanity and use the built in DOM libraries.

$dom = new DOMDocument();
@$dom->loadHTML($html);
$x = new DOMXPath($dom);
    $data = array();
foreach($x->query("//a[@title='Download PDF']") as $node)
{
    $data[] = $node->getAttribute("href");
}

Edit Updated code based on ircmaxell comment.

That's easier with phpQuery or QueryPath:

foreach (qp($html)->find("a") as $a) { 
    if ($a->attr("title") == "PDF") {
        print $a->attr("href");
        print $a->innerHTML();
    }
}

With regexps it depends on some consistency of the source:

preg_match_all('#<a[^>]+href="([^>"]+)"[^>]+title="Download PDF"[^>]*>(.*?)</a>#sim', $input, $m);

Looking for a fixed title="..." attribute is doable, but more difficult as it depends on the position before the closing bracket.

try something like this. If it does not work, show some examples of links you want to parse.

<?php
$address = "file.txt"; 
$input = @file_get_contents($address) or die("Could not access file: $address");
$regexp = '#<a[^>]*href="([^"]*)"[^>]*title="Download PDF"#'; 

if(preg_match_all($regexp, $input, $matches, PREG_SET_ORDER)) { 
  foreach ($matches as $match) {
    printf("Url: %s<br/>", $match[1]);
  }
}

edit: updated so it searches for Download "PDF entries" only

The best way is to use DomXPath to do the search in one step:

$dom = new DomDocument();
$dom->loadHTML($html);
$xpath = new DomXPath($dom);

$links = array();
foreach($xpath->query('//a[contains(@title, "Download PDF")]') as $node) {
    $links[] = $node->getAttribute("href");
}

Or even:

$links = array();
$query = '//a[contains(@title, "Download PDF")]/@href';
foreach($xpath->evaluate($query) as $attr) {
    $links[] = $attr->value;
}

href="([^]+)" will get you all the links of that form.

继续阅读：php regex search

PHP: regex search a pattern in a file and pick it up

edit

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？

edit

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集 河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？