Using a regular expression to extract URLs from links in an HTML document
I need to capture all links in a given html.
Here is sample code:
<div class="infobar">
... some code goes here ...
<a href="/link/some-text">link 1</a>
<a href="/link/another-text">link 2</a>
<a href="/link/blabla">link 3</a>
<a href="/link/whassup">link 4</a>
... some code goes here ...
</div>
I 开发者_运维百科need to get all links inside div.infobar
that starts with /link/
I tried this:
preg_match_all('#<div class="infobar">.*?(href="/link/(.*?)") .*?</div>#is', $raw, $x);
but it gives me the only first match.
Thanks for advices.
I would suggest using DOMDocument for this very purpose rather than using regex. Consider following simple code:
$content = '
<div class="infobar">
<a href="/link/some-text">link 1</a>
<a href="/link/another-text">link 2</a>
<a href="/link/blabla">link 3</a>
<a href="/link/whassup">link 4</a>
</div>';
$dom = new DOMDocument();
$dom->loadHTML($content);
// To hold all your links...
$links = array();
// Get all divs
$divs = $dom->getElementsByTagName("div");
foreach($divs as $div) {
// Check the class attr of each div
$cl = $div->getAttribute("class");
if ($cl == "infobar") {
// Find all hrefs and append it to our $links array
$hrefs = $div->getElementsByTagName("a");
foreach ($hrefs as $href)
$links[] = $href->getAttribute("href");
}
}
var_dump($links);
OUTPUT
array(4) {
[0]=>
string(15) "/link/some-text"
[1]=>
string(18) "/link/another-text"
[2]=>
string(12) "/link/blabla"
[3]=>
string(13) "/link/whassup"
}
Revising my previous answer. You'll need to do it in two steps:
//This first step grabs the contents of the div.
preg_match('#(?<=<div class="infobar">).*?(?=</div>)#is', $raw, $x);
//And here, we grab all of the links.
preg_match_all('#href="/link/(.*?)"#is', $x[0], $x);
http://simplehtmldom.sourceforge.net/ :
// Create DOM from URL or file
$html = file_get_html('http://www.google.com/');
// Find all links
foreach($html->find('a') as $element)
echo $element->href . '<br>';
Try this (I added a +
):
preg_match_all('#<div class="infobar">.*?(href="/link/(?:.*?)")+ .*?</div>#is', $raw, $x);
精彩评论