Regex problem - when scraping HTML segment
I'm trying to use Regex to scrape contents between the anchors
"<h2>Highlights</h2>"
& "</div><div class="FloatClear"></div><div id="SalesMarquee">
" within the HTML segment below:
But when I tried this regex, it returns nothing...
<h2>Highlights<\/h2>\t?\n?\s?\S?(.*?)<\/div>
I think it may have something to do with the empty spaces within the HTML source...
Can any Regex gurus give me the magic expression for grabbing everything between any given HTML archors, like the ones mentioned above (that can also cope with any empty spaces within the HTML source)?
BTW I can't use any PHP code as the Regex is for a script I purchased (there is just a text box I have to enter the regex into)...
Many thanks
HTML segment:
<div id="Highlights">
<h2>Highlights</h2>
<ul>
<li>1234</li>
<li>abc def asdasd asdasd</li>
<li>asdasda as asdasdasdas </li>
<li>asdasd asdasdas asdsad asdasd asa</li>
</ul>
</div>
<div开发者_Go百科 class="FloatClear"></div>
<div id="SalesMarquee">
<div id="SalesMarqueeTemplate" style="display: none;">
Use any HTML dom parser like SIMPLE HTML DOM PARSER
// Create DOM from URL or file
$html = file_get_html('http://www.example.com/');
// Find all links
foreach($html->find('a') as $element)
echo $element->href . '<br>';
Agree with Naveed - here is a post that is similar - Robust and Mature HTML Parser for PHP
The following pcre regex should work.
/<h2>.*<\/h2>(.*)<\/div>/is
The last two characters is i for ignore case and s for dot all mode. Dot all mode makes the dot match newlines as well.
Edit: You'll probably want this regex instead:
/<h2>Highlights<\/h2>(.*)<\/div>.*<div class="FloatClear">/is
Try adding an 'm' modifier (for 'multiline' to the regexes provided by hlindset:
/<h2>Highlights<\/h2>(.*)<\/div>.*<div class="FloatClear">/ism
Here it is in action:
- http://www.rubular.com/r/td1IUBvg26
Documentation on all modifiers is available by googling "pcre pattern modifiers".
精彩评论