开发者

Parse HTML DOM using PHP

I have this html code:

<marquee  align="left" id="LatestNewsM" SCROLLA开发者_JS百科MOUNT="4" loop="infinite" direction="right">

            <font dir="rtl" valign="top" class="StringTheme" style="font-size:14px;">test test test</font>  
            <img src="/Portal/images/LightVersionWeb2/jazeeraTicSep.gif" align="middle">

            <font dir="rtl" valign="top" class="StringTheme" style="font-size:14px;">test sample text sample</font>  
            <img src="/Portal/images/LightVersionWeb2/jazeeraTicSep.gif" align="middle">

            <font dir="rtl" valign="top" class="StringTheme" style="font-size:14px;">text text 222 another text</font>  
            <img src="/Portal/images/LightVersionWeb2/jazeeraTicSep.gif" align="middle">
            ...........
            .....
</marquee>

and this PHP code:

$homepage = file_get_contents('http://www.site.com');

How I can search in the content and get only the text inside Font tag <font>


You have few options, one mentioned by ThiefMaster as not to use "regex", doing strpos and substr or using DOM/XML parser.

If you go with regex, you might end up with something like this:

/<font[^>]*>.*<\/font>/i

When run on data like this:

> Hello, this is my brutal <font>font
> <font>tag</font> right</font> it is

You will end up with (if greedy)

<font>font <font>tag</font> right</font>

or if ungreedy

<font>font <font>tag</font>

You can use negative look ahead and do a better job but its still not a good solution (this example is to show you why, regex is kept as simple as possible)

If you go with strpos and substr, you'll have to look through all characters one by one and parse the document yourself (matching opening and closing tags, skipping attributes) or you can try

$opening = strpos($dataset, '<font', $closing) // closing is at offset zero
$closing = strpos($dataset, '</font', $opening) // start at opening tag

and so on until you parse it all.

If you go with DOM/XML parser, you might want to consider this, using file_get_contents or file() loads whole file into memory as most DOM/XML parsers does, I would go with XMLReader (Streaming instead of loading whole file in memory, parse it, build the tree), its more efficient.

p.s. Its quite late here (3:00AM), excuse me for any misspelled words. Thank you. :)


Will be useful:
http://php.net/manual/en/function.strip-tags.php - to delete all tags from text
http://php.net/manual/en/book.simplexml.php - to parse XML

If HTML will be valid (currently not - 'img' tags not closed), something like this can be used:

$xml = new SimpleXMLElement($data);
$fonts = $xml->xpath('/marquee/font');
foreach ($fonts as $font) print $font[0].PHP_EOL;
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜