Parse HTML DOM using PHP

2023-02-25 17:24 问答作者：

I have this html code:

<marquee  align="left" id="LatestNewsM" SCROLLA开发者_JS百科MOUNT="4" loop="infinite" direction="right">

            <font dir="rtl" valign="top" class="StringTheme" style="font-size:14px;">test test test</font>  
            <img src="/Portal/images/LightVersionWeb2/jazeeraTicSep.gif" align="middle">

            <font dir="rtl" valign="top" class="StringTheme" style="font-size:14px;">test sample text sample</font>  
            <img src="/Portal/images/LightVersionWeb2/jazeeraTicSep.gif" align="middle">

            <font dir="rtl" valign="top" class="StringTheme" style="font-size:14px;">text text 222 another text</font>  
            <img src="/Portal/images/LightVersionWeb2/jazeeraTicSep.gif" align="middle">
            ...........
            .....
</marquee>

and this PHP code:

$homepage = file_get_contents('http://www.site.com');

How I can search in the content and get only the text inside Font tag <font>

You have few options, one mentioned by ThiefMaster as not to use "regex", doing strpos and substr or using DOM/XML parser.

If you go with regex, you might end up with something like this:

/<font[^>]*>.*<\/font>/i

When run on data like this:

> Hello, this is my brutal <font>font
> <font>tag</font> right</font> it is

You will end up with (if greedy)

<font>font <font>tag</font> right</font>

or if ungreedy

<font>font <font>tag</font>

You can use negative look ahead and do a better job but its still not a good solution (this example is to show you why, regex is kept as simple as possible)

If you go with strpos and substr, you'll have to look through all characters one by one and parse the document yourself (matching opening and closing tags, skipping attributes) or you can try

$opening = strpos($dataset, '<font', $closing) // closing is at offset zero
$closing = strpos($dataset, '</font', $opening) // start at opening tag

and so on until you parse it all.

If you go with DOM/XML parser, you might want to consider this, using file_get_contents or file() loads whole file into memory as most DOM/XML parsers does, I would go with XMLReader (Streaming instead of loading whole file in memory, parse it, build the tree), its more efficient.

p.s. Its quite late here (3:00AM), excuse me for any misspelled words. Thank you. :)

Will be useful:
http://php.net/manual/en/function.strip-tags.php - to delete all tags from text
http://php.net/manual/en/book.simplexml.php - to parse XML

If HTML will be valid (currently not - 'img' tags not closed), something like this can be used:

$xml = new SimpleXMLElement($data);
$fonts = $xml->xpath('/marquee/font');
foreach ($fonts as $font) print $font[0].PHP_EOL;

继续阅读：dom php

Parse HTML DOM using PHP

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？