PHP5 webpage scan (simple DOM parser || file_get_contents()+regexp)... resources wise

2023-02-10 01:37 问答作者：

I was thinking about a script that would scan 10+ websites for specific content inside a specific div. Let's say it would be moderately used, some 400 searches a day.

Which of the t开发者_StackOverflow社区wo in the title would support better the load, take less resources and give better speeds:

Creating the DOM from each of the websites then iterating each for specific div id

creating a string from the website with file_get_contents, and then regexping the needed string.

To be more specific of what kind of operation I would need to execute hear the following,

Additional question: Is regexp capable of searching the following occurrence of the given string:

<div id="myId"> needed string </div>

to identify the tag with the given ID and return ONLY what is between tags?

Please answer only yes/no, if it's possible, I'll open a separate question about syntax so it's not all bundled here.

For 400 searches a day, which method you use is rather indifferent, performance-wise.

In any case, the fastest method would be file_get_contents+ strpos + substr, unless your location+extraction algorithm is complex enough. Depending on the specific regular expression it may or may not be faster than DOM, but it likely is. DOM will probably be a more reliable method than regular expressions, but than depends on the level of well-formedness of your pages (libxml2 does not exactly mimic the browsers' parsing).

Yes
Speed will depend on your server and the pages in question; both ways execution time will be negligible comparing to the time of downloading the pages to scan.
if you go with DOM / XPath, the thing is doable in 3 lines of code.

继续阅读：dom file-get-contents php regex web-crawler

PHP5 webpage scan (simple DOM parser || file_get_contents()+regexp)... resources wise

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？