开发者

PHP5 webpage scan (simple DOM parser || file_get_contents()+regexp)... resources wise

I was thinking about a script that would scan 10+ websites for specific content inside a specific div. Let's say it would be moderately used, some 400 searches a day.

Which of the t开发者_StackOverflow社区wo in the title would support better the load, take less resources and give better speeds:

Creating the DOM from each of the websites then iterating each for specific div id

OR

creating a string from the website with file_get_contents, and then regexping the needed string.

To be more specific of what kind of operation I would need to execute hear the following,

Additional question: Is regexp capable of searching the following occurrence of the given string:

<div id="myId"> needed string </div>

to identify the tag with the given ID and return ONLY what is between tags?

Please answer only yes/no, if it's possible, I'll open a separate question about syntax so it's not all bundled here.


For 400 searches a day, which method you use is rather indifferent, performance-wise.

In any case, the fastest method would be file_get_contents+ strpos + substr, unless your location+extraction algorithm is complex enough. Depending on the specific regular expression it may or may not be faster than DOM, but it likely is. DOM will probably be a more reliable method than regular expressions, but than depends on the level of well-formedness of your pages (libxml2 does not exactly mimic the browsers' parsing).


  1. Yes

  2. Speed will depend on your server and the pages in question; both ways execution time will be negligible comparing to the time of downloading the pages to scan.

  3. if you go with DOM / XPath, the thing is doable in 3 lines of code.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜