开发者

Perl, mod_perl2 or CGI for a web-scraping service?

I'm going to design an open-source web service which should collect ("web-scrape") some data from multiple - currently three - web sites.

The web sites do not expose any web service nor any API, they just publish web pages.

Data will be collected 'live' on any client's request from all the web sites in parallel, and will then be parsed to XML to be returned to the client.

T开发者_高级运维he server operating system will be Linux.

The clients will initially be just an Android application of mine.

The concurrent clients will possibly be about 100 or more, if the project will be successful... ;-).

Currently my preferencese go to the adoption of:

  • perl (for the service laguage)
  • mod_perl2 with ModPerl::Registry (for an Apache embedded fast perl interpreter)
  • perl module CHI::Driver::FastMmap (for a modern and fast cache handler)
  • perl module Coro (for an async event loop to place many requests in parallel)

Since I suppose the specifications on the project can be of general use and interest, and since I am getting many problems with the combined use of Coro with mod_perl2, I ask:

Are my adoption preferences well matched?

Do you see any incompatibilities or potential problems?

Do you have any suggestion to enhance (in this order):

  • compatibility among components
  • neatness of the implementation
  • ease of maintainability
  • performances


You probably don't want to develop using mod_perl for any new project anymore. You really want to use something Plack based, or maybe even Plack itself. If you want to use Coro, using a AnyEvent such as Twiggy based backend may make most sense (though you may want to put a reverse proxy in front of it).


Are you happy sticking with apache?
If so, forget Coro and let apache handle concurrency; preload your modules and configuration, and write a super-efficient apache RequestHandler. (That's the way I go whenever apache2+modperl2 is available.)
If not, start learning Plack which is server-agnostic.

If you choose the first route, I'd recommend avoiding traditional CGI and instead adopting CGI::Application, which gives almost the lightness and speed of CGI but with a much much nicer/modern development environment and framework (and is Plack-compatible).

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜