开发者

Schedule sending http requests to particular site

I want some way to be notified whenever a new result appears for search query in particular site. The site does not provide any feature(via RSS, alerts ..etc) for this. One way I think to accomplish this would be to send http request (for search) and process http response to send mail for any new resu开发者_如何学Clt which comes up.The search parameters can be static or better taken from a source (like a csv file). Does anyone know of an existing solution/s preferably online which can accomplish this.

Thanks, Jeet


Try iHook, it allows you to schedule (as frequent as 1 minute) HTTP requests to public web resources, and receive rule-based email notifications. You can create notification rules around response status code and response body (via JSON expression and CSS selector).


That would depend on the particular site you want to query.


I know of no open-source solution "out of the box" to do this so I believe you'd need to write a custom spider/crawler to accomplish your task; it would need to provide the following services:

  1. Scheduling - when the crawl should occur. Typically the 'cron' system service in Unix-like systems or the Task Scheduler in Windows are used.

  2. Retrieval - retrieving targeted pages. Using either a scripting language like Perl or a dedicated system tool like 'curl' or 'wget'.

  3. Extraction / Normalization - removing everything from the target (retrieved page) except the content of interest. Needed to compensate for changing sections of the target that are not germane to the task, like dates or advertising. Typically accomplished via a scripting language that supports regular expressions (for trivial cases) or an HTML parser library (for more specialized extractions).

  4. Checksumming - converting the target into a unique identifier determined by its content. Used to determine changes to the target since the last crawl. Accomplished by a system tool (such as the Linux 'cksum' command) or a scripting language.

  5. Change detection - comparing the previously saved checksum for the last retrieved target with the newly computed checksum for the current retrieval. Again, typically using a scripting language.

  6. Alerting - informing users of identified changes. Typically via email or text message.

  7. State management - storing target URIs, extraction rules, user preferences and target checksums from the previous run. Both configuration files or databases (like Mysql) are used.

Please note that this list of services attempts to describe the system in abstract and so sounds a lot more complicated than the actual tool you create will be. I've written several systems like this before so I expect a simple solution written in Perl (utilizing standard Perl modules) and running on Linux would require a hundred lines or so for a couple of target sites depending on extraction complexity.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜