开发者

Advice/Tips on what the best way to spider/crawl/collect audio content from the internet

well what I'm actually trying to do is to figure out how BEEMP3.COM works.

Because of the site's speed, I doubt they scrape other sites/sources on the spot. They probably use some sort of database (PostgreSQL or MySQL) to store the "results" and then just query the search terms.

My question is how do you guys think they crawl/spider or actually get the mp3 files/content? They must have some algori开发者_如何学编程thm to spider the internet OR use google's index of mp3 trick to find hosts with the raw mp3 files.

Any comments and tips or ideas are appreciated :)


QueryPath is a great tool for building a web spider.

I'm guessing they find MP3s using a combination approach - they have a list of "seed sites" (gathered from Google, Usenet or manually inserted) that they use as a starting points for the search and then set spiders running against them.

You need to write a script that will:

  • Take a webpage as a starting point
  • Fetch the webpage data (use cURL)
  • Use a regular expression to extract (a) any links (b) any links to mp3 files
  • Place any MP3 links into a database
  • Add the list of links to other webpages to a queue for processing through the above method

You'll also need to re-check your MP3 links regularly to erase any bad links.


Alternatively you can crawl MP3 spiders like beemp3.com and extract all direct download links and save them to your data base. you need only two file I. Simple html Dom. II. An application that can take extracted links to your database.

Check what i did in http://kenyaforums.com/bongomp3_external_link_search_engine_at_kenyaforums_com.php

You keep on asking in case of any contradiction.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜