Python data scraping

2023-03-31 12:00 问答作者：

I want to download a couple songs off of http://www.youtube-mp3.org/. I'm using urllib2 and BeautifulSoup.

The problem is that when I urllib2 open the site with my video ID plugged in, http://www.youtube-mp3.org/?c#v=lV7r8PiuecQ, I get the site but they are tricky about it and load the info after the initial pageload with some js ajax stuff. So when I try to scrape the url of the download link, literally isn't on the page because it hasn't been loaded.

Anyone know how I can maybe trigger this js loader in my python script, or something?

Here is the relevant empty html BEFORE the content that I want is loaded into it.

<div id="link_box" style="display:none">
   <div id="link_box_title" style="font-weight:bold; text-decoration:underline">
   </div>
   <div class="row">
    <div id="link_box_bb_code_title" style="font-weight:bold">
    </div>
    <input type="text" id="BBCodeLink" onclick="sAll(this)" />
   </div>
   <div class="row">
    <div id="link_box_html_code_title" style="font-weight:bold">
    </div>
    <input type="text" id="HTMLLink" onclick="sAll(this)" />
   </div>
   <div class="row">
    <开发者_如何学Python;div id="link_box_direct_code_title" style="font-weight:bold">
    </div>
    <input type="text" id="DirectLink" onclick="sAll(this)" />
   </div>
  </div>
  <div id="v-ads">
  </div>
  <div id="dl_link">
  </div>
  <div id="progress">
  </div>
  <div id="loader">
   <img src="ajax-loader-b.gif" alt="loading.." width="16" height="11" />
  </div>
 </div>
 <div class="clear">
 </div>
</div>

The API is JSON-based, so the contents of the html files won't give you any clue on where to find the files. A good idea when exploring web services like this one, is to open the Network tab in Chrome's developer tools and see what pages it loads when interacting with the page. That exercise showed me that two urls in particular seem interesting:

http://www.youtube-mp3.org/api/pushItem/?item=http%3A//www.youtube.com/watch%3Fv%3DKMU0tzLwhbE&xy=trve&r=1314700829128
http://www.youtube-mp3.org/api/itemInfo/?video_id=KMU0tzLwhbE&adloc=&r=1314700829314

The first url appears to be queuing a file for processing, the second to get the status of the processing job.

The second url takes a video_id GET parameter that is the id for the video on youtube (http://www.youtube.com/watch?v=KMU0tzLwhbE) and returns the status of the decoding job. The second and third seem irrelevant for this purpose which you can verify by test loading the url with and without the extra parameters.

The content of the page is:

info = { "title" : "Developers", 
         "image" : "http://i4.ytimg.com/vi/KMU0tzLwhbE/default.jpg", 
         "length" : "3", "status" : "serving", "progress_speed" : "", 
         "progress" : "", "ads" : "", 
         "h" : "a0aa17294103c638fa7f5e0606f839d3" };

Which happens to be JSON data. The interesting bit in this is "a0aa17294103c638fa7f5e0606f839d3" which looks like a hash that the web service use to refer to the decoded mp3 file. Also check out how the download link on the front page looks:

http://www.youtube-mp3.org/get?video_id=KMU0tzLwhbE&h=a0aa17294103c638fa7f5e0606f839d3

Now we have all the missing pieces of the puzzle together. First, we take the url of a youtube video (http://www.youtube.com/watch?v=iKP7DZmqdbU) url quote it and feed it to the api using this url:

http://www.youtube-mp3.org/api/pushItem/?item=http%3A//www.youtube.com/watch%3Fv%3DiKP7DZmqdbU&xy=trve

Then, wait a few moments until the decoding job is done:

http://www.youtube-mp3.org/api/itemInfo/?video_id=iKP7DZmqdbU

Take the hash found in the info url to construct the download url:

http://www.youtube-mp3.org/get?video_id=iKP7DZmqdbU&h=2e4b61b6ddc8bf83f5a0e4e4ee0635bb

Note that it is possible that the web master of the site does not want to be scraped and will take counter measures if people starts to (in the webmasters eyes) abuse the site. For example it seem to use referer protection so clicking the links in this post won't work, you have to copy them and load them in a new browser window.

Test code:

from re import findall
from time import sleep
from urllib import urlopen, quote

yt_code = 'gijypDkEqUA'

yt_url = 'http://www.youtube.com/watch?v=%s' % yt_code
push_url_fmt = 'http://www.youtube-mp3.org/api/pushItem/?item=%s&xy=trve'
info_url_fmt = 'http://www.youtube-mp3.org/api/itemInfo/?video_id=%s'
download_url_fmt = 'http://www.youtube-mp3.org/get?video_id=%s&h=%s'
push_url = push_url_fmt % quote(yt_url)
data = urlopen(push_url).read()
sleep(10)
info_url = info_url_fmt % yt_code
data = urlopen(info_url).read()
res = findall('"h" : "([^"]*)"', data)
download_url = download_url_fmt % (yt_code, res[0])
print 'Download here:', download_url

You could use selenium to interact with the js stuff and then combine it with BeautifulSoup or do everything with selenium, just as you prefer.

http://seleniumhq.org/

Selenium is a tool for browser automatization and has bindings for a few languages including Python. It takes a running instance of Firefox/IE/Chrome and let's you script it (I suggest using the selenium webdriver for this simple problem, not the whole selenium server).

You're going to have to work through http://www.youtube-mp3.org/client.js and figure out the exact information that is being passed around, this could allow you to post a request, parse the response and download from the correct scraped url.

继续阅读：python scrape urllib2 youtube

Python data scraping

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？