Beautifulsoup and AJAX-table problem

2023-03-03 03:55 问答作者：

I am making a script that scrapes the games of the Team Liquid database of international StarCraft 2 games. (http://www.teamliquid.net/tlpd/sc2-international/games)

However I come accros a problem. I have my script looping through all the pages, however the Team Liquid site uses some kind of AJAX I think in the table to update it. Now when I use BeautifulSoup I can't get the right data.

So I loop through these pages:

http://www.teamliquid.net/tlpd/sc2-international/games#tblt-948-1-1-DESC

http://www.teamliquid.net/tlpd/sc2-international/games#tblt-948-2-1-DESC

http://www.teamliquid.net/tlpd/sc2-international/games#tblt-948-3-1-DESC

http://www.teamliquid.net/tlpd/sc2-international/games#tblt-948-4-1-DESC etc...

When you open these yourself you see different pa开发者_运维技巧ges, however my script keeps getting the same first page every time. I think this is because when opening the other pages you see some loading thing for a small amount of time updating the table with games to the correct page. So I guess beatifulsoup is to fast and needs to wait for the loading and updating of the table to be done.

So my question is: How can i make sure it takes the updated table?

I now use this code to get the contents of the table, after which I put the contents in a .csv:

html = urlopen(url).read().lower()
bs = BeautifulSoup(html)
table = bs.find(lambda tag: tag.name=='table' and tag.has_key('id')
                and tag['id']=="tblt_table") 
rows = table.findAll(lambda tag: tag.name=='tr')

When you try to scrape a site using AJAX, it's best to see what the javascript code actually does. In many cases it simply retrieves XML or HTML, which would be even easier to scrape than the non-AJAXy content. It just requires looking at some source code.

In your case, the site retrieves the HTML code for the table control by itself (instead of refreshing the whole page) from a special URL and dynamically replaces it in the browser DOM. Looking at http://www.teamliquid.net/tlpd/tabulator/ajax.js, you'd see this URL is formatted like this:

http://www.teamliquid.net/tlpd/tabulator/update.php?tabulator_id=1811&tabulator_page=1&tabulator_order_col=1&tabulator_order_desc=1&tabulator_Search&tabulator_search=

So all you need to do is to scrape this URL directly with BeautifulSoup and advance the tabulator_page counter each time you want the next page.

You can't with just BeautifulSoup; it doesn't execute javascript for you.

You might have more luck with selenium, assuming you don't want to try to parse the relevant javascript yourself and make the calls the AJAX would be making to get the data.

For sites with dynamic content through AJAX and Javascript, I have used PhantomJS. It doesn't require open a browser because it's in itself a fully scriptable web browser. PhantomJS is fast and includes native support for various web standards as DOM handling, CSS selector, JSON and Canvas.

If you aren't a JavaScript Ninja, You should look CasperJS, it is written over PhantomJS. It eases the process of defining a full navigation scenario and provides useful high-level functions.

Here an example about how CasperJS works:

CasperJs and Jquery with chained Selects

It seems that the cause of your problem is that neither BeautifulSoup nor urllib will be able to execute the javascript inside the page.

Maybe, you should use selenium to open the page in a real browser, then extract the html when it is ready and parse it with BeautifulSoup.

继续阅读：python

Beautifulsoup and AJAX-table problem

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

王昌瑞《潜梦追凶》剧组庆生新锐演员未来可期？

Is it allowed to ask users to enter credit card details for own payment method?

Escaping "<" in Perl-generated XML

imessage会显示已读吗？

微信重新建群怎么建？