Beautifulsoup and AJAX-table problem
I am making a script that scrapes the games of the Team Liquid database of international StarCraft 2 games. (http://www.teamliquid.net/tlpd/sc2-international/games)
However I come accros a problem. I have my script looping through all the pages, however the Team Liquid site uses some kind of AJAX I think in the table to update it. Now when I use BeautifulSoup I can't get the right data.
So I loop through these pages:
http://www.teamliquid.net/tlpd/sc2-international/games#tblt-948-1-1-DESC
http://www.teamliquid.net/tlpd/sc2-international/games#tblt-948-2-1-DESC
http://www.teamliquid.net/tlpd/sc2-international/games#tblt-948-3-1-DESC
http://www.teamliquid.net/tlpd/sc2-international/games#tblt-948-4-1-DESC etc...
When you open these yourself you see different pa开发者_运维技巧ges, however my script keeps getting the same first page every time. I think this is because when opening the other pages you see some loading thing for a small amount of time updating the table with games to the correct page. So I guess beatifulsoup is to fast and needs to wait for the loading and updating of the table to be done.
So my question is: How can i make sure it takes the updated table?
I now use this code to get the contents of the table, after which I put the contents in a .csv:
html = urlopen(url).read().lower()
bs = BeautifulSoup(html)
table = bs.find(lambda tag: tag.name=='table' and tag.has_key('id')
and tag['id']=="tblt_table")
rows = table.findAll(lambda tag: tag.name=='tr')
When you try to scrape a site using AJAX, it's best to see what the javascript code actually does. In many cases it simply retrieves XML or HTML, which would be even easier to scrape than the non-AJAXy content. It just requires looking at some source code.
In your case, the site retrieves the HTML code for the table control by itself (instead of refreshing the whole page) from a special URL and dynamically replaces it in the browser DOM. Looking at http://www.teamliquid.net/tlpd/tabulator/ajax.js, you'd see this URL is formatted like this:
http://www.teamliquid.net/tlpd/tabulator/update.php?tabulator_id=1811&tabulator_page=1&tabulator_order_col=1&tabulator_order_desc=1&tabulator_Search&tabulator_search=
So all you need to do is to scrape this URL directly with BeautifulSoup and advance the tabulator_page counter each time you want the next page.
You can't with just BeautifulSoup; it doesn't execute javascript for you.
You might have more luck with selenium, assuming you don't want to try to parse the relevant javascript yourself and make the calls the AJAX would be making to get the data.
For sites with dynamic content through AJAX and Javascript, I have used PhantomJS. It doesn't require open a browser because it's in itself a fully scriptable web browser. PhantomJS is fast and includes native support for various web standards as DOM handling, CSS selector, JSON and Canvas.
If you aren't a JavaScript Ninja, You should look CasperJS, it is written over PhantomJS. It eases the process of defining a full navigation scenario and provides useful high-level functions.
Here an example about how CasperJS works:
CasperJs and Jquery with chained Selects
It seems that the cause of your problem is that neither BeautifulSoup nor urllib will be able to execute the javascript inside the page.
Maybe, you should use selenium to open the page in a real browser, then extract the html when it is ready and parse it with BeautifulSoup.
精彩评论