How to serve up HTML snapshots of an AJAX app with a headless browser, from PHP?

2023-02-15 02:33 问答作者：

Having real trouble working out how to fire up a headless browser to serve up static HTML snapshots of a site that uses javascript (sammy.js, to be specific) to deliver the AJAX content.

I'm working off Google's specification for making AJAX apps crawlable:

http://code.google.com/web/ajaxcrawling/docs/getting-started.html

which for the most part is great and very clear, and I'm having no problems picking up the ?_escaped_fragment_ URLs.

Most of the templating is done server side, so I was tempted to just write a PHP snapshot-building file that uses the same regex matches from the sammy app code (there are a lot of routes) to include in various template files. However, a lot of the action happens in the javascript app, so it would mean mirroring all of that processing in PHP, which then means maintaining both files side by side, cross-language - which is a lot of work!

Now, I've read that you can use a Head开发者_如何学Pythonless Browser to 'render' the page and execute all the javascript (matching the #!/ route and delivering the correct content for the request) and then return the entire DOM contents as HTML, which would be served to googlebot.

I've searched long and hard and can't find any step-by-step guides on running headless browsers from PHP (for total Java newbs). Which I suppose means I just don't know what to search for.

What I'm wondering: is it even more work to set up and use a headless browser to serve up these HTML snapshots? And if so, is it worth doing anyway?

Also, if there are any guides you could point me to, that'd be great!

Thanks!

Joss

I think you're better off replicating on the server what you've got on the client side. Though it might seem like an inefficient undertaking, it's at least got a clear and limited scope.

Most of the reputable headless browsers are designed as testing tools for application development. Accordingly, they are very open-ended in their structure, which is a good thing if you're responsible for the QA of an application, but not so much if you want to do just one specific thing with it.

I used Selenium-RC to do just one specific thing on a particular project, and found that dealing with all the Selenium-related concerns quickly became a project unto itself. Though Selenium-RC could certainly accomplish what you're trying to do, it just seems like a big commitment given the specificity of what you're looking to accomplish.

(Being a complete Java amateur myself, I can't really comment on HTMLUnit, but on spec alone, it seems like it's probably more appropriate for your needs than Selenium-RC. It wouldn't surprise me though if using it had some of the same setup and management demands.)

So back to the alternative of duplicating everything in PHP...

Keep in mind that you don't need everything to be exactly identical in the HTML snapshots as they would be in-browser: as long as you've got the core content and the key navigational links, the GoogleBot will have most everything it needs. Do you also need to have every single page on your site indexed? Or could you identify the handful of routes that really matter most, and just provide snapshots of those? You could also use web analytics or server log data to better inform snapshot priorities.

To anybody wondering - I've since worked out how to do exactly what was needed using node.js, and will publish it on github soon, and update the question...

继续阅读：headless-browser javascript php

How to serve up HTML snapshots of an AJAX app with a headless browser, from PHP?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？