Making AJAX Applications Crawlable? How to build a simple web service on Google App Engine to produce HTML Snapshots?

2023-01-11 23:26 问答作者：

Real World Problem:

I have my a开发者_StackOverflow中文版pp hosted on Heroku, who (to my knowledge) are unable to offer a solution for running a Headless (GUI-less) Browser - such as HTMLUnit - for generating HTML Snapshots for Googlebot to index my AJAX content.

My Proposed Solution:

If you haven't already, I suggest reading Google's Full Specification for Making AJAX Applications Crawlable.

Imagine I have:

a Sinatra app hosted on Heroku on the domain http://example.com
the app has tabs along the top of the page TabA, TabB and TabC
under each tab is SubTab1, SubTab2, SubTab3
onload if the url is http://example.com#!tab=TabA&subtab=SubTab3 then client-side Javascript takes the location.hash and loads in TabA, SubTab3 content via AJAX.

Note: the Hash Bang (#!) is part of the google spec.

I would like to build a simple "web service" hosted on Google App Engine (GAE) that:

Accepts a URL param e.g. http://htmlsnapshot.appspot.com?url=http://example.com#!tab=TabA&subtab=SubTab3 (url param should be URLEncoded)
Runs HTMLUnit to open http://example.com#!tab=TabA&subtab=SubTab3 and run the client-side javascript on the sever.
HTMLUnit returns the DOM once everything is complete (or something like 45 seconds has passed).
The return content could be sent back via JSON/JSONP, or alternatively a URL is return to a file generated and stored on the google app engine server (for file based "cached" results)... open to suggestions here. If a URL to a file was returned then you could CURL to get the source code (aka a HTML Snapshot).

My http://example.com app would need to manage the call to http://htmlsnapshot.appspot.com... basically:

Catch Googlebots call to http://example.com/?_escaped_fragment_=tab=TabA%26subtab=SubTab3 (googlebot crawler escapes certain characters e.g. %26 = &).
Send request from the backend to http://htmlsnapshot.appspot.com?url=http://example.com#!tab=TabA&subtab=SubTab3 (url param should be URLEncoded)
Render the returned HTML Snapshot to the frontend.
Google Indexes the content and we rejoice!

I don't have any experience with Google App Engine or Java or HTMLUnit.

I might be able to figure it out... and will post my results if I do.

Otherwise I feel this is a VERY good opportunity for someone to write a kick-ass blog post that outlines a novices step-by-step guide to setting up a web service like this.

This will introduce more people to the excellent (and free!) Google App Engine. Also it will undoubtably encourage more people to adopt Google's specs for crawlable AJAX content... something we can all benefit from!

As Google's specification gains more acceptance the "hurdle" of setting up a Headless Browser is going to send many devs Googling for answers! Get in now with an answer for fame and glory! (edit: at the very least I will sing your praises).

Hit me up on twitter @_chrisjacob if you would like to discuss solutions.

I have successfully used HTMLunit on AppEngine. My GWT code to do this is available in the gwt-platform project the results I got were similar to that of the HTMLunit-AppEngine test application by Amit Manjhi.

It should be relatively easy to use GWTP current HTMLunit support to do exactly what you describe, although you could likely do it in a simpler app. One problem I see is that AppEngine requests have a 30 second timeout, so you can't have a page that takes HTMLunit longer than that to process.

UPDATE: It's been a while, but I finally closed the long standing issue about making GWT applications crawlable using GWTP. The documentation is not entirely there, but check out the issue: http://code.google.com/p/gwt-platform/issues/detail?id=1

继续阅读：google-app-engine htmlunit seo

Making AJAX Applications Crawlable? How to build a simple web service on Google App Engine to produce HTML Snapshots?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？