开发者

Google's Indexing XSLT Pages

My site has been created with a开发者_如何学Cn XML as a data store, and XSLT used as a template. It appears that Google is not very good on indexing sites that are XML/XSLT based. Are there any efficient/easy to implement software components that can render the XSLT just for the Google bot indexer? It would be even better if they worked with PHP.


Take a look at the PHP XSLT processor.

http://php.net/manual/en/class.xsltprocessor.php

Use as follows:

<?php 
$sXml  = "<xml>"; 
$sXml .= "<sudhir>hello sudhir</sudhir>"; 
$sXml .= "</xml>"; 

# LOAD XML FILE 
$XML = new DOMDocument(); 
$XML->loadXML( $sXml ); 

# START XSLT 
$xslt = new XSLTProcessor(); 
$XSL = new DOMDocument(); 
$XSL->load( 'xsl/index.xsl', LIBXML_NOCDATA); 
$xslt->importStylesheet( $XSL ); 
#PRINT 
print $xslt->transformToXML( $XML ); 
?>

(From http://php.net/manual/en/book.xsl.php)

UPDATE

You asked in the comment how to intercept a request from a specific user agent (eg. the Googlebot). There are various ways to do this, depending on the web server technology you are using.

On Apache, one method would be to use mod_rewrite to internally divert the processing of the request to a PHP script containing code similar to what we see above. This script retrieves the XML from the originally requested URL and renders the transformation to the client. The rewrite rule would have a Rewrite Condition that compares the HTTP_USER_AGENT header to Google's. Here is an example of the rule (untested, but you should get the idea):

RewriteCond %{HTTP_USER_AGENT} ^(.*)Googlebot(.*)$ [NC]
RewriteRule ^(.*\.xml.*)$ /renderxslt.php?url=$1 [L]

Briefly, the condition is looking for a referrer starting with the string "googlebot" and the rewrite rule is matching any URL with the string ".xml" in it, and passing the full URL to the renderxslt.php page as a querystring parameter.

A port of mod_rewrite exis for IIS too (http://www.isapirewrite.com/).

Alternatively, with IIS you could use an ASP.NET HTTP module to intercept the request, again checking Request.Headers["HTTP_USER_AGENT"] for Google's signature. You can then proceed in a similar manner to above by reading the HTML generated by your PHP script, or altenatively by using the ASP.NET XML control:

<asp:Xml ID="Xml1" runat="server" DocumentSource="~/cdlist.xml" TransformSource="~/listformat.xsl"></asp:Xml>


Why not just exclude the directory that holds your xsl files in your robots.txt?

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜