开发者

Cannot access web pages using crawlers after removing index.php from URL

I have removed index.php from my application URL as presented around the web. But I have a weird problem afterwards.

I can access the website using browser like this... http://www.oakquotes.com/quotes/author/etc-etc (notice the lack of index.php) but when I try to access the same URL using a crawler then I get forbidden 403 http error.

Here is the robots.txt file:

User-agent: *
Allow:/quotes/topic
Allow:/quotes/author
Disallow:

Sitemap: http://www.oakquotes.com/Sitemap.xml
Sitemap: http://www.oakquotes.com/author_sitemap.xml
Sitemap: http://www.oakquotes.com/topic_sitemap.xml

I think the culprit is the .htaccess rul开发者_JS百科e that I have written to remove index.php from the URL. Here is the code for htaccess:

<IfModule mod_rewrite.c>
    # For security reasons, Option followsymlinks cannot be overridden.
    #  Options +FollowSymlinks
    Options +SymLinksIfOwnerMatch
    RewriteEngine On
    RewriteBase /
    RewriteCond $1 !^(index\.php|images|robots\.txt|Sitemap\.xml|topic_sitemap\.xml|author_sitemap\.xml|search\.html|style|js|system|application|quotes/authors|quotes/topic|application/controllers|application/views)
    RewriteRule ^(.*)$ ./index.php/$1 [L]
</IfModule>

Am I missing a step? Please help me in this regard. Thanks.


With a regular Browser, you also get an 403 error. The reason why a website is displayed is the following:

A basic authentication will always return an 403 error. On most servers, a global rule for ErrorDocument 403 is defined like 403.html. If a 403 error is triggered, the server will internally look for the error document 403.html. This document not exists and your RewriteRule matches and the server returns the rendered index.php page. This is why you see an webpage, even though it returned an 403 error. Even more complex, because an 403.html site not exist, an 404 (page not found) is triggered by looking up the 403.html site. That's the problem with global defined ErrorDocuments. An error 500 will trigger an 404 error because to 500.html is defined.

Try to define your one ErrorDocument handling in your .htaccess and you'll see the difference.

ErrorDocument 403 "Access denied"

This rule will print an error message if an error 403 is triggered and will stop the rendering of index.php

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜