prevent crawler from following POST form action
I have simple form on my site:
<form method="POST" action="Home/Import"> ... </form>
I get tons of error reports bec开发者_开发技巧ause of crawlers sending HEAD
request to Home/Import
Notice form is POST.
Questions
- Why crawlers try to crawl those actions?
- Anything I can do to prevent it? (I already have Home in robots.txt)
- What is a good way to deal with those invalid (but correct)
HEAD
requests?
Details:
I use Post-Redirect-Get pattern, if that matters. Platform: ASP.NET MVC 3.0 (C#) on IIS 7.51) A crawler typically makes HEAD requests to get the mime-type of the response.
2) The HEAD request shouldn't invoke the action handler for a POST. If I saw that I was getting alot of HEAD requests to a resource I don't want the crawler to crawl I would give it a link I do want it to crawl. Most crawlers read a Robots.txt
you can disable head requests at webserver level... for apache:
<LimitExcept GET POST>
deny from all
</LimitExcept>
you can work this at robots.txt level by adding:
Disallow: /Home/Import
Head requests are used to get information about the page, without getting the whole page, like last-modified-time, size etc. it is an efficiency thing. your script should not be giving errors because of head requests, and those errors are probably because of lack of validations in your code. your code could check if the request http method is 'head' and do something different.
4 years ago but still answering question #1: Google does indeed try to crawl POST forms, both by just sending a "GET" to the URL and actual "POST" requests. See their blog on this. The why is in the nature of the web: bad web developers hide their content links behind POST search forms. To reach that content, search engines have to improvise.
About #2: The reliability of robots.txt varies.
And about #3: The ultra clean way would probably be: HTTP Status 405 Method not allowed if HEAD requests in particular are your problem. Not sure browsers will like this, though.
精彩评论