CGI: Using GET instead of POST when client can only use a URL string
My situation is that I have a POST cgi script that generates and returns a media file (mp3). One of the clients of this script wants to use an iOS media player (MPMoviePlayer) object that only takes NSURL (basically the URL string) as input. The problem is that on iOS the POST parameters will not be able to be sent using just an NSURL. iOS can of course do post requests using other objects (NSURLRequest), but the script takes a while to run, so it is not acceptable to run the request, saving the file to disk, and then pass the file to the media player ojbect.
At first I thought maybe we should change to GET, and although it wouldn't be good RESTful design, it wouldn't be so bad as long as I set up a robots.txt. But I found a similar question on SO with opinions unequivocally being that GET is a bad idea if you are changing server state with a cgi script, even if it would make access easier:
Using GET instead of POST to delete data behind authenticated pages
I don't see an easy way out of this short of rewriting a media player object. Can anyone suggest an alternative to changing the script to GET while still using the url based player?
The scariest thing about using GET is not security/malicious hackers because most of these issues affect POST as well. I'm mainly worried about new 'fixes' in search engine robots/etc that ignore robots.txt. Is there anything else?
Also if there is a justification why GET might be acceptable here I would be interested in that answer as well. I was wondering if the search engine/bot issue is a non issue here because we won't have an http form that submits a GET request anywhere (since the iOS app will do it from within the app), and cgi scripts d开发者_运维技巧o not define the request method they are used with (although they can detect this and abort.)
I agree that POST is the correct HTTP method in this case, when the client request contains parameters that result in a file being created on the server. Besides being good practice, the browser pre-fetching problem you've linked to is a good cautionary tale of what could go wrong (Google has since reverted this "feature").
But, assuming you're going forward with GET instead of rewriting to use POST, I'll give some advice.
Regarding search engine crawling, this is a legitimate concern. I have been doing a lot of work with the robots exclusion standard recently, and in my experience it's a well-observed standard followed by all legitimate crawlers, like the search engines. At this point, it's 15+ years old, basically unchanged, with a very simple syntax. Although Google and company are crawling more and more things these days (new file types, inside search forms), I can't imagine them disobeying an explicit Disallow
rule. In any case, definitely implement and test robots.txt for these URL patterns, and that should address the search engine issue.
I don't know if these URLs are exposed to the user in any way (browser address bar, web page link they can right-click) or are merely part of the HTTP traffic -- but if it's only the latter, you don't have much to worry about. The way deep URLs like these typically get indexed, if they're not followed through links on your site, is by users sharing the links via email, Twitter, etc. If somebody tweets a link, and googlebot crawls it, Google will obey robots.txt and not index it, but it may still be a problem for you that it's findable in Twitter. So, as much as possible, make any sensitive URLs operate only in the background (I'm sure 99.9% of your users won't bother sniffing TCP to get a URL to share).
As far as mitigating other issues with GET, I'd suggest using a nonce as part of your GET URL, along with a session key. This is a pretty standard thing in web apps to protect against CSRF, and it'll make these GET URLs only usable once by the client session (if that's too restrictive, you can make them available to the session longer ... use a time-based hash instead of a nonce). When the nonce is expired, return a 404 or 30X redirect.
Something else you can do is, instead of having the CGI return the .mp3 file stream directly, have it do an HTTP redirect to a second URL that returns the file. Depending on whether you want this second URL to be a "public" URL, you could use the permanent (301) or temporary (302/303) response code. Search engines won't index the initial GET URL, since it results in a redirect, and it won't persist in a browser address bar for a user to copy/paste.
Ensuring these URLs are one-time, one-session only will address the security/load issues. The only exception I can think of is the browser pre-fetching case.
On prefetching ... the HTML specs are moving toward doing elective link prefetching with type="prefetch"
(currently, type="next"
accomplishes this in a link
tag but there is no equivalent to specify that an a
tag may/should be prefetched) and not a "default prefetch everything" behavior. A related issue is that there could also be pre-fetching by proxies (for instance, an office network's Squid, or mobile browsers like Opera Mini that use a proxy to fetch and recompress images), but I don't know of any that currently do this. So, I don't think you have anything to really worry about in this area, but if you want to be paranoid about this, the way out is to follow the rules and just use POST when modifying server state.
So, the key to this is that the GET request generats the mp3 on-the-fly, so whether this is viable depends on a couple of features of the application:
if the mp3 is re-usable the app should generate the mp3 either in-advance or on-request and then cache the result. Thus subsequent GET requests simply serve the data, which is correct use of GET.
if the mp3 is custom to the user in that one download instance, then a POST request is more appropriate, as this is a request to the system to generate a resource in a certain context. This context is supplied by the data in the POST. Bear in mind that REST is not religious about the meaning of POST - you can use it to do a variety of things like this without straying outside the model.
If you are concerned, instead, about download by robots, then there are a few tricks that you could employ:
Throttle access to the resource per IP. Multiple requests to the resource could be limited to one every minute, or whatever seems appropriate. This would give you some defence against denial-of-service attacks.
Block access to the resource for known robots. There are lists of IP addresses of the common indexers, and you could use these to return 403 (Forbidden) responses on GET of your resource.
Filter access to the resource based on HTTP request headers. For example, you could look at the User-Agent and ensure that this is one of your "supported" browsers. This is probably the least advisable.
Hope that helps.
GET should be fine.
The important part here is that the client doesn't request a server state change.
See the specification:
Naturally, it is not possible to ensure that the server does not generate side-effects as a result of performing a GET request; in fact, some dynamic resources consider that a feature. The important distinction here is that the user did not request the side-effects, so therefore cannot be held accountable for them.
GET should be ok. You don't need to blindly follow specification, written for general purpose. In your situation it would be much easier for you to use GET - so why not to use?
精彩评论