escaping CRLF in HTTP multipart/form-data content type (iOS)
I'm trying to post a file using the multipart/form-data content type, and I got this question:
Shouldn't I escape CRLFs when I write the content of a file? I got a code piece on the web and I think it might be wrong:NSMutableURLRequest* req = [NSMutableURLRequest requestWithURL: url];
[req setHTTPMethod: @"POST"];
NSString* contentType = @"multipart/form-data, boundary=AaB03x";
[req setValue:contentType forHTTPHeaderField: @"Content-type"];
NSData* boundary = [@"\r\n--AaB03x\r\n" dataUsingEncoding:NSUTF8StringEncoding];
NSMutableData *postBody = [NSMutableData data];
[postBody appendData: boundary];
[postBody appendData: [@"Content-Disposition: form-data; name=\"datafile\"; filename=\"t.jpg\"" dataUsingEncoding:NSUTF8StringEncoding]];
[postBody appendData: [@"Content-Type: image/jpeg\r\n\r\n" dataUsingEncoding:NSUTF8StringEncoding]];
[postBody appendData: imageData];
[postBody 开发者_C百科appendData: boundary];
[req setHTTPBody:postBody];
This is wrong because imageData might contain \r\n sequences, right? If so, is there a way to escape CRLFs in raw data? Or am I missing something?
Thanks in advance!
This is an interesting question. Looking at the multipart media type RFC it appears that it is up to the composing agent to make sure that the boundary does not appear in the encapsulated data. In addition, it states the following:
NOTE: Because boundary delimiters must not appear in the body parts being encapsulated, a user agent must exercise care to choose a unique boundary parameter value. The boundary parameter value in the example above could have been the result of an algorithm designed to produce boundary delimiters with a very low probability of already existing in the data to be encapsulated without having to prescan the data.
I interpret this to mean that in order to be sure that the boundary value doesn't appear in the encapsulated data, you would have to scan the data for the boundary value. Because this is an unacceptably expensive operation in most cases, it's expected that user agents will simply choose a value that has a very low probability of occurring in the data.
Consider the probability of the boundary in your example occurring in a random string of bytes (which for the sake of argument, we will assume represents a JPEG image). The full string that would need to be matched in order to end your image data early would be "\r\n--AaB03x" - 10 bytes, or 80 bits. Starting from any bit, the chance that the next 10 bytes are that sequence is one in 2^80. In a 1MB JPEG file, there are 2^23 bits. This means that the chance of a JPEG file containing the sequence is less than 2^23/2^80, or one in 2^57 (more than one hundred quadrillion).
So, I think the answer is that to be 100% sure, you would have to check the data for the boundary sequence, and then use a different one if that boundary sequence exists in the data. But in practice, the chances of the boundary sequence occurring are small enough that it's not worth it.
Technically speaking, it is wrong because the trailing \r\n
should not be a part of boundary as stated in RFC2046. The trailing \r\n
should be a part of transport-padding
, but in practice, it shouldn't matter because you're gonna put it after the boundary anyways.
Also I take it that the whole sequence is to be avoided, not subsequences.
精彩评论