Couple of questions on NSFileHandle, Obj-C
I'm working now on Obj-C with files, my application shall read some huge text files (e.g. 5 MB) that have character encoding of UTF16.. The first problem is how do I detect the file size that I'm going to read from ?
The second problem is when I read the file only one time it gives me the right text, but when I try to seek or read another time, then it will not give me my original text, and here is my code segment :
NSFileHandle *sourceFile;
NSData *d1;
NSString *st1,*st2 = @"";
sourceFile = [NSFileHandle fileHandleForReadingAtPath : filePath]; // my file's size is 5 MB
for (int i = 0; i < 500; i ++) {
d1 = [sourceFile readDataOfLength:20];
st1 = [[NSString alloc] initWithData:d1 encoding:NSUTF16StringEncoding]; // converting my raw data into a UTF16 string
st2 = [st2 stringByAppendingFormat:@"%@",st1];
st1 = @"";
}
[sourceFile closeFile];
after this executed, then st2 will carry some string, and this string will have some clear character (as in the original file), but then it will carry a mess of unclear characters (e.g 䠆⠆䀆䀆䀆ㄆ䌆✆⨆䜆).. I开发者_StackOverflow社区 haven't slept all the night trying to figure it out, but couldn't :(
@Neovibrant: Sorry to prrof you wrong, but UTF-16 is not always 2 Bytes (or 16 bit) per character. As you see in the wikipedia article it can be 4 bytes for all characters above U+10000 ... So it will not be sufficient to watch out for an even offset because you can truncate a 4-byte character by this. Best way is always to use the correct encoding and leave it to the file manager to determine the size of a character.
To get the file size you can simply use the NSFileManager:
NSFileManager *fileManager = [[[NSFileManager alloc] init] autorelease];
NSDictionary *fileAttributes = [fileManager attributesOfItemAtPath:filePath error:nil];
unsigned long long size = [fileAttributes fileSize];
The second problem is because of the UTF-16 encoding. You see, in UTF-16, a character is represented by 2+ bytes (http://en.wikipedia.org/wiki/UTF-16).
Let's assume you have a text file in UTF-16 with the text Hello
. The bytes will be:
00 48 │ 00 65 │ 00 6C │ 00 6C │ 00 6F
H │ e │ l │ l │ o
Everything is fine if you start reading from byte 0 (or any even index), you'll get the expected result. But you start reading from and odd byte (like 1), all characters will be screwed up because the bytes are shifted:
48 00 │ 65 00 │ 6C 00 │ 6C 00 │ 6F
䠀 │ 攀 │ 氀 │ 氀 │ ?
精彩评论