开发者

Couple of questions on NSFileHandle, Obj-C

I'm working now on Obj-C with files, my application shall read some huge text files (e.g. 5 MB) that have character encoding of UTF16.. The first problem is how do I detect the file size that I'm going to read from ?

The second problem is when I read the file only one time it gives me the right text, but when I try to seek or read another time, then it will not give me my original text, and here is my code segment :

NSFileHandle *sourceFile;

NSData *d1;

NSString *st1,*st2 = @"";

sourceFile = [NSFileHandle fileHandleForReadingAtPath : filePath]; // my file's size is 5 MB

for (int i = 0; i < 500; i ++) {

d1 = [sourceFile readDataOfLength:20];

st1 = [[NSString alloc] initWithData:d1 encoding:NSUTF16StringEncoding]; // converting my raw data into a UTF16 string

st2 = [st2 stringByAppendingFormat:@"%@",st1];

st1 = @"";

}

[sourceFile closeFile];

after this executed, then st2 will carry some string, and this string will have some clear character (as in the original file), but then it will carry a mess of unclear characters (e.g 䠆⠆䀆䀆䀆ㄆ䌆✆⨆䜆).. I开发者_StackOverflow社区 haven't slept all the night trying to figure it out, but couldn't :(


@Neovibrant: Sorry to prrof you wrong, but UTF-16 is not always 2 Bytes (or 16 bit) per character. As you see in the wikipedia article it can be 4 bytes for all characters above U+10000 ... So it will not be sufficient to watch out for an even offset because you can truncate a 4-byte character by this. Best way is always to use the correct encoding and leave it to the file manager to determine the size of a character.


To get the file size you can simply use the NSFileManager:

NSFileManager *fileManager = [[[NSFileManager alloc] init] autorelease];
NSDictionary *fileAttributes = [fileManager attributesOfItemAtPath:filePath error:nil];
unsigned long long size = [fileAttributes fileSize];

The second problem is because of the UTF-16 encoding. You see, in UTF-16, a character is represented by 2+ bytes (http://en.wikipedia.org/wiki/UTF-16).

Let's assume you have a text file in UTF-16 with the text Hello. The bytes will be:

00 48 │ 00 65 │ 00 6C │ 00 6C │ 00 6F
   H  │    e  │     l │     l │     o

Everything is fine if you start reading from byte 0 (or any even index), you'll get the expected result. But you start reading from and odd byte (like 1), all characters will be screwed up because the bytes are shifted:

48 00 │ 65 00 │ 6C 00 │ 6C 00 │ 6F
   䠀 │     攀 │    氀 │    氀 │  ?
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜