Writing small integers to binary files in R
开发者_Python百科I have a question about writing to binary files in R. I am working on data compression and I want to write to a binary file integers that can be represented on two bytes but R represents them on 4 bytes. Is there any data type that can store numbers in two-bytes or one-byte (something like short integer in C)?
If not, when you use writeBin with a small integer (that can be saved in 1 byte for example) and size=1, is the program writing all the 4 bytes of the integer (including the 0 bytes) or it converts it to 1 byte.
This is a very important and urgent problem for me and your help would be greatly appreciated. If you know of a comprehensive help on writing to binary files please let me know. Thanks!
(Sometimes, when I use writeBin with size equal to small numbers, I get an error saying that the size is not defined on my machine. How can I fix that? What is the best way to write integers to files for compression purposes (to have the smallest possible files)? Does the raw data type help?)
You may be making your life too complicated. R uses compression by default in save()
, have you measured that is not already good enough? An example:
R> vec <- rep(1L, 100) ## 100 integer elements
R> object.size(vec)
440 bytes ## so there must be a 40 byte overhead
R> str(vec)
int [1:100] 1 1 1 1 1 1 1 1 1 1 ...
R> save( vec, file="/tmp/vec.RData")
R> file.info("/tmp/vec.RData")[1:3]
size isdir mode
/tmp/vec.RData 64 FALSE 644 ## stored to 64 bytes!
R>
You could argue that the repeat values are ideal for compression but they may even hold for your dataset?
Otherwise, maybe try the CRAN package ff which supports one and two-byte types.
Lastly, if you want full control, you could use C or C++ to assign to shorter integer types, or even char
types. There is a package I could recommend for interfacing C++...
The size argument to writeBin should be 1, 2, or 4 for integers - 8 works too but not for compression ;-)
Do you really need size=3?
writeBin would write the integer values with only as many bytes per integer as you specify. If the integer doesn't fit, the high bits are silently skipped.
For signed values (the default):
size=1 for integer values between [-128, 127]
size=2 for integer values between [-32768, 32767]
Or, if you read them in with signed=FALSE:
size=1 for integer values between [0, 255]
size=2 for integer values between [0, 65535].
Example of writing too large values for the specified size:
writeBin(254:257, "foo.bin", size=1)
readBin("foo.bin", "int", 4, size=1, signed=FALSE) # 254 255 0 1
精彩评论