Serializing simple data structures in C++: evaluating a 1d versus 2d representation
I am trying to decide on a simple yet efficient serialization approach for some largish data structures and am hoping to find some feedback regarding the simple tests I have just written to help me make the decision.
The data structure in question comprises several 2-dimensional vectors of floats:
vector<<vector float> > data;
The component vectors are mostly the same size, but this is not a guarantee. There is however, a very limited amount of variation in the size of the component vectors. For example, typically 99% of the component vectors will be the same size, with the remainder being some multiple of the base size. It may be possible to truncate the longer or pad the shorter component vectors, but this may have a negative impact on program performance in some cases.
These 2-d vectors typically contain several hundred thousand component vectors while each component vector itself typically contains 40 elements.
This data is read in once when the program is first booted, and thereafter the information is iterated over repeatedly, but never modified.
In terms of how to serialize this data, there seem to be two very obvious choices:
- Convert the 2-d vectors to 1-d vectors and serialize the 1-d vectors with a simple header. This should be very fast to read and write, and roughly the same as a 2-d vector to iterate over, albeit with a slightly different setup. Unfortunately this falls down in the case where the component vectors have variable length; iteration would also require extra book-keeping.
- Keep the 2-d structure and use a slightly more complex header. This will probably be somewhat slower to read and write, but we only do this once anyway. This allows us to account for the variable length component vectors and means no additional book-keeping during iteration. Existing code that uses this data need not be altered.
I have written a simple test program to try and get a general feel for how these choices might play out in practice:
run-serializer.cpp: http://pastebin.com/AzWivQDU
Serializer.h: http://pastebin.com/dvbqN1EY
Serializer.cpp: http://pastebin.com/0eDCg7sE
$ g++ -o run-serializer run-serializer.cpp Serializer.cpp
$ ./run-serializer 500
Initializing tests...
Total elems: 11441408
Test binfile: test.bin
Binary write: 0.038397
Average runtimes for 500 iterations:
Binary 1d read: 0.0228143
Binary 2d read: 0.156756
1d iter time: 0.125914
2d iter time: 0.126896
This seems to indicate that, on my laptop and given my example implementation, the 2d version takes about 8x longer than the 1d version. Not too surprising, and for my purposes this is a reasonable amount of time. It also indicates that as expected, there is very little difference between the 1d iteration time and the 2d iteration time, but that perhaps the 1d version is winning out by a smidgen.
Given the (potentially) variable length of the component vectors, I'm inclined to go with the existing 2-d approach. The initial loading process is noticeably slower, but still quite minor from the perspective of the application. Iteration seems to not be an issue either way and the 2-d approach means I won't have to alter any of the other code currently utilizing these data structures.
So I guess my real question is:
- 开发者_StackOverflow中文版
- Are these tests broadly reasonable, and do the conclusions I've drawn from them make sense?
Apologies for being long-winded.
There is some room for improvement.
Timers are more consistent with less granularity. Don't add up times of iterations, take the time of the full operation and do any division by number of iterations afterward. This also keeps code dealing with time at the higher levels of concern, where it makes the most sense. Low level code can compute it, but it can't do anything with it.
Using wall time over the full duration of the executable (or N invocations of the executable) really isn't that bad of an option. It is crude, but it should average out with a large enough sample.
Separate executables could reduce bias from static initialization, perhaps using conditional compiling for each of the different strategies.
Run the expected test sizes many times independently instead of running an extraordinary sizes fewer. Re-running the tests within the same instance of the executable could give you misleading results. I could see a scenario where an OS would make re-use of memory cheaper than first time allocations, for instance.
Depending on the system, I would expect memory allocation or disk contention to dominate the run time. Laptops usually have lower speed hard drives, so disk contention has a bigger impact. Buffering your reads with a decorator pattern is simple and effective for these sorts of problems.
If your requirements don't demand exceptionally fast start times, I would agree that the 2d version is appropriate. You are implementing closer to the actual requirements. If they did demand fast times, I would say bigger headers and lazy loading is probably more appropriate.
精彩评论