Parallel Binary DeSerialization?
I have a solution where I need to read objects into memory very quickly, however the binary stream might be cached compressed in memory to save time on disk io.
I've tinkered around with different solutions, obviously XmlTextWriter and XmlTextReader wasnt so good and neither was the built-in binary serialization. Protobuf-net is excellent but still a little bit too slow. Here are some stats:
File Size XML: 217 kb
File Size Binary: 开发者_如何学Go87 kb
Compressed Binary: 26 KB
Compressed XML: 26 KB
Deserialize with XML (XmlTextReader) : 8.4 sek
Deserialize with Binary (Protobuf-net): 6.2 sek
Deserialize with Binary wo string.interning (Protobuf-net): 5.2 sek
Deserialize with Binary From memory: 5.9 Sek
Time to decompress binary file into memory: 1.8 sek
Serialize With Xml (XmlTextWriter) : 11 sek
Serialize With Binary (Protobuf): 4 sek
Serialize With Binary length prefix (Protobuf-net): 3.8 sek
That got me thinking, it seems (correct me if I'm wrong) that the major culprit of deserialization is the actual byte conversion rather than the IO. If thats' the case then it should be a candidate for using the new Parallel extensions.
Since I'm bit of a novice when it comes to binary IO I'd appreciate some input before I commit time to solution though :)
For simplicity sake, say we want to deserialize a list of objects with no optional field. My first idea was simply to store each with a length prefix. Read the byte[] of each into a list of byte[] and use PLINQ to do the byte[] -> object deserialization.
However with that method I still need to read the byte[] singlethreadedly, so perhaps one could read the whole binary stream into memory instead (how large binary files are feasible for that btw?) and in the beginning of the binary file instead store how many objects there are and each of their length and offset. Then I should be able to just create ArraySegments or something and do the chunking in paralllel too.
So what do you guys think , is it feasible?
I do things like this quite a lot, and nothing really beats using BinaryReader to read things in. As far as I know, there is no faster way than using BinaryReader.ReadInt32 to read in a 32 bit integer.
You may also find that the overhead of making it parallel and joining back together to be too much. If you really want to go the parallel route, I would advise using multiple threads to read in multiple files, rather than multiple threads to read one file in multiple blocks.
You could also play around with the block size to make it match disk block size, but there are so many levels of abstraction in between your application and the disk that could make that a waste of time.
Binary file can be read simultaneously by several threads. To do that it must be opened with appropriate access/share modifiers. And then each thread can get its own offset and length in that file. Thus reading in parallel is not a problem.
Let us assume that you will stick to simple binary format: each object is prefixed with its length. Knowing that you can "scroll" the file and know the offset where to put the deserializing thread.
Deserializing algoritm can look like this: 1) analyze file (divide it into several relatively large chunks, chunk border should coinside with object border) 2) spawn necessary amount of deserializer threads and "instruct" them with appropriate offset and length to read 3) combine results of all deserializer threads into one list
That got me thinking, it seems (correct me if I'm wrong) that the major culprit of deserialization is the actual byte conversion rather than the IO.
Don't assume where the time is being spent, get yourself a profiler and find out.
Whene i Deserialize list of object larger then 1 MB xml i Deserialize les then 2 seconds with this code:
public static List<T> FromXML<T>(this string s) where T : class
{
var ls = new List<T>();
var xml = new XmlSerializer(typeof(List<T>));
var sr = new StringReader(s);
var xmltxt = new XmlTextReader(sr);
if (xml.CanDeserialize(xmltxt))
{
ls = (List<T>)xml.Deserialize(xmltxt);
}
return ls;
}
Try this if is beter for XML case?
精彩评论