Tips for designing a serialization file format that will permit easy merging
Say I'm building a UML modeling tool. There's some hierarchical organization to the data, and model elements need to be able to refer to others. I need some way to save model files to disk. If multiple people might be working on the files simultaneously, the time will come to merge these model files. Also, it would be nice to compare two revisions in source control and see what has changed. This seems like it would be a common problem across many domains
For this to work well using existing difference and merge tools, the file format should be text, separ开发者_高级运维ated onto multiple lines.
What are some existing serialization formats that do a good job (or poor job) addressing such problems? Or, if designing a custom file format, what are some tips / guidelines / pitfalls?
Bonus question: Any additional guidance if I want to eventually split the model up into multiple files, each separately source controlled?
I solved that problem long ago for octave/matlab, now I need something for C#. The task was to merge two octave-structs to one. I found no merge tool and no fitting serializer, so I had to think about something.
The most important concept decision was to split the struct-tree into lines with the complete path and the content of the leave.
The basic Idea was
- Serialize the Struct to Lines, where each line represents a basic Variable (Matrix, string, float,...)
- An array or matrix of struct will have the index in the path.
- concatenate the two resulting text files, sort the lines
- detect collisions and do collision-handling (very easy, because the same Properties will be positioned directly unde each other after the line sorting)
- do deserialize
Example:
>> s1
s1 =
scalar structure containing the fields:
b =
2x2 struct array containing the fields:
bruch
t = Textstring
f = 3.1416
s =
scalar structure containing the fields:
a = 3
b = 4
will be serialized to
root.b(1,1).bruch=txt2base('isfloat|[ [ 0, 4 ] ; [ 1, 0 ] ; ]');
root.b(1,2).bruch=txt2base('isfloat|[ [ 1, 6 ] ; [ 1, 0 ] ; ]');
root.b(2,1).bruch=txt2base('isfloat|[ [ 2, 7 ] ; [ 1, 0 ] ; ]');
root.b(2,2).bruch=txt2base('isfloat|[ [ 7 ] ; [ 1 ] ; ]');
root.f=txt2base('isfloat|[3.1416]');
root.s.a=txt2base('isfloat|[3]');
root.s.b=txt2base('isfloat|[4]');
root.t=txt2base('ischar|Textstring');
The advantage of this method is, that it is very easy to implement and it is human readable. First you have to write the two functions base2txt and txt2base, wich convert basic types to strings and back. Then you just go recursively through the tree and write for each struct property the path to the property (here seperated by ".") and the content to one line.
The big disadvantage is, that at least my implementation of this is very slow.
The answer to the second question: Is there already something like this out there? I dont know... but I searched for a while, so I don't think so.
Some guidelines:
The format should be designed so that when only one thing has changed in a model, there is only one corresponding change in the file. Some counterexamples:
- It's no good if the file format uses arbitrary reference IDs that change every time you edit and save the model.
- It's no good if array items are stored with their indices listed explicitly, since inserting items into the middle of an array will cause all the following indices to get shuffled down. That will cause those items to show up in a 'diff' unnecessarily.
Regarding references: if IDs are created serially, then two people editing the same revision of the model could end up creating new elements with the same ID. This will become a problem when merging.
精彩评论