How to store dynamic data (unknown number of fields) to a file?
I need to store some data in a file. FOR THE MOMENT each record (data set) consists in:
- a string (variable length),
- an array of integers (variable length),
- an array of bytes (variable length),
- some integer values.
It wouldn't be difficult at all to save all these stuff in a bi开发者_开发技巧nary file. However, I know for sure that (unfortunately) my data format will change in time, and I want to have the possibility to add more fields to each "record". So, obviously my file format cannot be fixed. I suppose that the best solution will be to save my data in (DB) table but I don't want to mess up with the big guns (SQL, ADO, BDE, Nexus...). I need a rudimentary library (if possible single PAS file) that can do that. Since the purpose of this is rather storing data than working with data, can it be done without a DB table?
Requirements for this library:
- it needs to easily support more than 1 million rows
- really lightweight
- single PAS file if possible
- MANDATORY: easy to install in a new machine (together with the project in which it compiles)
- MANDATORY: in order to use it I don't need to redistribute anything
- MANDATORY: in order to use it the user doesn't have to install/setup stuff
- can be freeware/shareware
- it doesn't have to support SQL queries or similar advanced stuff
I use D7
Take a look at our Synopse Big Table unit.
With its recent upgrade, it could fit perfectly your need.
Here is how you create your field layout:
var Table: TSynBigTableRecord;
FieldText, FieldInt: TSynTableFieldProperties;
begin
Table := TSynBigTableRecord.Create('FileName.ext','TableName');
FieldText := Table.AddField('text',tftWinAnsi,[tfoIndex]);
FieldInt := Table.AddField('Int',tftInt32,[tfoIndex,tfoUnique]);
Table.AddFieldUpdate;
For storing an array of bytes or Integers, just use the tftWinAnsi
or even better tftBlobInternal
kind of field (this is a true variable-length field), then map it to or from a dynamic array, just like a RawByteString
.
You can safely add fields later, the data file will be processed for you.
There are several ways of handling the data, but I've implemented a variant-based way of using a record, with true late-binding:
var vari: Variant;
// initialize the variant
vari := Table.VariantVoid;
// create record content, and add it to the database
vari.text := 'Some text';
vari.int := 12345;
aID := Table.VariantAdd(vari);
if aID=0 then
ShowMessage('Error adding record');
// how to retrieve it
vari := Table.VariantGet(aID);
assert(vari.ID=aID);
assert(vari.INT=12345);
assert(vari.Text='Some text');
About speed, you can't find anything faster IMHO. Creating 1,000,000 records with some text and an integer value, both fields using an index, and the integer field set as unique is less than 880 ms on my laptop. It will use very little disk space, because all storage is variable-length encoded (similar to Google's Protocol Buffers).
It needs only two units, works for Delphi 6 up to XE (and is already Unicode ready, so using this unit you could upgrade safely to a newer Delphi version when you want to). There is no installation need, and it's only a few KB added to your executable. It's just a small but powerful NoSQL engine written in pure Delphi, but with the power of use of a database (i.e. pure field layout) and the speed of a in-memory engine, with no limit in size.
And it's full OpenSource, with a permissive licence.
Note that we also provide a SQLite3 wrapper, but it's another project. Slower but more powerful, with SQL support and an integrated Client/Server ORM.
Use Synopse BigTable, http://synopse.info/ a key=>value database, value in this case is the serialization of you data (json, binary, xml, ...).
This is CRAZY fast, lightweight and free.
I don't think you need a database for this. If you use a database I don't see how it solves the problem of your data structure changing.
I personally would store to YAML format which is very easily extensible. That requires quite a bit of work linking to some LIBYAML so a very lightweight alternative would be to store to INI files. These are easily extensible whilst maintaining compatibility with old files.
You can quite easily roll your own binary format that is extensible. What you do is you write each record to a block. Each block has a short header which includes its length.
When you read the data you read up to the end of the block and then if you are expecting more data you simply stop reading and use default values for the data. If you have read all the data you know about but are not at the end of the block, the file must have come from a later version of your program and you just skip to the end of the block. Perhaps you warn that the file contained data which you didn't know about.
Extensibility is achieved by always writing data out in the same order as previous versions. Any new data goes at the end of each block.
In order of your level of effort to implement, I suggest, in this order:
CSV or INI files (TMemIniFile, or TJvCsvDataSet). This is the least work for you. You can support millions of lines in a single file, but the memory consumed will be enormous. I have envisioned a "data writer" component to replace my TJvCsvDataSet with something that only appends records, and does not load them into memory. This would allow you to write out to CSV files, and even read them back, row by row, but not load them all at once. This approach might be idea for you. A simple CSV reader/writer class that is NOT a dataset object.
one-XML-tag-per-line files. This is more flexible than INI files and can be hierarchical. INI files are non-hierarchical. Neither SAX or DOM are required if you simply open a file stream, and append a line of text, that is in this form, ending in cr+lf:
< logitem attrib1="value1" attrib2="value2" />
List item
- Some kind of binary nosql db like bsddb, couchdb, etc.
Any time you need to store variable length data into a binary format, you should store the data's length in front of the actual data.
Since you also need to add new fields later, you should store the number of fields per record (or at least an end-of-record marker at the end) so you can maintain the correct position while moving around the file during reading and seeking operations.
As for the actual record data, I would suggest a type-length-data format for each field so you can add new fields without knowing what their data types will be ahead of time, and to allow code to recognize and read/skip individual fields as needed regardless of content (ie, if an old app tries to read a file with newer fields, it can skip what it does not recognize).
In the end, you will end up with something like this as a start, which you can then expand on, optimize, etc as needed:
const
cTypeUnknown = $00;
cTypeString = $01;
cTypeInteger = $02;
cTypeByte = $03;
cTypeArray = $80;
cTypeStringArray = cTypeStringArray or cTypeArray;
cTypeIntegerArray = cTypeIntegerArray or cTypeArray;
cTypeByteArray = cTypeByteArray or cTypeArray;
type
Streamable = class
public
procedure Read(Stream: TStream); virtual; abstract;
procedure Write(Stream: TStream); virtual; abstract;
end;
Field = class(Streamable)
public
function GetType: Byte; virtual; abstract;
end;
FieldClass = class of Field;
StringField = class(Field)
public
Data: String;
function GetType: Byte; override;
procedure Read(Stream: TStream); override;
procedure Write(Stream: TStream); override;
end;
StringArrayField = class(Field)
public
Data: array of String;
function GetType: Byte; override;
procedure Read(Stream: TStream); override;
procedure Write(Stream: TStream); override;
end;
IntegerField = class(Field)
public
Data: Integer;
function GetType: Byte; override;
procedure Read(Stream: TStream); override;
procedure Write(Stream: TStream); override;
end;
IntegerArrayField = class(Field)
public
Data: array of Integer;
function GetType: Byte; override;
procedure Read(Stream: TStream); override;
procedure Write(Stream: TStream); override;
end;
ByteField = class(Field)
public
Data: Byte;
function GetType: Byte; override;
procedure Read(Stream: TStream); override;
procedure Write(Stream: TStream); override;
end;
ByteArrayField = class(Field)
public
Data: array of Byte;
function GetType: Byte; override;
procedure Read(Stream: TStream); override;
procedure Write(Stream: TStream); override;
end;
AnyField = class(ByteArrayField)
public
Type: Byte;
function GetType: Byte; override;
end;
Record = class(Streamable)
public
Fields: array of Field;
procedure Read(Stream: TStream); override;
procedure Write(Stream: TStream); override;
end;
RecordArray = class(Streamable)
public
Records: array of Record;
procedure Read(Stream: TStream); override;
procedure Write(Stream: TStream); override;
end;
procedure WriteByte(Stream: TStream; Value: Byte);
begin
Stream.WriteBuffer(@Value, SizeOf(Byte));
end;
function ReadByte(Stream: TStream): Byte;
begin
Stream.ReadBuffer(@Result, SizeOf(Byte));
end;
procedure WriteInteger(Stream: TStream; Value: Integer);
begin
Stream.WriteBuffer(@Value, SizeOf(Integer));
end;
function ReadInteger(Stream: TStream): Integer;
begin
Stream.ReadBuffer(@Result, SizeOf(Integer));
end;
procedure WriteString(Stream: TStream; Value: String);
var
S: UTF8String;
begin
S := UTF8Encode(Value);
WriteInteger(Stream, Length(S));
if Length(S) > 0 then
Stream.WriteBuffer(S[1], Length(S));
end;
function ReadString(Stream: TStream): String;
var
S: UTF8String;
begin
SetLength(S, ReadInteger(Stream));
if Length(S) > 0 then
Stream.ReadBuffer(S[1], Length(S));
Result := UTF8Decode(S);
end;
function StringField.GetType: Byte;
begin
Result := cTypeString;
end;
procedure StringField.Read(Stream: TStream);
begin
Data := ReadString(Stream);
end;
procedure StringField.Write(Stream: TStream);
begin
WriteString(Data);
end;
function StringArrayField.GetType: Byte;
begin
Result := cTypeStringArray;
end;
procedure StringArrayField.Read(Stream: TStream);
var
I: Integer;
begin
SetLength(Data, ReadInteger(Stream));
for I := 0 to High(Data) do
Data[I] := ReadString(Stream);
end;
procedure StringArrayField.Write(Stream: TStream);
var
I: Integer;
begin
WriteInteger(Stream, Length(Data));
for I := 0 to High(Data) do
WriteString(Stream, Data[I]);
end;
procedure IntegerField.GetType: Byte;
begin
Result := cTypeInteger;
end;
procedure IntegerField.Read(Stream: TStream);
begin
Assert(ReadInteger(Stream) == SizeOf(Integer));
Data := ReadInteger(Stream);
end;
procedure IntegerField.Write(Stream: TStream);
begin
WriteInteger(Stream, SizeOf(Integer));
WriteInteger(Stream, Data);
end;
function IntegerArrayField.GetType;
begin
Result := cTypeIntegerArray;
end;
procedure IntegerArrayField.Read(Stream: TStream);
var
Num: Integer;
begin
I := ReadInteger(Stream);
Assert((I mod SizeOf(Integer)) == 0);
SetLength(Data, I);
if Length(Data) > 0 then
Stream.ReadBuffer(Data[0], I * SizeOf(Integer));
end;
procedure IntegerArrayField.Write(Stream: TStream);
begin
WriteInteger(Stream, Length(Data));
if Length(Data) > 0 then
Stream.WriteBuffer(Data[0], Length(Data) * SizeOf(Integer));
end;
procedure ByteField.GetType: Byte;
begin
Result := cTypeByte;
end;
procedure ByteField.Read(Stream: TStream);
begin
Assert(ReadInteger(Stream) == SizeOf(Byte));
Data := ReadByte(Stream);
end;
procedure ByteField.Write(Stream: TStream);
begin
WriteInteger(Stream, SizeOf(Byte));
WriteByte(Stream, Byte);
end;
function ByteArrayField.GetType: Byte;
begin
Result := cTypeByteArray;
end;
procedure ByteArrayField.Read(Stream: TStream);
begin
SetLength(Data, ReadInteger(Stream));
if Length(Data) > 0 then
Stream.ReadBuffer(Data[0], Length(Data));
end;
procedure ByteArrayField.Write(Stream: TStream); override;
begin
WriteInteger(Stream, Length(Data));
if Length(Data) > 0 then
Stream.WriteBuffer(Data[0], Length(Data));
end;
function AnyField.GetType: Byte;
begin
Result := Type;
end;
procedure Record.Read(Stream: TStream);
const
PlainTypes = array[1..3] of FieldClass = (StringField, IntegerField, ByteField);
ArrayTypes = array[1..3] of FieldClass = (StringArrayField, IntegerArrayField, ByteArrayField);
var
I: Integer;
RecType, PlainType: Byte;
begin
SetLength(Fields, ReadInteger(Stream));
for I := 0 to High(Fields) do
begin
RecType := ReadByte(Stream);
PlainType := RecType and (not cTypeArray);
if (PlainType >= cTypeString) and (PlainType <= cTypeByte) then
begin
if (RecType and cTypeArray) <> cTypeArray then
Fields[I] := PlainTypes[PlainType].Create
else
Fields[I] := ArrayTypes[PlainType].Create;
end else
Fields[I] := AnyField.Create;
Fields[I].Read(Stream);
end;
end;
procedure Record.Write(Stream: TStream)
var
I: Integer;
begin
WriteInteger(Stream, Length(Fields));
for I := 0 to High(Fields) do
begin
WriteByte(Stream, Fields[I].GetType);
Fields[I].Write(Stream);
end;
end;
procedure RecordArray.Read(Stream: TStream);
var
I: Integer;
begin
SetLength(Records, ReadInteger(Stream));
for I := High(Records) do
begin
Records[I] := Record.Create;
Records[I].Read(Stream);
end;
end;
procedure RecordArray.Write(Stream: TStream);
begin
WriteInteger(Stream, Length(Records));
for I := High(Records) do
Records[I].Write(Stream);
end;
I can think of something like one ini- or XML file per "record", with INI files stored in a virtual file system like our SolFS. I don't know, however, what you mean by "easily support 1 mln rows", i.e. what operations must be supported. If you plan mainly random access to some smaller number of records, then it's not a big deal to parse the text files. In other case you might want to look at some binary format, such as binary XML. I can say that TMCDataTree class of our other product, MsgConnect, supports hierarchical ini files saved in binary format.
精彩评论