Problems with portability: aligning data, endianness issues, etc
I'm writing a toy database management system, and running up against some alignment and end开发者_StackOverflowianness issues.
First, allow me to explain the data that is being stored, and where it's being stored. So first some definitions. The layout of a record is broken up into a Record Directory and Record Data.
[Field count=N] [Field offset[0]] [...] [Field offset[N-1]] [Data for fields 0 to N]
The field count and offsets combined are called the Record Directory.
The data is called the Record Data.
The field count is of type
uint16_t
.The field offset is of type
uint16_t
.The data fields can be treated as a variable length byte buffer pointed to by
(uint8_t *)
with a length of at least N bytes.The field count cannot exceed: 4095 or 0x0FFF (in big endian).
The records are stored in a Page:
Pages are of size: 4096 bytes.
Pages need to store 2 bytes of data for each record.
The last 6 bytes of the page stores the running free space offset, and data for a slot directory. The metadata is irrelevant to the question, so I will not bore anyone with the details.
We're storing records on the page, by appending to the running free space offset, and appending to it. Records can later be altered and deleted. This will leave unused space fragments on the page. This data is not reused until time of compaction.
At the moment, we store a fragment byte of 0x80 in unused space (since the free space cannot exceed 0x0FFF, the first byte will never be 0x80).
However this becomes a problem during compaction time. We end up scanning everything until we hit the first byte that is not 0x80. We consider this the start of the free space. Well unfortunately, this is not portable and will only work on big endian machines.
But just to restate the issue here, the problem is distinguishing between: 0x808000
and 0x800080
where the first two bytes (read right to left) are two valid Field count fields depending on the endianness of the platform.
I want to try aligning records on even bytes. I just don't have the foresight to see if this would be a correct workaround for this issue.
At any given time, the free space offset should always sit on an even byte boundary. This means after inserting a record, you advance the free space pointer to the next even boundary.
The problem then becomes an issue of marking the fragments. Fragments are created upon deletion or altering a record (growing/shrinking by some number of bytes). I wanted to store what I would call 2-byte fragment markers: 0xFFFF. But that doesn't seem possible when altering.
This is where I'm stuck. Sorry for the long-winded problem explanation. We (my partner, this is an academic assignment) battled the problem of data ambiguity several times, and it keeps masking itself under different solutions.
Any insight would help. I hope the problem statement can be followed.
I would try this:
- Align records to at least 2-byte boundaries.
- Scan the list for free space as a list of uint16_t rather than char,
then look for
length & 0x8000
.
If you let the machine interpret integers as such instead of trying to scan them as characters, endianness shouldn't be an issue here (at least until you want to read your database on a different machine than the one that wrote it).
精彩评论