Storing multiple messages in one protocol buffer binary file

2023-02-22 16:33 问答作者：

I have repeating messages which I want to store in a single file. Currently I have to wrap this repeating message in anot开发者_Python百科her message. Is there a way around this?

package foo;

message Box {
  required int32 tl_x = 1;
  required int32 tl_y = 2;
  required int32 w = 3;
  required int32 h = 4;
}

message Boxes {
  repeated Box boxes = 1;
}

Here's what "Techniques" section of the Protocol Buffers documentation says about repeated messages:

If you want to write multiple messages to a single file or stream, it is up to you to keep track of where one message ends and the next begins. The Protocol Buffer wire format is not self-delimiting, so protocol buffer parsers cannot determine where a message ends on their own. The easiest way to solve this problem is to write the size of each message before you write the message itself. When you read the messages back in, you read the size, then read the bytes into a separate buffer, then parse from that buffer. (If you want to avoid copying bytes to a separate buffer, check out the CodedInputStream class (in both C++ and Java) which can be told to limit reads to a certain number of bytes.)

There's also a conventional way of implementing this in C++ and Java. Take a look at this Stack Overflow thread for details: Are there C++ equivalents for the Protocol Buffers delimited I/O functions in Java?

Protobuf doesn't support this functionality. It can be used to just serialize one message, but this serialized message doesn't contain information about its type (Box or Boxes) and length. So if you want to store multiple message you have to include type and length of message as well. Writing algorithm (in pseudo language) could look like this:

for every message {
    write(type_of_message) // 1 byte long
    write(length_of_serialized_message) // 4 bytes long
    write(serialized_message)
}

Load algorithm:

while(end_of_file) {

    type = read(1) // 1 byte
    length = read(4) // 4 bytes
    buffer = read(length)
    switch (type) {
      case 1:
         deserialise_message_1(buffer)
      case 2:
         deserialise_message_2(buffer)
    }
}

I was just working on this problem and ended up going with Parquet. Parquet works perfectly for storing a bunch of Protobuf messages in a file and makes it easier to work with them later on.

This bit of code will create the Parquet file:

Path path = new Path("/tmp/mydata.parq");
CompressionCodecName codecName = CompressionCodecName.SNAPPY;
int blockSize = 134217728;
int pageSize = 1048576;
boolean enableDictionary = true;
boolean validating = false;

ProtoParquetWriter<Message> writer
    = new ProtoParquetWriter<>(
        path,
        Box.class,
        codecName,
        blockSize,
        pageSize,
        enableDictionary,
        validating
    );

for (Message message : messages) {
    writer.write(message);
}

writer.close();

It might not suit your use case but I thought that it was worth a mention here.

In java you can use delimited messages. For C++ see Are there C++ equivalents for the Protocol Buffers delimited I/O functions in Java?

Basically in C++ according to above

const unsigned bufLength = 256;
unsigned char buffer[bufLength];
Message protoMessage;

google::protobuf::io::ArrayOutputStream arrayOutput(buffer, bufLength);
google::protobuf::io::CodedOutputStream codedOutput(&arrayOutput);

codedOutput.WriteLittleEndian32(protoMessage.ByteSize());
protoMessage.SerializeToCodedStream(&codedOutput);

and python you would need to work it out

继续阅读：protocol-buffers python

Storing multiple messages in one protocol buffer binary file

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？