Good conventions for embedding schema of a flat file

2022-12-23 11:26 问答作者：

We receive lots of data as flat files: delimitted or just fixed length records. It's sometimes hard to find out what the files actually contain.

Are there any well established practices for embedding the schema of the file to开发者_C百科 the beginning or the end of a file to make the file self-explanatory?

Just to get an idea, imagine something like this:

<data name=test records=2 type=fixed>
   <field name=foo start=0 length=2 type=numeric>
   <field name=bar start=2 length=4 type=text>
</data>
11test
12ing

We would parse the xml in the beginning and use it for reading the records.

So far as I'm aware no - or at least not hugely.

The only thing I'm aware of (in terms of a widely accepted standard) is for the first row of the data file to be the column names - at least for delimited records, for fixed length its harder especially if your data can contain multiple record types (which I've found to be far more likely with fixed length than with delimited).

From where I sit I'd suggest that you can't really embed the definition into the file I'm assuming you're getting data from external sources so you're unlikely to get help from them and even if you do you immediately create challenges as you can't (for example) easily open the files with Excel if necessary.

Thinking a bit laterally you could - if using XML - potentially embed the file into the definition (big lump of CDATA). This is a slightly more practical solution as its putting a wrapper round your external data not asking that the data itself be modified. Not sure how practical this is - but it feels better to me than the other way round.

have you looked at Protocol Buffers for inspiration?

I don't know about any established practice, but your idea of just prepending the schema to the data seems fine. Apache Avro is a data serialization tool similar to Protocol Buffers and Thrift. I believe typical Avro usage involves storing the schema with the data (by prepending it in the stream, I'd guess).

I wanted to also mention the PADS project. They have a schema language designed to let you describe "ad-hoc" data formats. Currently I believe they only have C and ML implementations, which may be a problem. On the other hand, their schema language was designed to handle a wide variety of formats, so it still might be worth using it over your own XML-based thing.

继续阅读：data-integration flat-file metadata xml

Good conventions for embedding schema of a flat file

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？