开发者

What's the best separator/delimiter character(s) for a plaintext db file? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.

Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.

Closed 3 years ago.

Improve this question

What's the best separator/delimiter character(s) for a plaintext db file?

I considered using |, ,, <TAB>, ;, etc. But they all seem to be possible to break when the nearby entries have special enough characters.

So, the experienced database users, what delimite开发者_开发问答r character(s) do you suggest to use?


Well, there are few separator characters in US-ASCII, hex 1c, 1d, 1e and 1f. The plain text shouldn't contain them.

1c  FS  ␜  ^\  File Separator
1d  GS  ␝  ^]  Group Separator
1e  RS  ␞  ^^  Record Separator
1f  US  ␟  ^_  Unit Separator


No matter which character you choose as your separator, you'll want to escape any instance of that character in your data.

Perhaps tilde(~), or go to a high-ASCII character.

Either way, if there's any chance that it could sneak into your data, you'd want to escape it before writing to your plaintext file.


I think the best way to join string with a three cherries '@@@'.


For a particular data warehousing situation where we had control over the source file, but escaping and qualifying were onerous, we were able to make the business decision that one extended ASCII character would be stripped from the data (if it ever occurs, which it hasn't).

On creation of the delimited source file, we stripped out any instances of █ (alt+219) in the data and use that character for the delimiter. Bonus, that character is really easy to spot.


Actually, it depends on the type of data you are trying to separate, we needed a separator for the machine events data and a couple of them were proposed:

=) or ^_^.

We chose ^_^ because it actually worked based on the number of samples tested and it also looks cute!


Personally I like using « as a delimiter character to split data in CSV files, I don't think I've ever found a naturally occurring instance of « and » personally, so here are my two cents about it.


I usually prefer non-printable characters like "\u0001", for instance I use this as a column delimiter in most of my Azure Data Analytics U-SQL Scripts. That is assuming you can use a multi-character custom delimiter


You could use the special separator characters (hex 1c -> 1f), yet they are non-printable, and some technologies have issues processing data containing them.

So, plan B, if your data is in UTF-8, you could pick a random UTF-8 character that is extremely unlikely to appear in any source data you receive.

Yet, even then, if you want to be sure you'll not run into issues, you better always scan your entire dataset for this character, and if it appears, simply pick another UTF-8 character.

I tend to hate encapsulation with a passion, and avoid it whenever possible, as explained in my post under the chapter 'encapsulation' here: https://theonemanitdepartment.wordpress.com/2014/12/15/the-absolute-minimum-everyone-working-with-data-absolutely-positively-must-know-about-file-types-encoding-delimiters-and-data-types-no-excuses/


If you can't control the data being put into it, don't use a plain text db. There can be no generally right answer here. Without context or constraints this is a false question.

To wit: If I said I was only going to accept lower case letters as data, I could use any other symbol as a separator. Even, say, the number 9, and I'd be fine. No symbol other than a lower case character would be better than any other.

Conversely, if said I could accept any character, then I don't have any characters left for a separator, and I'd be left with a very sorry database that could only store a single value.

If you have to try too hard to get your db into plain text, you probably want a binary db. Have you looked at sqlite? It's pretty darned easy to use, is available in many contexts, and comes with a ton of benefits over a plain text db.


If you have the option of a string as column separator, use "" as delimiter. You can make up any string for that matter and gives you flexibility.


I've used an ePUB convertor before and the delimiter char was the notational quote character, anywhere it had been used it would be rewritten to file as @, simple but effective even if it did destroy the sample material being produced.


I propose the interrobang character "‽". More details: https://en.wikipedia.org/wiki/Interrobang

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜