Understanding this CSV header

2023-02-03 08:30 问答作者：

I need to parse a CSV file which has this header:

Company;Registered office;Notifying party;Domicile or Registered office;Holdings of voting rights;;;;;;Publication

;;;;directly held;;additionally counted;;to开发者_开发技巧tal;;in Germany;;in foreign countries

;;;;percentage;single rights;percentage;single rights;percentage;single rights;Official stock exchange

I was wondering whether this is a standard header format, because I expected to have all the fields listed one after another, like (in the first row) "Holdings of voting rights-directly held-percentage;Holdings of voting rights-directly held-single rights", while I see that information spread over three lines.

Currently my file has 6 lines of header (the three shown and other three in another language), how can I detect, if a day they'll add some more header lines?? The file continues with the following line (the first data) and so on. The first line of real data isn't always the same

BBS Kraftfahrzeugtechnik AG;Schiltach;Baumgartner, Heinrich;Deutschland;62,5;;37,5;;100,0;;Börsenzeitung;04.04.2002

I'm also looking for java libraries which are able to parse CSV files.

I disagree to others who claim that only comma is allowed. Wikipedia, for example, gives a case of German CSV which uses semicolons for CSV separation (as commas are used for decimal separation). I think MS Excel is also pretty much flexible on what delimiters to use. It's just programmers' minds that try to gravitate towards most simplistic case.

For CSV parsing I recommend Ostermiller Utils.

Q> how can I detect, if a day they'll add some more header lines?
A> you can't. The only thing you can rely is either dynamic layout (where you know column names in advance) or static layout (where you assume that this column is always n-th).

Despite CSV (Comma Seperated Value) files having the word comma in their name, I've seen some very weird stuff in the enterprise world.

I would suggest creating your own representation of the data. It sounds like you may be reading in multiple files all formatted a bit differently?

I would approach the problem in a modular fashion. Have importers for the different formats, bring it in to a normalized data representation that you than do what you want with.

This is all assuming that these files contain the same type of data and that you have no control over the files you are receiving.

Even if this is not the case, abstracting out the data from it's representation and sticking that in a separate project would be useful.

I would also recommend the use of OpenCSV

Yes, you have a legitimate CSV file. I read it in successfully by Excel, and suspect I would have no problem with OpenOffice. For Excel, I saved it as a .txt file, but then had to tell Excel in the opening dialogue that it was delimited by semicolons.

This is "standard" in the sense that it is separating columns by a delimiter (semicolons are OK, as are tabs and of course commas) and rows by new lines.

The reason that you were given this format is because the second and third header lines don't come directly under the first line. "Holdings of voting rights" spans 6 columns. Underneath it, on the second header line, "directly held" spans 2 columns, as does "additional counted" and "total." The third header line breaks down the second header line into "percentage" and "single rights."

I don't think you will easily be able to find when the headers stop and the data begins. This is a semantic problem -- one of meaning. It is easier for a human, though!

This is not a CSV file. You need to get the specification for the file from whoever is generating it.

CSV files are Comma-Separated-Values, with one record per line. It's a loose specification with regards to how to escape commas and escape characters. Excel uses double quotes around values, and then doubled-up double-quotes.

With regards to CSV parsing libraries, I would highly recommend OpenCSV.

Also see: Can you recommend a Java library for reading (and possibly writing) CSV files?

There is no standard header format. It can been seen as a convention that the first line is a comma separated list of values representing the column headers.

In your case, your table has three header lines (my guess based on counting cells and comparing with the content of your data example).

It is still csv, but you have t know in advance which line is the first line holding actual data. There is no clue given by the format itself.

As for CSV headers go, there is no standard format. In all cases, we do assume that first line is a header. Altough if header spans over multiple lines (which I am seeing for first time here) then you would need to know the count of header columns before you start parsing this file. Atleast that is a start.

The next assumption in csv files is generally that one line is one row or record. So usually headers and data are seperated by newline. In your case, I am not sure how you are generating the file and how is it planned to be used.

继续阅读：csv parsing

Understanding this CSV header

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？