Tricky file parsing. Inconsistent Delimeters

2022-12-23 16:35 问答作者：

I need to parse a file with the following format.

0000000 ...ISBN.. ..Author.. ..Title.. ..Edit.. ..Year.. ..Pub.. ..Comments.. NrtlExt Nrtl Next Navg NQoH UrtlExt Urtl Uext Uavg UQoH ABS NEB MBS FOL 
ABE0001 0-679-73378-7 ABE WOMAN IN THE DUNES (INT'L ED) 1st 64 RANDOM 0.00 13.90 0.00 10.43 0 21.00 10.50 6.44 3.22 2 2.00 0.50 2.00 2.00 ABS

The ID and ISBN are not a problem, the title is. There is no set length for these fields, and there are no solid delimiters- the space can be used for most of the file.

Another issue is that there is not always an entry in 开发者_StackOverflow社区the comments field. When there is, there are spaced within the content.

So I can get the first two, and the last fourteen. I need some help figuring out how to parse the middle six fields.

This file was generated by an older program that I cannot change. I am using php to parse this file.

I would also ask myself 'How good does this have to be' and 'How many records are there'?

If, for example, you are parsing this list to put up a catalog of books to sell on a website - you probably want to be as good as you can, but expect that you will miss some titles and build in feedback mechanism so your users can help you fix the issue ( and make it easy for you to fix it in your new format).

On the other hand, if you absolutely have to get it right because you will loose lots of money for each wrong parse, and there are only a few thousand books, you should plan on getting close, and then doing a human review of the entire file.

(In my first job, we spend six weeks on a data conversion project to convert 150 records - not a good use of time).

Find the title and publisher of the book by ISBN (in some on-line database) and parse only the rest :)

BTW. are you sure that what looks like space actually is a space? There are more "invisible" characters (like non-break space). I know, not a good idea, but apparently author of that format was pretty creative...

You need to analyze you data by hand and find out what year, edition and publisher look like. For example if you find that year is always two digits and publisher always comes from some limited list, this is something you can start with.

While I don't see any way other then guessing a bit I'd go about it something like this:

I'd scale off what I know I can parse out reliably. Leaving you with ABE WOMAN IN THE DUNES (INT'L ED) 1st 64 RANDOM

From there I'd try locate the Edition and split the string into two at that position after storing and removing the Edition leaving you with ABE WOMAN IN THE DUNES (INT'L ED) & 64 RANDOM, another option is to try with the year but of course Titles such as 1984 might present a problem . (Guessing edition is of course assuming it's 7th, 51st etc for all editions).

Finally I'd assume I could somewhat reliable guess the year 64 at the start of the second string and further limit the Publisher(/Comment) part.

The rest is pure guesswork unless you got a list of authors/publishers somewhere to match against as I'd assume there are not only comments with spaces but also publishers with spaces in their names. But at least you should be down to 2 strings containing Author/Title in one and Publisher(/Comments) in the other.

All in all it should limit the manual part a bit.

Once done I'd also save it in a better format somewhere so I don't have to go about parsing it again ;)

I don't know if the pcre engine allows multiple groups from within selection, therefore:

([A-Z0-1]{7})\ (\d-\d{3}-\d{5}-\d)\ (.+)\ (\d(?:st|nd|rd))\ \d{2}\ ([^\d.]+)\ (\d+.\d{2})\ (\d+.\d{2})\ (\d+.\d{2})\ (\d+.\d{2})\ (\d{1})\ (\d+.\d{2})\ (\d+.\d{2})\ (\d+.\d{2})\ (\d+.\d{2})\ (\d)\ (\d+.\d{2})\ (\d+.\d{2})\ (\d+.\d{2})\ (\d+.\d{2})\ (\w{3})

It does look quite ugly and doesn't fix your author-title problem but it matches quite good for the rest of it. Concerning your problem I don't see any solution but having a lookup table for authors or using other services to lookup title and author via the ISBN.

Thats if unlike in your example above the authors are not just represented by their first name. Also double check all exception that might occur with the above regex as titles may contain 1st or alike.

继续阅读：file parsing php

Tricky file parsing. Inconsistent Delimeters

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？