开发者

Tricky file parsing. Inconsistent Delimeters

I need to parse a file with the following format.

0000000 ...ISBN.. ..Author.. ..Title.. ..Edit.. ..Year.. ..Pub.. ..Comments.. NrtlExt Nrtl Next Navg NQoH UrtlExt Urtl Uext Uavg UQoH ABS NEB MBS FOL 
ABE0001 0-679-73378-7 ABE WOMAN IN THE DUNES (INT'L ED) 1st 64 RANDOM 0.00 13.90 0.00 10.43 0 21.00 10.50 6.44 3.22 2 2.00 0.50 2.00 2.00 ABS 

The ID and ISBN are not a problem, the title is. There is no set length for these fields, and there are no solid delimiters- the space can be used for most of the file.

Another issue is that there is not always an entry in 开发者_StackOverflow社区the comments field. When there is, there are spaced within the content.

So I can get the first two, and the last fourteen. I need some help figuring out how to parse the middle six fields.

This file was generated by an older program that I cannot change. I am using php to parse this file.


I would also ask myself 'How good does this have to be' and 'How many records are there'?

If, for example, you are parsing this list to put up a catalog of books to sell on a website - you probably want to be as good as you can, but expect that you will miss some titles and build in feedback mechanism so your users can help you fix the issue ( and make it easy for you to fix it in your new format).

On the other hand, if you absolutely have to get it right because you will loose lots of money for each wrong parse, and there are only a few thousand books, you should plan on getting close, and then doing a human review of the entire file.

(In my first job, we spend six weeks on a data conversion project to convert 150 records - not a good use of time).


Find the title and publisher of the book by ISBN (in some on-line database) and parse only the rest :)

BTW. are you sure that what looks like space actually is a space? There are more "invisible" characters (like non-break space). I know, not a good idea, but apparently author of that format was pretty creative...


You need to analyze you data by hand and find out what year, edition and publisher look like. For example if you find that year is always two digits and publisher always comes from some limited list, this is something you can start with.


While I don't see any way other then guessing a bit I'd go about it something like this:

I'd scale off what I know I can parse out reliably. Leaving you with ABE WOMAN IN THE DUNES (INT'L ED) 1st 64 RANDOM

From there I'd try locate the Edition and split the string into two at that position after storing and removing the Edition leaving you with ABE WOMAN IN THE DUNES (INT'L ED) & 64 RANDOM, another option is to try with the year but of course Titles such as 1984 might present a problem . (Guessing edition is of course assuming it's 7th, 51st etc for all editions).

Finally I'd assume I could somewhat reliable guess the year 64 at the start of the second string and further limit the Publisher(/Comment) part.

The rest is pure guesswork unless you got a list of authors/publishers somewhere to match against as I'd assume there are not only comments with spaces but also publishers with spaces in their names. But at least you should be down to 2 strings containing Author/Title in one and Publisher(/Comments) in the other.

All in all it should limit the manual part a bit.

Once done I'd also save it in a better format somewhere so I don't have to go about parsing it again ;)


I don't know if the pcre engine allows multiple groups from within selection, therefore:

([A-Z0-1]{7})\ (\d-\d{3}-\d{5}-\d)\ (.+)\ (\d(?:st|nd|rd))\ \d{2}\ ([^\d.]+)\ (\d+.\d{2})\ (\d+.\d{2})\ (\d+.\d{2})\ (\d+.\d{2})\ (\d{1})\ (\d+.\d{2})\ (\d+.\d{2})\ (\d+.\d{2})\ (\d+.\d{2})\ (\d)\ (\d+.\d{2})\ (\d+.\d{2})\ (\d+.\d{2})\ (\d+.\d{2})\ (\w{3})

It does look quite ugly and doesn't fix your author-title problem but it matches quite good for the rest of it. Concerning your problem I don't see any solution but having a lookup table for authors or using other services to lookup title and author via the ISBN.

Thats if unlike in your example above the authors are not just represented by their first name. Also double check all exception that might occur with the above regex as titles may contain 1st or alike.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜