Flexible parsing of text with regular expressions in Java or Python
I am working on some code to parse text into XML. I am currently using java and jaxb to handle the XML and the in-program representation of my data. I need to setup an easily expandable and adaptable method to parse the info from my text files into my java classes. The data will for the most part stay the same, but I need to be able to support later changes in the text input format. (I am parsing airline pilot flight schedules, and I want to support the schedules of other airlines down the road.) It seems like regular expressions are the way to go, but the little I have worked with java RE makes it seem a poor solution compared to python - named captures specifically. But, I know less about python than I do about Java!
So, I am looking for a modular system to parse text data that I can easily adapt, extend, and distribute later on. I am willing to learn more python if I need it, but my time and abilities are limited. Any suggestions? An example of the text I am p开发者_开发百科arsing follows.
================================================================================================= 8122 TU REPORT AT 06.45/N EFFECTIVE JUN 08-JUN 29 1 CAPT, 1 F/O DAY FLT. EQP DEPARTS ARRIVES BLK. BLK. DUTY CR. LAYOVER MO TU WE TR FR SA SU TU 180 320 PHX 0745 SAN 0857* 1.12 -- -- -- -- -- -- TU 005 320 SAN 0950 PHX 1106 1.16 -- 8 -- -- -- -- -- TU 592 L 320 PHX 1215 MCI 1652 2.37 -- 15 -- -- -- -- -- Radisson A/P 5.05 8.22 5.05 MCI 12.18 -- 22 -- -- -- -- -- (816) 464-2423 -- 29 -- WE 403 B 320 MCI 0610 PHX 0657 2.47 WE 149 320 PHX 0859 CMH 1547 3.48 Holiday Inn City Center 6.35 9.37 6.35 CMH 15.13 (614) 221-3281 TH 335 B 320 CMH 0800 PHX 0913 4.13 TH 343 L 320 PHX 1029 PVR 1508 2.39 Marriott Casamagna 6.52 9.23 6.52 PVR 15.52 52-322-2260000 TRANS: Hotel Shuttle FR 621 320 PVR 0815 PHX 0839 2.24 2.24 3.39 2.24 CREDIT HRS. 21.00 BLK. HRS. 20.56 LDGS: 8 TAFB 74.24 =================================================================================================
Those look like fixed-width fields, which are probably a good choice for simple string splitting. The only thing it looks like you could use regular expressions for is to determine what type of record you are looking at, although it looks like the indentation level is also useful for determining that.
You should be fine with java regular expressions and it should be a trivial exercise to support named captures. After all it is just mapping capture group numbers to names. I even have code for this around somewhere, but can't share due to copyright reasons.
You could put regular expressions to parse the individual parts of such listings in a text file and make those part of your configuration. Regular expressions are compiled at run-time, so this should be fairly dynamic.
If you want a more flexible system (albeit at the cost of a pre-compilation step), have a look at parser generators like JavaCC or ANTLR. These allow you to create context-free grammars which are considerably more powerful than regexp.
In Python, you could try Gelatin.
精彩评论