开发者

Flexible parsing of text with regular expressions in Java or Python

I am working on some code to parse text into XML. I am currently using java and jaxb to handle the XML and the in-program representation of my data. I need to setup an easily expandable and adaptable method to parse the info from my text files into my java classes. The data will for the most part stay the same, but I need to be able to support later changes in the text input format. (I am parsing airline pilot flight schedules, and I want to support the schedules of other airlines down the road.) It seems like regular expressions are the way to go, but the little I have worked with java RE makes it seem a poor solution compared to python - named captures specifically. But, I know less about python than I do about Java!

So, I am looking for a modular system to parse text data that I can easily adapt, extend, and distribute later on. I am willing to learn more python if I need it, but my time and abilities are limited. Any suggestions? An example of the text I am p开发者_开发百科arsing follows.

=================================================================================================
 8122 TU             REPORT AT 06.45/N             EFFECTIVE JUN 08-JUN 29
      1 CAPT, 1 F/O
   DAY  FLT.  EQP DEPARTS   ARRIVES    BLK.   BLK.  DUTY   CR.     LAYOVER   MO TU WE TR FR SA SU
   TU   180   320 PHX 0745  SAN 0857* 1.12                                      -- -- -- -- -- --
   TU   005   320 SAN 0950  PHX 1106  1.16                                   --  8 -- -- -- -- --
   TU   592 L 320 PHX 1215  MCI 1652  2.37                                   -- 15 -- -- -- -- --
             Radisson A/P                     5.05  8.22  5.05  MCI  12.18   -- 22 -- -- -- -- --
             (816) 464-2423                                                  -- 29 --            
   WE   403 B 320 MCI 0610  PHX 0657  2.47                                  
   WE   149   320 PHX 0859  CMH 1547  3.48                                  
             Holiday Inn City Center          6.35  9.37  6.35  CMH  15.13  
             (614) 221-3281                                        
   TH   335 B 320 CMH 0800  PHX 0913  4.13                                  
   TH   343 L 320 PHX 1029  PVR 1508  2.39                                  
             Marriott Casamagna               6.52  9.23  6.52  PVR  15.52  
             52-322-2260000 TRANS: Hotel Shuttle                   
   FR   621   320 PVR 0815  PHX 0839  2.24                                  
                                              2.24  3.39  2.24              
      CREDIT HRS.  21.00     BLK. HRS. 20.56    LDGS:  8     TAFB    74.24  
=================================================================================================


Those look like fixed-width fields, which are probably a good choice for simple string splitting. The only thing it looks like you could use regular expressions for is to determine what type of record you are looking at, although it looks like the indentation level is also useful for determining that.


You should be fine with java regular expressions and it should be a trivial exercise to support named captures. After all it is just mapping capture group numbers to names. I even have code for this around somewhere, but can't share due to copyright reasons.

You could put regular expressions to parse the individual parts of such listings in a text file and make those part of your configuration. Regular expressions are compiled at run-time, so this should be fairly dynamic.

If you want a more flexible system (albeit at the cost of a pre-compilation step), have a look at parser generators like JavaCC or ANTLR. These allow you to create context-free grammars which are considerably more powerful than regexp.


In Python, you could try Gelatin.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜