How do you go about batch processing poorly formatted text files?
People complain a lot about XML but, when compared to EDI and some of the proprietary file formats I've dealt with in my career, I think XML is bliss. The work I did on importing data files from Automotive Comparative Raters, eac开发者_Go百科h with it's own creative and nightmarish file format, still gives me nightmares.
That being said I'm curious how other programmers approach automated parsing of poorly formatted text files. Do you have a language preference? Are there any automation tools that you find invaluable? How do you make your code reusable?
A solution I learned about quite recently is using a standalone lexer. You get to use structured regular expressions and you avoid the constraints of a full blown parser generator.
Here are some examples with ocamllex (the lexer generator provided with OCaml):
- an ocamllex tutorial with some examples.
- processing of genbank loosely formatted text files (other link which better illustrates the point but hindered by a javascript dialog).
Obviously lexer generators are also available in other languages if using OCaml is an issue for you.
Perl / Python, build up functionality slowly, keep the worse ones as test case, lots of coffee
When I need to parse a poorly formatted text, I use Perl and Marpa, a general BNF parser. Look at the text, find patterns, describe them as BNF rules, e.g.
pattern_name ::= pattern_symbol1 pattern_symbol2 ...
or for lexeme patterns,
lexeme ~ lexeme_symbol1 lexeme_symbol2 ...
you can use single quoted string and character classes to describe lexemes in the BNF grammar text. Feed the BNF to Marpa, define semantic actions and evaluate the parse or just process the ast to get the results.
Examples of Perl scripts using Marpa to parse poorly formatted text here at SO:
Parse values from a block of text based on specific keys
Problem Category = "Human Endeavors "
Problem Subcategory = "Space Exploration"
Problem Type = "Failure to Launch"
Software Version = "9.8.77.omni.3"
Problem Details = "Issue with signal barrier chamber."
extracted from:
Problem Category: Human Endeavors Problem Subcategory: Space ExplorationProblem Type: Failure to LaunchSoftware Version: 9.8.77.omni.3Problem Details: Issue with signal barrier chamber.
Parsing of parenthesis with sed using regex
key1
key2
key3
key4
key5
key6
key7
extracted from
dummy
(key1)
(key2)dummy(key3)
dummy(key4)dummy
dummy(key5)dummy))))dummy
dummy(key6)dummy))(key7)dummy))))
How to extract corporate bonds informations using machine learning
ABC 2.5 19
XYZ 6.5 15
extracted from
<[/] Trading 10mm ABC 2.5 19 05/06 mkt can use 50mm>
<XYZ 6.5 15 10-2B 106-107 B3 AAA- 1.646MM 2x2>
Hope this helps.
I know I'll receive scathing responses for this, but I like Java as an all-around language. In the case of file parsing, generic regexes (I know, now I have 2 problems...) work well for me.
精彩评论