How to parse text with delimiters? [duplicate]
Possible Duplicate:
How to parse this output and separate each field/word
I want to parse the following data, such that i get the output as specified below.
Input:
RTRV-ALM-EQPT::ALL:RA01; SIMULATOR 09-11-20 13:52:15 M RA01 COMPLD "SLOT-1-1-1,CMP:MN,T-FANCURRENT-1-HIGH,NSA,01-10-09,00-00-00,,:\"Fan-T\"," "SLOT-1-1-1,CMP:MJ,T-BATTERYPWR-2-LOW,NSA,01-10-09,00-00-00,,:\"Battery-T\"," "SLOT-1-1-2,CMP:CR,PROC_FAIL,SA,09-11-20,13-51-55,,:\"Processor Failure\"," "SLOT-1-1-3,OLC:MN,T-LASERCURR-1-HIGH,SA, 01-10-07,13-21-03,,:\"Laser-T\"," "SLOT-1-1-3,OLC:MJ,T-LASERCURR-2-LOW,NSA, 01-10-02,21-32-11,,:\" Laser-T\"," "SLOT-1-1-4,OLC:MN,T-LASERCURR-1-HIGH,SA,01-10-05,02-14-03,,:\"Laser-T\"," "SLOT-1-1-4,OLC:MJ,T-LASERCURR-2-LOW,NSA,01-10-04,01-03-02,,:\"Laser-T\"," ;
output:
1) RTRV-ALM-EQPT::ALL:RA01;开发者_JS百科 2) SIMULATOR 3) 09-11-20 4) 13:52:15 5) M 6) RA01 7) COMPLD 8) "SLOT-1-1-1,CMP:MN,T-FANCURRENT-1-HIGH,NSA,01-10-09,00-00-00,,:\"Fan-T\"," 9) "SLOT-1-1-1,CMP:MJ,T-BATTERYPWR-2-LOW,NSA,01-10-09,00-00-00,,:\"Battery-T\"," 10) "SLOT-1-1-2,CMP:CR,PROC_FAIL,SA,09-11-20,13-51-55,,:\"Processor Failure\"," 11) "SLOT-1-1-3,OLC:MN,T-LASERCURR-1-HIGH,SA, 01-10-07,13-21-03,,:\"Laser-T\"," 12) "SLOT-1-1-3,OLC:MJ,T-LASERCURR-2-LOW,NSA, 01-10-02,21-32-11,,:\" Laser-T\"," 13) "SLOT-1-1-4,OLC:MN,T-LASERCURR-1-HIGH,SA,01-10-05,02-14-03,,:\"Laser-T\"," 14) "SLOT-1-1-4,OLC:MJ,T-LASERCURR-2-LOW,NSA,01-10-04,01-03-02,,:\"Laser-T\","
The best approach is probably not to think of converting the first text to the second text.
Rather, think of first parsing the first text into a set of Java objects representing what they actually are. For example, the second/third line of your input might be represented by a Test
class with "area", "day" and "time" properties. (Only you can come up with a sensible model based on your knowledge of what everything means).
Then once you've got a nice in-memory representation of the file information, you can consider printing out to text as in the second case. It should be very easy now to just print out the various fields and properties from your Java objects, rather than trying to transform the input text on the fly.
Assuming the files are relatively small and can therefore be read into memory. Try something like this:
public class Main {
public static void main(String[] args) {
String text = "RTRV-ALM-EQPT::ALL:RA01;\n"+
"\n"+
" SIMULATOR 09-11-20 13:52:15\n"+
"M RA01 COMPLD\n"+
" \"SLOT-1-1-1,CMP:MN,T-FANCURRENT-1-HIGH,NSA,01-10-09,00-00-00,,:\\\"Fan-T\\\",\"\n"+
" \"SLOT-1-1-1,CMP:MJ,T-BATTERYPWR-2-LOW,NSA,01-10-09,00-00-00,,:\\\"Battery-T\\\",\"\n"+
" \"SLOT-1-1-2,CMP:CR,PROC_FAIL,SA,09-11-20,13-51-55,,:\\\"Processor Failure\\\",\"\n"+
" \"SLOT-1-1-3,OLC:MN,T-LASERCURR-1-HIGH,SA, 01-10-07,13-21-03,,:\\\"Laser-T\\\",\"\n"+
" \"SLOT-1-1-3,OLC:MJ,T-LASERCURR-2-LOW,NSA, 01-10-02,21-32-11,,:\\\" Laser-T\\\",\"\n"+
" \"SLOT-1-1-4,OLC:MN,T-LASERCURR-1-HIGH,SA,01-10-05,02-14-03,,:\\\"Laser-T\\\",\"\n"+
" \"SLOT-1-1-4,OLC:MJ,T-LASERCURR-2-LOW,NSA,01-10-04,01-03-02,,:\\\"Laser-T\\\",\"\n"+
";";
Matcher m = Pattern.compile("\"(?:\\\\.|[^\\\"])*\"|\\S+").matcher(text);
int n = 0;
while(m.find()) {
System.out.println((++n)+") "+m.group());
}
}
}
Output:
1) RTRV-ALM-EQPT::ALL:RA01;
2) SIMULATOR
3) 09-11-20
4) 13:52:15
5) M
6) RA01
7) COMPLD
8) "SLOT-1-1-1,CMP:MN,T-FANCURRENT-1-HIGH,NSA,01-10-09,00-00-00,,:\"Fan-T\","
9) "SLOT-1-1-1,CMP:MJ,T-BATTERYPWR-2-LOW,NSA,01-10-09,00-00-00,,:\"Battery-T\","
10) "SLOT-1-1-2,CMP:CR,PROC_FAIL,SA,09-11-20,13-51-55,,:\"Processor Failure\","
11) "SLOT-1-1-3,OLC:MN,T-LASERCURR-1-HIGH,SA, 01-10-07,13-21-03,,:\"Laser-T\","
12) "SLOT-1-1-3,OLC:MJ,T-LASERCURR-2-LOW,NSA, 01-10-02,21-32-11,,:\" Laser-T\","
13) "SLOT-1-1-4,OLC:MN,T-LASERCURR-1-HIGH,SA,01-10-05,02-14-03,,:\"Laser-T\","
14) "SLOT-1-1-4,OLC:MJ,T-LASERCURR-2-LOW,NSA,01-10-04,01-03-02,,:\"Laser-T\","
15) ;
The only difference is that there's a 15th match: the ;
, which you forgot, I believe.
The raw regex (without all the escapes) looks like this:
"(?:\\.|[^\\"])*"|\S+
and matches:
" # match a double quote
(?: # open non matching group 1
\\. # match a backslash followed by any char (except line breaks)
| # OR
[^\\"] # match any char except a backslash and a double quote
)* # close non matching group 1 and repeat it zero or more times
" # match a double quote
| # OR
\S+ # match one or more characters other than white space chars
In other words: match a quoted string or match a word consisting of solely non-space characters.
For parsing any input you must know its structure.
- Are the first four lines always present?
- What is the format of each of these four lines?
精彩评论