开发者

Best approach to parse text files that contain multiple types of delimiters?

I need to parse some text files that have different types of delimiters (tildes, spaces, commas, pipes, caret characters).

There is also a different order of elements depending on what the delimiter is, e.g:

comma: A, B, C, D, E
caret: B, C, A, E, D
tilde: C, A, B, D, E 

The delimiter is th开发者_开发百科e same within the file but different from one file to another. From what I can tell, there are no delimiters within the data elements.

What's a good approach to do this in plain ol' Java?


I like to read the first two lines of a file, and then test the delimiters. If you split on a delimiter, and both lines return the same non-zero number of pieces, then you've probably guessed the correct one. Here's an example program which checks the file names.txt.

public static void main(String[] args) throws IOException {
    File file = new File("etc/names.txt");

    String delim = getDelimiter(file);
    System.out.println("Delim is " + delim + " (" + (int) delim.charAt(0) + ")");
}

private static final String[] DELIMS = new String[] { "\t", ",", " " };

private static String getDelimiter(File file) throws IOException {
    for (String delim : DELIMS) {

        BufferedReader br = new BufferedReader(new FileReader(file));
        String[] line0 = br.readLine().split(delim);
        String[] line1 = br.readLine().split(delim);
        br.close();
        if (line0.length == line1.length && line0.length > 1) {
            return delim;
        }
    }
    throw new IllegalStateException("Failed to find delimiter for file " + file);
}


I might start by playing with Java's StringTokenizer. This takes a string, and lets you find each token that is separated by a delimiter.

Here is one example from the net.

But you want to tokenize things from a file. In that case, you might want to play with Java's StreamTokenizer, which lets you parse input from a file stream.

edit

If you don't know the delimiters in advance, you could do a few things:

  1. Delimit based on all possible delimiters. If your data itself doesn't have any delimiters, then this would work. (ie, look for both "," and ";" - provided that your data itself doesn't nave either of those characters)
  2. If you have an idea of what your data is supposed to look like (supposed to be integers, or supposed to be single characters) then your code could try different delimiters (try "," first, then try ";", etc) until it parsed a line of text "correctly".


If it's the same delimiter throughout the file, write a function for one delimiter, call it d, and when handling other files, replace their delimiter with d. Rinse. Repeat. :)

Another approach: have your parsing function accept a file name and a delimiter as parameters. This assumes the parsing logic is the same for all files.

If your files look completely different - than delimiters are the least of your problem.


if its same delimiter through out the file then probabably while loading file to parse you can input the delimiter.

Say for ex..

    void someFunction(char delimiter){
--- do wateva you want to do with the file --- // you can use stringTokenizer for this purpose
}

Each time upon loading the file , you can use this function by calling it with delimiter for the file as argument.

Hope this helps.. :-)


You could write a class that parses a file something like this:

interface MyParser {
  public MyParser(char delimiter, List<String> fields);

  Map<String,String> ParseFile(InputStream file);
}

You'd pass the delimiter and an ordered list of fields to the constructor, then ask it to parse a file. You'd get back a map of field names (from the ordered list) to values.

The implementation of ParseFile would probably use split with the delimiter and then iterate through the array returned by split and the list of fields concurrently, creating the map as it went.


One possible approach is to use the Java Compiler Compiler (https://javacc.dev.java.net/). With this you can write a set of rules for what you will accept and what delimiters might appear at any one time. The engine can be given rules to work around order issues depending on the delimiter in use. And the file could, if necessary, switch delimiters along the way.


If the exactly order of the records is known when a specific delimiter is used, I'd just create a parser that would return a Record object for each line... something like below.

This does include a lot of hard coded values but I'm not sure how flexible you would need this. I would consider this more of a scripty/hacky solution rather than something you could extend. If you don't know the delimiters, you could test the first line of the file by using the String.split() method and see if the number of columns match the expected count.

 class MyParser

    {
        public static Record parseLine(String line, char delimiter)
        {
            StringTokenizer st1 = new StringTokenizer(line, delimiter);
            //You could easily use an array instead of these dumb variables
            String temp1,temp2,temp3,temp4,temp5;

            temp1 = st1.getNextToken();
            .. etc..

            Record ret = new Record();
            switch (delimiter)
            {
                case '^':
                ret.A = temp2;
                ret.B = temp3;
                ...etc...
                break;
                case '~':
                ...etc...
                break;
            }
        }
    }

    class Record
    {
        String A;
        String B;
        String C;
        String D;
        String E:
    }


You can use the StringTokenizer as mentioned earlier. Yes you will need to specify a string for all the possible delimiters. Don't forget to set the "returnsDelims" property of the tokenizer. That way you will know which token is used in the file and can then parse the data accordingly.


One way to find the delimiter in the file is to some kind of regex. A simple case would be to find any character that isn't alphabetical or numerical: [^A-Za-z0-9]

static String getDelimiter(String str) {
  Pattern p = Pattern.compile("([^A-Za-z0-9])");
  Matcher m = p.matcher(str.trim()); //remove whitespace as first char(s)
  if(m.find())
   return m.group(0);
  else 
   return null;
 }




public static void main(String[] args) {
  String[] str = {" A, B, C, D", "A B C D", "A;B;C;D"};
  for(String s : str){   
   String[] data = s.split(getDelimiter(s));
   //do clever stuff with the array
  }
 }

In this case I've loaded the data from an array instead of reading from a file. When reading from a file feed the first line to the getDelimiter method.


Most of the open source CSV parsing libraries allow you to change the delimiter characters, and also have behavior built in to handle escaping. Opencsv seems to be the popular one nowadays, but I haven't used it yet. I was pretty happy with the Ostermiller csv library last time I had to do a lot of csv parsing.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜