How do you dynamically identify unknown delimiters in a data file?

2023-01-20 02:15 问答作者：

I have three input data files. Each uses a different delimiter for th开发者_如何学Goe data contained therein. Data file one looks like this:

apples | bananas | oranges | grapes

data file two looks like this:

quarter, dime, nickel, penny

data file three looks like this:

horse cow pig chicken goat

(the change in the number of columns is also intentional)

The thought I had was to count the number of non-alpha characters, and presume that the highest count was the separator character. However, the files with non-space separators also have spaces before and after the separators, so the spaces win on all three files. Here's my code:

def count_chars(s):
    valid_seps=[' ','|',',',';','\t']
    cnt = {}
    for c in s:
        if c in valid_seps: cnt[c] = cnt.get(c,0) + 1
    return cnt

infile = 'pipe.txt' #or 'comma.txt' or 'space.txt'
records = open(infile,'r').read()
print count_chars(records)

It will print a dictionary with the counts of all the acceptable characters. In each case, the space always wins, so I can't rely on that to tell me what the separator is.

But I can't think of a better way to do this.

Any suggestions?

How about trying Python CSV's standard: http://docs.python.org/library/csv.html#csv.Sniffer

import csv

sniffer = csv.Sniffer()
dialect = sniffer.sniff('quarter, dime, nickel, penny')
print dialect.delimiter
# returns ','

If you're using python, I'd suggest just calling re.split on the line with all valid expected separators:

>>> l = "big long list of space separated words"
>>> re.split(r'[ ,|;"]+', l)
['big', 'long', 'list', 'of', 'space', 'separated', 'words']

The only issue would be if one of the files used a separator as part of the data.

If you must identify the separator, your best bet is to count everything excluding spaces. If there are almost no occurrences, then it's probably space, otherwise, it's the max of the mapped characters.

Unfortunately, there's really no way to be sure. You may have space separated data filled with commas, or you may have | separated data filled with semicolons. It may not always work.

We can determine the delimiter right most of the time based on some prior information (such as list of common delimiter) and frequency counting that all the lines give the same number of delimiter

def head(filename: str, n: int):
    try:
        with open(filename) as f:
            head_lines = [next(f).rstrip() for x in range(n)]
    except StopIteration:
        with open(filename) as f:
            head_lines = f.read().splitlines()
    return head_lines


def detect_delimiter(filename: str, n=2):
    sample_lines = head(filename, n)
    common_delimiters= [',',';','\t',' ','|',':']
    for d in common_delimiters:
        ref = sample_lines[0].count(d)
        if ref > 0:
            if all([ ref == sample_lines[i].count(d) for i in range(1,n)]):
                return d
    return ','

Often n=2 lines should be enough, check more lines for a more robust answers. Of course there are cases (often artificial ones) those lead to a false detection but it is unlikely happened in practice.

Here I use an efficient python implementation of head function that only read n-first line of a file. See my answer on How to read first N-lines of a file

I ended up going with the regex, because of the problem of spaces. Here's my finished code, in case anyone's interested, or could use anything else in it. On a tangential note, it would be neat to find a way to dynamically identify column order, but I realize that's a bit more tricky. In the meantime, I'm falling back on old tricks to sort that out.

for infile in glob.glob(os.path.join(self._input_dir, self._file_mask)):
            #couldn't quite figure out a way to make this a single block 
            #(rather than three separate if/elifs. But you can see the split is
            #generalized already, so if anyone can come up with a better way,
            #I'm all ears!! :)
            for row in open(infile,'r').readlines():
                if infile.find('comma') > -1: 
                    datefmt = "%m/%d/%Y"
                    last, first, gender, color, dobraw = \
                            [x.strip() for x in re.split(r'[ ,|;"\t]+', row)]
                elif infile.find('space') > -1: 
                    datefmt = "%m-%d-%Y"
                    last, first, unused, gender, dobraw, color = \
                            [x.strip() for x in re.split(r'[ ,|;"\t]+', row)]

                elif infile.find('pipe') > -1:
                    datefmt = "%m-%d-%Y"
                    last, first, unused, gender, color, dobraw = \
                            [x.strip() for x in re.split(r'[ ,|;"\t]+', row)]
                    #There is also a way to do this with csv.Sniffer, but the 
                    #spaces around the pipe delimiter also confuse sniffer, so
                    #I couldn't use it.
                else: raise ValueError(infile + "is not an acceptable input file.")

With Python 3, using standard library 'csv':

import csv


def get_delimiter(file_path: str) -> str:
    with open(file_path, 'r') as csvfile:
        delimiter = str(csv.Sniffer().sniff(csvfile.read()).delimiter)
        return delimiter

继续阅读：csv parsing python text-files textinput

How do you dynamically identify unknown delimiters in a data file?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？