How to use python csv module for splitting double pipe delimited data

2023-03-12 20:10 问答作者：

I have got data which looks like:

"1234"||"abcd"||"a1s1"

I am trying to read and write using Python's csv reader and writer. As the csv module's delimiter is l开发者_如何学编程imited to single char, is there any way to retrieve data cleanly? I cannot afford to remove the empty columns as it is a massively huge data set to be processed in time bound manner. Any thoughts will be helpful.

The docs and experimentation prove that only single-character delimiters are allowed.

Since cvs.reader accepts any object that supports iterator protocol, you can use generator syntax to replace ||-s with |-s, and then feed this generator to the reader:

def read_this_funky_csv(source):
  # be sure to pass a source object that supports
  # iteration (e.g. a file object, or a list of csv text lines)
  return csv.reader((line.replace('||', '|') for line in source), delimiter='|')

This code is pretty effective since it operates on one CSV line at a time, provided your CSV source yields lines that do not exceed your available RAM :)

>>> import csv
>>> reader = csv.reader(['"1234"||"abcd"||"a1s1"'], delimiter='|')
>>> for row in reader:
...     assert not ''.join(row[1::2])
...     row = row[0::2]
...     print row
...
['1234', 'abcd', 'a1s1']
>>>

Unfortunately, delimiter is represented by a character in C. This means that it is impossible to have it be anything other than a single character in Python. The good news is that it is possible to ignore the values which are null:

reader = csv.reader(['"1234"||"abcd"||"a1s1"'], delimiter='|')
#iterate through the reader.
for x in reader:
    #you have to use a numeric range here to ensure that you eliminate the 
    #right things.
    for i in range(len(x)):
        #Odd indexes will be discarded.
        if i%2 == 0: x[i] #x[i] where i%2 == 0 represents the values you want.

There are other ways to accomplish this (a function could be written, for one), but this gives you the logic which is needed.

If your data literally looks like the example (the fields never contain '||' and are always quoted), and you can tolerate the quote marks, or are willing to slice them off later, just use .split

>>> '"1234"||"abcd"||"a1s1"'.split('||')
['"1234"', '"abcd"', '"a1s1"']
>>> list(s[1:-1] for s in '"1234"||"abcd"||"a1s1"'.split('||'))
['1234', 'abcd', 'a1s1']

csv is only needed if the delimiter is found within the fields, or to delete optional quotes around fields

继续阅读：csv delimiter python

How to use python csv module for splitting double pipe delimited data

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？