Python - CSV: Large file with rows of different lengths

2022-12-25 00:10 问答作者：

In short, I have a 20,000,000 line csv file that has different row lengths. This is due to archaic data loggers and proprietary formats. We get the end result as a csv file in the following format. MY goal is to insert this file into a postgres database. How Can I do the following:

Keep the first 8 columns and my last 2 columns, to have a consistent CSV file
Add a new column to the csv file ether at the first or last position.

1, 2, 3, 4, 5, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, img_id.jpg, -50
1, 2, 3, 4, 5, 0,0,0,0,0,0,0,0,0, img_id.jpg, -50
1, 2, 3, 4, 5, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, img_id.jpg, -50
1, 2, 3, 4, 5, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, img_id.jpg, -50
1, 2, 3, 4, 5, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, img_id.jpg, -50
1, 2, 3, 4, 5, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, img_id.jpg, -50
1, 2, 3, 4, 5, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, img_id.jpg, -50
1, 2, 3, 4, 5, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, img_id.jpg, -50
1, 2, 3, 4开发者_Go百科, 5, 0,0,0,0,0,0, img_id.jpg, -50
1, 2, 3, 4, 5, 0,0,0,0,0,0,0,0,0,0,0 img_id.jpg, -50

Read a row with csv, then:

newrow = row[:8] + row[-2:]

then add your new field and write it out (also with csv).

You can open the file as a textfile and read the lines one at a time. Are there quoted or escaped commas that don't "split fields"? If not, you can do

with open('thebigfile.csv', 'r') as thecsv:
    for line in thecsv:
        fields = [f.strip() for f in thecsv.split(',')]
        consist = fields[:8] + fields[-2:] + ['onemore']
        ... use the `consist` list as warranted ...

I suspect that where I have + ['onemore'] you may want to "add a column", as you say, with some very different content, but of course I can't guess what it might be.

Don't send each line separately with an insert to the DB -- 20 million inserts would take a long time. Rather, group the "made-consistent" lists, appending them to a temporary list -- each time that list's length hits, say, 1000, use an executemany to add all those entries.

Edit: to clarify, I don't recommend using csv to process a file you know is not in "proper" csv format: processing it directly gives you more direct control (especially as and when you discover other irregularities beyond the varying number of commas per line).

I would recommend using the csv module. Here's some code based off CSV processing that I've done elsewhere

from __future__ import with_statement
import csv

def process( reader, writer):
    for line in reader:
        data = row[:8] + row[-2:]
        writer.write( data )

def main( infilename, outfilename ):
    with open( infilename, 'rU' ) as infile:
        reader = csv.reader( infile )
        with open( outfilename, 'w') as outfile:
            writer = csv.writer( outfile )
            process( reader, writer )

if __name__ == '__main__':
    if len(sys.argv) != 3:
        print "syntax: python process.py filename outname"
        sys.exit(1)
    main( sys.argv[1], sys.argv[2] )

Sorry, you will need to write some code with this one. When you have a huge file like this, it's worth checking all of it to be sure it's consistent with what you expect. If you let the unhappy data into your database, you will never get all of it out.

Remember oddities about CSV: it's a mishmash of a bunch of similar standards with different rules about quoting, escaping, null characters, unicode, empty fields (",,,"), multi-line inputs, and blank lines. The csv module has 'dialects' and options, and you might find the csv.Sniffer class helpful.

I recommend you:

run a 'tail' command to look at the last few lines.
if it appears well behaved, run the whole file through csv reader to see it breaks. Make a quick histogram of "fields per line".
Think about "valid" ranges and character types and rigorously check them as you read. Especially watch for unusual unicode or characters outside of the printable range.
Seriously consider if you want to keep the extra, odd-ball values in a "rest of the line" text field.
Toss any unexpected lines into an exception file.
Fix up your code to handle the new pattern in exceptions file. Rinse. Repeat.
Finally, run the whole thing again, actually dumping data into the database.

Your development time will be faster from not touching a database until you are completely done. Also, be advised the SQLite is blazingly fast on read only data, so PostGres might not be the best solution.

Your final code will probably look like this, but I can't be sure without knowing your data, especially how 'well behaved' it is:

while not eof
    out = []
    for chunk in range(1000):
       try:
          fields = csv.reader.next()
       except StopIteration:
          break
       except:
          print str(reader.line_num) + ", 'failed to parse'"
       try:
          assert len(fields) > 5 and len(fields < 12)
          assert int(fields[3]) > 0 and int(fields[3]) < 999999
          assert int(fields[4]) >= 1 and int(fields[4] <= 12) # date
          assert field[5] == field[5].strip()  # no extra whitespace
          assert not field[5].strip(printable_chars)  # no odd chars
          ...
       except AssertionError:
          print str(reader.line_num) + ", 'failed checks'"
       new_rec = [reader.line_num]  # new first item
       new_rec.extend(fields[:8])   # first eight
       new_rec.extend(fields[-2:])  # last two
       new_rec.append(",".join(field[8:-2])) # and the rest
       out.append(new_rec)
    if database:
       cursor.execute_many("INSERT INTO raw_table VALUES %d,...", out)

Of course, your mileage my vary with this code. It's a first draft of pseduo-code. Expect writing solid code for the input to take most of a day.

继续阅读：csv parsing python

Python - CSV: Large file with rows of different lengths

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

王昌瑞《潜梦追凶》剧组庆生新锐演员未来可期？

Is it allowed to ask users to enter credit card details for own payment method?

Escaping "<" in Perl-generated XML

imessage会显示已读吗？

微信重新建群怎么建？