开发者

Python - CSV: Large file with rows of different lengths

In short, I have a 20,000,000 line csv file that has different row lengths. This is due to archaic data loggers and proprietary formats. We get the end result as a csv file in the following format. MY goal is to insert this file into a postgres database. How Can I do the following:

  • Keep the first 8 columns and my last 2 columns, to have a consistent CSV file
  • Add a new column to the csv file ether at the first or last position.

1, 2, 3, 4, 5, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, img_id.jpg, -50
1, 2, 3, 4, 5, 0,0,0,0,0,0,0,0,0, img_id.jpg, -50
1, 2, 3, 4, 5, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, img_id.jpg, -50
1, 2, 3, 4, 5, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, img_id.jpg, -50
1, 2, 3, 4, 5, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, img_id.jpg, -50
1, 2, 3, 4, 5, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, img_id.jpg, -50
1, 2, 3, 4, 5, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, img_id.jpg, -50
1, 2, 3, 4, 5, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, img_id.jpg, -50
1, 2, 3, 4开发者_Go百科, 5, 0,0,0,0,0,0, img_id.jpg, -50
1, 2, 3, 4, 5, 0,0,0,0,0,0,0,0,0,0,0 img_id.jpg, -50


Read a row with csv, then:

newrow = row[:8] + row[-2:]

then add your new field and write it out (also with csv).


You can open the file as a textfile and read the lines one at a time. Are there quoted or escaped commas that don't "split fields"? If not, you can do

with open('thebigfile.csv', 'r') as thecsv:
    for line in thecsv:
        fields = [f.strip() for f in thecsv.split(',')]
        consist = fields[:8] + fields[-2:] + ['onemore']
        ... use the `consist` list as warranted ...

I suspect that where I have + ['onemore'] you may want to "add a column", as you say, with some very different content, but of course I can't guess what it might be.

Don't send each line separately with an insert to the DB -- 20 million inserts would take a long time. Rather, group the "made-consistent" lists, appending them to a temporary list -- each time that list's length hits, say, 1000, use an executemany to add all those entries.

Edit: to clarify, I don't recommend using csv to process a file you know is not in "proper" csv format: processing it directly gives you more direct control (especially as and when you discover other irregularities beyond the varying number of commas per line).


I would recommend using the csv module. Here's some code based off CSV processing that I've done elsewhere

from __future__ import with_statement
import csv

def process( reader, writer):
    for line in reader:
        data = row[:8] + row[-2:]
        writer.write( data )

def main( infilename, outfilename ):
    with open( infilename, 'rU' ) as infile:
        reader = csv.reader( infile )
        with open( outfilename, 'w') as outfile:
            writer = csv.writer( outfile )
            process( reader, writer )

if __name__ == '__main__':
    if len(sys.argv) != 3:
        print "syntax: python process.py filename outname"
        sys.exit(1)
    main( sys.argv[1], sys.argv[2] )


Sorry, you will need to write some code with this one. When you have a huge file like this, it's worth checking all of it to be sure it's consistent with what you expect. If you let the unhappy data into your database, you will never get all of it out.

Remember oddities about CSV: it's a mishmash of a bunch of similar standards with different rules about quoting, escaping, null characters, unicode, empty fields (",,,"), multi-line inputs, and blank lines. The csv module has 'dialects' and options, and you might find the csv.Sniffer class helpful.

I recommend you:

  • run a 'tail' command to look at the last few lines.
  • if it appears well behaved, run the whole file through csv reader to see it breaks. Make a quick histogram of "fields per line".
  • Think about "valid" ranges and character types and rigorously check them as you read. Especially watch for unusual unicode or characters outside of the printable range.
  • Seriously consider if you want to keep the extra, odd-ball values in a "rest of the line" text field.
  • Toss any unexpected lines into an exception file.
  • Fix up your code to handle the new pattern in exceptions file. Rinse. Repeat.
  • Finally, run the whole thing again, actually dumping data into the database.

Your development time will be faster from not touching a database until you are completely done. Also, be advised the SQLite is blazingly fast on read only data, so PostGres might not be the best solution.

Your final code will probably look like this, but I can't be sure without knowing your data, especially how 'well behaved' it is:

while not eof
    out = []
    for chunk in range(1000):
       try:
          fields = csv.reader.next()
       except StopIteration:
          break
       except:
          print str(reader.line_num) + ", 'failed to parse'"
       try:
          assert len(fields) > 5 and len(fields < 12)
          assert int(fields[3]) > 0 and int(fields[3]) < 999999
          assert int(fields[4]) >= 1 and int(fields[4] <= 12) # date
          assert field[5] == field[5].strip()  # no extra whitespace
          assert not field[5].strip(printable_chars)  # no odd chars
          ...
       except AssertionError:
          print str(reader.line_num) + ", 'failed checks'"
       new_rec = [reader.line_num]  # new first item
       new_rec.extend(fields[:8])   # first eight
       new_rec.extend(fields[-2:])  # last two
       new_rec.append(",".join(field[8:-2])) # and the rest
       out.append(new_rec)
    if database:
       cursor.execute_many("INSERT INTO raw_table VALUES %d,...", out)

Of course, your mileage my vary with this code. It's a first draft of pseduo-code. Expect writing solid code for the input to take most of a day.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜