Python: Comparing specific columns in two csv files
Say that I have two CSV files (file1 and file2) with c开发者_Go百科ontents as shown below:
file1:
fred,43,Male,"23,45",blue,"1, bedrock avenue"
file2:
fred,39,Male,"23,45",blue,"1, bedrock avenue"
I would like to compare these two CSV records to see if columns 0,2,3,4, and 5 are the same. I don't care about column 1.
What's the most pythonic way of doing this?
EDIT:
Some example code would be appreciated.
EDIT2:
Please note the embedded commas need to be handled correctly.
I suppose the best ways is to use Python library: http://docs.python.org/library/csv.html.
UPDATE (example added):
import csv
reader1 = csv.reader(open('data1.csv', 'rb'), delimiter=',', quotechar='"'))
row1 = reader1.next()
reader2 = csv.reader(open('data2.csv', 'rb'), delimiter=',', quotechar='"'))
row2 = reader2.next()
if (row1[0] == row2[0]) and (row1[2:] == row2[2:]):
print "eq"
else:
print "different"
>>> import csv
>>> csv1 = csv.reader(open("file1.csv", "r"))
>>> csv2 = csv.reader(open("file2.csv", "r"))
>>> while True:
... try:
... line1 = csv1.next()
... line2 = csv2.next()
... equal = (line1[0]==line2[0] and line1[2]==line2[2] and line1[3]==line2[3] and line1[4]==line2[4] and line1[5]==line2[5])
... print equal
... except StopIteration:
... break
True
Update
3 years later, I think I'd rather write it this way.
import csv
interesting_cols = [0, 2, 3, 4, 5]
with open("file1.csv", 'r') as file1,\
open("file2.csv", 'r') as file2:
reader1, reader2 = csv.reader(file1), csv.reader(file2)
for line1, line2 in zip(reader1, reader2):
equal = all(x == y
for n, (x, y) in enumerate(zip(line1, line2))
if n in interesting_cols
)
print(equal)
I would read both records, eliminate column 1 and the compare what's left. (In python3 works)
import csv
file1 = csv.reader(open("file1.csv", "r"))
file2 = csv.reader(open("file2.csv", "r"))
r1 = next(file1)
r1.pop(1)
r2 = next(file2)
r2.pop(1)
return r1 == r2
# Include required modules
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# Include required csv files
df_TrainSet = pd.read_csv('../data/ldp_TrainSet.csv')
df_DataSet = pd.read_csv('../data/ldp_DataSet.csv')
# First test
[c for c in df_TrainSet if c not in df_DataSet.columns]
# Second test
[c for c in df_DataSet if c not in df_TrainSet.columns]
With this example I check both CSV files whether the columns in both files are present in each other.
精彩评论