marking duplicates in a csv file

2022-12-11 04:59 问答作者：

I'm stumped with a problem illustrated in the sample below:

"ID","NAME","PHONE","REF","DISCARD"
1,"JOHN",12345,,
2,"PETER",6232,,
3,"JON",12345,,
4,"PETERSON",6232,,
5,"ALEX",7854,,
6,"JON",12345,,

I want to detect duplicates in column "PHONE", and mark the subsequent duplicates using the column "REF", with a value pointing to the "ID" of the first item and the value "Yes" for the "DISCARD" column

"ID","NAME","PHONE","REF","DISCARD"
1,"JOHN",12345,1,
2,"PETER",6232,2,
3,"JON",12345,1,"Yes"
4,"PETERSON",6232,2,"Yes"
5,"ALEX",7854,,
6,"JON",12345,1,"Yes"

So, how do I go about it? I tried this code but my logic wasn't right, of course.

import csv
myfile = open("C:\Users\Eduardo\Documents\TEST2.csv", "rb")
myfile1 = open("C:\Users\Eduardo\Documents\TEST2.csv", "rb")

dest = csv.writer(open("C:\Users\Eduardo\Documents\TESTF开发者_如何学JAVAIXED.csv", "wb"), dialect="excel")

reader = csv.reader(myfile)
verum = list(reader)
verum.sort(key=lambda x: x[2])
for i, row in enumerate(verum):
    if row[2] == verum[i][2]:
        verum[i][3] = row[0]

print verum

Your direction and help would be much appreciated.

The only thing you have to keep in memory while this is running is a map of phone numbers to their IDs.

map = {}
with open(r'c:\temp\input.csv', 'r') as fin:
    reader = csv.reader(fin)
    with open(r'c:\temp\output.csv', 'w') as fout:
        writer = csv.writer(fout)
        # omit this if the file has no header row
        writer.writerow(next(reader))
        for row in reader:
            (id, name, phone, ref, discard) = row
            if map.has_key(phone):
                ref = map[phone]
                discard = "YES"
            else:
                map[phone] = id
            writer.writerow((id, name, phone, ref, discard))

Sounds like homework. Since this is a CSV file (and thus changing the record size is next to impossible) you are best off loading the whole file into memory and manipulating it there before writing it out to a new file. Create a list of strings which is the original lines of the file. Then create a map, insert into the the phone number (the key) and the value (the id). Before the insert you look for the number if it already exists, you update the line containing the duplicate phone number. If it isn't already in the map, you insert the (phone, id) pair.

I know one thing. I know you don't have to read the entire file into memory to accomplish this.

import csv
myfile = "C:\Users\Eduardo\Documents\TEST2.csv"

dest = csv.writer(open("C:\Users\Eduardo\Documents\TESTFIXED.csv", "wb"), dialect="excel")

phonedict = {}

for row in cvs.reader(open(myfile, "r")):
    # setdefault sets the value to the second argument if it hasn't been set, and then
    # returns what the value in the dictionary is.
    firstid = phonedict.setdefault(row[2], row[0])
    row[3] = firstid
    if firstid is not row[0]:
       row[4] = "Yes"
    dest.writerow(row)

from operator import itemgetter
from itertools import groupby

import csv
verum = csv.reader(open('data.csv','rb'))

verum.sort(key=itemgetter(2,0))
def grouper( verum ):
    for key, grp in groupby(verum,itemgetter(2)):
        # key = phone number, grp = records with that number
        first = grp.next()
        # first item gets its id written into the 4th column
        yield [first[0],first[1],first[2],first[0],''] #or list(itemgetter(0,1,2,0,4)(first)) 
        for x in grp:
            # all others get the first items id as ref
            yield [x[0],x[1],x[2], first[0], "Yes"]

for line in sorted(grouper(verum), key=itemgetter(0)):
    print line

Outputs:

['1', 'JOHN', '12345', '1', '']
['2', 'PETER', '6232', '2', '']
['3', 'JON', '12345', '1', 'Yes']
['4', 'PETERSON', '6232', '2', 'Yes']
['5', 'ALEX', '7854', '5', '']
['6', 'JON', '12345', '1', 'Yes']

Writing the data back is left to the reader ;-)

I work with large 40k plus record csv files, the easiest way to get rid of dupes it with Access. 1. Create new database, 2, Tables tab Get external Data 3. Save Table. 4. Queries tab New find dupe wizard ( Match on phone field, show all fields and count) 5. Save Query ( export has .txt but name dupes.txt ) 6. Import Query result as new table, do not import field with dupe count.. 7. Query Find unmatched (match by phone field, show all fields in result. save query then Export has .txt but name unique.txt) 8. Import unique file in to existing table ( dupes ) 9.You can now save and export again into what ever files type you need and not have any dupes

继续阅读：csv duplicates python

marking duplicates in a csv file

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？