Finding Duplicates (Regex)
I have a CSV containing list of 500 members wit开发者_StackOverflow中文版h their phone numbers. I tried diff tools but none can seem to find duplicates.
Can I use regex to find duplicate rows by members' phone numbers?
I'm using Textmate on Mac.
Many thanks
What duplicates are you searching for? The whole lines or just the same phone number?
If it is the whole line, then try this:
sort phonelist.txt | uniq -c | sort -n
and you will see at the bottom all lines, that occur more than once.
If it is just the phone number in some column, then use this:
awk -F ';' '{print $4}' phonelist.txt | uniq -c | sort -n
replace the '4' with the number of the column with the phone number and the ';' with the real separator you are using in your file.
Or give us a few example lines from this file.
EDIT:
If the data format is: name,mobile,phone,uniqueid,group
, then use the following:
awk -F ',' '{print $3}' phonelist.txt | uniq -c | sort -n
in the command line.
Yes. For one way to do it, look here. But you would probably not want to do it this way.
You can normally parse this file, and check what rows are duplicated. I think RAGEX
is a worst solution for this problem.
What language are you using? In .NET, with little effort you could load the CSV file in to a DataTable and find/remove the duplicate rows. Afterwards, write your DataTable back to another CSV file.
Heck, you can load this file in to Excel and sort by a field and find the duplicates manually. 500 isn't THAT many.
use PERL.
Load the CSV file into an array, and match the column you want to check (phone numbers) for duplicates, then store the values into another array, then check for duplicates in that array, using:
my %seen;
my @unique = grep !$seen{$_}++, @array2;
After that, all you need to do is load the unique array(phone numbers) into a for loop, and inside it load array#1(lines) into a for loop. Compare the phone number in the unique array, and if it matches, output that line into another csv file.
精彩评论