Mongodb import and deciphering changed rows
I have a large csv file which contains over 30million rows. I need to load this file on a daily basis and identify which of the rows have changed. Unfortunately there is no unique key field but it's possible to use four of the fields to make it unique. Once I have identified the changed rows I will then want to export the data. I have tried using a traditional SQL Server solution but the performance is so slow it's not going to work. Therefore I have been looking at Mongodb - this has managed to import the file in about 20 minutes (which is fine). Now I don't have any experience using Monogdb and more importantly knowing best practices. So, my idea is the following:
As a one off - Import data into a collection using the mongoimport.
Copy all of the unique id's generated by mongo and put them in a separate collection.
Import new data into 开发者_开发知识库the existing collection using upsert fields which should create a new id for each new and changed row.
Compare the 'copy' to the new collection to list out all the changed rows.
Export changed data.
This to me will work but I am hoping there is a much better way to tackle this problem.
Use unix sort and diff.
Sort the file on disk
sort -o new_file.csv -t ',' big_file.csv
sort -o old_file.csv -t ',' yesterday.csv
diff new_file.csv old_file.csv
Commands may need some tweeking.
You can also use mysql to import the file via
http://dev.mysql.com/doc/refman/5.1/en/load-data.html (LOAD FILE)
and then create KEY (or primary key) on the 4 fields.
Then load yesterday's file into a different table and then use a 2 sql statements to compare the files...
But, diff will work best!
-daniel
精彩评论