Challenge: merging CSV files intelligently
We are in the middle of changing web store platform and we need to import products' data from different sources.
We currently have several different csv files from different it systems/databases because each system is missing some information. Fortunately the product ids are the same so it's possible to relate the data using ids.
We need to merge this data into one big csv file so we can import in into our new e-commerce site.
My question: is there a general approach when 开发者_高级运维you need to merge csv files with related data into one csv file? Are there any applications or tools that helps you out?
If you use R, you can load the csv files and then perform any query you like as if you were working with tables in a database.
For example, you could use a command like the following for each csv
table1 <- read.csv("D:/table1.csv", header=T, sep=",", na.strings=c(".", "NA", "", "?"))
Then load the sqldf
package
library(sqldf)
And perform queries
table3 <- sqldf("SELECT X1, X2, X3 FROM table1 JOIN table2 USING(X1)")
Finally, export the results as a new csv
write.csv(table3 , file = "D:/temp/table3.csv")
Yes. Use Python and the csv
module.
Read, create the merged set of results, and write.
import csv
products = {}
with open( "one file.csv", "rb" ) as source:
rdr = csv.DictReader( source )
for row in rdr:
products[row['somekeycolumn']] = row # or maybe some rearrangement
with open( 'another file.csv', 'rb' ) as source:
rdr = csv.DictReader( source )
for row in rdr:
if row['adifferentkey'] in products:
# maybe update?
else:
products[row['adifferentkey']] = row # or maybe some rearrangement
with open( 'result.csv', 'wb' ) as target:
wtr = csv.writer( target )
wtr.writerow( ['the','heading','columns'] )
for name in sorted( product ) ):
wtr.writerow( products[name] )
Something like that.
sure , any programming language/tools that is able to manipulate files can do the job (Python/Perl/PHP/Ruby/awk among others). an example with awk *nix tool, processing 2 csv files using field 1 as the common key
awk -F"," 'FNR==NR{
d[$1]=$0
next
}
{
print d[$1],$0
}' file1 file2
I wrote TxtSushi to do SQL selects and some other simple transformations on flat files files. For example you can do something like the following:
tssql -table t1 tbl1.csv -table t2 tbl2.csv \ 'select * from t1 inner join t2 on t1.id=t2.id'
There are many example scripts here. It is distributed as a package on http://hackage.haskell.org so installation depends on the Haskell Platform.
why not just load each csv file into a common input table and then use SQL to merge into permanent tables from there. Sample input tables:
InputCSVFiles
FileID int PK auto number/identity
FileName string
FileHash string
InputCVSItems
ItemID int PK auto number/identity
FileID FK
ProductID
Col1
Col2
...
just load each file into the tables, can use the file hash to prevent duplicate file upload. You can then use SQL to identity duplicates and use SQL to INSERT using SELECTs into the permanent tables as necessary. I'm not sure how many duplicate products there are, but dealing with them in SQL is fairly easy with the self joins, COALESCE(), ROW_NUMBER(), GROUP BY, MIN(), etc..
INSERT INTO Product
(ProductID, col1, col2, ...)
SELECT
ProductID, col1, col2, ...
FROM InputCVSItems
...
Once I wrote "dsv2sql" utility that does exactly what you need. Here is a sample command:
./dsv2sql --delimiter=, --output=csv --uid=id one.csv two.csv > combined_one_two.csv
Naturally input CSV files with headers are expected.
This is a CVS-Merge utility, it can optionally only compare a column range when merging, with support for merging by detecting overlap or by only adding unique rows.
精彩评论