Searching CSV with 1.6 Million lines (150MB) file?
I have a CSV containing 1.6 million lines of data and at around 150MB, it contains product data. I have another CSV containing 2000 lines, which contains a list of product in the big CSV. They relate to each other by a unique id. The idea is to add 开发者_JAVA技巧the product data in the CSV with 2000 lines.
The databank.csv has headers ID
, Product Name
, Description
, Price
.
The sm_list.csv
has header ID
.
The result is to output a csv with products in sm_list.csv
, with the corresponding data in databank.csv
... 2000 rows long.
My original solution reads in all of the sm_list
, and reads databank line by line. It searches sm_list
for the ID
in the line read in from databank. This leads to 2000x1.6Million = 3200 million comparisons!
Could you please provide a basic algorithm outline to complete this task in the most efficient way?
Assuming you know to how read/write CSV files in MATLAB (several questions here on SO shows how), here is an example:
%# this would be read from "databank.csv"
prodID = (1:10)'; %'
prodName = cellstr( num2str(prodID, 'Product %02d') );
prodDesc = cellstr( num2str(prodID, 'Description %02d') );
prodPrice = rand(10,1)*100;
databank = [num2cell(prodID) prodName prodDesc num2cell(prodPrice)];
%# same for "sm_list.csv"
sm_list = [2;5;7;10];
%# find matching rows
idx = ismember(prodID,sm_list);
result = databank(idx,:)
%# ... export 'result' to CSV file ...
The result of the above example:
result =
[ 2] 'Product 02' 'Description 02' [19.251]
[ 5] 'Product 05' 'Description 05' [14.651]
[ 7] 'Product 07' 'Description 07' [4.2652]
[10] 'Product 10' 'Description 10' [ 53.86]
have to be using matlab? If you just input all that data into a database, it'll be easier. A simple select tableA.ID, tableB.productname... where tableA.id = tableB.id
will do it.
精彩评论