开发者

Searching CSV with 1.6 Million lines (150MB) file?

I have a CSV containing 1.6 million lines of data and at around 150MB, it contains product data. I have another CSV containing 2000 lines, which contains a list of product in the big CSV. They relate to each other by a unique id. The idea is to add 开发者_JAVA技巧the product data in the CSV with 2000 lines.

The databank.csv has headers ID, Product Name, Description, Price . The sm_list.csv has header ID.

The result is to output a csv with products in sm_list.csv, with the corresponding data in databank.csv... 2000 rows long.

My original solution reads in all of the sm_list, and reads databank line by line. It searches sm_list for the ID in the line read in from databank. This leads to 2000x1.6Million = 3200 million comparisons!

Could you please provide a basic algorithm outline to complete this task in the most efficient way?


Assuming you know to how read/write CSV files in MATLAB (several questions here on SO shows how), here is an example:

%# this would be read from "databank.csv"
prodID = (1:10)';     %'
prodName = cellstr( num2str(prodID, 'Product %02d') );
prodDesc = cellstr( num2str(prodID, 'Description %02d') );
prodPrice = rand(10,1)*100;
databank = [num2cell(prodID) prodName prodDesc num2cell(prodPrice)];

%# same for "sm_list.csv"
sm_list = [2;5;7;10];

%# find matching rows
idx = ismember(prodID,sm_list);
result = databank(idx,:)

%# ... export 'result' to CSV file ...

The result of the above example:

result = 
    [ 2]    'Product 02'    'Description 02'    [19.251]
    [ 5]    'Product 05'    'Description 05'    [14.651]
    [ 7]    'Product 07'    'Description 07'    [4.2652]
    [10]    'Product 10'    'Description 10'    [ 53.86]


have to be using matlab? If you just input all that data into a database, it'll be easier. A simple select tableA.ID, tableB.productname... where tableA.id = tableB.id will do it.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜