Matlab: Join datasets by not exact but similar values

2023-03-18 16:26 问答作者：

I have two exa开发者_运维问答mple datasets, A and B below, that I want to join in Matlab to create C. The keys will be 'product' and 'year', but the problem is that the product number in dataset B only matches the one in A by the first 4 digits. Is there a way to join 'almost' matching numbers in this way?

 
A       
product tariff  year
202341  2       1999
202341  4       2000
202341  20      2008
202355  9       1999
202355  16      2000
438811  0       1999
438891  8       1999
438891  3       2001
671212  15      2005
671260  10      2005

and

B       
product avg_tariff  year
2023    5,5         1999
2023    10          2000
2023    20          2008
4388    4           1999
4388    3           2001
6712    12,5        2005

are joined to produce matrix C

C           
product tariff  year    avg_tariff
202341  2       1999    5,5
202341  4       2000    10
202341  20      2008    20
202355  9       1999    5,5
202355  16      2000    10
438811  0       1999    4
438891  8       1999    4
438891  3       2001    3
671212  15      2005    12,5
671260  10      2005    12,5

Thanks in advance

Oscar

Since this question is related to a previous one of yours I answered, I will reuse the code and update it to the new data:

a.csv

product tariff  year
202341  2       1999
202341  4       2000
202341  20      2008
202355  9       1999
202355  16      2000
438811  0       1999
438891  8       1999
438891  3       2001
671212  15      2005
671260  10      2005

b.csv

product avg_tariff  year
2023    5.5         1999
2023    10          2000
2023    20          2008
4388    4           1999
4388    3           2001
6712    12.5        2005

MATLAB code

(using the Dataset class from the Statistics Toolbox):

%# read A, and build dataset
fid = fopen('a.csv','rt');
C = textscan(fid, '%s%f%f', 'Delimiter',' ', 'MultipleDelimsAsOne',true, 'HeaderLines',1);
fclose(fid);
dA = dataset({C{1} 'product'}, {C{2} 'tariff'}, {C{3} 'year'});

%# read B, and build dataset
fid = fopen('b.csv','rt');
C = textscan(fid, '%s%f%f', 'Delimiter',' ', 'MultipleDelimsAsOne',true, 'HeaderLines',1);
fclose(fid);
dB = dataset({C{1} 'product'}, {C{2} 'avg_tariff'}, {C{3} 'year'});

%# truncate productA
dA.productLong = dA.product;
dA.product = cellfun(@(s)s(:,1:end-2), cellstr(dA.product), 'UniformOutput',false);

%# inner join (keep only rows that exist in both datasets)
ds = join(dA, dB, 'keys',{'product' 'year'}, 'type','inner', 'MergeKeys',true);

%# restore the long product number as first column, and sort by it
ds.product = ds.productLong;
ds.productLong = [];
ds = sortrows(ds, 'product')

The result as expected:

ds = 
    product         tariff    year    avg_tariff
    '202341'         2        1999     5.5      
    '202341'         4        2000      10      
    '202341'        20        2008      20      
    '202355'         9        1999     5.5      
    '202355'        16        2000      10      
    '438811'         0        1999       4      
    '438891'         8        1999       4      
    '438891'         3        2001       3      
    '671212'        15        2005    12.5      
    '671260'        10        2005    12.5

load the product array and treat it as strings using textscan:

fidA = fopen('A.txt');
fidB = fopen('B.txt');
A = textscan(fidA,'%s%s%s','delimiter',' ');
B = textscan(fidB,'%s%s%s','delimiter',' ');
fclose(fidA);
fclose(fidB);

keep only the first 4 chars of product in A

for i = 1:length(A{1})
   rowKeyA{i} = [A{1}{i}(1:4),A{3}{i}]; %product(1:4),year
end
for i = 1:length(B{1})
   rowKeyB{i} = [B{1}{i},B{3}{i}]; %product,year
end

now just find matches between rowKeyA and rowKeyB

for i = 1:length(rowKeyA)
    j = find(strcmp(rowKeyB,rowKeyA{i}),1);
    if(j)
       fprintf('%s %s %s\n',rowKeyA{i},A{2},B{2});
    end
end

继续阅读：join

Matlab: Join datasets by not exact but similar values

a.csv

b.csv

MATLAB code

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？

a.csv

b.csv

MATLAB code

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集 河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？