Record linking data frames with function RLBigDataLinkage in package RecordLinkage in R
I would like to link two data frames that use different numeric keys, but similar strings. Specifically, one data frame uses a numeric key GVKEY
and a company name CONML
> head(temp.compustat[order(temp.compustat$CONML, decreasing = T), ])
GVKEY CONML
225994 13023 ZZZZ Best Co Inc
211017 11696 Zytrex Corp
213816 11951 Zytec Systems Inc
309886 29163 Zytec Corp
373950 129441 Zynex Inc
383184 145228 ZymoGenetics Inc
> dim(temp.compustat)
[1] 31354 2
The other data frame uses a different numeric key companyid
and a company name company
that may be slightly different than CONML
in the first database.
> head(temp.dealscan[ order(temp.dealscan$company, decreasing = T), ])
companyid company
70473 18192 Zytec Corp
32025 16969 Zynaxis Inc
19714 92271 ZYGO Teraoptix Inc
80473 13185 Zygo Corp
1901 24303 Zycon Corp SDN Bhd
33993 21219 Zycon Corp
> dim(temp.dealscan)
[1] 85818 2
(I am sorting in reverse, because the DealScan databases has blanks and * at the beginning of some entries). It seems that the function RLBigDataLinkage
in package RecordLinkage
is the solution, but I can't get even unsupervised linking to work. Here's my error.
> library(RecordLinkage)
> rpairs <- RLBigDataLinkage(dataset1 = temp.compustat, dataset2 = temp.dealscan, exclude = 1, strcmp = 2, strcmpfun = "levenshtein")
> result <- epiClassify(rpairs, threshold.upper = 0.5)
Error in if (max <= min) stop("must have max > min") :
missing value where TRUE/FALSE needed
In addition: Warning message:
In nData1 * nData2 : NAs produced by integer overflow
Coincident with me getting started yesterday, I saw this question about using the Levenshtein distance function manually. Is this approach a better option for me? These data frames are fairly large, so it seems that RLBigData
should be the correct approach (the package authors recommend it for > 1000 entries). Than开发者_开发问答ks!
This is due to a bug in the package code which is fixed now. The fixed version will be available from the project site on R-Forge (https://r-forge.r-project.org/projects/recordlinkage) probably until tomorrow morning. Note that it depends on version 1.5.4 of the data.table package, which is currently not available on CRAN, but on R-Forge(https://r-forge.r-project.org/projects/datatable/).
精彩评论