开发者

Transforming character strings in R

I have to merge to data fra开发者_运维技巧mes in R. The two data frames share a common id variable, the name of the subject. However, the names in one data frame are partly capitalized, while in the other they are in lower cases. Furthermore the names appear in reverse order. Here is a sample from the data frames:

DataFrame1$Name:
"Van Brempt Kathleen"
"Gräßle Ingeborg"
"Gauzès Jean-Paul"
"Winkler Iuliu" 

DataFrame2$Name:
"Kathleen VAN BREMPT" 
"Ingeborg GRÄSSLE"
"Jean-Paul GAUZÈS"
"Iuliu WINKLER"

Is there a way in R to make these two variables usable as an identifier for merging the data frames?

Best, Thomas


You can use gsub to convert the names around:

> names
[1] "Kathleen VAN BREMPT" "jean-paul GAULTIER" 
> gsub("([^\\s]*)\\s(.*)","\\2 \\1",names,perl=TRUE)
[1] "VAN BREMPT Kathleen" "GAULTIER jean-paul" 
> 

This works by matching first anything up to the first space and then anything after that, and switching them around. Then add tolower() or toupper() if you want, and use match() for joining your data frames.

Good luck matching Grassle with Graßle though. Lots of other things will probably bite you too, such as people with two first names separated by space, or someone listed with a title!

Barry


Here's a complete solution that combines the two partial methods offered so far (and overcomes the fears expressed by Spacedman about "matching Grassle with Graßle"):

DataFrame2$revname <- gsub("([^\\s]*)\\s(.*)","\\2 \\1",DataFrame2$Name,perl=TRUE)
DataFrame2$agnum <-sapply(tolower(DataFrame2$revname), agrep, tolower(DataFrame1$Name) )
DataFrame1$num <-1:nrow(DataFrame1)
merge(DataFrame1, DataFrame2, by.x="num", by.y="agnum")

Output:

  num              Name.x              Name.y             revname

1   1 Van Brempt Kathleen Kathleen VAN BREMPT VAN BREMPT Kathleen
2   2     Gräßle Ingeborg    Ingeborg GRÄSSLE    GRÄSSLE Ingeborg
3   3    Gauzès Jean-Paul    Jean-Paul GAUZÈS    GAUZÈS Jean-Paul
4   4       Winkler Iuliu       Iuliu WINKLER       WINKLER Iuliu

The third step would not be necessary if DatFrame1 had rownames that were still sequentially numbered (as they would be by default). The merge statement would then be:

merge(DataFrame1, DataFrame2, by.x="row.names", by.y="agnum")

-- David.


Can you add an additional column/variable to each data frame which is a lowercase version of the original name:

DataFrame1$NameLower <- tolower(DataFrame1$Name)
DataFrame2$NameLower <- tolower(DataFrame2$Name)

Then perform a merge on this:

MergedDataFrame <- merge(DataFrame1, DataFrame2, by="NameLower")


In addition to the answer using gsub to rearrange the names, you might want to also look at the agrep function, this looks for approximate matches. You can use this with sapply to find the matching rows from one data frame to the other, e.g.:

> sapply( c('newyork', 'NEWJersey', 'Vormont'), agrep, x=state.name, ignore.case=TRUE )
  newyork NEWJersey   Vormont 
       32        30        45 
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜