开发者

Data cleaning in Excel sheets using R

I have data in Excel sheets and I need a way to clean it. I would like remove inconsistent values, like Branch name is specified as (Computer Science and Engin开发者_如何学编程eering, C.S.E, C.S, Computer Science). So how can I bring all of them into single notation?


The car package has a recode function. See it's help page for worked examples.

In fact an argument could be made that this should be a closed question:

Why is recode in R not changing the original values?

How to recode a variable to numeric in R?

Recode/relevel data.frame factors with different levels

And a few more questions easily identifiable with a search: [r] recode

EDIT: I liked Marek's comment so much I decided to make a function that implemented it. (Factors have always been one of those R-traps for me and his approach seemed very intuitive.) The function is designed to take character or factor class input and return a grouped result that also classifies an "all_others" level.

my_recode <- function(fac, levslist){ nfac <- factor(fac);
    inlevs <- levels(nfac);
    othrlevs <- inlevs[ !inlevs %in% unlist(levslist) ]
      # levslist of the form ::::    list(
      #     animal = c("cow", "pig"),
      #     bird = c("eagle", "pigeon") )
 levels(nfac)<- c(levslist, all_others =othrlevs); nfac}

 df <- data.frame(name = c('cow','pig','eagle','pigeon', "zebra"), 
              stringsAsFactors = FALSE)
 df$type <- my_recode(df$name, list(
     animal = c("cow", "pig"),
     bird = c("eagle", "pigeon") ) )
 df
#-----------
    name       type
1    cow     animal
2    pig     animal
3  eagle       bird
4 pigeon       bird
5  zebra all_others


You want a way to clean your data and you specify R. Is there a reason for it? (automation, remote control [console], ...)

If not, I would suggest Open Refine. It is a great tool exactly for this job. It is not hosted, you can safely download it and run against your dataset (xls/xlsx work fine), you then create a text facet and group away.

It uses advanced algorithms (and even gives you a choice) and is really helpful. I have cleaned a lot of data in no time.

The videos at the official web site are useful.


There are no one size fits all solutions for these types of problems. From what I understand you have Branch Names that are inconsistently labelled.

You would like to see C.S.E. but what you actually have is CS, Computer Science, CSE, etc. And perhaps a number of other Branch Names that are inconsistent.

The first thing I would do is get a unique list of Branch Names in the file. I'll provide an example using letters() so you can see what I mean

your_df <- data.frame(ID=1:2000)
your_df$BranchNames <- sample(letters,2000, replace=T)
your_df$BranchNames <- as.character(your_df$BranchNames) # only if it's a factor
unique.names <- sort(unique(your_df$BranchNames))

Now that we have a sorted list of unique values, we can create a listing of recodes:

Let's say we wanted to rename A through G as just A

your_df$BranchNames[your_df$BranchNames %in% unique.names[1:7]] <- "A"

And you'd repeat the process above eliminating or group the unique names as appropriate.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜