How to drop columns by name in a data frame
I have a large data set and I would like to read specific columns or drop all the others.
data <- read.dta("file.dta")
I select the columns that I'm not interested in:
var.out <- names(data)[!names(data) %in% c("iden", "name", "x_serv", "m_serv")]
and than I'd like to do something like:
for(i in 1:length(var.out)) {
paste("dat开发者_C百科a$", var.out[i], sep="") <- NULL
}
to drop all the unwanted columns. Is this the optimal solution?
You should use either indexing or the subset
function. For example :
R> df <- data.frame(x=1:5, y=2:6, z=3:7, u=4:8)
R> df
x y z u
1 1 2 3 4
2 2 3 4 5
3 3 4 5 6
4 4 5 6 7
5 5 6 7 8
Then you can use the which
function and the -
operator in column indexation :
R> df[ , -which(names(df) %in% c("z","u"))]
x y
1 1 2
2 2 3
3 3 4
4 4 5
5 5 6
Or, much simpler, use the select
argument of the subset
function : you can then use the -
operator directly on a vector of column names, and you can even omit the quotes around the names !
R> subset(df, select=-c(z,u))
x y
1 1 2
2 2 3
3 3 4
4 4 5
5 5 6
Note that you can also select the columns you want instead of dropping the others :
R> df[ , c("x","y")]
x y
1 1 2
2 2 3
3 3 4
4 4 5
5 5 6
R> subset(df, select=c(x,y))
x y
1 1 2
2 2 3
3 3 4
4 4 5
5 5 6
Do not use -which()
for this, it is extremely dangerous. Consider:
dat <- data.frame(x=1:5, y=2:6, z=3:7, u=4:8)
dat[ , -which(names(dat) %in% c("z","u"))] ## works as expected
dat[ , -which(names(dat) %in% c("foo","bar"))] ## deletes all columns! Probably not what you wanted...
Instead use subset or the !
function:
dat[ , !names(dat) %in% c("z","u")] ## works as expected
dat[ , !names(dat) %in% c("foo","bar")] ## returns the un-altered data.frame. Probably what you want
I have learned this from painful experience. Do not overuse which()
!
First, you can use direct indexing (with booleans vectors) instead of re-accessing column names if you are working with the same data frame; it will be safer as pointed out by Ista, and quicker to write and to execute. So what you will only need is:
var.out.bool <- !names(data) %in% c("iden", "name", "x_serv", "m_serv")
and then, simply reassign data:
data <- data[,var.out.bool] # or...
data <- data[,var.out.bool, drop = FALSE] # You will need this option to avoid the conversion to an atomic vector if there is only one column left
Second, quicker to write, you can directly assign NULL to the columns you want to remove:
data[c("iden", "name", "x_serv", "m_serv")] <- list(NULL) # You need list() to respect the target structure.
Finally, you can use subset(), but it cannot really be used in the code (even the help file warns about it). Specifically, a problem to me is that if you want to directly use the drop feature of susbset() you need to write without quotes the expression corresponding to the column names:
subset( data, select = -c("iden", "name", "x_serv", "m_serv") ) # WILL NOT WORK
subset( data, select = -c(iden, name, x_serv, m_serv) ) # WILL
As a bonus, here is small benchmark of the different options, that clearly shows that subset is the slower, and that the first, reassigning method is the faster:
re_assign(dtest, drop_vec) 46.719 52.5655 54.6460 59.0400 1347.331
null_assign(dtest, drop_vec) 74.593 83.0585 86.2025 94.0035 1476.150
subset(dtest, select = !names(dtest) %in% drop_vec) 106.280 115.4810 120.3435 131.4665 65133.780
subset(dtest, select = names(dtest)[!names(dtest) %in% drop_vec]) 108.611 119.4830 124.0865 135.4270 1599.577
subset(dtest, select = -c(x, y)) 102.026 111.2680 115.7035 126.2320 1484.174
Code is below :
dtest <- data.frame(x=1:5, y=2:6, z = 3:7)
drop_vec <- c("x", "y")
null_assign <- function(df, names) {
df[names] <- list(NULL)
df
}
re_assign <- function(df, drop) {
df <- df [, ! names(df) %in% drop, drop = FALSE]
df
}
res <- microbenchmark(
re_assign(dtest,drop_vec),
null_assign(dtest,drop_vec),
subset(dtest, select = ! names(dtest) %in% drop_vec),
subset(dtest, select = names(dtest)[! names(dtest) %in% drop_vec]),
subset(dtest, select = -c(x, y) ),
times=5000)
plt <- ggplot2::qplot(y=time, data=res[res$time < 1000000,], colour=expr)
plt <- plt + ggplot2::scale_y_log10() +
ggplot2::labs(colour = "expression") +
ggplot2::scale_color_discrete(labels = c("re_assign", "null_assign", "subset_bool", "subset_names", "subset_drop")) +
ggplot2::theme_bw(base_size=16)
print(plt)
You can also try the dplyr
package:
R> df <- data.frame(x=1:5, y=2:6, z=3:7, u=4:8)
R> df
x y z u
1 1 2 3 4
2 2 3 4 5
3 3 4 5 6
4 4 5 6 7
5 5 6 7 8
R> library(dplyr)
R> dplyr::select(df2, -c(x, y)) # remove columns x and y
z u
1 3 4
2 4 5
3 5 6
4 6 7
5 7 8
Here's a quick solution for this. Say, you have a data frame X with three columns A, B and C:
> X<-data.frame(A=c(1,2),B=c(3,4),C=c(5,6))
> X
A B C
1 1 3 5
2 2 4 6
If I want to remove a column, say B, just use grep on colnames to get the column index, which you can then use to omit the column.
> X<-X[,-grep("B",colnames(X))]
Your new X data frame would look like the following (this time without the B column):
> X
A C
1 1 5
2 2 6
The beauty of grep is that you can specify multiple columns that match the regular expression. If I had X with five columns (A,B,C,D,E):
> X<-data.frame(A=c(1,2),B=c(3,4),C=c(5,6),D=c(7,8),E=c(9,10))
> X
A B C D E
1 1 3 5 7 9
2 2 4 6 8 10
Take out columns B and D:
> X<-X[,-grep("B|D",colnames(X))]
> X
A C E
1 1 5 9
2 2 6 10
EDIT: Considering the grepl suggestion of Matthew Lundberg in the comments below:
> X<-data.frame(A=c(1,2),B=c(3,4),C=c(5,6),D=c(7,8),E=c(9,10))
> X
A B C D E
1 1 3 5 7 9
2 2 4 6 8 10
> X<-X[,!grepl("B|D",colnames(X))]
> X
A C E
1 1 5 9
2 2 6 10
If I try to drop a column that's non-existent,nothing should happen:
> X<-X[,!grepl("G",colnames(X))]
> X
A C E
1 1 5 9
2 2 6 10
I tried to delete a column while using the package data.table
and got an unexpected result. I kind of think the following might be worth posting. Just a little cautionary note.
[ Edited by Matthew ... ]
DF = read.table(text = "
fruit state grade y1980 y1990 y2000
apples Ohio aa 500 100 55
apples Ohio bb 0 0 44
apples Ohio cc 700 0 33
apples Ohio dd 300 50 66
", sep = "", header = TRUE, stringsAsFactors = FALSE)
DF[ , !names(DF) %in% c("grade")] # all columns other than 'grade'
fruit state y1980 y1990 y2000
1 apples Ohio 500 100 55
2 apples Ohio 0 0 44
3 apples Ohio 700 0 33
4 apples Ohio 300 50 66
library('data.table')
DT = as.data.table(DF)
DT[ , !names(dat4) %in% c("grade")] # not expected !! not the same as DF !!
[1] TRUE TRUE FALSE TRUE TRUE TRUE
DT[ , !names(DT) %in% c("grade"), with=FALSE] # that's better
fruit state y1980 y1990 y2000
1: apples Ohio 500 100 55
2: apples Ohio 0 0 44
3: apples Ohio 700 0 33
4: apples Ohio 300 50 66
Basically, the syntax for data.table
is NOT exactly the same as data.frame
. There are in fact lots of differences, see FAQ 1.1 and FAQ 2.17. You have been warned!
df2 <- df[!names(df) %in% c("c1", "c2")]
I changed the code to:
# read data
dat<-read.dta("file.dta")
# vars to delete
var.in<-c("iden", "name", "x_serv", "m_serv")
# what I'm keeping
var.out<-setdiff(names(dat),var.in)
# keep only the ones I want
dat <- dat[var.out]
Anyway, juba's answer is the best solution to my problem!
If you know exactly the names of the columns in original dataframe called "df":
cols_to_drop <- c("A", "B", "C")
df_clean = df[,!(names(df) %in% cols_to_drop)]
Src: https://www.listendata.com/2015/06/r-keep-drop-columns-from-data-frame.html
Here is another solution that may be helpful to others. The code below selects a small number of rows and columns from a large data set. The columns are selected as in one of juba's answers except that I use a paste function to select a set of columns with names that are numbered sequentially:
df = read.table(text = "
state county city region mmatrix X1 X2 X3 A1 A2 A3 B1 B2 B3 C1 C2 C3
1 1 1 1 111010 1 0 0 2 20 200 4 8 12 NA NA NA
1 2 1 1 111010 1 0 0 4 NA 400 5 9 NA NA NA NA
1 1 2 1 111010 1 0 0 6 60 NA NA 10 14 NA NA NA
1 2 2 1 111010 1 0 0 NA 80 800 7 11 15 NA NA NA
1 1 3 2 111010 0 1 0 1 2 1 2 2 2 10 20 30
1 2 3 2 111010 0 1 0 2 NA 1 2 2 NA 40 50 NA
1 1 4 2 111010 0 1 0 1 1 NA NA 2 2 70 80 90
1 2 4 2 111010 0 1 0 NA 2 1 2 2 10 100 110 120
1 1 1 3 010010 0 0 1 10 20 10 200 200 200 1 2 3
1 2 1 3 001000 0 0 1 20 NA 10 200 200 200 4 5 9
1 1 2 3 101000 0 0 1 10 10 NA 200 200 200 7 8 NA
1 2 2 3 011010 0 0 1 NA 20 10 200 200 200 10 11 12
", sep = "", header = TRUE, stringsAsFactors = FALSE)
df
df2 <- df[df$region == 2, names(df) %in% c(paste("C", seq_along(1:3), sep=''))]
df2
# C1 C2 C3
# 5 10 20 30
# 6 40 50 NA
# 7 70 80 90
# 8 100 110 120
I can´t answer your question in the comments due to low reputation score.
The next code will give you an error because the paste function return a character string
for(i in 1:length(var.out)) {
paste("data$", var.out[i], sep="") <- NULL
}
Here is a possible solution:
for(i in 1:length(var.out)) {
text_to_source <- paste0 ("data$", var.out[i], "<- NULL") # Write a line of your
# code like a character string
eval (parse (text=text_to_source)) # Source a text that contains a code
}
or just do:
for(i in 1:length(var.out)) {
data[var.out[i]] <- NULL
}
df = mtcars
remove vs and am because they are categorical. In the dataset
vs is in column number 8, am is in column number 9
dfnum = df[,-c(8,9)]
精彩评论