How can I trim leading and trailing white space?
I am having some trouble with leading and trailing white space in a data.frame.
For example, I look at a specific row
in a data.frame
based on a certain condition:
> myDummy[myDummy$country == c("Austria"),c(1,2,3:7,19)]
[1] codeHelper country dummy开发者_Go百科LI dummyLMI dummyUMI
[6] dummyHInonOECD dummyHIOECD dummyOECD
<0 rows> (or 0-length row.names)
I was wondering why I didn't get the expected output since the country Austria obviously existed in my data.frame
. After looking through my code history and trying to figure out what went wrong I tried:
> myDummy[myDummy$country == c("Austria "),c(1,2,3:7,19)]
codeHelper country dummyLI dummyLMI dummyUMI dummyHInonOECD dummyHIOECD
18 AUT Austria 0 0 0 0 1
dummyOECD
18 1
All I have changed in the command is an additional white space after Austria.
Further annoying problems obviously arise. For example, when I like to merge two frames based on the country column. One data.frame
uses "Austria "
while the other frame has "Austria"
. The matching doesn't work.
- Is there a nice way to 'show' the white space on my screen so that I am aware of the problem?
- And can I remove the leading and trailing white space in R?
So far I used to write a simple Perl script which removes the whites pace, but it would be nice if I can somehow do it inside R.
As of R 3.2.0 a new function was introduced for removing leading/trailing white spaces:
trimws()
See: Remove Leading/Trailing Whitespace
Probably the best way is to handle the trailing white spaces when you read your data file. If you use read.csv
or read.table
you can set the parameterstrip.white=TRUE
.
If you want to clean strings afterwards you could use one of these functions:
# Returns string without leading white space
trim.leading <- function (x) sub("^\\s+", "", x)
# Returns string without trailing white space
trim.trailing <- function (x) sub("\\s+$", "", x)
# Returns string without leading or trailing white space
trim <- function (x) gsub("^\\s+|\\s+$", "", x)
To use one of these functions on myDummy$country
:
myDummy$country <- trim(myDummy$country)
To 'show' the white space you could use:
paste(myDummy$country)
which will show you the strings surrounded by quotation marks (") making white spaces easier to spot.
To manipulate the white space, use str_trim() in the stringr package. The package has manual dated Feb 15, 2013 and is in CRAN. The function can also handle string vectors.
install.packages("stringr", dependencies=TRUE)
require(stringr)
example(str_trim)
d4$clean2<-str_trim(d4$V2)
(Credit goes to commenter: R. Cotton)
A simple function to remove leading and trailing whitespace:
trim <- function( x ) {
gsub("(^[[:space:]]+|[[:space:]]+$)", "", x)
}
Usage:
> text = " foo bar baz 3 "
> trim(text)
[1] "foo bar baz 3"
Ad 1) To see white spaces you could directly call print.data.frame
with modified arguments:
print(head(iris), quote=TRUE)
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# 1 "5.1" "3.5" "1.4" "0.2" "setosa"
# 2 "4.9" "3.0" "1.4" "0.2" "setosa"
# 3 "4.7" "3.2" "1.3" "0.2" "setosa"
# 4 "4.6" "3.1" "1.5" "0.2" "setosa"
# 5 "5.0" "3.6" "1.4" "0.2" "setosa"
# 6 "5.4" "3.9" "1.7" "0.4" "setosa"
See also ?print.data.frame
for other options.
Use grep or grepl to find observations with white spaces and sub to get rid of them.
names<-c("Ganga Din\t", "Shyam Lal", "Bulbul ")
grep("[[:space:]]+$", names)
[1] 1 3
grepl("[[:space:]]+$", names)
[1] TRUE FALSE TRUE
sub("[[:space:]]+$", "", names)
[1] "Ganga Din" "Shyam Lal" "Bulbul"
Removing leading and trailing blanks might be achieved through the trim() function from the gdata package as well:
require(gdata)
example(trim)
Usage example:
> trim(" Remove leading and trailing blanks ")
[1] "Remove leading and trailing blanks"
I'd prefer to add the answer as comment to user56's, but I am yet unable so writing as an independent answer.
Another option is to use the stri_trim
function from the stringi
package which defaults to removing leading and trailing whitespace:
> x <- c(" leading space","trailing space ")
> stri_trim(x)
[1] "leading space" "trailing space"
For only removing leading whitespace, use stri_trim_left
. For only removing trailing whitespace, use stri_trim_right
. When you want to remove other leading or trailing characters, you have to specify that with pattern =
.
See also ?stri_trim
for more info.
Another related problem occurs if you have multiple spaces in between inputs:
> a <- " a string with lots of starting, inter mediate and trailing whitespace "
You can then easily split this string into "real" tokens using a regular expression to the split
argument:
> strsplit(a, split=" +")
[[1]]
[1] "" "a" "string" "with" "lots"
[6] "of" "starting," "inter" "mediate" "and"
[11] "trailing" "whitespace"
Note that if there is a match at the beginning of a (non-empty) string, the first element of the output is ‘""’, but if there is a match at the end of the string, the output is the same as with the match removed.
I created a trim.strings ()
function to trim leading and/or trailing whitespace as:
# Arguments: x - character vector
# side - side(s) on which to remove whitespace
# default : "both"
# possible values: c("both", "leading", "trailing")
trim.strings <- function(x, side = "both") {
if (is.na(match(side, c("both", "leading", "trailing")))) {
side <- "both"
}
if (side == "leading") {
sub("^\\s+", "", x)
} else {
if (side == "trailing") {
sub("\\s+$", "", x)
} else gsub("^\\s+|\\s+$", "", x)
}
}
For illustration,
a <- c(" ABC123 456 ", " ABC123DEF ")
# returns string without leading and trailing whitespace
trim.strings(a)
# [1] "ABC123 456" "ABC123DEF"
# returns string without leading whitespace
trim.strings(a, side = "leading")
# [1] "ABC123 456 " "ABC123DEF "
# returns string without trailing whitespace
trim.strings(a, side = "trailing")
# [1] " ABC123 456" " ABC123DEF"
Use dplyr/tidyverse mutate_all
with str_trim
to trim the entire data frame:
myDummy %>%
mutate_all(str_trim)
library(tidyverse)
set.seed(335)
df <- mtcars %>%
rownames_to_column("car") %>%
mutate(car = ifelse(runif(nrow(mtcars)) > 0.4, car, paste0(car, " "))) %>%
select(car, mpg)
print(head(df), quote = T)
#> car mpg
#> 1 "Mazda RX4 " "21.0"
#> 2 "Mazda RX4 Wag" "21.0"
#> 3 "Datsun 710 " "22.8"
#> 4 "Hornet 4 Drive " "21.4"
#> 5 "Hornet Sportabout " "18.7"
#> 6 "Valiant " "18.1"
df_trim <- df %>%
mutate_all(str_trim)
print(head(df_trim), quote = T)
#> car mpg
#> 1 "Mazda RX4" "21"
#> 2 "Mazda RX4 Wag" "21"
#> 3 "Datsun 710" "22.8"
#> 4 "Hornet 4 Drive" "21.4"
#> 5 "Hornet Sportabout" "18.7"
#> 6 "Valiant" "18.1"
Created on 2021-05-07 by the reprex package (v0.3.0)
The best method is trimws().
The following code will apply this function to the entire dataframe.
mydataframe<- data.frame(lapply(mydataframe, trimws),stringsAsFactors = FALSE)
Benchmarking of the main approaches in this thread. This is not capturing all weird cases, but so far we are still lacking the example where str_trim
removes whitespace and trimws
doesn't (see Richard Telford's comment to this answer). Doesn't seem to matter - the gsub option seems to be fastest :)
x <- c(" lead", "trail ", " both ", " both and middle ", " _special")
## gsub function from https://stackoverflow.com/a/2261149/7941188
## this is NOT the function from user Bernhard Kausler, which uses
## a much less concise regex
gsub_trim <- function (x) gsub("^\\s+|\\s+$", "", x)
res <- microbenchmark::microbenchmark(
gsub = gsub_trim(x),
## https://stackoverflow.com/a/30210713/7941188
trimws = trimws(x),
## https://stackoverflow.com/a/15007398/7941188
str_trim = stringr::str_trim(x),
times = 10^5
)
res
#> Unit: microseconds
#> expr min lq mean median uq max neval cld
#> gsub 20.201 22.788 31.43943 24.654 28.4115 5303.741 1e+05 a
#> trimws 38.204 41.980 61.92218 44.420 51.1810 40363.860 1e+05 b
#> str_trim 88.672 92.347 116.59186 94.542 105.2800 13618.673 1e+05 c
ggplot2::autoplot(res)
sessionInfo()
#> R version 4.0.3 (2020-10-10)
#> Platform: x86_64-apple-darwin17.0 (64-bit)
#> Running under: macOS Big Sur 10.16
#>
#> locale:
#> [1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> loaded via a namespace (and not attached):
#> stringr_1.4.0
I tried trim(). It works well with white spaces as well as the '\n'.
x = '\n Harden, J.\n '
trim(x)
myDummy[myDummy$country == "Austria "] <- "Austria"
After this, you'll need to force R not to recognize "Austria "
as a level. Let's pretend you also have "USA"
and "Spain"
as levels:
myDummy$country = factor(myDummy$country, levels=c("Austria", "USA", "Spain"))
It is a little less intimidating than the highest voted response, but it should still work.
精彩评论