Using regular expression in R to categorize data
I have a file with two columns, one has the content type of HTTP objects like text/html, application/rar etc and the other has the bytes size.
Content Type Size
video/x-flv 100
image/jpeg 150
text/html 160
application/octet-stream 200
application/x-shockwave-flash ...
text/html; charset=utf-8
application/x-javascript; charset=utf-8 ...
A开发者_如何转开发s you can see there are many variations of the same content type, such as application/x-javascript
and application/x-javascript; charset=utf-8
and so on. So, I would like to create another column to categorize them more generically. So, that these two would just be web/javascript
and so on.
Content Type Size Category
video/x-flv 100 web/video
image/jpeg 150 web/image
text/html 160 web/html
application/octet-stream 200 web/binary
application/x-shockwave-flash ... web/flash
text/plain web/plaintext
application/x-javascript web/javascript
video/x-msvideo web/video
text/xml web/xml
text/css web/css
text/html; charset=utf-8 web/html
video/quicktime web/video
application/x-javascript; charset=utf-8 web/javascript
How would I accomplish this in R and I presume I need to use regular expressions of some sort for this?
There are several ways you can simplify your variable. Here I will use the stringr
package for string manipulation functions :
R> library(stringr)
First, copy your content type variable into a new character variable :
R> d <- data.frame(type=c("video/x-flv", "image/jpeg","video/x-msvideo", "application/x-javascript; charset=utf-8", "application/x-javascript"))
R> d$type2 <- as.character(d$type)
Which just gives you :
type type2
1 video/x-flv video/x-flv
2 image/jpeg image/jpeg
3 video/x-msvideo video/x-msvideo
4 application/x-javascript; charset=utf-8 application/x-javascript; charset=utf-8
5 application/x-javascript application/x-javascript
Then you can work on your new variable. You can just replace manually certain type value by another :
R> d$type2[d$type2 == "video/x-flv"] <- "video"
R> d
type type2
1 video/x-flv video
2 image/jpeg image/jpeg
3 video/x-msvideo video/x-msvideo
4 application/x-javascript; charset=utf-8 application/x-javascript; charset=utf-8
5 application/x-javascript application/x-javascript
You can use regexp matching to replace all the values matching, for example, "video" :
R> d$type2[str_detect(d$type2, ".*video.*")] <- "video"
R> d
type type2
1 video/x-flv video
2 image/jpeg image/jpeg
3 video/x-msvideo video
4 application/x-javascript; charset=utf-8 application/x-javascript; charset=utf-8
5 application/x-javascript application/x-javascript
Or you can use regexp replacement to clean certain values. For example by removing everything behind the ";" in your content types :
R> d$type2 <- str_replace(d$type2, ";.*$", "")
R> d
type type2
1 video/x-flv video
2 image/jpeg image/jpeg
3 video/x-msvideo video
4 application/x-javascript; charset=utf-8 application/x-javascript
5 application/x-javascript application/x-javascript
Be careful of the order of your instructions, though, as your result highly depends on it.
If you had to do it by hand, you could assign your factors into corresponding categories. In this example, I group first 13 letters of the alphabet as "1" and the second half of the letters as "2".
> x <- as.factor(sample(letters, 100, replace = TRUE))
> x
[1] d n p n k l a x c n v p l o u e z m y x t r q b l n y s s m d u l l a d k
[38] t a p x s g w i p l b s o t b s h h v c b j o p h f j m v d r m x o d l e
[75] l f y l u e w f e e o s w s m v a z q l a t f z x s
Levels: a b c d e f g h i j k l m n o p q r s t u v w x y z
> levels(x)
[1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s"
[20] "t" "u" "v" "w" "x" "y" "z"
> levels(x) <- c(rep(1, 13), rep(2, 13))
> x
[1] 1 2 2 2 1 1 1 2 1 2 2 2 1 2 2 1 2 1 2 2 2 2 2 1 1 2 2 2 2 1 1 2 1 1 1 1 1
[38] 2 1 2 2 2 1 2 1 2 1 1 2 2 2 1 2 1 1 2 1 1 1 2 2 1 1 1 1 2 1 2 1 2 2 1 1 1
[75] 1 1 2 1 2 1 2 1 1 1 2 2 2 2 1 2 1 2 2 1 1 2 1 2 2 2
Levels: 1 2
> levels(x)
[1] "1" "2"
If your example contains (only) factors i.e.:
"video/x-flv" "image/jpeg" "video/x-msvideo" "application/x-javascript; charset=utf-8"
... you would code your levels like so:
levels(obj) <- c("web/video", "web/image", "web/video", "web/javascript")
Assume that DF
is our data frame. Define a regular expression, re
to match the strings of interest and then use strapply
in the gsubfn
package to extract them, prefixing "web/"
to each. In the strapply
statement we have converted DF[[1]]
to character just in case its a factor rather than a character vector. NULL
entries were not matched so lets assume those are "web/binary"
. Finally expand any occurrences of "plain"
to "plaintext"
> library(gsubfn)
> re <- "(video|image|html|flash|plain|javascript|xml|css).*"
> short <- strapply(as.character(DF[[1]]), re, ~ paste("web", x, sep = "/"))
> DF$short <- sapply(short, function(x) if (is.null(x)) "web/binary" else x)
> DF$short <- sub("plain", "plaintext", DF$short)
> DF
Content short
1 video/x-flv web/video
2 image/jpeg web/image
3 text/html web/html
4 application/octet-stream web/binary
5 application/x-shockwave-flash web/flash
6 text/plain web/plaintext
7 application/x-javascript web/javascript
8 video/x-msvideo web/video
9 text/xml web/xml
10 text/css web/css
11 text/html; charset=utf-8 web/html
12 video/quicktime web/video
13 application/x-javascript; charset=utf-8 web/javascript
There is more info on the gsubfn
package at .