Using regular expression in R to categorize data

2023-02-12 21:05 问答作者：

I have a file with two columns, one has the content type of HTTP objects like text/html, application/rar etc and the other has the bytes size.

Content Type                                     Size
video/x-flv                                       100
image/jpeg                                        150
text/html                                         160
application/octet-stream                          200  
application/x-shockwave-flash                     ...
text/plain
application/x-javascript
text/xml
text/css
text/html; charset=utf-8
application/x-javascript; charset=utf-8           ...

A开发者_如何转开发s you can see there are many variations of the same content type, such as application/x-javascript and application/x-javascript; charset=utf-8 and so on. So, I would like to create another column to categorize them more generically. So, that these two would just be web/javascript and so on.

 Content Type                                      Size      Category
    video/x-flv                                       100       web/video
    image/jpeg                                        150       web/image
    text/html                                         160       web/html
    application/octet-stream                          200       web/binary
    application/x-shockwave-flash                     ...       web/flash
    text/plain                                                  web/plaintext
    application/x-javascript                                    web/javascript
    video/x-msvideo                                             web/video
    text/xml                                                    web/xml
    text/css                                                    web/css
    text/html; charset=utf-8                                    web/html
    video/quicktime                                             web/video
    application/x-javascript; charset=utf-8                     web/javascript

How would I accomplish this in R and I presume I need to use regular expressions of some sort for this?

There are several ways you can simplify your variable. Here I will use the stringr package for string manipulation functions :

R> library(stringr)

First, copy your content type variable into a new character variable :

R> d <- data.frame(type=c("video/x-flv", "image/jpeg","video/x-msvideo", "application/x-javascript; charset=utf-8", "application/x-javascript"))
R> d$type2 <- as.character(d$type)

Which just gives you :

                                     type                                   type2
1                             video/x-flv                             video/x-flv
2                              image/jpeg                              image/jpeg
3                         video/x-msvideo                         video/x-msvideo
4 application/x-javascript; charset=utf-8 application/x-javascript; charset=utf-8
5                application/x-javascript                application/x-javascript

Then you can work on your new variable. You can just replace manually certain type value by another :

R> d$type2[d$type2 == "video/x-flv"] <- "video"
R> d
                                     type                                   type2
1                             video/x-flv                                   video
2                              image/jpeg                              image/jpeg
3                         video/x-msvideo                         video/x-msvideo
4 application/x-javascript; charset=utf-8 application/x-javascript; charset=utf-8
5                application/x-javascript                application/x-javascript

You can use regexp matching to replace all the values matching, for example, "video" :

R> d$type2[str_detect(d$type2, ".*video.*")] <- "video"
R> d
                                     type                                   type2
1                             video/x-flv                                   video
2                              image/jpeg                              image/jpeg
3                         video/x-msvideo                                   video
4 application/x-javascript; charset=utf-8 application/x-javascript; charset=utf-8
5                application/x-javascript                application/x-javascript

Or you can use regexp replacement to clean certain values. For example by removing everything behind the ";" in your content types :

R> d$type2 <- str_replace(d$type2, ";.*$", "")
R> d
                                     type                    type2
1                             video/x-flv                    video
2                              image/jpeg               image/jpeg
3                         video/x-msvideo                    video
4 application/x-javascript; charset=utf-8 application/x-javascript
5                application/x-javascript application/x-javascript

Be careful of the order of your instructions, though, as your result highly depends on it.

If you had to do it by hand, you could assign your factors into corresponding categories. In this example, I group first 13 letters of the alphabet as "1" and the second half of the letters as "2".

> x <- as.factor(sample(letters, 100, replace = TRUE))
> x
  [1] d n p n k l a x c n v p l o u e z m y x t r q b l n y s s m d u l l a d k
 [38] t a p x s g w i p l b s o t b s h h v c b j o p h f j m v d r m x o d l e
 [75] l f y l u e w f e e o s w s m v a z q l a t f z x s
Levels: a b c d e f g h i j k l m n o p q r s t u v w x y z
> levels(x)
 [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s"
[20] "t" "u" "v" "w" "x" "y" "z"
> levels(x) <- c(rep(1, 13), rep(2, 13))
> x
  [1] 1 2 2 2 1 1 1 2 1 2 2 2 1 2 2 1 2 1 2 2 2 2 2 1 1 2 2 2 2 1 1 2 1 1 1 1 1
 [38] 2 1 2 2 2 1 2 1 2 1 1 2 2 2 1 2 1 1 2 1 1 1 2 2 1 1 1 1 2 1 2 1 2 2 1 1 1
 [75] 1 1 2 1 2 1 2 1 1 1 2 2 2 2 1 2 1 2 2 1 1 2 1 2 2 2
Levels: 1 2
> levels(x)
[1] "1" "2"

If your example contains (only) factors i.e.:

"video/x-flv" "image/jpeg" "video/x-msvideo" "application/x-javascript; charset=utf-8"

... you would code your levels like so:

levels(obj) <- c("web/video", "web/image", "web/video", "web/javascript")

Assume that DF is our data frame. Define a regular expression, re to match the strings of interest and then use strapply in the gsubfn package to extract them, prefixing "web/" to each. In the strapply statement we have converted DF[[1]] to character just in case its a factor rather than a character vector. NULL entries were not matched so lets assume those are "web/binary" . Finally expand any occurrences of "plain" to "plaintext" :

> library(gsubfn)
> re <- "(video|image|html|flash|plain|javascript|xml|css).*"
> short <- strapply(as.character(DF[[1]]), re, ~ paste("web", x, sep = "/"))
> DF$short <- sapply(short, function(x) if (is.null(x)) "web/binary" else x)
> DF$short <- sub("plain", "plaintext", DF$short)
> DF
                                   Content          short
1                              video/x-flv      web/video
2                               image/jpeg      web/image
3                                text/html       web/html
4                 application/octet-stream     web/binary
5            application/x-shockwave-flash      web/flash
6                               text/plain  web/plaintext
7                 application/x-javascript web/javascript
8                          video/x-msvideo      web/video
9                                 text/xml        web/xml
10                                text/css        web/css
11                text/html; charset=utf-8       web/html
12                         video/quicktime      web/video
13 application/x-javascript; charset=utf-8 web/javascript

There is more info on the gsubfn package at http://gsubfn.googlecode.com .

继续阅读：aggregate r regex

Using regular expression in R to categorize data

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？