开发者

R and regexp: Extract name source from news

I have a scrap news by R like the following:

> View(mydf$title)
<name of the news> <dash> <source name>    
Матч КХЛ перенесен на 2 дня - Газета.Ru
Всероссийская универсиада 2010 - Interfax Russia
Звезда хоккея снялся в клипе популярного рэпера. ВИДЕО - Ura.ru
Трактор – Тролейбус 2:1 14.04.2011 – YouTube

I need to split mydf$title on the title news and name of source (- Газета.ru, - Interfax Russia, - Ura.ru, etc)

I try this library(stringr):

mydf$sourse <- str_extract(mydf$title, '\\- [A-Za-zА-Яа-я0-9." ]{0,}$')
mydf$sourse <- str_extract(mydf$title, "\\-[:space:[:alpha:][:punct:][:space:]]{0,}$")
mydf$sourse <- str_extract(mydf$title, '\\-\\s[A-Za-zА-Яа-я0-9[:punct:]\\s]{0,}')
mydf$sourse <- str_开发者_JS百科extract(mydf$title, "\\s-\\s[\\w+\\s.]{0,}$")
mydf$sourse <- str_extract(mydf$title, "\\s-\\s[:alpha:][:print:]$")

But does not work very well. How do I split a string optimally? Thanks for the tips. Спасибо.

Note: mydf is data.frame:

> str(mydf)
'data.frame':   100 obs. of  6 variables:
 $ title      : Factor w/ 100 levels...
 $ link       : Factor w/ 100 levels...
 $ guid.text  : Factor w/ 100 levels...
 $ guid..attrs: Factor w/ 1 level...
 $ pubDate    : Factor w/ 100 levels...
 $ description: Factor w/ 100 levels...


Try using strsplit, but I note that your separator is in fact two different types of dash:

strsplit(mydf$title, split=" [–-] ", useBytes=TRUE)

This will give you a list of elements. (As you can see, I couldn't get the encoding to be correct on my machine, but even so, it's clear that the news agency is always the last element of each list. The only other issue that you will have to deal with then is that sometimes the source can also inlude a dash. If this happens you will have to use paste to combine all but the last element of each list.

[[1]]
[1] "<U+041C><U+0430><U+0442><U+0447> <U+041A><U+0425><U+041B> <U+043F><U+0435><U+0440><U+0435><U+043D><U+0435><U+0441><U+0435><U+043D> <U+043D><U+0430> 2 <U+0434><U+043D><U+044F>"
[2] "<U+0413><U+0430><U+0437><U+0435><U+0442><U+0430>.Ru"                                                                                                                           

[[2]]
[1] "<U+0412><U+0441><U+0435><U+0440><U+043E><U+0441><U+0441><U+0438><U+0439><U+0441><U+043A><U+0430><U+044F> <U+0443><U+043D><U+0438><U+0432><U+0435><U+0440><U+0441><U+0438><U+0430><U+0434><U+0430> 2010"
[2] "Interfax Russia"                                                                                                                                                                                       

[[3]]
[1] "<U+0417><U+0432><U+0435><U+0437><U+0434><U+0430> <U+0445><U+043E><U+043A><U+043A><U+0435><U+044F> <U+0441><U+043D><U+044F><U+043B><U+0441><U+044F> <U+0432> <U+043A><U+043B><U+0438><U+043F><U+0435> <U+043F><U+043E><U+043F><U+0443><U+043B><U+044F><U+0440><U+043D><U+043E><U+0433><U+043E> <U+0440><U+044D><U+043F><U+0435><U+0440><U+0430>. <U+0412><U+0418><U+0414><U+0415><U+041E>"
[2] "Ura.ru"                                                                                                                                                                                                                                                                                                                                                                                  

[[4]]
[1] "<U+0422><U+0440><U+0430><U+043A><U+0442><U+043E><U+0440>"                               
[2] "<U+0422><U+0440><U+043E><U+043B><U+0435><U+0439><U+0431><U+0443><U+0441> 2:1 14.04.2011"
[3] "YouTube"   


Perhaps you are overcomplicating things:

strsplit(c("before - after", "123 - 456"), " - ", fixed=TRUE)
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜