R and regexp: Extract name source from news
I have a scrap news by R like the following:
> View(mydf$title)
<name of the news> <dash> <source name>
Матч КХЛ перенесен на 2 дня - Газета.Ru
Всероссийская универсиада 2010 - Interfax Russia
Звезда хоккея снялся в клипе популярного рэпера. ВИДЕО - Ura.ru
Трактор – Тролейбус 2:1 14.04.2011 – YouTube
I need to split mydf$title
on the title news and name of source (- Газета.ru, - Interfax Russia, - Ura.ru, etc)
library(stringr)
:
mydf$sourse <- str_extract(mydf$title, '\\- [A-Za-zА-Яа-я0-9." ]{0,}$')
mydf$sourse <- str_extract(mydf$title, "\\-[:space:[:alpha:][:punct:][:space:]]{0,}$")
mydf$sourse <- str_extract(mydf$title, '\\-\\s[A-Za-zА-Яа-я0-9[:punct:]\\s]{0,}')
mydf$sourse <- str_开发者_JS百科extract(mydf$title, "\\s-\\s[\\w+\\s.]{0,}$")
mydf$sourse <- str_extract(mydf$title, "\\s-\\s[:alpha:][:print:]$")
But does not work very well. How do I split a string optimally? Thanks for the tips. Спасибо.
Note: mydf
is data.frame:
> str(mydf)
'data.frame': 100 obs. of 6 variables:
$ title : Factor w/ 100 levels...
$ link : Factor w/ 100 levels...
$ guid.text : Factor w/ 100 levels...
$ guid..attrs: Factor w/ 1 level...
$ pubDate : Factor w/ 100 levels...
$ description: Factor w/ 100 levels...
Try using strsplit
, but I note that your separator is in fact two different types of dash:
strsplit(mydf$title, split=" [–-] ", useBytes=TRUE)
This will give you a list of elements. (As you can see, I couldn't get the encoding to be correct on my machine, but even so, it's clear that the news agency is always the last element of each list. The only other issue that you will have to deal with then is that sometimes the source can also inlude a dash. If this happens you will have to use paste to combine all but the last element of each list.
[[1]]
[1] "<U+041C><U+0430><U+0442><U+0447> <U+041A><U+0425><U+041B> <U+043F><U+0435><U+0440><U+0435><U+043D><U+0435><U+0441><U+0435><U+043D> <U+043D><U+0430> 2 <U+0434><U+043D><U+044F>"
[2] "<U+0413><U+0430><U+0437><U+0435><U+0442><U+0430>.Ru"
[[2]]
[1] "<U+0412><U+0441><U+0435><U+0440><U+043E><U+0441><U+0441><U+0438><U+0439><U+0441><U+043A><U+0430><U+044F> <U+0443><U+043D><U+0438><U+0432><U+0435><U+0440><U+0441><U+0438><U+0430><U+0434><U+0430> 2010"
[2] "Interfax Russia"
[[3]]
[1] "<U+0417><U+0432><U+0435><U+0437><U+0434><U+0430> <U+0445><U+043E><U+043A><U+043A><U+0435><U+044F> <U+0441><U+043D><U+044F><U+043B><U+0441><U+044F> <U+0432> <U+043A><U+043B><U+0438><U+043F><U+0435> <U+043F><U+043E><U+043F><U+0443><U+043B><U+044F><U+0440><U+043D><U+043E><U+0433><U+043E> <U+0440><U+044D><U+043F><U+0435><U+0440><U+0430>. <U+0412><U+0418><U+0414><U+0415><U+041E>"
[2] "Ura.ru"
[[4]]
[1] "<U+0422><U+0440><U+0430><U+043A><U+0442><U+043E><U+0440>"
[2] "<U+0422><U+0440><U+043E><U+043B><U+0435><U+0439><U+0431><U+0443><U+0441> 2:1 14.04.2011"
[3] "YouTube"
Perhaps you are overcomplicating things:
strsplit(c("before - after", "123 - 456"), " - ", fixed=TRUE)
精彩评论