开发者

Removing HTML code in R using gsub

I have a portion of HTML code in R like the one below:

"</a> <img src=\"images/arrow_orange.gif\" width=\"8\" height=\"12\"> <a href=\"group.php?g=1\">开发者_开发问答XXXX</a> <img src=\"images/arrow_orange.gif\" width=\"8\" height=\"12\"> <a href=\"category.php?c=100050\">YYYY</a> <img src=\"images/arrow_orange.gif\" width=\"8\" height=\"12\"> <a href=\"category.php?c=100050&brand=Motorola\">ZZZZ</a> <img src=\"images/arrow_orange.gif\" width=\"8\" height=\"12\">AAAA"

I want to use gsub to remove the unwanted HTML code so that the output will be:

XXXX YYYY ZZZZ AAAA

I tried <([A-Z][A-Z0-9]*)\b[^>]*>(.*?)</\1> as shown here but fail, why?

How can I do it in R? Thanks.


I suggest you heed the warnings of @Ramnath and @Iterator and use a parser instead, but here is the best I can do with your string and regex:

(First add a missing to the end of your input string)

x <- "</a> <img src=\"images/arrow_orange.gif\" width=\"8\" height=\"12\"> <a href=\"group.php?g=1\">XXXX</a> <img src=\"images/arrow_orange.gif\" width=\"8\" height=\"12\"> <a href=\"category.php?c=100050\">YYYY</a> <img src=\"images/arrow_orange.gif\" width=\"8\" height=\"12\"> <a href=\"category.php?c=100050&brand=Motorola\">ZZZ</a> <img src=\"images/arrow_orange.gif\" width=\"8\" height=\"12\">AAAA</a>"

The code:

x1 <- gsub("<([[:alpha:]][[:alnum:]]*)(.[^>]*)>([.^<]*)", "\\3", x)
x1
[1] "</a>  XXXX</a>  YYYY</a>  ZZZ</a> AAAA</a>"

gsub("</a>", "", x1)
[1] "  XXXX  YYYY  ZZZ AAAA"
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜