Removing HTML code in R using gsub
I have a portion of HTML code in R like the one below:
"</a> <img src=\"images/arrow_orange.gif\" width=\"8\" height=\"12\"> <a href=\"group.php?g=1\">开发者_开发问答XXXX</a> <img src=\"images/arrow_orange.gif\" width=\"8\" height=\"12\"> <a href=\"category.php?c=100050\">YYYY</a> <img src=\"images/arrow_orange.gif\" width=\"8\" height=\"12\"> <a href=\"category.php?c=100050&brand=Motorola\">ZZZZ</a> <img src=\"images/arrow_orange.gif\" width=\"8\" height=\"12\">AAAA"
I want to use gsub to remove the unwanted HTML code so that the output will be:
XXXX YYYY ZZZZ AAAA
I tried <([A-Z][A-Z0-9]*)\b[^>]*>(.*?)</\1>
as shown here but fail, why?
How can I do it in R? Thanks.
I suggest you heed the warnings of @Ramnath and @Iterator and use a parser instead, but here is the best I can do with your string and regex
:
(First add a missing to the end of your input string)
x <- "</a> <img src=\"images/arrow_orange.gif\" width=\"8\" height=\"12\"> <a href=\"group.php?g=1\">XXXX</a> <img src=\"images/arrow_orange.gif\" width=\"8\" height=\"12\"> <a href=\"category.php?c=100050\">YYYY</a> <img src=\"images/arrow_orange.gif\" width=\"8\" height=\"12\"> <a href=\"category.php?c=100050&brand=Motorola\">ZZZ</a> <img src=\"images/arrow_orange.gif\" width=\"8\" height=\"12\">AAAA</a>"
The code:
x1 <- gsub("<([[:alpha:]][[:alnum:]]*)(.[^>]*)>([.^<]*)", "\\3", x)
x1
[1] "</a> XXXX</a> YYYY</a> ZZZ</a> AAAA</a>"
gsub("</a>", "", x1)
[1] " XXXX YYYY ZZZ AAAA"
精彩评论