Remove anything within a pair of parentheses using gsub in R
Suppose I have string like below:
<a>b<c>
I want to remove both <a>
and <c>
, but I can't use gsub("<.*>","","<a>b<c>")
as this will remove the b
also.
I asked a similar question before, but on a second thought, I think I should learn in general, how to deal wit开发者_JS百科h this kind of problems. Thanks.
Don't allow a closing bracket >
in the stuff between the brackets:
z <- "<a>b<c>"
gsub("<[^>]+>","",z)
You can use a non-greedy regex, eg. /<.*?>/
.
This will only work for simple HTML and can be easily subverted. Consider the following HTML, which cannot easily be removed using regular expressions.
<span title="Help > Index">
One more idea, often quite useful in noisy settings (i.e. when it comes nearer to making a tokenizer):
strsplit("<a>b<c>",split='<|>')[[1]][3]
精彩评论