开发者

Remove anything within a pair of parentheses using gsub in R

Suppose I have string like below:

<a>b<c>

I want to remove both <a> and <c>, but I can't use gsub("<.*>","","<a>b<c>") as this will remove the b also.

I asked a similar question before, but on a second thought, I think I should learn in general, how to deal wit开发者_JS百科h this kind of problems. Thanks.


Don't allow a closing bracket > in the stuff between the brackets:

z <- "<a>b<c>"
gsub("<[^>]+>","",z)


You can use a non-greedy regex, eg. /<.*?>/.

This will only work for simple HTML and can be easily subverted. Consider the following HTML, which cannot easily be removed using regular expressions.

<span title="Help > Index">


One more idea, often quite useful in noisy settings (i.e. when it comes nearer to making a tokenizer):

strsplit("<a>b<c>",split='<|>')[[1]][3]
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜