Need a modified behavior for non-greedy grep
I am attempting to clean out a ton of spam that was injected into a client's blog. One of the issues is that the hack that originally did the injection did so in a way that it actually wound up with malformed multi-embeded links, so I am having trouble grabbing them in a concise way.
My thought was to开发者_如何学Go dump all of the links in the posts table into a text file, then remove the valid ones from that list, and from there create a bash script that removed the malicious ones one line at a time. I was trying to use a non-greedy grep to dump the links, otherwise in cases where there was more than one link in the post it would go from the start of the first link to the end of the last one. This is the line I was using:
grep -Po "<a href=[\'\"][^\'\"]*[\'\"]>.*?</a>" wp_posts.sql>full-link-list.txt
The problem is happening when it tries to parse links embedded within other links. For instance, I get this:
<a href="http://blogtorn.com/images/">where <a href="http://clinesite.com/images/">buy n viagra </a>
from a section like this:
<a href="http://blogtorn.com/images/">where <a href="http://clinesite.com/images/">buy n viagra </a> do you buy viagra | buy cialis phentermine | cheap levitra online</a>
Not all links are broken like this though, and if I clean out the ones output from the command above I think it will make it very difficult to track down the debris. What I think I need is either something that grabs the whole block (ie. matching opening <a href
with the same number of closing </a>
), or just the smallest inner match possible (ie. greedy from the inside out) and I then do it in multiple passes, but I am open to other suggestions too. Any thoughts on this? Thanks!
I think the inside-out approach is your best bet. Assuming there are no other tags inside the <a>
elements, it should be as simple as changing the .*?
to [^<>]*
. And, as you said, making multiple passes.
While it is possible in many regex flavors to match the whole nested structure in one pass, every flavor does it differently, and it's always ugly.
精彩评论