开发者

delete html tag, but not the tag content

I have a bunch of Word docs which were "saved as" filtered html. The html files contain extraneous ole-links which I need to delete. For example, I want to replace:

<h3><a name="OLE_LINK25">My Section Title</a></h3>

with

<h3>My Section Title</h3>

Any suggestions for how I might do this, in an a开发者_如何转开发utomated way?


Jsoup could help to remove all anchor tags with name starting with "OLE".

Elements anchors = doc.select("a[name^=OLE]");
for (Iterator it = anchors.iterator(); it.hasNext(); ) {
    Element anchor = it.next();
    String text = anchor.text();
    Element header = anchor.parent();
    header.text(text);
}


You could try something like this (untested, make sure to test first):

sed -i".backup" 's/<([^ ]+) name="OLE[^"]*">([^<]+)<\/\1>/\2/g' *.html

What this will do is replace all occurrances of <TAG name="OLE....">WHATEVER_HERE</TAG> with just WHATEVER_HERE in all *.html files. It will also make a backup of each *.html file from FILENAME.html to FILENAME.html.backup

If necessary, download sed for Windows

Or gnu sed

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜