开发者

RegEx to close an unclosed tag

I have received some invalid XML data from a poll provider and would like to clean up several unclosed tags before processing.

The data currently looks like this:

<questions>
<question number="1">
<title>What is your name?</title>
<answer>John Doe<ans开发者_如何学编程wer> <!-- this is the problem -->
</question>
<question number="2">
...
</question>

Is there a way with regular expressions to clean this and go ahead and close that <answer> tag?

I have this: "<answer>.*?(?<closingtag><answer>)" to find the occurrences, but how do I do a specific replacement on that <closingtag> named group?

Sorry for this very basic question, but I am struggling a bit with my regex expression.

Thanks,

Hal


If the problem is always a missing / (that is, there is a matching tag, but it's not currently a closing one), you could do something like this:

Find: <([^/>]+)>([^<]*?)<\1>

Replace with: <\1>\2</\1>

This would attempt to find tags that are two-in-a-row-unclosed tags (not including self-closing tags), and replace them with the tag, the content, and then a closing version of the tag.

There are some caveats, of course - if a tag has an attribute that includes a /, or if the value of the unclosed tag includes < (or other tags) this regex wouldn't work.


Programatic repair of human error in XML validation is asking for trouble. In the extreme, you might as well undo all XML validation. Take just one example:

<questions> 
<question number="1"> 
<title>What is your name?</title> 
<answer>John Doe<answer> 
<!-- this is the problem --> </question> <question number="2"> ... </question>

Repair...

<answer>John Doe</answer> 

Or...

<answer>John</answer><answer> Doe</answer>

Or...

<answer>John Doe</answer><answer> </answer>

Can you see where this is headed?

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜