开发者

Fixing unescaped XML entities in Java with Regex?

I have some badly formatted XML that I must parse. Fixing the problem upstream is not possible.

The (current) problem is that ampersand characters are not always escaped properly, so I need to convert & into &

If &amp; is already there, I don't want to change it to &amp;amp;. In general, if any well-formed entity is already there, I don't want to destroy it. I don't think that it's possible, in general, to know all entities that could appear in any particular XML document, so I want a solution where anything like &<characters>; is preserved.

Where <characters> is some set of characters defining an entity between the initial & and the closing ;. In particular, < and > are not literals that would otherwise denote an XML element.

Now, when parsing, if I see &<characters> I don't know whether I'll run into a ;, a (space), end-of-line, or another &. So I 开发者_运维知识库think that I have to remember <characters> as I look ahead for a character that will tell me what to do with the original &.

I think that I need the power of a Push Down Automaton to do this, I don't think that a Finite State Machine will work because of what I think is a memory requirement - is that correct? If I need a PDA, then a regular expression in a call to String.replaceAll(String, String) won't work. Or is there a Java regex that can solve this problem?

Remember: there could be multiple replacements per line.

(I'm aware of this question, but it does not provide the answer that I am looking for.)


Here's the regex you're looking for: &([^;\\W]*([^;\\w]|$)), and the corresponding replacement string would be &amp;$1. It matches on &, followed by zero or more non-semicolons or word breaks (it needs to allow zero to match the stand-alone ampersand), followed by a word break that is not a semicolon (or a line end). The capturing group allows you to do the replacement with &amp; that you're looking for.

Here's some sample code using it:

String s = "&amp; & &nsbp; &tc., &tc. &tc";
final String regex = "&([^;\\W]*([^;\\w]|$))";
final String replacement = "&amp;$1";
final String t = s.replaceAll(regex, replacement);

After running this in a sandbox, I get the following result for t:

&amp; &amp; &nsbp; &amp;tc., &amp;tc. &amp;tc

As you can see, the original &amp; and &nbsp; remain unchanged. However, if you try it with "&&", you get &amp;&, and if you try it with "&&&", you get &amp;&&amp;, which I take as a symptom of the look-ahead problem you were alluding to. However, if you replace the line:

final String t = s.replaceAll(regex, replacement);

with:

final String t = s.replaceAll(regex, replacement).replaceAll(regex, replacement);

It works with all of those strings and any others that I could think of. (In a finished product, you'd presumably write a single routine that would do this double replaceAll invocation.)


I think you can also use look-ahead to see if & characters are followed by characters & a semicolon (e.g. &(?!\w+;)). Here's an example:

import java.util.*;
import java.util.regex.*;

public class HelloWorld{
    private static final Pattern UNESCAPED_AMPERSAND =
        Pattern.compile("&(?!(#\\d+|\\w+);)");
     public static void main(String []args){
        for (String s : Arrays.asList(
            "http://www.example.com/?a=1&b=2&amp;c=3/",
            "Three in a row: &amp;&&amp;",
            "&lt; is <, &gt; is >, &apos; is ', etc."
        )) {
            System.out.println(
                UNESCAPED_AMPERSAND.matcher(s).replaceAll("&amp;")
            );        
        }
     }
}

// Output:
// http://www.example.com/?a=1&amp;b=2&amp;c=3/
// Three in a row: &amp;&amp;&amp;
// &lt; is <, &gt; is >, &apos; is ', etc.


Start by understanding the grammar around entities: http://www.w3.org/TR/xml/#NT-EntityRef

Then look at the JavaDoc for FilterInputStream: http://download.oracle.com/javase/6/docs/api/java/io/FilterInputStream.html

Then implement one that reads the actual input character-by-character. When it sees an ampersand, it switches into "entity mode" and looks for a valid entity reference (& Name ;). If it finds one before the first character that isn't allowed in Name, then it writes it to the output verbatim. Otherwise it writes &amp; followed by everything after the ampersand.


Instead of trying to do something generically over all possible bad data, just deal the occurances of bad data, one at a time. Chances are that whatever is generating the XML is messing up one or two characters but not all of em. This is an assumption of course.

Try just replacing all & with & EXCEPT when the & is followed by amp;. If the next improperly encoded charcter you run into is <, then replace them all with <. Keep the rule set small and manageable, only dealing with things you know are wrong.

If you try to do to much, you may end up replacing things you didn't intend to and messing the data up yourself.

I just want to also note that the best solution is to encourage whoever is producing the XML to fix the encoding on their end. This may be awkward to ask but if you explain to them, professionally, that they are not generating valid XML, they may be willing to fix the bug(s). This would have the added benefit of the next person who has to consume it not needed to do some crazy custom code to work around a problem which should be solved at the source. Consider it at least. Worse thing that can happen is that you ask, they say no, and you are right where you are now.


Sorry for stirring up an old thread:
I faced the same problem and the workaround I used was in 3 steps:

  1. Identify valid entity references and 'hide' them from regex
  2. replace non-escaped characters using regex
  3. Restore previously 'hidden' entity references

The hiding is done by enclosing entities in custom character sequences. e.g. "#||<ENTITY_NAME>||#"

To illustrate, say we have this XML snippet with unescaped character &:

<NAME>Testname</NAME>
<VALUE>
    random words one &amp; two
    I am sad&happy; at the same time!
    its still &lt; ecstatic
    It is two & three words
    Short form is 2&three
    Now for some invalid entity refs: &amp, &gt, and &lt too.
</VALUE>

Step1:
We use the regex replace "[&]\(amp|apos|gt|lt|quot\)[;]" with "#||$1||#". This is because the valid XML entity references as per W3C are amp,lt,gt,apos & quot. The string now looks like this:

<NAME>Testname</NAME>
<VALUE>
    random words one #||amp||# two
    I am sad&happy; at the same time!
    its still #||lt||# ecstatic
    It is two & three words
    Short form is 2&three
    Now for some invalid entity refs: &amp, &gt, and &lt too.
</VALUE>

Only the valid entity references were hidden. &happy; was left untouched.

Step2:
Do the regex replace "[&]" with "&amp;". The string now looks like this:

<NAME>Testname</NAME>
<VALUE>
    random words one #||amp||# two
    I am sad&amp;happy; at the same time!
    its still #||lt||# ecstatic
    It is two &amp; three words
    Short form is 2&amp;three
    Now for some invalid entity refs: &amp;amp, &amp;gt, and &amp;lt too.
</VALUE>

Step3:
Do the regex replace "#\|\|([a-z]+)\|\|#" with "&$1;". The final corrected string now looks like this:

<NAME>Testname</NAME>
<VALUE>
    random words one &amp; two
    I am sad&amp;happy; at the same time!
    its still &lt; ecstatic
    It is two &amp; three words
    Short form is 2&amp;three
    Now for some invalid entity refs: &amp;amp, &amp;gt, and &amp;lt too.
</VALUE>


Downsides: The custom char sequence to hide the valid entity must be chosen carefully to ensure that no valid content will by chance contain the same sequence. Chances are minimal though, but admitted, this is not a fullproof solution...


I used the UNESCAPED_AMPERSAND solution above but I had to change the regex to

private static final Pattern UNESCAPED_AMPERSAND =
        Pattern.compile("&(?!(#\\d+|#x[0-9a-fA-F]+|\\w+);)");

adding |#x[0-9a-fA-F]+ to account for hex character references.

(I wanted to comment on that solution but apparently I can't.)

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜