开发者

Finding encoding issues in Java Project/Source

I'm currently working on a Java project where it's part of my job to watch over the quality. As tools I use Jenkins in combination with Sonar. These tools are great and the helped me to track issues fast and continuously.

One issue I don't get under control is that some people commit using other encoding than UTF-8.

When code like this:

if (someString == "something") {
    resultString = "string with encoding problem: �";
}

... gets committed, Sonar will help me finding the "String Literal Equality" issue. But as you see in the second line there is an issue with the encoding: "�" should usually be an "ü".

Is there any possibility to find these kinds of problems with Sonar/Findbugs/PMD...

Please advice! Thank you.

Ps: Of course I've tried to explain t开发者_如何学Gohe issue to my co-developers in person as well as via email. I even changed their project/workspace encoding myself... But somehow the still succeed in committing code like this.


I'm agree with @bmargulies, it's a valid UTF-8 char (actually it's the replacement character) but after all, a PMD rule could help. Here is a proof of concept rule with a hard-coded unallowed character list:

import net.sourceforge.pmd.AbstractJavaRule;
import net.sourceforge.pmd.ast.ASTLiteral;

import org.apache.commons.lang3.StringUtils;

public class EncodingRule extends AbstractJavaRule {

    private static final String badChars = "\uFFFD";

    public EncodingRule() {
    }

    @Override
    public Object visit(final ASTLiteral node, final Object data) {
        if (node.isStringLiteral()) {
            final String image = node.getImage();
            if (StringUtils.containsAny(image, badChars)) {
                addViolationWithMessage(data, node, "Disallowed char in '"
                        + image + "'");
            }
        }
        return super.visit(node, data);
    }

}

Maybe it would be useful to invert the condition and make an allowedChars whitelist with ASCII characters and your local chars as well. (There is some more detail of custom PMD rules in this answer.)


You can write checkstyle and PMD extensions in Java, and you can walk the AST and discover things. The problem is, that the code will already have been converted from something to Unicode. That Blot character a particular Unicode character used to substitute for characters that can't be mapped in the current encoding, so you could look for those. It won't help you if the encoding confusion results in a ? or just an incorrect character. It may be challenging to get Sonar to apply your custom rules.


Here is the same concept as palacsint's answer but in XPath

  Black list any string that contains X or Y
  //Literal[matches(@Image,"[XY]")]

  White list any string that does not match X or Y 
  //Literal[not(matches(@Image,"[XY]"))]

  Black list any string that contains X using the unicode representation
  //Literal[matches(@Image,"[\u0058]")]

Using XPath may be a lot more concise then doing it in Java.

Here are some tutorials on using Custom PMD rules using XPath in case you or someone else who reads this answer is not familiar.

http://www.techtraits.ca/custom-pmd-rules-using-xpath/

http://blog.code-cop.org/2010/05/custom-pmd-rules.html

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜