Tidy breaks links with not-latin chars

2022-12-30 13:20 问答作者：

I use java library Tidy to sanitize html-code. Some of the code contains links with Russian letters. For example

<a href="http://example.com/Русский">link with Russian letters</a>

I understand that "Русский" must be escaped, but I get this html from users. And my job is to convert it to XHTML.

I think tidy tries to escape not-latin letters, but as a result I get

<a href="http://example.com/%420%443%441%441%43A%438%439">link with Russia开发者_开发问答n letters</a>

This is not corect. Correct version is

<a href="http://example.com/%D0%A0%D1%83%D1%81%D1%81%D0%BA%D0%B8%D0%B9">link with Russian letters</a>

Java code is

private static Tidy getTidy() {
    if (null == tidy) {
      tidy = new Tidy();
      tidy.setQuiet(true);
      tidy.setShowErrors(0);
      tidy.setShowWarnings(false);
      tidy.setXHTML(true);
      tidy.setOutputEncoding("UTF-8");
    }
    return tidy;
}

public static String sanitizeHtml(String html, URI pageUri) {
    boolean escapeMedia = false;
    String ret = "";
    try {
      Document doc = getTidy().parseDOM(new StringReader("<body>" + html + "</body>"), null);

      // here I make some processing

      // string output
      ByteArrayOutputStream out = new ByteArrayOutputStream();
      Node node = doc.getElementsByTagName("body").item(0);
      getTidy().pprint(node, out);
      ret = out.toString().trim();
    }
    catch (Exception e) {
      ret = html;
      e.printStackTrace();
    }

    return ret;
}

It's a hard-coded behaviour and it's probably a bug. They use UTF-16 to escape non-ASCII characters in URLs when they should use UTF-8. See org/w3c/tidy/AttrCheckImpl.java.

继续阅读：tidy urlencode

Tidy breaks links with not-latin chars

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？