How do I handle unicode user input in Scala safely (esp XML entities)

2022-12-15 18:22 问答作者：

On my website I have a form that takes in some textual user input. All works fine for "normal" characters. However when unicode characters are input... well, the plot thickens.

User inputs something like

やっぱ死にかけてる

This comes in to the server as text containing XML entity refs

&#12420;&#12387;&#12401;&#27515;&#12395;&#12363;&#12369;&#12390;&#12427;&#65311;

Now, when I want to serve this back to the client in HTML, how do I do it?

If I simply output the string as it is, there could be a chance for a script attack. If I try to encode it with scala.xml.Text it gets converted to:

&amp;#12420;&amp;#12387;&amp;#12401;&amp;#27515;&amp;#12395;&amp;#12363;&amp;#12369;&amp;#12390;&amp;#12427;&amp;#65311;

开发者_StackOverflow

Is there a better ready-made solution in Scala which can detect entity refs and not escape them, yet escape XML tags?

Parse the string containing entity references as a fragment of XML. To safely output the Unicode characters in XML, you can be paranoid and use XML entity references for them, as per the function escape

scala>import xml.parsing.ConstructingParser                                                             
import xml.parsing.ConstructingParser

scala>import io.Source                                                                                  
import io.Source

scala> val d = ConstructingParser.fromSource(Source.fromString("<dummy>&#12420;</dummy>"), true).documnent
d: scala.xml.Document = <dummy>や</dummy>

scala>val t = d(0).text                                                                                         
res0: String = や

scala> import xml._
import xml._

scala> def escape(xmlText: String): NodeSeq = {
     |   def escapeChar(c: Char): xml.Node =
     |     if (c > 0x7F || Character.isISOControl(c))
     |       xml.EntityRef("#" + Integer.toString(c, 10))
     |     else
     |       xml.Text(c.toString)
     | 
     |   new xml.Group(xmlText.map(escapeChar(_)))
     | }
escape: (xmlText: String)scala.xml.NodeSeq

scala> <foo>{escape(t)}</foo>                            
res3: scala.xml.Elem = <foo>&#12420;</foo>

Ok, I am trying this simple hack. Comments welcome:

def secureEscape(text: String) = {
  val s = new StringBuilder()
  for (c <- text.elements) c match {
   case '<' => s.append("&lt;")
   case '>' => s.append("&gt;")
   case _   => s.append(c)
  }
  s.toString
}

This will basically escape < and >.

I then use this function to parse the incoming form input and then dish it out without further processing to the client.

Really, the browser should be responsible for the correct UTF-8 encoding and escaping of characters (this appears to be happening). Your web framework should then handle the unescaping and decoding.

This can be a tricky business, with several steps involved, all of which may have to be explicitly configured for correct UTF-8 operation. Especially when working with older frameworks and servers, caching proxies, content delivery networks, etc.

The point being that, internally, you want to be seeing the expected unicode characters - not the entity refs. Likewise, you should be outputting native unicode and handle and required encoding at the boundary of your system, preferably this will be automatically handled by your choice of web framework.

In order to give you the correct solution, it's necessary to know what software stack(s) you're using and how the form is being submitted (i.e. GET/POST/AJAX+JSON)

Browsers only encode input characters to numeric character reference entities when the character is outside the character set the page was served in. Save yourself a lot of trouble and serve your pages in UTF-8 properly tagged as UTF-8. Scala, Java and Javascript string processing is all in Unicode, and restricting to iso-8859-1 for your web pages is inviting conversion problems like this in all directions. If your existing content is ASCII then conversion should be painless.

继续阅读：scala xml xml-serialization

How do I handle unicode user input in Scala safely (esp XML entities)

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？