How do I handle unicode user input in Scala safely (esp XML entities)
On my website I have a form that takes in some textual user input. All works fine for "normal" characters. However when unicode characters are input... well, the plot thickens.
User inputs something like
This comes in to the server as text containing XML entity refs
Now, when I want to serve this back to the client in HTML, how do I do it?
If I simply output the string as it is, there could be a chance for a script attack. If I try to encode it with scala.xml.Text
it gets converted to:
Is there a better ready-made solution in Scala which can detect entity refs and not escape them, yet escape XML tags?
Parse the string containing entity references as a fragment of XML. To safely output the Unicode characters in XML, you can be paranoid and use XML entity references for them, as per the function escape
scala>import xml.parsing.ConstructingParser
import xml.parsing.ConstructingParser
scala>import io.Source
import io.Source
scala> val d = ConstructingParser.fromSource(Source.fromString("<dummy>や</dummy>"), true).documnent
d: scala.xml.Document = <dummy>や</dummy>
scala>val t = d(0).text
res0: String = や
scala> import xml._
import xml._
scala> def escape(xmlText: String): NodeSeq = {
| def escapeChar(c: Char): xml.Node =
| if (c > 0x7F || Character.isISOControl(c))
| xml.EntityRef("#" + Integer.toString(c, 10))
| else
| xml.Text(c.toString)
| new xml.Group(
| }
escape: (xmlText: String)scala.xml.NodeSeq
scala> <foo>{escape(t)}</foo>
res3: scala.xml.Elem = <foo>や</foo>
Ok, I am trying this simple hack. Comments welcome:
def secureEscape(text: String) = {
val s = new StringBuilder()
for (c <- text.elements) c match {
case '<' => s.append("<")
case '>' => s.append(">")
case _ => s.append(c)
This will basically escape <
and >
I then use this function to parse the incoming form input and then dish it out without further processing to the client.
Really, the browser should be responsible for the correct UTF-8 encoding and escaping of characters (this appears to be happening). Your web framework should then handle the unescaping and decoding.
This can be a tricky business, with several steps involved, all of which may have to be explicitly configured for correct UTF-8 operation. Especially when working with older frameworks and servers, caching proxies, content delivery networks, etc.
The point being that, internally, you want to be seeing the expected unicode characters - not the entity refs. Likewise, you should be outputting native unicode and handle and required encoding at the boundary of your system, preferably this will be automatically handled by your choice of web framework.
In order to give you the correct solution, it's necessary to know what software stack(s) you're using and how the form is being submitted (i.e. GET/POST/AJAX+JSON)
Browsers only encode input characters to numeric character reference entities when the character is outside the character set the page was served in. Save yourself a lot of trouble and serve your pages in UTF-8 properly tagged as UTF-8. Scala, Java and Javascript string processing is all in Unicode, and restricting to iso-8859-1 for your web pages is inviting conversion problems like this in all directions. If your existing content is ASCII then conversion should be painless.