random text from /dev/random raising an error in lxml: All strings must be XML compatible: Unicode or ASCII, no NULL bytes
I am, for the sake of testing my web app, pasting some random characters from /dev/random into my web frontend. This line throws an error:
print repr(comment)
import html5lib
print html5lib.parse(comment, treebuilder="lxml")
'a\xef\xbf\xbd\xef\xbf\xbd\xc9\xb6E\xef\xbf\xbd\xef\xbf\xbd`\xef\xbf\xbd]\xef\xbf\xbd\xef\xbf\xbd\xef\xbf\xbd\xef\xbf\xbd2 \x14\xef\xbf\xbd\xc7\xbe\xef\xbf\xbdy\xcb\x9c\xef\xbf\xbdi1O\xef\xbf\xbd\xef\xbf\xbd\xef\xbf\xbd\xef\xbf\xbdZ\xef\xbf\xbd.\xef\xbf\xbd\x17^C'
Unhandled Error
Traceback (most recent call last):
File "/usr/lib/python2.6/dist-packages/twisted/internet/defer.py", line 893, in _inlineCallbacks
result = g.send(result)
File "/home/work/random/social/social/item.py", line 389, in _new
convId, conv = yield plugin.create(request)
File "/home/work/random/social/soc开发者_如何学Cial/logging.py", line 47, in wrapper
ret = func(*args, **kwargs)
File "/usr/lib/python2.6/dist-packages/twisted/internet/defer.py", line 1014, in unwindGenerator
return _inlineCallbacks(None, f(*args, **kwargs), Deferred())
--- <exception caught here> ---
File "/usr/lib/python2.6/dist-packages/twisted/internet/defer.py", line 893, in _inlineCallbacks
result = g.send(result)
File "/home/work/random/social/twisted/plugins/status.py", line 63, in create
print html5lib.parse(comment, treebuilder="lxml")
File "/usr/local/lib/python2.6/dist-packages/html5lib-0.90-py2.6.egg/html5lib/html5parser.py", line 38, in parse
return p.parse(doc, encoding=encoding)
File "/usr/local/lib/python2.6/dist-packages/html5lib-0.90-py2.6.egg/html5lib/html5parser.py", line 211, in parse
parseMeta=parseMeta, useChardet=useChardet)
File "/usr/local/lib/python2.6/dist-packages/html5lib-0.90-py2.6.egg/html5lib/html5parser.py", line 111, in _parse
self.mainLoop()
File "/usr/local/lib/python2.6/dist-packages/html5lib-0.90-py2.6.egg/html5lib/html5parser.py", line 174, in mainLoop
self.phase.processCharacters(token)
File "/usr/local/lib/python2.6/dist-packages/html5lib-0.90-py2.6.egg/html5lib/html5parser.py", line 572, in processCharacters
self.parser.phase.processCharacters(token)
File "/usr/local/lib/python2.6/dist-packages/html5lib-0.90-py2.6.egg/html5lib/html5parser.py", line 611, in processCharacters
self.parser.phase.processCharacters(token)
File "/usr/local/lib/python2.6/dist-packages/html5lib-0.90-py2.6.egg/html5lib/html5parser.py", line 652, in processCharacters
self.parser.phase.processCharacters(token)
File "/usr/local/lib/python2.6/dist-packages/html5lib-0.90-py2.6.egg/html5lib/html5parser.py", line 711, in processCharacters
self.parser.phase.processCharacters(token)
File "/usr/local/lib/python2.6/dist-packages/html5lib-0.90-py2.6.egg/html5lib/html5parser.py", line 804, in processCharacters
self.parser.phase.processCharacters(token)
File "/usr/local/lib/python2.6/dist-packages/html5lib-0.90-py2.6.egg/html5lib/html5parser.py", line 948, in processCharacters
self.tree.insertText(token["data"])
File "/usr/local/lib/python2.6/dist-packages/html5lib-0.90-py2.6.egg/html5lib/treebuilders/_base.py", line 288, in insertText
parent.insertText(data)
File "/usr/local/lib/python2.6/dist-packages/html5lib-0.90-py2.6.egg/html5lib/treebuilders/etree_lxml.py", line 225, in insertText
builder.Element.insertText(self, data, insertBefore)
File "/usr/local/lib/python2.6/dist-packages/html5lib-0.90-py2.6.egg/html5lib/treebuilders/etree.py", line 114, in insertText
self._element.text += data
File "lxml.etree.pyx", line 821, in lxml.etree._Element.text.__set__ (src/lxml/lxml.etree.c:33308)
File "apihelpers.pxi", line 646, in lxml.etree._setNodeText (src/lxml/lxml.etree.c:15287)
File "apihelpers.pxi", line 1295, in lxml.etree._utf8 (src/lxml/lxml.etree.c:20212)
exceptions.ValueError: All strings must be XML compatible: Unicode or ASCII, no NULL bytes
Before I am committing a user entered string, I am doing this:
comment.decode('utf-8').encode('utf-8', "replace")
but this does not seem to be helping in this case.
-- Abhi
The problem is that text in XML cannot include certain characters mainly control ones with byte value below 32 The XML 1.0 Recommendation defines a Char as
Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]
/dev/random can provide bytes that don't match this e.g. control characters and some multi byte characters.
So you have to filter out these bytes before trying any encoding.
精彩评论