开发者

Issue parsing a xhtml page using Python

Hello i am trying to parse a page in xhtml with python but i receive this error:

**xml.parsers.expat.ExpatError: unbound prefix: line 6, column 0**

[Fri Mar 25 09:58:21 2011] [error] [client 127.0.0.1] mod_wsgi (pid=9156): Exception occurred processing WSGI script '/home/hidura/webapps/karinapp/Suite/Gate.py'.
[Fri Mar 25 09:58:21 2011] [error] [client 127.0.0.1] Traceback (most recent call last):
[Fri Mar 25 09:58:21 2011] [error] [client 127.0.0.1]   File "/home/hidura/webapps/karinapp/Suite/Gate.py", line 32, in application
[Fri Mar 25 09:58:21 2011] [error] [client 127.0.0.1]     response = assistant(buildReq.extrctEnv(environ, location))#Here the assistant takes the parameters and begins the work
[Fri Mar 25 09:58:21 2011] [error] [client 127.0.0.1]   File "/home/hidura/webapps/karinapp/Suite/wsgi/Utilities/Assistant/Assistant.py", line 114, in __init__
[Fri Mar 25 09:58:21 2011] [error] [client 127.0.0.1]     self.websearch()#Finding the web.
[Fri Mar 25 09:58:21 2011] [error] [client 127.0.0.1]   File "/home/hidura/webapps/karinapp/Suite/wsgi/Utilities/Assistant/Assistant.py", line 364, in websearch
[Fri Mar 25 09:58:21 2011] [error] [client 127.0.0.1]     websource = self.manage.string2parse(result[0][1])#Transforming the web page into a tree.
[Fri Mar 25 09:58:21 2011] [error] [client 127.0.0.1]   File "/home/hidura/webapps/karinapp/Suite/wsgi/Writer/tagsmanip.py", line 56, in string2parse
[Fri Mar 25 09:58:21 2011] [error] [client 127.0.0.1]     self.doc = parseString(newData)
[Fri Mar 25 09:58:21 2011] [error] [client 127.0.0.1]   File "/usr/local/lib/python3.1/xml/dom/minidom.py", line 1937, in parseString
[Fri Mar 25 09:58:21 2011] [error] [client 127.0.0.1]     return expatbuilder.parseString(string)
[Fri Mar 25 09:58:21 2011] [error] [client 127.0.0.1]   File "/usr/local/lib/python3.1/xml/dom/expatbuilder.py", line 940, in parseString
[Fri Mar 25 09:58:21 2011] [error] [client 127.0.0.1]     return builder.parseString(string)
[Fri Mar 25 09:58:21 2011] [error] [client 127.0.0.1]   File "/usr/local/lib/python3.1/xml/dom/expatbuilder.py", line 223, in parseString
[Fri Mar 25 09:58:21 2011] [error] [client 127.0.0.1]     parser.Parse(string, True)
[Fri Mar 25 09:58:21 2011] [error] [client 127.0.0.1] xml.parsers.expat.ExpatError: unbound prefix: line 6, column 0

This is the code of the page:

<HTML xmlns:fb="http://www.facebook.com/2008/fbml"><HEAD><TITLE id="ttl">KarinApp(Karina application web maker)</TITLE><LINK id="css_front_1" type="text/css" href="http://www.karinapp.com/modules/front/css/main.css" rel="stylesheet"/><SCRIPT type="text/javascript" id="jQuery-front" src="/modules/general/scripts/jQuery.js"><!--empty--></SCRIPT><SCRIPT type="text/javascript" id="gnrlScrpt" src="/modules/general/scripts/general.js"><!--empty--></SCRIPT><SCRIPT type="text/javascript" id="ctchScrpt" src="/modules/general/scripts/Catcher.js"><!--empty--></SCRIPT><SCRIPT type="text/javascript" id="pdloadScr" src="/modules/general/scripts/loadPage.js"><!--empty--></SCRIPT><SCRIPT type="text/javascript" id="pdLoader">window.onload = function(){postLoad();
        }
function __init__(){main();}</SCRIPT><LINK id="link1" href="/modules/front/css/jquery-ui-1.8.10.custom.css" type="text/css" rel="stylesheet"/><SCRIPT id="script5" src="/modules/front/scripts/ui/jquery.ui.core.js"><!--empty--></SCRIPT><SCRIPT id="script6" src="/modules/front/scripts/u开发者_运维技巧i/jquery.ui.widget.js"><!--empty--></SCRIPT><SCRIPT id="script8" src="/modules/front/scripts/ui/jquery.ui.button.js"><!--empty--></SCRIPT><SCRIPT id="script10" src="/modules/front/scripts/main.js"><!--empty--></SCRIPT><SCRIPT id="script9"><!--empty--></SCRIPT><SCRIPT id="script11" type="text/javascript" src="http://connect.facebook.net/en_US/all.js#appId=150388711687556&amp;amp;xfbml=1"><!--empty--></SCRIPT></HEAD><BODY id="body"><IMG id="logo" father="@body" src="/modules/front/image/logo.png"/><DIV id="comments" father="@body"><!--Comment--><DIV id="fbK" father="@comments"><IFRAME src="http://www.facebook.com/plugins/likebox.php?href=http%3A%2F%2Fwww.facebook.com%2Fpages%2FKarinapp%2F150388711687556&amp;width=295&amp;colorscheme=light&amp;show_faces=false&amp;stream=true&amp;header=false&amp;height=300" scrolling="no" frameborder="1" style="border:none; overflow:hidden; width:295px; height:300px;" allowtransparency="false">&amp;lt;!--empty--&amp;gt;</IFRAME>

<LIKE-BOX href="http://www.facebook.com/pages/Karinapp/150388711687556" width="295" show_faces="false" stream="true" header="false"><!--empty--></LIKE-BOX></DIV></DIV><DIV id="head" father="@body"><!--Comment--></DIV><A id="fb" father="@body" href="http://www.facebook.com/karinapp#!/pages/Karinapp/150388711687556" border="0"><IMG src="/modules/front/image/fb.png" father="@fb"/></A><A id="tw" father="@body" href="http://www.twitter.com/#!/karinappm" border="0"><IMG src="/modules/front/image/tw.png" father="@tw"/></A><DIV id="div4" father="@body"><DIV id="fb-root"><!--empty--></DIV>
<FB:LOGIN-BUTTON xmlns:fb="http://www.facebook.com/2008/fbml" show-faces="true" width="250" max-rows="1"/></DIV></BODY></HTML>

Thanks in advance!


The problem is expat is using fb as the namespace prefix, but that tag is FB:LOGIN-BUTTON. Expat sees FB as unbound. The XHTML specification notes that all HTML elements and attributes must be lowercase since XML is case-sensitive.

I tried your document using the lxml XML parser and it auto-converted the prefixes to lowercase. Perhaps you can switch to a different parser:

import lxml.etree
data = open('fb.xhtml', 'rb').read()
tree = lxml.etree.fromstring(data)
ns_map = {'fb': 'http://www.facebook.com/2008/fbml'}
print tree.xpath('.//fb:LOGIN-BUTTON', namespaces=ns_map)

Output:

[<Element {http://www.facebook.com/2008/fbml}LOGIN-BUTTON at 1011fa260>]


I think the problem is that http://www.facebook.com/2008/fbml is a not found page

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜