Issue parsing a xhtml page using Python
Hello i am trying to parse a page in xhtml with python but i receive this error:
**xml.parsers.expat.ExpatError: unbound prefix: line 6, column 0**
[Fri Mar 25 09:58:21 2011] [error] [client 127.0.0.1] mod_wsgi (pid=9156): Exception occurred processing WSGI script '/home/hidura/webapps/karinapp/Suite/Gate.py'.
[Fri Mar 25 09:58:21 2011] [error] [client 127.0.0.1] Traceback (most recent call last):
[Fri Mar 25 09:58:21 2011] [error] [client 127.0.0.1] File "/home/hidura/webapps/karinapp/Suite/Gate.py", line 32, in application
[Fri Mar 25 09:58:21 2011] [error] [client 127.0.0.1] response = assistant(buildReq.extrctEnv(environ, location))#Here the assistant takes the parameters and begins the work
[Fri Mar 25 09:58:21 2011] [error] [client 127.0.0.1] File "/home/hidura/webapps/karinapp/Suite/wsgi/Utilities/Assistant/Assistant.py", line 114, in __init__
[Fri Mar 25 09:58:21 2011] [error] [client 127.0.0.1] self.websearch()#Finding the web.
[Fri Mar 25 09:58:21 2011] [error] [client 127.0.0.1] File "/home/hidura/webapps/karinapp/Suite/wsgi/Utilities/Assistant/Assistant.py", line 364, in websearch
[Fri Mar 25 09:58:21 2011] [error] [client 127.0.0.1] websource = self.manage.string2parse(result[0][1])#Transforming the web page into a tree.
[Fri Mar 25 09:58:21 2011] [error] [client 127.0.0.1] File "/home/hidura/webapps/karinapp/Suite/wsgi/Writer/tagsmanip.py", line 56, in string2parse
[Fri Mar 25 09:58:21 2011] [error] [client 127.0.0.1] self.doc = parseString(newData)
[Fri Mar 25 09:58:21 2011] [error] [client 127.0.0.1] File "/usr/local/lib/python3.1/xml/dom/minidom.py", line 1937, in parseString
[Fri Mar 25 09:58:21 2011] [error] [client 127.0.0.1] return expatbuilder.parseString(string)
[Fri Mar 25 09:58:21 2011] [error] [client 127.0.0.1] File "/usr/local/lib/python3.1/xml/dom/expatbuilder.py", line 940, in parseString
[Fri Mar 25 09:58:21 2011] [error] [client 127.0.0.1] return builder.parseString(string)
[Fri Mar 25 09:58:21 2011] [error] [client 127.0.0.1] File "/usr/local/lib/python3.1/xml/dom/expatbuilder.py", line 223, in parseString
[Fri Mar 25 09:58:21 2011] [error] [client 127.0.0.1] parser.Parse(string, True)
[Fri Mar 25 09:58:21 2011] [error] [client 127.0.0.1] xml.parsers.expat.ExpatError: unbound prefix: line 6, column 0
This is the code of the page:
<HTML xmlns:fb="http://www.facebook.com/2008/fbml"><HEAD><TITLE id="ttl">KarinApp(Karina application web maker)</TITLE><LINK id="css_front_1" type="text/css" href="http://www.karinapp.com/modules/front/css/main.css" rel="stylesheet"/><SCRIPT type="text/javascript" id="jQuery-front" src="/modules/general/scripts/jQuery.js"><!--empty--></SCRIPT><SCRIPT type="text/javascript" id="gnrlScrpt" src="/modules/general/scripts/general.js"><!--empty--></SCRIPT><SCRIPT type="text/javascript" id="ctchScrpt" src="/modules/general/scripts/Catcher.js"><!--empty--></SCRIPT><SCRIPT type="text/javascript" id="pdloadScr" src="/modules/general/scripts/loadPage.js"><!--empty--></SCRIPT><SCRIPT type="text/javascript" id="pdLoader">window.onload = function(){postLoad();
}
function __init__(){main();}</SCRIPT><LINK id="link1" href="/modules/front/css/jquery-ui-1.8.10.custom.css" type="text/css" rel="stylesheet"/><SCRIPT id="script5" src="/modules/front/scripts/ui/jquery.ui.core.js"><!--empty--></SCRIPT><SCRIPT id="script6" src="/modules/front/scripts/u开发者_运维技巧i/jquery.ui.widget.js"><!--empty--></SCRIPT><SCRIPT id="script8" src="/modules/front/scripts/ui/jquery.ui.button.js"><!--empty--></SCRIPT><SCRIPT id="script10" src="/modules/front/scripts/main.js"><!--empty--></SCRIPT><SCRIPT id="script9"><!--empty--></SCRIPT><SCRIPT id="script11" type="text/javascript" src="http://connect.facebook.net/en_US/all.js#appId=150388711687556&amp;xfbml=1"><!--empty--></SCRIPT></HEAD><BODY id="body"><IMG id="logo" father="@body" src="/modules/front/image/logo.png"/><DIV id="comments" father="@body"><!--Comment--><DIV id="fbK" father="@comments"><IFRAME src="http://www.facebook.com/plugins/likebox.php?href=http%3A%2F%2Fwww.facebook.com%2Fpages%2FKarinapp%2F150388711687556&width=295&colorscheme=light&show_faces=false&stream=true&header=false&height=300" scrolling="no" frameborder="1" style="border:none; overflow:hidden; width:295px; height:300px;" allowtransparency="false">&lt;!--empty--&gt;</IFRAME>
<LIKE-BOX href="http://www.facebook.com/pages/Karinapp/150388711687556" width="295" show_faces="false" stream="true" header="false"><!--empty--></LIKE-BOX></DIV></DIV><DIV id="head" father="@body"><!--Comment--></DIV><A id="fb" father="@body" href="http://www.facebook.com/karinapp#!/pages/Karinapp/150388711687556" border="0"><IMG src="/modules/front/image/fb.png" father="@fb"/></A><A id="tw" father="@body" href="http://www.twitter.com/#!/karinappm" border="0"><IMG src="/modules/front/image/tw.png" father="@tw"/></A><DIV id="div4" father="@body"><DIV id="fb-root"><!--empty--></DIV>
<FB:LOGIN-BUTTON xmlns:fb="http://www.facebook.com/2008/fbml" show-faces="true" width="250" max-rows="1"/></DIV></BODY></HTML>
Thanks in advance!
The problem is expat is using fb
as the namespace prefix, but that tag is FB:LOGIN-BUTTON
. Expat sees FB
as unbound. The XHTML specification notes that all HTML elements and attributes must be lowercase since XML is case-sensitive.
I tried your document using the lxml XML parser and it auto-converted the prefixes to lowercase. Perhaps you can switch to a different parser:
import lxml.etree
data = open('fb.xhtml', 'rb').read()
tree = lxml.etree.fromstring(data)
ns_map = {'fb': 'http://www.facebook.com/2008/fbml'}
print tree.xpath('.//fb:LOGIN-BUTTON', namespaces=ns_map)
Output:
[<Element {http://www.facebook.com/2008/fbml}LOGIN-BUTTON at 1011fa260>]
I think the problem is that http://www.facebook.com/2008/fbml is a not found page
精彩评论