开发者

Unable to read HTML data - Python

I am attempting to parse html data from a website using BeautifulSoup for python. However, urllib2 or mechanize is not able to read the whole html format. The returned data is

<html>
<head>
    <title>
    EC 4.1.2.13 - Fructose-bisphosphate aldolase    </title>
    <meta name="description" content="Information on EC 4.1.2.13 - Fructose-bisphosphate aldolase">
    <meta name="keywords" content="EC,Number,Enzyme,Pathway,Reaction,Organism,Substrate,Cofactor,Inhibitor,Compound,KM Value,KI Value,IC50 Value,pi Value,Turnover Number,pH,Temperature,Optimum,Range,Source Tissue,BLAST,Subunits,Modification,Crystallization,Stability,Purification">
</head>
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Frameset//EN" "http://www.w3.org/TR/html4/frameset.dtd">
<frameset cols="190,*" border="0">
    <frame name="navigation" src="flat_navigation.php4?ecno=4.1.2.13&organism_list=Mycobacterium tuberculosis&Suchword=&UniProtAcc=P67475" frameborder="no">
    <frameset rows="110,*" border="0">
            <frame name="header" src="flat_head.php4?ecno=4.1.2.13" frameborder="no">

        <frame name="flat" src="flat_result.php4?ecno=4.1.2.13&organism_list=Mycobacterium tuberculosis&Suchword=&UniProtAcc=P67475" frameborder="no">

    </frameset>
</frameset>
<noframes>
<body>
<h1>EC 4.1.2.13 - Fructose-bisphosphate aldolase </h1>

<a href="flat_result.php4?ecno=4.1.2.13&organism_list=Mycobacterium tuberculosis&Suchword=&UniProtAcc=P67475">More detailed information on the enzyme EC 4.1.2.13 - Fructose-bisphosphate aldolase</a>

Sorry, but your browser doesn't support frames. Please use another browser!
</body>
</noframes>
</html>

When I manually open the webste using Internet Explorer the whole h开发者_如何转开发tml can be read. Is there anyway using urllib2, mechanize, or BeautifulSoup to work around this?


That's because the content is in the frames. You can either parse the page and look for the src attribute of the main <frame> element or directly request the frame. In most browsers, you can right-click and select "Frame Properties" or so to get the frame's URL.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜