Python Parsing: Pull data from html test file for non-standard layout

2023-02-07 01:28 问答作者：

I need help with parsing an html text file that has a layout that I'm not sure how to parse through, and could really use the help.

code thus far:

import urllib,os, urllib2, webbrowser, StringIO, re
from BeautifulSoup import BeautifulSoup
from urllib import urlopen

urlfile = open('output.txt','r')

html = urlfile

soup = BeautifulSoup(''.join(html))

print soup.prettify()
table = soup.find('table', id="dgProducts__ctl2_lblCountry")
rows = table.findAll('<span id="dgProducts__ctl2_lblCountry">')

for tr in rows:
  cols = tr.findAll('td')
for td in cols:
   text = ''.join(td.find(text=True))
   print text+"|",
print

What I'm Trying to Do: I'm looking to extract the data from the .html text file and have it presented in the following format:

Header Row:  Country Company Name  Company Product Name       Status
Data Row(s): 1        Ace           Desktop      Ace Vision    Gold

abbreviated .html file Data Structure:

</tr><tr bgcolor="White">
  <td><font color="#330099" size="1">
         <span><font size="2">
           <input id="dgProducts__ctl12_ckCompare" type="checkbox" name="dgProducts:_ctl12:ckCompare" onclick="checkSelected(this.form, this);" />
           </font></span>
        </font></td><td><font color="#330099" size="1">
         <span id="dgProducts__ctl12_lblModel1"><font size="2">  
          <a href='ProductDisplay.aspx?return=pm&action=view&search=true&productid=4592&ProductType=1&epeatcountryid=1'>Ace Vision 7HS</a></font></span>
        </font></td><td><font color="#330099" size="1">
         <span id="dgProducts__ctl12_lblCountry">United States</span>
        </font></td><td><font color="#330099" size="1">
         <span id="dgProducts__ctl12_lblProductCategory1"><font size="2">Desktops</font></span>
        </font></td><td><font color="#330099" size="1">
         <span id="dgProducts__ctl12_lblRating1"><font size="2">Gold</font></span>
        </font></td><td><font color="#330099" size="1">
         <span id="dgProducts__ctl12_lblPoints1">18</span>
        </font></td><td><font color="#330099" size="1">
         <span id="dgProducts__ctl12_lblEnergyStar">5.0</span>
        </font></td><td><font color="#330099" size="1">
         <span id="dgProducts__ctl12_lblMonitorType1"><font size="2"></font></span>
        </font></td><td><font color="#330099" size="1">
         <span id="dgProducts__ctl12_lblMonitorSize"><font size="2"></font></span>
        </font></td><td><font color="#330099" size="1">
         <span id="dgProducts__ctl12_lblListingDate1"><font size="2">3/16/2010</font></span>
        </font></td><td><font color="#330099" size="1">
         <span id="dgProducts__ctl12_lblStatus"><font size="2">Active</font></span>
        </font></td><td><font color="#330099" size="1">
         <span id="dgProducts__ctl12_lblExceptions" align="center"><a href='#' onclick=ShowExceptions('Exceptions.aspx?id=4592');>    
          <img src='http://www.epeat.net/Images/inform.gif' title='Click to view exceptions' alt='Click to view exceptions' border='0'></a></span>
 开发者_如何学运维       </font></td>

I'd recommend that you use the module called MiniDom or xml.dom.minidom. It makes it easy to parse XML and HTML files.

继续阅读：html-parsing python text text-parsing

Python Parsing: Pull data from html test file for non-standard layout

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？