开发者

Python BeautifulSoup Automatically tracking content table rows andcolumns

First let me say that I am new to Stack and to Python. I just started working with it last week. I am however a seasoned PHP/C++/Pascal/ADA/B/Forth (showing my age) programmer.

I have written a script that pulls product pages from a website and stores them in my local MySQL database. I did this so that I can crawl the site late at night when the load is light. I now need to sort through the html of each page and get the product descriptions. These are placed in tables. However, each page may开发者_开发知识库 have the needed values in different rows/columns.

The things I can be sure of are:

  • Each table has a heading that defines the data in the rows/columns below it.
  • The heading text is consistent for each value i.e. 'Part' always describes the part type and 'Part No.' always describes a part number.
  • Not all pages will contain all the data desired. So If not located it must save what it finds.

In the section below it is the second part, getting the data values that I am having trouble with. How do I select the nth column from a row?

My current approach is:

To Get Desired Columns

  • Get html doc from db
  • Grab the table (my table is always contained in the only div on the page.
  • Grab all the rows (really only need to do this for the first row)
  • For each row grab the row and column index' when I find a desired field names.

To Get Data Values

  • For each row:
  • Skip the row if it was a header (save the row counts for those with header fields)
  • for each column grab the text value.
  • Save the values to db

The important part of my page looks like this:

<div>
   ... 
   <table>
      <tr><td>&nbsp;</td><td><b>Item</b></td><td>&nbsp;</td><td><b>Description</b></td><td>&nbsp;</td><td><b>Part No.</b></td><td>&nbsp;</td><td><b>Color</b></td><td>&nbsp;</td></tr>
      <tr><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td></tr>
      <tr><td>&nbsp;</td><td>Toaster</td><td>&nbsp;</td><td>2-Slice</td><td>&nbsp;</td><td>#25713</td><td>&nbsp;</td><td>Chorme</td><td>&nbsp;</td></tr>
   </table>
   ...
</div>

A Big thank you to anyone who responds.


Here's how I'd tackle it:

from BeautifulSoup import BeautifulSoup

doc = '''<div>
   <table>
      <tr><td>&nbsp;</td><td><b>Item</b></td><td>&nbsp;</td><td><b>Description</b></td><td>&nbsp;</td><td><b>Part No.</b></td><td>&nbsp;</td><td><b>Color</b></td><td>&nbsp;</td></tr>
      <tr><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td></tr>
      <tr><td>&nbsp;</td><td>Toaster</td><td>&nbsp;</td><td>2-Slice</td><td>&nbsp;</td><td>#25713</td><td>&nbsp;</td><td>Chorme</td><td>&nbsp;</td></tr>
   </table>
</div>'''

soup = BeautifulSoup(doc)
# find the table element in the HTML document
table = soup.find("table")
# grabs the top row
firstRow = table.contents[0]
# find how many columns there are
numberOfColumns = len(firstRow.contents)
restOfRows = table.contents[1:]
for row in restOfRows:
  for x in range(0,numberOfColumns):
    print "column data: %s" % row.contents[x].string

That will extract the table element from any document. Then find the number of columns based on the first row. Finally, it will loop through the rest of the rows printing out the data in the row.

Useful link to BS docs: http://www.crummy.com/software/BeautifulSoup/documentation.html


Here is how you do it with HTQL:

import htql;
doc = '''<div>     <table>
    <tr><td>&nbsp;</td><td><b>Item</b></td><td>&nbsp;</td><td><b>Description</b></td><td>&nbsp;        </td><td><b>Part No.</b></td><td>&nbsp;</td><td><b>Color</b></td><td>&nbsp;</td></tr>
    <tr><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td></tr>
    <tr><td>&nbsp;</td><td>Toaster</td><td>&nbsp;</td><td>2-Slice</td><td>&nbsp;</td><td>#25713</td><td>&nbsp;</td><td>Chorme</td><td>&nbsp;</td></tr>
  </table>  </div>''';

query = "<div>.<table>.<tr>{item=<td (th='Item')>&tx; desc=<td (th='Description')>&tx | item<>'Item'}";

for item, desc in htql.HTQL(doc, query): 
    print(item, desc); 
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜