Python BeautifulSoup Automatically tracking content table rows andcolumns

2023-02-22 01:31 问答作者：

First let me say that I am new to Stack and to Python. I just started working with it last week. I am however a seasoned PHP/C++/Pascal/ADA/B/Forth (showing my age) programmer.

I have written a script that pulls product pages from a website and stores them in my local MySQL database. I did this so that I can crawl the site late at night when the load is light. I now need to sort through the html of each page and get the product descriptions. These are placed in tables. However, each page may开发者_开发知识库 have the needed values in different rows/columns.

The things I can be sure of are:

Each table has a heading that defines the data in the rows/columns below it.
The heading text is consistent for each value i.e. 'Part' always describes the part type and 'Part No.' always describes a part number.
Not all pages will contain all the data desired. So If not located it must save what it finds.

In the section below it is the second part, getting the data values that I am having trouble with. How do I select the nth column from a row?

My current approach is:

To Get Desired Columns

Get html doc from db
Grab the table (my table is always contained in the only div on the page.
Grab all the rows (really only need to do this for the first row)
For each row grab the row and column index' when I find a desired field names.

To Get Data Values

For each row:
Skip the row if it was a header (save the row counts for those with header fields)
for each column grab the text value.
Save the values to db

The important part of my page looks like this:

<div>
   ... 
   <table>
      <tr><td>&nbsp;</td><td><b>Item</b></td><td>&nbsp;</td><td><b>Description</b></td><td>&nbsp;</td><td><b>Part No.</b></td><td>&nbsp;</td><td><b>Color</b></td><td>&nbsp;</td></tr>
      <tr><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td></tr>
      <tr><td>&nbsp;</td><td>Toaster</td><td>&nbsp;</td><td>2-Slice</td><td>&nbsp;</td><td>#25713</td><td>&nbsp;</td><td>Chorme</td><td>&nbsp;</td></tr>
   </table>
   ...
</div>

A Big thank you to anyone who responds.

Here's how I'd tackle it:

from BeautifulSoup import BeautifulSoup

doc = '''<div>
   <table>
      <tr><td>&nbsp;</td><td><b>Item</b></td><td>&nbsp;</td><td><b>Description</b></td><td>&nbsp;</td><td><b>Part No.</b></td><td>&nbsp;</td><td><b>Color</b></td><td>&nbsp;</td></tr>
      <tr><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td></tr>
      <tr><td>&nbsp;</td><td>Toaster</td><td>&nbsp;</td><td>2-Slice</td><td>&nbsp;</td><td>#25713</td><td>&nbsp;</td><td>Chorme</td><td>&nbsp;</td></tr>
   </table>
</div>'''

soup = BeautifulSoup(doc)
# find the table element in the HTML document
table = soup.find("table")
# grabs the top row
firstRow = table.contents[0]
# find how many columns there are
numberOfColumns = len(firstRow.contents)
restOfRows = table.contents[1:]
for row in restOfRows:
  for x in range(0,numberOfColumns):
    print "column data: %s" % row.contents[x].string

That will extract the table element from any document. Then find the number of columns based on the first row. Finally, it will loop through the rest of the rows printing out the data in the row.

Useful link to BS docs: http://www.crummy.com/software/BeautifulSoup/documentation.html

Here is how you do it with HTQL:

import htql;
doc = '''<div>     <table>
    <tr><td>&nbsp;</td><td><b>Item</b></td><td>&nbsp;</td><td><b>Description</b></td><td>&nbsp;        </td><td><b>Part No.</b></td><td>&nbsp;</td><td><b>Color</b></td><td>&nbsp;</td></tr>
    <tr><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td></tr>
    <tr><td>&nbsp;</td><td>Toaster</td><td>&nbsp;</td><td>2-Slice</td><td>&nbsp;</td><td>#25713</td><td>&nbsp;</td><td>Chorme</td><td>&nbsp;</td></tr>
  </table>  </div>''';

query = "<div>.<table>.<tr>{item=<td (th='Item')>&tx; desc=<td (th='Description')>&tx | item<>'Item'}";

for item, desc in htql.HTQL(doc, query): 
    print(item, desc);

继续阅读：html-parsing python web-crawler

Python BeautifulSoup Automatically tracking content table rows andcolumns

To Get Desired Columns

To Get Data Values

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？

To Get Desired Columns

To Get Data Values

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集 河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？