开发者

Extracting text fields from HTML using Python?

what is the best way to extract data from this HTML file and put it into MySQL database with company phone number, company name and email with a primary key as phone number?

</tr><tr class="tableRowOdd">
                <td>"JSC company inc. 00" &lt;email@email.com&gt;</td>
                <td>1231231234</td>
            </tr><tr class="tableRowEven">
                <td>"JSC company inc. 01" &lt;email01@email.com&gt;</td>
                <td>234234234234234</td>
            </tr><tr class="tableRowOdd">
                <td>"JSC company inc. 02" &lt;email2@email.com&gt;</td>
                <td>32423234234</td>
            </tr><tr class="tableRowEven">
                <td>"JSC company inc. 03" &lt;email3@email.com&gt;</td>
                <td>23423424324</td>
            </tr><tr class="tableRowOdd">
                <td>"JSC company inc. 04" &lt;email4@email.com&gt;</td>
                <td>2342342323242开发者_运维百科44</td>
            </tr>   <tr>


For extracting and general HTML munging look at

http://www.crummy.com/software/BeautifulSoup/

For the MySQL I suggest googling on: MySQL tutorial python


Here is how you get the td contents into a python list using BeautifulSoup:

#!/usr/bin/python
from BeautifulSoup import BeautifulSoup, SoupStrainer

def find_rows(data):
    table_rows = SoupStrainer('tr')
    rows = [tag for tag in BeautifulSoup(data, parseOnlyThese=table_rows)]
    return rows

def cell_data(row):
    cells = [tag.string for tag in row.contents]
    return cells

if __name__ == "__main__":
    f = open("testdata.html", "r")
    data = f.read()
    rows = find_rows(data)
    for row in rows:
        print cell_data(row)

Save your html file as testdata.html, and run this script from the same directory. With the data you posted here, the output is

[u'\n', u'"JSC company inc. 00" &lt;email@email.com&gt;', u'\n', u'1231231234', u'\n']
[u'\n', u'"JSC company inc. 01" &lt;email01@email.com&gt;', u'\n', u'234234234234234', u'\n']
[u'\n', u'"JSC company inc. 02" &lt;email2@email.com&gt;', u'\n', u'32423234234', u'\n']
[u'\n', u'"JSC company inc. 03" &lt;email3@email.com&gt;', u'\n', u'23423424324', u'\n']
[u'\n', u'"JSC company inc. 04" &lt;email4@email.com&gt;', u'\n', u'234234232324244', u'\n']


For the parsing, I definitely also recommend Beautiful Soup.

To put the text in a database, I recommend a good Python ORM. My top suggestion is to use the ORM from Django, if you can. With Django, you not only get an ORM, you also get a web interface that lets you browse through your database with a web browser; you can even enter data into the database using the web browser.

If you can't use Django, I recommend SQLAlchemy.

Good luck.


With lxml you can do it almost as easily as you could do it with jQuery.

from lxml import html

doc = html.parse('test.html').getroot()
for row in doc.cssselect('tr'):
    name, phone_number = row.cssselect('td')[:2]
    print name.text_content()
    print phone_number.text_content()


+1 for BeautifulSoup

Now that you've got the data, you need to put it into MySQL. If you want a pure python solution, you'll also need the MySQL-Python binding.

Otherwise, the SQL you'll need to generate is relatively painless. We'll hijack gnuds example. Add to the top of the file:

 import re

Then at the bottom:

exp = r'\"(.*)\" &lt;(.*)&gt;'
for row in rows:
 matcher = re.match(exp, row[1])
 name, email = matcher.groups()
 phone = row[3]

 sql = "INSERT INTO company (email, name, phone) VALUES ('%s','%s','%s')" % (email, name, phone)
 print sql

Which gives you output like:

INSERT INTO company (email, name, phone) VALUES ('email@email.com','JSC company inc. 00','1231231234');
INSERT INTO company (email, name, phone) VALUES ('email01@email.com','JSC company inc. 01','234234234234234');
INSERT INTO company (email, name, phone) VALUES ('email2@email.com','JSC company inc. 02','32423234234');
INSERT INTO company (email, name, phone) VALUES ('email3@email.com','JSC company inc. 03','23423424324');
INSERT INTO company (email, name, phone) VALUES ('email4@email.com','JSC company inc. 04','234234232324244');
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜