Extracting text fields from HTML using Python?
what is the best way to extract data from this HTML file and put it into MySQL database with company phone number, company name and email with a primary key as phone number?
</tr><tr class="tableRowOdd">
<td>"JSC company inc. 00" <email@email.com></td>
<td>1231231234</td>
</tr><tr class="tableRowEven">
<td>"JSC company inc. 01" <email01@email.com></td>
<td>234234234234234</td>
</tr><tr class="tableRowOdd">
<td>"JSC company inc. 02" <email2@email.com></td>
<td>32423234234</td>
</tr><tr class="tableRowEven">
<td>"JSC company inc. 03" <email3@email.com></td>
<td>23423424324</td>
</tr><tr class="tableRowOdd">
<td>"JSC company inc. 04" <email4@email.com></td>
<td>2342342323242开发者_运维百科44</td>
</tr> <tr>
For extracting and general HTML munging look at
http://www.crummy.com/software/BeautifulSoup/
For the MySQL I suggest googling on: MySQL tutorial python
Here is how you get the td
contents into a python list using BeautifulSoup
:
#!/usr/bin/python
from BeautifulSoup import BeautifulSoup, SoupStrainer
def find_rows(data):
table_rows = SoupStrainer('tr')
rows = [tag for tag in BeautifulSoup(data, parseOnlyThese=table_rows)]
return rows
def cell_data(row):
cells = [tag.string for tag in row.contents]
return cells
if __name__ == "__main__":
f = open("testdata.html", "r")
data = f.read()
rows = find_rows(data)
for row in rows:
print cell_data(row)
Save your html file as testdata.html
, and run this script from the same directory.
With the data you posted here, the output is
[u'\n', u'"JSC company inc. 00" <email@email.com>', u'\n', u'1231231234', u'\n']
[u'\n', u'"JSC company inc. 01" <email01@email.com>', u'\n', u'234234234234234', u'\n']
[u'\n', u'"JSC company inc. 02" <email2@email.com>', u'\n', u'32423234234', u'\n']
[u'\n', u'"JSC company inc. 03" <email3@email.com>', u'\n', u'23423424324', u'\n']
[u'\n', u'"JSC company inc. 04" <email4@email.com>', u'\n', u'234234232324244', u'\n']
For the parsing, I definitely also recommend Beautiful Soup.
To put the text in a database, I recommend a good Python ORM. My top suggestion is to use the ORM from Django, if you can. With Django, you not only get an ORM, you also get a web interface that lets you browse through your database with a web browser; you can even enter data into the database using the web browser.
If you can't use Django, I recommend SQLAlchemy.
Good luck.
With lxml you can do it almost as easily as you could do it with jQuery.
from lxml import html
doc = html.parse('test.html').getroot()
for row in doc.cssselect('tr'):
name, phone_number = row.cssselect('td')[:2]
print name.text_content()
print phone_number.text_content()
+1 for BeautifulSoup
Now that you've got the data, you need to put it into MySQL. If you want a pure python solution, you'll also need the MySQL-Python binding.
Otherwise, the SQL you'll need to generate is relatively painless. We'll hijack gnuds example. Add to the top of the file:
import re
Then at the bottom:
exp = r'\"(.*)\" <(.*)>'
for row in rows:
matcher = re.match(exp, row[1])
name, email = matcher.groups()
phone = row[3]
sql = "INSERT INTO company (email, name, phone) VALUES ('%s','%s','%s')" % (email, name, phone)
print sql
Which gives you output like:
INSERT INTO company (email, name, phone) VALUES ('email@email.com','JSC company inc. 00','1231231234');
INSERT INTO company (email, name, phone) VALUES ('email01@email.com','JSC company inc. 01','234234234234234');
INSERT INTO company (email, name, phone) VALUES ('email2@email.com','JSC company inc. 02','32423234234');
INSERT INTO company (email, name, phone) VALUES ('email3@email.com','JSC company inc. 03','23423424324');
INSERT INTO company (email, name, phone) VALUES ('email4@email.com','JSC company inc. 04','234234232324244');
精彩评论