python - A problem happen when I'm trying to fetch documents from a website
I tried to download documents from this page Securities Class Action Filings
I tried to download the 25 documents on the page. I thought it was simple, and here's my code:
from BeautifulSoup import BeautifulSoup
import re
import urllib2
import os
if __name__ == "__main__":
pre_url = "http://securities.stanford.edu"
url = "http://securities.stanford.edu/fmi/xsl/SCACPUDB/recordlist.xsl?-db=SCACPUDB&-lay=Search&FIC_DateFiled_Quater=Q1&FIC_DateFiled_Year=2011&-sortfield.1=FIC_DateFiled&-sortfield.2=LitigationName&-sortorder.1=ascend&开发者_如何学Goamp;-max=25&-find"
response = urllib2.urlopen(url)
soup = BeautifulSoup(response.read()).findAll('tr')
url_list = []
for s in soup[8:]:
url_list.append(pre_url + s.a['href'])
for x in url_list:
name = x.split("/")[4]
context = urllib2.urlopen(x).read()
soup = BeautifulSoup(context)
file = open(name + ".txt", "w")
file.write(soup.prettify())
print "DONE"
After executing the script, I downloaded 25 files successfully. But then I found 10 of them are full of garbage characters! How come? Can anyone help me?
Thanks a lot, and I'm sorry for my poor English.
Update: This is one of the pages which would be downloaded incorrectly by the script http://securities.stanford.edu/1046/BWEN00_01/
The sample page is encoded in UTF-16 without properly providing that factoid in the header.
>>> page = urllib2.urlopen( "http://securities.stanford.edu/1046/BWEN00_01" )
>>> page.info().headers
['Date: Mon, 22 Aug 2011 13:13:56 GMT\r\n', 'Server: Apache/1.3.33 (Darwin) mod_jk/1.2.2 DAV/1.0.3\r\n', 'Cache-Control: max-age=60\r\n', 'Expires: Mon, 22 Aug 2011 13:14:56 GMT\r\n', 'Last-Modified: Thu, 21 Jul 2011 22:06:51 GMT\r\n', 'ETag: "18b9a6e-9af6-4e28a2fb"\r\n', 'Accept-Ranges: bytes\r\n', 'Content-Length: 39670\r\n', 'Connection: close\r\n', 'Content-Type: text/html\r\n']
Try page.decode('utf-16')
to see the page in proper Unicode characters instead of bytes.
open(name + ".txt", "w")
It's possible that your problem is that you're opening the files in text mode, but they're being downloaded in binary mode. Replace the above expression with
open(name + ".txt", "wb")
and see if it improves things.
精彩评论