how to apply "catch-all" exception clause to complex python web-scraping script?

2022-12-08 06:42 问答作者：

I've got a list of 100 websites in CSV format. All of the sites have the same general format, including a large table with 7 columns. I wrote this script to extract the data from the 7th column of each of the websites and then write this data to file. The script below partially works, however: opening the output file (after running the script) shows that something is being skipped because it only shows 98 writes (clearly the script also registers a number of exceptions). Guidance on how to implement a "catching exception" in this context would be much appreciated. Thank you!

import csv, urllib2, re
def replace(variab): return variab.replace(",", " ")

urls = csv.reader(open('input100.txt', 'rb'))  #access list of 100 URLs
for url in urls:
    html = urllib2.urlopen(url[0]).read()  #get HTML starting with the first URL
    col7 = re.findall('td7.*?td', html)  #use regex to get data from column 7
    string = str(col7)  #stringify data
    neat = re.findall('div3.*?div', string)  #use regex to get target text  
    result = map(replace, neat)  #apply function to remove','s from elements
    string2 = ", ".join(result)  #separate list elements with ', ' for export to csv
    output = open('output.csv', 'ab') #open file for writing 
    output.write(string2 + '\n') #append output to file and create new line
    output.close()

Return:

Traceback (most recent call last):
 File "C:\Python26\supertest3.py", line 6, in <module>
  html = urllib2.urlopen(url[0]).rea开发者_开发知识库d()
 File "C:\Python26\lib\urllib2.py", line 124, in urlopen
  return _opener.open(url, data, timeout)
 File "C:\Python26\lib\urllib2.py", line 383, in open
  response = self._open(req, data)
 File "C:\Python26\lib\urllib2.py", line 401, in _open
  '_open', req)
 File "C:\Python26\lib\urllib2.py", line 361, in _call_chain
  result = func(*args)
 File "C:\Python26\lib\urllib2.py", line 1130, in http_open
  return self.do_open(httplib.HTTPConnection, req)
 File "C:\Python26\lib\urllib2.py", line 1103, in do_open
  r = h.getresponse()
 File "C:\Python26\lib\httplib.py", line 950, in getresponse
  response.begin()
 File "C:\Python26\lib\httplib.py", line 390, in begin
  version, status, reason = self._read_status()
 File "C:\Python26\lib\httplib.py", line 354, in _read_status
  raise BadStatusLine(line)
BadStatusLine
>>>>

Make the body of your for loop into:

for url in urls:
  try:
    ...the body you have now...
  except Exception, e:
    print>>sys.stderr, "Url %r not processed: error (%s) % (url, e)

(Or, use logging.error instead of the goofy print>>, if you're already using the logging module of the standard library [and you should;-)]).

I'd recommend reading the Errors and Exceptions Python documentation, especially section 8.3 -- Handling Exceptions.

继续阅读：exception-handling list loops python urllib2

how to apply "catch-all" exception clause to complex python web-scraping script?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？