开发者

Replace BeautifulSoup with another (standard) HTML parsing module in this Python script

I have made a script with BeautifulSoup which works fine and is very readable, but I want to redistribute it some day, and BeautifulSoup is an external dependency I would like to avoid, specially considering Windows use.

Here is the code, it gets every usermap link from a given google maps user. The ####### marked lines are the ones using BeautifulSoup:

# coding: utf-8

import urllib, re
from BeautifulSoup import BeautifulSoup as bs

uid = '200931058040775970557'
start = 0
shown = 1

while True:
    url = 'http://maps.google.com/maps/user?uid='+uid+'&ptab=2&start='+str(start)
    source = urllib.urlopen(url).read()
    soup = bs(source)  ####
    maptables = soup.findAll(id=re.compile('^map[0-9]+$'))  #################
    for table in maptables:
        for line in table.findAll('a', 'maptitle'):  ################
            mapid = re.search(uid+'\.([^"]*)', str(line)).group(1)
            mapname = re.search('>(.*)</a>', str(line)).group(1).strip()[:-3]
            print shown, mapid, '\t', mapname
            shown += 1

            ur开发者_JS百科llib.urlretrieve('http://maps.google.com.br/maps/ms?msid=' + uid + '.' + str(mapid) +
                               '&msa=0&output=kml', mapname + '.kml')


    if '<span>Next</span>' in str(source):
        start += 5
    else:
        break

As you can see, there are just three lines of code using BSoup, but I am not a programmer and I had a lot of difficulty trying to use other standard HTML and XML parsing tools, probably because I tried the wrong way, I guess.

EDIT: This question is more about replacing the three lines of code of this script than to find a way to solve generic html parsing problems there might be.

Any help will be much appreciated, thanks for reading!


Unfortunately, Python does not have useful HTML parsing in the standard library, so the only reasonable way to parse HTML is by using a third party module like lxml.html or BeautifulSoup. This does not mean that you have to have a separate dependency--these modules are free software and if you do not want an external dependency, you're welcome to bundle them with your code, which then won't make them any more a dependency than the code you write yourself.


to parse HTML code I see have three solutions :

  • use simple string search (.find(),...) Fast !
  • use regular expressions (aka regex)
  • use HTMLParser


I have tried this code (see below) and it shows up a list of links. As I have no beautiful soup installed and don't want to, it is very difficult to me to check the results against what your code gives. The "pure" python code without any "soup" is even shorter and more readable. Anyway, here it is. Tell me what you think ! Friendly, Louis.

#coding: utf-8

import urllib, re

uid = '200931058040775970557'
start = 0
shown = 1

while True:
    url = 'http://maps.google.com/maps/user?uid='+uid+'&ptab=2&start='+str(start)
    source = urllib.urlopen(url).read()
    while True:
        endit = source.find('maptitle')
        mapid = re.search(uid+'\.([^"]*)', str(source)).group(1)
        mapname = re.search('>(.*)</a>', str(source)).group(1).strip()[:-3]
        print shown, mapid, '\t', mapname
        shown += 1
        urllib.urlretrieve('http://maps.google.com.br/maps/ms?msid=' + uid + '.' + str(mapid) + '&msa=0&output=kml', mapname + '.kml')

    if '<span>Next</span>' in str(source):
        start += 5
    else:
        break
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜