开发者

using urllib to import formatted text file with lines out of column

I'm trying to use urllib to parse a text file from the website and pull in data. There are other files that I have been able to do, they're text formatted in columns, but this one is kind of throwing me because of the line for Southern Illinois-Edwardsville pushes the second score and location out of the column.

file = urllib.urlopen('http://www.boydsworld.com/cgi/scores.pl?team1=all&team2=all&firstyear=2011&lastyear=2011&format=Text&submit=Fetch')

for line in file:
    game_month = line[0:1].rstrip()
    game_day   = line[2:4].rstrip()
    game_year  = line[5:9].rstrip()
    team1      = line[11:37].rstrip()
    team1_scr  = line[38:40].rstrip()
    team2      = line[42:68].rstrip()
    team2_scor = line[68:70].rstrip()
    extra_info = line[72:100].rstrip()

The Southern Illi开发者_如何学编程nois-Edwardsville line imports 'il' as team2_scr and imports ' 4 @Central Arkansas' as the extra_info.


Wanna see the best solution? http://www.boydsworld.com/cgi/scores.pl?team1=all&team2=all&firstyear=2011&lastyear=2011&format=CSV&submit=Fetch will give you nice CSV file, no dark magic needed.


do you want something like this:

def get_row(row):
    row=row.split()
    num_pos=[]
    for i in range(len(row)):
        try:
            int(row[i])
            num_pos.append(i)
        except:
            pass
    assert(len(num_pos)==2)
    ans=[]
    ans.append(row[0])
    ans.append("".join(row[1:num_pos[0]]))
    ans.append(int(row[num_pos[0]]))
    ans.append("".join(row[num_pos[0]+1:num_pos[1]]))
    ans.append(int(row[num_pos[1]]))
    ans.append("".join(row[num_pos[1]+1:]))
    return ans


row1="2/18/2011  Central Arkansas           5  Southern Illinois-Edwardsville  4  @Central Arkansas"
row2="2/18/2011  Central Florida           11  Siena                      1  @Central Florida"

print get_row(row1)
print get_row(row2)

output:

['2/18/2011', 'CentralArkansas', 5, 'SouthernIllinois-Edwardsville', 4, '@CentralArkansas']
['2/18/2011', 'CentralFlorida', 11, 'Siena', 1, '@CentralFlorida']


Clearly you just need to split on multiple spaces. Unfortunately the csv module only allows a single-character delimiter, but re.sub can help. I would recommend something like this:

import urllib2
import csv
import re

u = urllib2.urlopen('http://www.boydsworld.com/cgi/scores.pl?team1=all&team2=all&firstyear=2011&lastyear=2011&format=Text&submit=Fetch')

reader = csv.DictReader((re.sub(' {2,}', '\t', line) for line in u), delimiter='\t', fieldnames=('date', 'team1', 'team1_score', 'team2', 'team2_score', 'extra_info'))

for i, row in enumerate(reader):
    if i == 5: break  # Only do five (otherwise you don't need ``enumerate()``)
    print row

This produces results like this:

{'team1': 'Air Force', 'team2': 'Missouri State', 'date': '2/18/2011', 'team2_score': '2', 'team1_score': '7', 'extra_info': '@neutral'}
{'team1': 'Akron', 'team2': 'Lamar', 'date': '2/18/2011', 'team2_score': '1', 'team1_score': '2', 'extra_info': '@neutral'}
{'team1': 'Alabama', 'team2': 'Alcorn State', 'date': '2/18/2011', 'team2_score': '0', 'team1_score': '11', 'extra_info': '@Alabama'}
{'team1': 'Alabama State', 'team2': 'Tuskegee', 'date': '2/18/2011', 'team2_score': '5', 'team1_score': '9', 'extra_info': '@Alabama State'}
{'team1': 'Appalachian State', 'team2': 'Maryland-Eastern Shore', 'date': '2/18/2011', 'team2_score': '0', 'team1_score': '4', 'extra_info': '@Appalachian State'}

Or if you prefer, just use a cvs.reader and get lists rather than dicts:

reader = csv.reader((re.sub(' {2,}', '\t', line) for line in u), delimiter='\t')

print reader.next()


Say that s contains one row of your table. Then you could use the split() method of the re (regular expressions) library:

import re
rexp = re.compile('  +')  # Match two or more spaces
cols = rexp.split(s)

...and cols is now a list of strings, each a column in your table row. This assumes that table columns are separated by at least two spaces, and nothing else. If that is not the case, the argument to re.compile() can be edited to allow for other configurations.

Recall that Python considers a file a sequence of lines, separated by newline characters. Therefore, all you have to do is to for-loop over your file, applying .split() to each line.

For an even nicer solution, check out the built-in map() function and try using that instead of a for-loop.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜