using urllib to import formatted text file with lines out of column

2023-03-14 02:14 问答作者：

I'm trying to use urllib to parse a text file from the website and pull in data. There are other files that I have been able to do, they're text formatted in columns, but this one is kind of throwing me because of the line for Southern Illinois-Edwardsville pushes the second score and location out of the column.

file = urllib.urlopen('http://www.boydsworld.com/cgi/scores.pl?team1=all&team2=all&firstyear=2011&lastyear=2011&format=Text&submit=Fetch')

for line in file:
    game_month = line[0:1].rstrip()
    game_day   = line[2:4].rstrip()
    game_year  = line[5:9].rstrip()
    team1      = line[11:37].rstrip()
    team1_scr  = line[38:40].rstrip()
    team2      = line[42:68].rstrip()
    team2_scor = line[68:70].rstrip()
    extra_info = line[72:100].rstrip()

The Southern Illi开发者_如何学编程nois-Edwardsville line imports 'il' as team2_scr and imports ' 4 @Central Arkansas' as the extra_info.

Wanna see the best solution? http://www.boydsworld.com/cgi/scores.pl?team1=all&team2=all&firstyear=2011&lastyear=2011&format=CSV&submit=Fetch will give you nice CSV file, no dark magic needed.

do you want something like this:

def get_row(row):
    row=row.split()
    num_pos=[]
    for i in range(len(row)):
        try:
            int(row[i])
            num_pos.append(i)
        except:
            pass
    assert(len(num_pos)==2)
    ans=[]
    ans.append(row[0])
    ans.append("".join(row[1:num_pos[0]]))
    ans.append(int(row[num_pos[0]]))
    ans.append("".join(row[num_pos[0]+1:num_pos[1]]))
    ans.append(int(row[num_pos[1]]))
    ans.append("".join(row[num_pos[1]+1:]))
    return ans


row1="2/18/2011  Central Arkansas           5  Southern Illinois-Edwardsville  4  @Central Arkansas"
row2="2/18/2011  Central Florida           11  Siena                      1  @Central Florida"

print get_row(row1)
print get_row(row2)

output:

['2/18/2011', 'CentralArkansas', 5, 'SouthernIllinois-Edwardsville', 4, '@CentralArkansas']
['2/18/2011', 'CentralFlorida', 11, 'Siena', 1, '@CentralFlorida']

Clearly you just need to split on multiple spaces. Unfortunately the csv module only allows a single-character delimiter, but re.sub can help. I would recommend something like this:

import urllib2
import csv
import re

u = urllib2.urlopen('http://www.boydsworld.com/cgi/scores.pl?team1=all&team2=all&firstyear=2011&lastyear=2011&format=Text&submit=Fetch')

reader = csv.DictReader((re.sub(' {2,}', '\t', line) for line in u), delimiter='\t', fieldnames=('date', 'team1', 'team1_score', 'team2', 'team2_score', 'extra_info'))

for i, row in enumerate(reader):
    if i == 5: break  # Only do five (otherwise you don't need ``enumerate()``)
    print row

This produces results like this:

{'team1': 'Air Force', 'team2': 'Missouri State', 'date': '2/18/2011', 'team2_score': '2', 'team1_score': '7', 'extra_info': '@neutral'}
{'team1': 'Akron', 'team2': 'Lamar', 'date': '2/18/2011', 'team2_score': '1', 'team1_score': '2', 'extra_info': '@neutral'}
{'team1': 'Alabama', 'team2': 'Alcorn State', 'date': '2/18/2011', 'team2_score': '0', 'team1_score': '11', 'extra_info': '@Alabama'}
{'team1': 'Alabama State', 'team2': 'Tuskegee', 'date': '2/18/2011', 'team2_score': '5', 'team1_score': '9', 'extra_info': '@Alabama State'}
{'team1': 'Appalachian State', 'team2': 'Maryland-Eastern Shore', 'date': '2/18/2011', 'team2_score': '0', 'team1_score': '4', 'extra_info': '@Appalachian State'}

Or if you prefer, just use a cvs.reader and get lists rather than dicts:

reader = csv.reader((re.sub(' {2,}', '\t', line) for line in u), delimiter='\t')

print reader.next()

Say that s contains one row of your table. Then you could use the split() method of the re (regular expressions) library:

import re
rexp = re.compile('  +')  # Match two or more spaces
cols = rexp.split(s)

...and cols is now a list of strings, each a column in your table row. This assumes that table columns are separated by at least two spaces, and nothing else. If that is not the case, the argument to re.compile() can be edited to allow for other configurations.

Recall that Python considers a file a sequence of lines, separated by newline characters. Therefore, all you have to do is to for-loop over your file, applying .split() to each line.

For an even nicer solution, check out the built-in map() function and try using that instead of a for-loop.

继续阅读：python urllib

using urllib to import formatted text file with lines out of column

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？