开发者

Python data structure recommendation?

I currently have a structure that is a dict: each value is a list that contains numeric values. Each of these numeric lists contain what (to borrow a SQL idiom) you could call a primary key containing the first three values which are: a year, a player identifier, and a team identifier. This is the key for the dict.

So you can get a unique row by passing the a value in for the year, player ID, and team ID like so:

statline = stats[(2001, 'SEA', 'suzukic01')]

Which yields something like

[305, 20, 444, 330, 45]

I'd like to alter this data structure to be quickly summed by eith开发者_运维问答er of these three keys: so you could easily slice the totals for a given index in the numeric lists by passing in ONE of year, player ID, and team ID, and then the index. I want to be able to do something like

hr_total = stats[year=2001, idx=3]

Where that idx of 3 corresponds to the third column in the numeric list(s) that would be retrieved.

Any ideas?


Read up on Data Warehousing. Any book.

Read up on Star Schema Design. Any book. Seriously.

You have several dimensions: Year, Player, Team.

You have one fact: score

You want to have a structure like this.

You then want to create a set of dimension indexes like this.

years = collections.defaultdict( list )
players = collections.defaultdict( list )
teams = collections.defaultdict( list )

Your fact table can be this a collections.namedtuple. You can use something like this.

class ScoreFact( object ):
    def __init__( self, year, player, team, score ):
        self.year= year
        self.player= player
        self.team= team
        self.score= score
        years[self.year].append( self )
        players[self.player].append( self )
        teams[self.team].append( self )

Now you can find all items in a given dimension value. It's a simple list attached to a dimension value.

years['2001'] are all scores for the given year.

players['SEA'] are all scores for the given player.

etc. You can simply use sum() to add them up. A multi-dimensional query is something like this.

[ x for x in players['SEA'] if x.year == '2001' ]


Put your data into SQLite, and use its relational engine to do the work. You can create an in-memory database and not even have to touch the disk.


The syntax stats[year=2001, idx=3] is invalid Python and there is no way you can make it work with those square brackets and "keyword arguments"; you'll need to have a function or method call in order to accept keyword arguments.

So, say we make it a function, to be called like wells(stats, year=2001, idx=3). I imagine the idx argument is mandatory (which is very peculiar given the call, but you give no indication of what could possibly mean to omit idx) and exactly one of year, playerid, and teamid must be there.

With your current data structure, wells can already be implemented:

def wells(stats, year=None, playerid=None, teamid=None, idx=None):
  if idx is None: raise ValueError('idx must be specified')
  specifiers = [(i, x) for x in enumerate((year, playerid, teamid)) if x is not None]
  if len(specifiers) != 2:
    raise ValueError('Exactly one of year, playerid, teamid, must be given')
  ikey, keyv = specifiers[0]
  return sum(v[idx] for k, v in stats.iteritems() if k[ikey]==keyv)

of course, this is O(N) in the size of stats -- it must examine every entry in it. Please measure correctness and performance with this simple implementation as a baseline. An alternative solutions (much speedier in use, but requiring much time for preparation) is to put three dicts of lists (one each for year, playerid, teamid) to the side of stats, each entry indicating (or copying, but I think indicating by full key may suffice) all entries of stats that match that that ikey / keyv pair. But it's not clear at this time whether this implementation may not be premature, so please try first with the simple-minded idea!-)


def getSum(d, year, idx):
    sum = 0
    for key in d.keys():
        if key[0] == year:
            sum += d[key][idx]
    return sum

This should get you started. I have made the assumption in this code, that ONLY year will be asked for, but it should be easy enough for you to manipulate this to check for other parameters as well

Cheers

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜