开发者

How can I split a 2D array into an array with unique values and a dictionary?

I'm trying to split a 2D array into a specific format and can't figure out the last step. A sample of my data is structured as follows:

# Original Data
fileListCode = [['Seq3.xls', 'B08524_057'], 
                ['Seq3.xls', 'B08524_053'], 
                ['Seq3.xls', 'B08524_054'],
                ['Seq98.xls', 'B25034_001'], 
                ['Seq98.xls', 'D25034_002'], 
                ['Seq98.xls', 'B25034_003']]

I am trying to split it up so that it looks like this:

# split into [['Seq3.xls', {'B08524_057':1,'B08524_053':2, 'B08524_054':3},
#             ['Seq98.xls',{'B25034_001':1,'D25034_002':2, 'B25034_003':3}]

The dictionary keys 1,2,3 are based on the original position of the entry, starting from the first time that the filename appears. To do this, I've first made an array to get all the unique file names (anything that is .xls is a filename)

tmpFileList = []
tmpCodeList = []
arrayListDict = []

# store unique filelist in a tempprary array:
for i in range( len(fileListCode)):
    if fileListCode[i][0] not in tmpFileList:
        开发者_开发知识库tmpFileList.append( fileListCode[i][0]  )

However, I'm struggling with the next step. I can't figure out a good way of pulling out the codenames (B08524_052 for example), and converting them into a dictionary with an index based on their position.

# make array to store filelist, and codes with dictionary values
for i in range( len(tmpFileList)):
    arrayListDict.append([tmpFileList[i], {}])

This code just produces [['Seq3.xls', {}], ['Seq98.xls', {}]] ; I'm not sure whether I should first produce the structure and then try and add the code and dictionary values in, or whether there is a better way.

-- EDIT: I just made sample a little more clear by changing the values in fileListCode


With, itertools.groupby this process will be much simplier:

>>> key = operator.itemgetter(0)
>>> grouped = itertools.groupby(sorted(fileListCode, key=key), key=key)
>>> [(i, {k[1]: n for n, k in enumerate(j, 1)}) for i, j in grouped]
[('Seq3.xls', {'B08524_052': 1, 'B08524_053': 2, 'B08524_054': 3}),
 ('Seq98.xls', {'B25034_001': 1, 'B25034_002': 2, 'B25034_003': 3})]

For old Python versions:

>>> [(i, dict((k[1], n) for n, k in enumerate(j, 1))) for i, j in grouped]
[('Seq3.xls', {'B08524_052': 1, 'B08524_053': 2, 'B08524_054': 3}),
 ('Seq98.xls', {'B25034_001': 1, 'B25034_002': 2, 'B25034_003': 3})]

But I think using dict would be better:

>>> {i: {k[1]: n for n, k in enumerate(j, 1)} for i, j in grouped}
{'Seq3.xls': {'B08524_052': 1, 'B08524_053': 2, 'B08524_054': 3},
 'Seq98.xls': {'B25034_001': 1, 'B25034_002': 2, 'B25034_003': 3}}


You've confused lists and dictonaries.

It would make far more sense to do something more like this:

file_list_code = [['Seq3.xls', 'B08524_052'],
                  ['Seq3.xls', 'B08524_053'],                  
                  ['Seq3.xls', 'B08524_054'],                 
                  ['Seq98.xls', 'B25034_001'],                  
                  ['Seq98.xls', 'B25034_002'],                  
                  ['Seq98.xls', 'B25034_003']] 

file_codes = {}
for name, code in file_list_code:
    if name not in file_codes:
        file_codes[name] = []
    file_codes[name].append(code)

This yields:

{'Seq3.xls': ['B08524_052', 'B08524_053', 'B08524_054'], 
'Seq98.xls': ['B25034_001', 'B25034_002', 'B25034_003']}

This could be further simplifed by using a defaultdict. It's arguably overkill for something this simple, but it's good to know about. Here's an example:

import collections

file_list_code = [['Seq3.xls', 'B08524_052'],
                  ['Seq3.xls', 'B08524_053'],                  
                  ['Seq3.xls', 'B08524_054'],                 
                  ['Seq98.xls', 'B25034_001'],                  
                  ['Seq98.xls', 'B25034_002'],                  
                  ['Seq98.xls', 'B25034_003']] 

file_codes = collections.defaultdict(list)
for name, code in file_list_code:
    file_codes[name].append(code)


fileListCode = [['Seq3.xls', 'B08524_052'],
                ['Seq3.xls', 'B08524_053'],
                ['Seq3.xls', 'B08524_054'],
                ['Seq98.xls', 'B25034_001'],
                ['Seq98.xls', 'B25034_002'],
                ['Seq98.xls', 'B25034_003']]

dico = {}
li = []
for a,b in fileListCode:

    if a in dico:
        li[dico[a]][1][b] = len( li[dico[a]][1] ) + 1


    else:
        dico[a] = len(li)
        li.append([a,{b:1}])


print '\n'.join(map(str,li))
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜