How can I split a 2D array into an array with unique values and a dictionary?
I'm trying to split a 2D array into a specific format and can't figure out the last step. A sample of my data is structured as follows:
# Original Data
fileListCode = [['Seq3.xls', 'B08524_057'],
['Seq3.xls', 'B08524_053'],
['Seq3.xls', 'B08524_054'],
['Seq98.xls', 'B25034_001'],
['Seq98.xls', 'D25034_002'],
['Seq98.xls', 'B25034_003']]
I am trying to split it up so that it looks like this:
# split into [['Seq3.xls', {'B08524_057':1,'B08524_053':2, 'B08524_054':3},
# ['Seq98.xls',{'B25034_001':1,'D25034_002':2, 'B25034_003':3}]
The dictionary keys 1,2,3 are based on the original position of the entry, starting from the first time that the filename appears. To do this, I've first made an array to get all the unique file names (anything that is .xls
is a filename)
tmpFileList = []
tmpCodeList = []
arrayListDict = []
# store unique filelist in a tempprary array:
for i in range( len(fileListCode)):
if fileListCode[i][0] not in tmpFileList:
开发者_开发知识库tmpFileList.append( fileListCode[i][0] )
However, I'm struggling with the next step. I can't figure out a good way of pulling out the codenames (B08524_052
for example), and converting them into a dictionary with an index based on their position.
# make array to store filelist, and codes with dictionary values
for i in range( len(tmpFileList)):
arrayListDict.append([tmpFileList[i], {}])
This code just produces [['Seq3.xls', {}], ['Seq98.xls', {}]]
; I'm not sure whether I should first produce the structure and then try and add the code and dictionary values in, or whether there is a better way.
--
EDIT: I just made sample a little more clear by changing the values in fileListCode
With, itertools.groupby this process will be much simplier:
>>> key = operator.itemgetter(0)
>>> grouped = itertools.groupby(sorted(fileListCode, key=key), key=key)
>>> [(i, {k[1]: n for n, k in enumerate(j, 1)}) for i, j in grouped]
[('Seq3.xls', {'B08524_052': 1, 'B08524_053': 2, 'B08524_054': 3}),
('Seq98.xls', {'B25034_001': 1, 'B25034_002': 2, 'B25034_003': 3})]
For old Python versions:
>>> [(i, dict((k[1], n) for n, k in enumerate(j, 1))) for i, j in grouped]
[('Seq3.xls', {'B08524_052': 1, 'B08524_053': 2, 'B08524_054': 3}),
('Seq98.xls', {'B25034_001': 1, 'B25034_002': 2, 'B25034_003': 3})]
But I think using dict would be better:
>>> {i: {k[1]: n for n, k in enumerate(j, 1)} for i, j in grouped}
{'Seq3.xls': {'B08524_052': 1, 'B08524_053': 2, 'B08524_054': 3},
'Seq98.xls': {'B25034_001': 1, 'B25034_002': 2, 'B25034_003': 3}}
You've confused lists and dictonaries.
It would make far more sense to do something more like this:
file_list_code = [['Seq3.xls', 'B08524_052'],
['Seq3.xls', 'B08524_053'],
['Seq3.xls', 'B08524_054'],
['Seq98.xls', 'B25034_001'],
['Seq98.xls', 'B25034_002'],
['Seq98.xls', 'B25034_003']]
file_codes = {}
for name, code in file_list_code:
if name not in file_codes:
file_codes[name] = []
file_codes[name].append(code)
This yields:
{'Seq3.xls': ['B08524_052', 'B08524_053', 'B08524_054'],
'Seq98.xls': ['B25034_001', 'B25034_002', 'B25034_003']}
This could be further simplifed by using a defaultdict. It's arguably overkill for something this simple, but it's good to know about. Here's an example:
import collections
file_list_code = [['Seq3.xls', 'B08524_052'],
['Seq3.xls', 'B08524_053'],
['Seq3.xls', 'B08524_054'],
['Seq98.xls', 'B25034_001'],
['Seq98.xls', 'B25034_002'],
['Seq98.xls', 'B25034_003']]
file_codes = collections.defaultdict(list)
for name, code in file_list_code:
file_codes[name].append(code)
fileListCode = [['Seq3.xls', 'B08524_052'],
['Seq3.xls', 'B08524_053'],
['Seq3.xls', 'B08524_054'],
['Seq98.xls', 'B25034_001'],
['Seq98.xls', 'B25034_002'],
['Seq98.xls', 'B25034_003']]
dico = {}
li = []
for a,b in fileListCode:
if a in dico:
li[dico[a]][1][b] = len( li[dico[a]][1] ) + 1
else:
dico[a] = len(li)
li.append([a,{b:1}])
print '\n'.join(map(str,li))
精彩评论