How can I read to dictionary keys in a way that make sense?
I have about a thousand files that are named in a semi-sensible way like the following:
aaa.ba.ca.01
aaa.ba.ca.02
aaa.ba.ca.03
aaa.ba.da.01
aaa.ba.da.02
aaa.ba.da.03
and so on. Let's say each file contains 2 columns of numbers which I need to read in to the dictionaries: wavelength, flux. The reading in part is easy for me, the hard part is that I need to load these dictionaries so that they store the information like:
wavelength['aaa.ba.ca.01'] (which is the wavelengths of that one file)
wavelength['aaa.ba.ca'] (which is the wavelengths of all subfiles ie ...ca.01, ...ca.02, and ...ca.03 -- in order)
wavelength['aaa.ba'] (which also includes all wavelengths of all "subfiles" as well -- again in order).
and so on. The filenames are well-behaved (the sections are separated by periods, the grouping hierarchy is always the same direction, etc.) but the files can be between 4 sections, or 8 sections long.
My question: is there some sensible way to have pyth开发者_如何学运维on glob the names of the files and by parsing strings or some other magic get the data into these dictionaries? I've hit a brick wall.
A simple, but not efficient, way to do so is to subclass Pythons dictionary, so that when given one non-complete key, it concatenates the contents of all matching keys, in alphabetical order.
There could be more efficient designs: this one major drawback being it will sort and verify all existing dictionary keys on a partial key request. Otherwise, it is so simple to implement that it is worth a try:
class MultiDict(dict):
def __getitem__(self, key):
if key in self:
return dict.__getitem__(self, key)
result = []
for complete_key in sorted(self.keys()):
if complete_key.startswith(key):
result.extend(self[complete_key])
return result
#example
a = MultiDict()
a["a0"] = [1]
a["a1"] = [2]
print a["a"]
[1, 2]
As for getting teh data in the dictionary, just iterate over all files, with glob or os.listdir, and read the desired contents, as a list, into a MultiDict item using the filename as a key.
What you want does not sound like a dictionary at all. In many ways, I'd say that this is a data structure comparable to a tree. So instead of using a dictionary you're going to want to make a data structure wherein you've got a first node:
Root
'ba' 'ca' 'cd' 'fg'
/ | \ / \ / \ |
/ | \ / \ / \ |
'aa' 'di' '30' '34' '45' 'ac' 'ty' '01'
and then perform a depth first search wherein you've indicated the number of leafs searched by the name: 'ba.aa' would only return things from the 'ba'->'aa' leaf, while 'ba' would return 'ba'->'aa', 'ba'->'di', and 'ba'->'30'.
If you want, I'd make each "level" of nesting into it's own dictionary. That way you could map quickly to the wavelengths and such.
If you only have 1000 files a linear search to look them up is probably fine. On my machine it took 250 us to do one look up. Then you can use itertools.chain to combine data from multiple files.
class DataGlob(object):
def __init__(self):
self.files = []
self.wavedata = dict()
self.fluxdata = dict()
def add(self, filename):
wlist = []
flist = []
for l in open(filename):
(wlen, flux) = map(float, l.split())
wlist.append(wlen)
flist.append(flux)
self.wavedata[filename] = wlist
self.fluxdata[filename] = flist
def find_keys(self, prefix):
return [f for f in self.files if f.startswith(prefix)]
def wavelength(self,fileprefix):
validkeys = find_keys(prefix)
return itertools.chain.from_iterable(self.wavedata[k] for k in validkeys)
def flux(self, fileprefix):
validkeys = self.find_keys(fileprefix)
return itertools.chain.from_iterable(self.fluxdata[k] for k in validkeys)
精彩评论