Organizing XML data into dictionaries
I'm trying to organize my data into a dictionary format from XML data. This will be used to run Monte Carlo simulations.
Here is an example of what a couple of entries in the XML look like:
<retirement>
<item>
<low>-0.34</low>
<high>-0.32</high>
<freq>0.0294117647058824</freq>
<variable>stock</variable>
<type>historic</type>
</item>
<item>
<low>-0.32</low>
<high>-0.29</high>
<freq>0</freq>
<variable>stock</variable>
<type>historic</type>
</item>
</retirement>
My current data sets only have two variables and the type can be 1 of 3 or possible 4 discrete types. Hard coding two variables isn't a problem, but I would like to start working with data that has many more variables and automate this process. My goal is to automatically import this XML data into a dictionary to be able to further manipulate it later without having to hard code in the array titles and the variables.
Here is what I have:
# Import XML Parser
import xml.etree.ElementTree as ET
# Parse XML directly from the file path
tree = ET.parse('xmlfile')
# Create iterable item list
Items = tree.findall('item')
# Create Master Dictionary
masterDictionary = {}
# Assign variables to dictionary
for Item in Items:
thisKey = Item.find('variable').text
if thisKey in masterDictionary == False:
masterDictionary[thisKey] = []
else:
pass
thisList = masterDictionar开发者_如何转开发y[thisKey]
newDataPoint = DataPoint(float(Item.find('low').text), float(Item.find('high').text), float(Item.find('freq').text))
thisSublist.append(newDataPoint)
I'm getting a KeyError @ thisList = masterDictionary[thisKey]
I am also trying to create a class to deal with some of the other elements of the xml:
# Define a class for each data point that contains low, hi and freq attributes
class DataPoint:
def __init__(self, low, high, freq):
self.low = low
self.high = high
self.freq = freq
Would I then be able to check a value with something like:
masterDictionary['stock'] [0].freq
Any and all help is appreciated
UPDATE
Thanks for the help John. The indentation issues are sloppiness on my part. It's my first time posting on Stack and I just didn't get the copy/paste right. The part after the else: is in fact indented to be a part of the for loop and the class is indented with four spaces in my code--just a bad posting here. I'll keep the capitalization convention in mind. Your suggestion indeed worked and now with the commands:
print masterDictionary.keys()
print masterDictionary['stock'][0].low
yields:
['inflation', 'stock']
-0.34
those are indeed my two variables and the value syncs with the xml listed at the top.
UPDATE 2
Well, I thought I had figured this one out, but I was careless again and it turns out that I hadn't quite fixed the issue. The previous solution ended up writing all of the data to my two dictionary keys so that I have two equal lists of all the data assigned to two different dictionary keys. The idea is to have distinct sets of data assigned from the XML to the matching dictionary key. Here is the current code:
# Import XML Parser
import xml.etree.ElementTree as ET
# Parse XML directly from the file path
tree = ET.parse(xml file)
# Create iterable item list
items = tree.findall('item')
# Create class for historic variables
class DataPoint:
def __init__(self, low, high, freq):
self.low = low
self.high = high
self.freq = freq
# Create Master Dictionary and variable list for historic variables
masterDictionary = {}
thisList = []
# Loop to assign variables as dictionary keys and associate their values with them
for item in items:
thisKey = item.find('variable').text
masterDictionary[thisKey] = thisList
if thisKey not in masterDictionary:
masterDictionary[thisKey] = []
newDataPoint = DataPoint(float(item.find('low').text), float(item.find('high').text), float(item.find('freq').text))
thisList.append(newDataPoint)
When I input:
print masterDictionary['stock'][5].low
print masterDictionary['inflation'][5].low
print len(masterDictionary['stock'])
print len(masterDictionary['inflation'])
the results are identical for both keys ('stock' and 'inflation'):
-.22
-.22
56
56
There are 27 items with the stock tag in the XML file and 29 tagged with inflation. How can I make each list assigned to a dictionary key only pull the particular data in the loop?
UPDATE 3
It seems to work with 2 loops, but I have no idea how and why it won't work in 1 single loop. I managed this accidentally:
# Import XML Parser
import xml.etree.ElementTree as ET
# Parse XML directly from the file path
tree = ET.parse(xml file)
# Create iterable item list
items = tree.findall('item')
# Create class for historic variables
class DataPoint:
def __init__(self, low, high, freq):
self.low = low
self.high = high
self.freq = freq
# Create Master Dictionary and variable list for historic variables
masterDictionary = {}
# Loop to assign variables as dictionary keys and associate their values with them
for item in items:
thisKey = item.find('variable').text
thisList = []
masterDictionary[thisKey] = thisList
for item in items:
thisKey = item.find('variable').text
newDataPoint = DataPoint(float(item.find('low').text), float(item.find('high').text), float(item.find('freq').text))
masterDictionary[thisKey].append(newDataPoint)
I have tried a large number of permutations to make it happen in one single loop but no luck. I can get all of the data listed into both keys--identical arrays of all the data (not very helpful), or the data sorted correctly into 2 distinct arrays for both keys, but only the last single data entry (the loop overwrites itself each time leaving you with only one entry in the array).
You have a serious indentation problem after the (unnecessary) else: pass
. Fix that and try again. Does the problem occur with your sample input data? other data? First time around the loop? What is the value of thisKey
that is causing the problem [hint: it's reported in the KeyError error message]? What are the contents of masterDictionary just before the error happens [hint: sprinkle a few print
statements around your code]?
Other remarks not relevant to your problem:
Instead of if thisKey in masterDictionary == False:
consider using if thisKey not in masterDictionary:
... comparisons against True
or False
are almost always redundant and/or a bit of a "code smell".
Python convention is to reserve names with an initial capital letter (like Item
) for classes.
Using only one space per indentation level makes code almost illegible and is severely deprecated. Use 4 always (unless you have a good reason -- but I've never heard of one).
Update I was wrong: thisKey in masterDictionary == False
is worse than I thought; because in
is a relational operator, chained evaluation is used (like a <= b < c
) so you have (thisKey in masterDictionary) and (masterDictionary == False)
which will always evaluate to False, and thus the dictionary is never updated. The fix is as I suggested: use if thisKey not in masterDictionary:
Also it looks like thisList
(initialised but not used) should be thisSublist
(used but not initialised).
Change:
if thisKey in masterDictionary == False:
to
if thisKey not in masterDictionary:
That seems to be why you were getting that error. Also, you need to assign something to 'thisSublist' before you try and append to it. Try:
thisSublist = []
thisSublist.append(newDataPoint)
You have an error in your if-statement inside the for-loop. Instead of
if thisKey in masterDictionary == False:
write
if (thisKey in masterDictionary) == False:
Given the rest of your original code, you will be able to access data like so:
masterDictionary['stock'][0].freq
John Machin makes some valid points regarding style and smell, (and you should think about his suggested changes), but those things will come with time and experience.
精彩评论