memory usage of a probabilistic parser
I am writing a CKY parser for a Range Concatenation Grammar. I want to use a treebank as grammar, so the grammar will be large. I've written a prototype 1 in Python and it seems to work well when I simulate a treebank of a couple tens of sentences, but the memory usage is unacceptable. I tried writing it in C++ but so far that has been very frustrating as I have never used C++ before. Here's some data (n is number of sentences the grammar is based on):
n mem
9 173M
18 486M
36 836M
This growth pattern is what is to be expected given the best-first algorithm, but the amount of overhead is what concerns me. The memory usage according to heapy is a factor ten smaller than these numbers, valgrind reported something similar. What causes this discrepancy and is there anything I can do about it in Python (or Cython)? Perhaps it's due to fragmentation? Or maybe it is the overhead of python dictionaries?
Some background: the two important datastructures are the agenda mapping edges to probabilities, and the chart, which is a dictionary mapping nonterminals and positions to edges. The agenda is implemented with a heapdict (which internally uses a dict and a heapq list), the chart with a dictionary mapping nonterminals and positions to edges. The agenda is frequently inserted and removed from, the chart only gets insertions and lookups. I represent edges with tuples like this:
(("S", 111), ("NP", 010), ("VP", 100, 001))
The strings are the nonterminal labels from the grammar, the positions are encoded as a bitmask. There can be multiple positions when a constituent is discontinuous. So this edge could be represent an analysis of "is Mary happy", where "is" and happy" both belong to the VP. The chart dictionary is indexed by the first element of this edge, ("S", 111) in this case. In a new version I tried transposing this representation in the hope that it would save memory due to reuse:
(("S", "NP", "VP), (111, 100, 011))
I figured that Python would store the first part only once if it would occur in combination with different positions, although I'm not actually sure this is true. In either case, it didn't seem to make any difference.
So basically what I am wondering is if it is worth pursuing my Python implementation any further, including doing things with Cython and different datastructures, or that writing it from the ground up in C++ is the only viable option.
UPDATE: After so开发者_如何学Gome improvements I no longer have issues with memory usage. I'm working on an optimized Cython version. I'll award the bounty to the most useful suggestion for increasing efficiency of the code. There is an annotated version at http://student.science.uva.nl/~acranenb/plcfrs_cython.html
1 https://github.com/andreasvc/disco-dop/ -- run test.py to parse some sentences. Requires python 2.6, nltk and heapdict
I figured that Python would store the first part only once if it would occur in combination with different positions
Not necessarily:
>>> ("S", "NP", "VP") is ("S", "NP", "VP")
False
You might want to intern
all strings referring to non-terminals, since you seem to be creating a lot of these in rcgrules.py
. If you want to intern
a tuple, then turn it into a string first:
>>> intern("S NP VP") is intern(' '.join('S', 'NP', 'VP'))
True
Otherwise, you'll have to "copy" the tuples instead of constructing them afresh.
(If you're new to C++, then rewriting such an algorithm in it is unlikely to provide much of a memory benefit. You'd have to evaluate various hash table implementations first and learn about the copying behavior in its containers. I've found boost::unordered_map
to be quite wasteful with lots of small hashtables.)
Have you tried running your application with PyPy rather than CPython?
PyPy is a lot smarter than CPython about noticing commonalities and avoiding the memory overhead associated with duplicating things unnecessarily.
It's worth trying, anyway: http://pypy.org/
The first to do in these cases is always to profile:
15147/297 0.032 0.000 0.041 0.000 tree.py:102(__eq__)
15400/200 0.031 0.000 0.106 0.001 tree.py:399(convert)
1 0.023 0.023 0.129 0.129 plcfrs_cython.pyx:52(parse)
6701/1143 0.022 0.000 0.043 0.000 heapdict.py:45(_min_heapify)
18212 0.017 0.000 0.023 0.000 plcfrs_cython.pyx:38(__richcmp__)
10975/10875 0.017 0.000 0.035 0.000 tree.py:75(__init__)
5772 0.016 0.000 0.050 0.000 tree.py:665(__init__)
960 0.016 0.000 0.025 0.000 plcfrs_cython.pyx:118(deduced_from)
46938 0.014 0.000 0.014 0.000 tree.py:708(_get_node)
25220/2190 0.014 0.000 0.016 0.000 tree.py:231(subtrees)
10975 0.013 0.000 0.023 0.000 tree.py:60(__new__)
49441 0.013 0.000 0.013 0.000 {isinstance}
16748 0.008 0.000 0.015 0.000 {hasattr}
The First thing I noticed is that very few functions are from the cython module itself. Most of them come from the tree.py module and maybe is that the bottleneck.
Focusing on the cython side I see the richcmp function:
we can optimize it simply by adding the type of the values in the method declaration
def __richcmp__(ChartItem self, ChartItem other, int op):
....
This brings down the value
ncalls tottime percall cumtime percall filename:lineno(function)
....
18212 0.011 0.000 0.015 0.000 plcfrs_cython.pyx:38(__richcmp__)
Adding the elif syntax instead of the single if will enable the switch optimization of cython
if op == 0: return self.label < other.label or self.vec < other.vec
elif op == 1: return self.label <= other.label or self.vec <= other.vec
elif op == 2: return self.label == other.label and self.vec == other.vec
elif op == 3: return self.label != other.label or self.vec != other.vec
elif op == 4: return self.label > other.label or self.vec > other.vec
elif op == 5: return self.label >= other.label or self.vec >= other.vec
obtaining:
17963 0.002 0.000 0.002 0.000 plcfrs_cython.pyx:38(__richcmp__)
trying to figure out where that tree.py:399 convert comes from I found out that this function inside dopg.py takes all that time
def removeids(tree):
""" remove unique IDs introduced by the Goodman reduction """
result = Tree.convert(tree)
for a in result.subtrees(lambda t: '@' in t.node):
a.node = a.node.rsplit('@', 1)[0]
if isinstance(tree, ImmutableTree): return result.freeze()
return result
Now I am not sure if each node in the tree is a ChartItem and if the getitem value is being used somewhere else but adding this changes :
cdef class ChartItem:
cdef public str label
cdef public str root
cdef public long vec
cdef int _hash
__slots__ = ("label", "vec", "_hash")
def __init__(ChartItem self, label, int vec):
self.label = intern(label) #.rsplit('@', 1)[0])
self.root = intern(label.rsplit('@', 1)[0])
self.vec = vec
self._hash = hash((self.label, self.vec))
def __hash__(self):
return self._hash
def __richcmp__(ChartItem self, ChartItem other, int op):
if op == 0: return self.label < other.label or self.vec < other.vec
elif op == 1: return self.label <= other.label or self.vec <= other.vec
elif op == 2: return self.label == other.label and self.vec == other.vec
elif op == 3: return self.label != other.label or self.vec != other.vec
elif op == 4: return self.label > other.label or self.vec > other.vec
elif op == 5: return self.label >= other.label or self.vec >= other.vec
def __getitem__(ChartItem self, int n):
if n == 0: return self.root
elif n == 1: return self.vec
def __repr__(self):
#would need bitlen for proper padding
return "%s[%s]" % (self.label, bin(self.vec)[2:][::-1])
and inside of mostprobableparse:
from libc cimport pow
def mostprobableparse...
...
cdef dict parsetrees = <dict>defaultdict(float)
cdef float prob
m = 0
for n,(a,prob) in enumerate(derivations):
parsetrees[a] += pow(e,prob)
m += 1
I get:
189345 function calls (173785 primitive calls) in 0.162 seconds
Ordered by: internal time
ncalls tottime percall cumtime percall filename:lineno(function)
6701/1143 0.025 0.000 0.037 0.000 heapdict.py:45(_min_heapify)
1 0.023 0.023 0.120 0.120 plcfrs_cython.pyx:54(parse)
960 0.018 0.000 0.030 0.000 plcfrs_cython.pyx:122(deduced_from)
5190/198 0.011 0.000 0.015 0.000 tree.py:102(__eq__)
6619 0.006 0.000 0.006 0.000 heapdict.py:67(_swap)
9678 0.006 0.000 0.008 0.000 plcfrs_cython.pyx:137(concat)
so the next steps are to optimize heapify and deduced_from
deduce_from can be optimized a bit more:
cdef inline deduced_from(ChartItem Ih, double x, pyCx, pyunary, pylbinary, pyrbinary, int bitlen):
cdef str I = Ih.label
cdef int Ir = Ih.vec
cdef list result = []
cdef dict Cx = <dict>pyCx
cdef dict unary = <dict>pyunary
cdef dict lbinary = <dict>pylbinary
cdef dict rbinary = <dict>pyrbinary
cdef ChartItem Ilh
cdef double z
cdef double y
cdef ChartItem I1h
for rule, z in unary[I]:
result.append((ChartItem(rule[0][0], Ir), ((x+z,z), (Ih,))))
for rule, z in lbinary[I]:
for I1h, y in Cx[rule[0][2]].items():
if concat(rule[1], Ir, I1h.vec, bitlen):
result.append((ChartItem(rule[0][0], Ir ^ I1h.vec), ((x+y+z, z), (Ih, I1h))))
for rule, z in rbinary[I]:
for I1h, y in Cx[rule[0][1]].items():
if concat(rule[1], I1h.vec, Ir, bitlen):
result.append((ChartItem(rule[0][0], I1h.vec ^ Ir), ((x+y+z, z), (I1h, Ih))))
return result
I will stop here although I am confident that we can keep optimizing as more insight is acquired on the problem.
A series of unittest would be useful to assert that each optimization don't introduce any subtle error.
A side note, try to use spaces instead of tabs.
精彩评论