memory usage of a probabilistic parser

2023-02-17 23:35 问答作者：

I am writing a CKY parser for a Range Concatenation Grammar. I want to use a treebank as grammar, so the grammar will be large. I've written a prototype 1 in Python and it seems to work well when I simulate a treebank of a couple tens of sentences, but the memory usage is unacceptable. I tried writing it in C++ but so far that has been very frustrating as I have never used C++ before. Here's some data (n is number of sentences the grammar is based on):

This growth pattern is what is to be expected given the best-first algorithm, but the amount of overhead is what concerns me. The memory usage according to heapy is a factor ten smaller than these numbers, valgrind reported something similar. What causes this discrepancy and is there anything I can do about it in Python (or Cython)? Perhaps it's due to fragmentation? Or maybe it is the overhead of python dictionaries?

Some background: the two important datastructures are the agenda mapping edges to probabilities, and the chart, which is a dictionary mapping nonterminals and positions to edges. The agenda is implemented with a heapdict (which internally uses a dict and a heapq list), the chart with a dictionary mapping nonterminals and positions to edges. The agenda is frequently inserted and removed from, the chart only gets insertions and lookups. I represent edges with tuples like this:

(("S", 111), ("NP", 010), ("VP", 100, 001))

The strings are the nonterminal labels from the grammar, the positions are encoded as a bitmask. There can be multiple positions when a constituent is discontinuous. So this edge could be represent an analysis of "is Mary happy", where "is" and happy" both belong to the VP. The chart dictionary is indexed by the first element of this edge, ("S", 111) in this case. In a new version I tried transposing this representation in the hope that it would save memory due to reuse:

(("S", "NP", "VP), (111, 100, 011))

I figured that Python would store the first part only once if it would occur in combination with different positions, although I'm not actually sure this is true. In either case, it didn't seem to make any difference.

So basically what I am wondering is if it is worth pursuing my Python implementation any further, including doing things with Cython and different datastructures, or that writing it from the ground up in C++ is the only viable option.

UPDATE: After so开发者_如何学Gome improvements I no longer have issues with memory usage. I'm working on an optimized Cython version. I'll award the bounty to the most useful suggestion for increasing efficiency of the code. There is an annotated version at http://student.science.uva.nl/~acranenb/plcfrs_cython.html

1 https://github.com/andreasvc/disco-dop/ -- run test.py to parse some sentences. Requires python 2.6, nltk and heapdict

I figured that Python would store the first part only once if it would occur in combination with different positions

Not necessarily:

>>> ("S", "NP", "VP") is ("S", "NP", "VP")
False

You might want to intern all strings referring to non-terminals, since you seem to be creating a lot of these in rcgrules.py. If you want to intern a tuple, then turn it into a string first:

>>> intern("S NP VP") is intern(' '.join('S', 'NP', 'VP'))
True

Otherwise, you'll have to "copy" the tuples instead of constructing them afresh.

(If you're new to C++, then rewriting such an algorithm in it is unlikely to provide much of a memory benefit. You'd have to evaluate various hash table implementations first and learn about the copying behavior in its containers. I've found boost::unordered_map to be quite wasteful with lots of small hashtables.)

Have you tried running your application with PyPy rather than CPython?

PyPy is a lot smarter than CPython about noticing commonalities and avoiding the memory overhead associated with duplicating things unnecessarily.

It's worth trying, anyway: http://pypy.org/

The first to do in these cases is always to profile:

15147/297    0.032    0.000    0.041    0.000 tree.py:102(__eq__)
15400/200    0.031    0.000    0.106    0.001 tree.py:399(convert)
        1    0.023    0.023    0.129    0.129 plcfrs_cython.pyx:52(parse)
6701/1143    0.022    0.000    0.043    0.000 heapdict.py:45(_min_heapify)
    18212    0.017    0.000    0.023    0.000 plcfrs_cython.pyx:38(__richcmp__)
10975/10875    0.017    0.000    0.035    0.000 tree.py:75(__init__)
     5772    0.016    0.000    0.050    0.000 tree.py:665(__init__)
      960    0.016    0.000    0.025    0.000 plcfrs_cython.pyx:118(deduced_from)
    46938    0.014    0.000    0.014    0.000 tree.py:708(_get_node)
25220/2190    0.014    0.000    0.016    0.000 tree.py:231(subtrees)
    10975    0.013    0.000    0.023    0.000 tree.py:60(__new__)
    49441    0.013    0.000    0.013    0.000 {isinstance}
    16748    0.008    0.000    0.015    0.000 {hasattr}

The First thing I noticed is that very few functions are from the cython module itself. Most of them come from the tree.py module and maybe is that the bottleneck.

Focusing on the cython side I see the richcmp function:

we can optimize it simply by adding the type of the values in the method declaration

def __richcmp__(ChartItem self, ChartItem other, int op):
        ....

This brings down the value

ncalls  tottime  percall  cumtime  percall filename:lineno(function)
....
18212    0.011    0.000    0.015    0.000 plcfrs_cython.pyx:38(__richcmp__)

Adding the elif syntax instead of the single if will enable the switch optimization of cython

    if op == 0: return self.label < other.label or self.vec < other.vec
    elif op == 1: return self.label <= other.label or self.vec <= other.vec
    elif op == 2: return self.label == other.label and self.vec == other.vec
    elif op == 3: return self.label != other.label or self.vec != other.vec
    elif op == 4: return self.label > other.label or self.vec > other.vec
    elif op == 5: return self.label >= other.label or self.vec >= other.vec

obtaining:

17963    0.002    0.000    0.002    0.000 plcfrs_cython.pyx:38(__richcmp__)

trying to figure out where that tree.py:399 convert comes from I found out that this function inside dopg.py takes all that time

  def removeids(tree):
""" remove unique IDs introduced by the Goodman reduction """
result = Tree.convert(tree)
for a in result.subtrees(lambda t: '@' in t.node):
    a.node = a.node.rsplit('@', 1)[0]
if isinstance(tree, ImmutableTree): return result.freeze()
return result

Now I am not sure if each node in the tree is a ChartItem and if the getitem value is being used somewhere else but adding this changes :

cdef class ChartItem:
cdef public str label
cdef public str root
cdef public long vec
cdef int _hash
__slots__ = ("label", "vec", "_hash")
def __init__(ChartItem self, label, int vec):
    self.label = intern(label) #.rsplit('@', 1)[0])
    self.root = intern(label.rsplit('@', 1)[0])
    self.vec = vec
    self._hash = hash((self.label, self.vec))
def __hash__(self):
    return self._hash
def __richcmp__(ChartItem self, ChartItem other, int op):
    if op == 0: return self.label < other.label or self.vec < other.vec
    elif op == 1: return self.label <= other.label or self.vec <= other.vec
    elif op == 2: return self.label == other.label and self.vec == other.vec
    elif op == 3: return self.label != other.label or self.vec != other.vec
    elif op == 4: return self.label > other.label or self.vec > other.vec
    elif op == 5: return self.label >= other.label or self.vec >= other.vec
def __getitem__(ChartItem self, int n):
    if n == 0: return self.root
    elif n == 1: return self.vec
def __repr__(self):
    #would need bitlen for proper padding
    return "%s[%s]" % (self.label, bin(self.vec)[2:][::-1])

and inside of mostprobableparse:

from libc cimport pow
def mostprobableparse...
            ...
    cdef dict parsetrees = <dict>defaultdict(float)
    cdef float prob
    m = 0
    for n,(a,prob) in enumerate(derivations):
        parsetrees[a] += pow(e,prob)
        m += 1

I get:

         189345 function calls (173785 primitive calls) in 0.162 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
6701/1143    0.025    0.000    0.037    0.000 heapdict.py:45(_min_heapify)
        1    0.023    0.023    0.120    0.120 plcfrs_cython.pyx:54(parse)
      960    0.018    0.000    0.030    0.000 plcfrs_cython.pyx:122(deduced_from)
 5190/198    0.011    0.000    0.015    0.000 tree.py:102(__eq__)
     6619    0.006    0.000    0.006    0.000 heapdict.py:67(_swap)
     9678    0.006    0.000    0.008    0.000 plcfrs_cython.pyx:137(concat)

so the next steps are to optimize heapify and deduced_from

deduce_from can be optimized a bit more:

cdef inline deduced_from(ChartItem Ih, double x, pyCx, pyunary, pylbinary, pyrbinary, int bitlen):
cdef str I = Ih.label
cdef int Ir = Ih.vec
cdef list result = []
cdef dict Cx = <dict>pyCx
cdef dict unary = <dict>pyunary
cdef dict lbinary = <dict>pylbinary
cdef dict rbinary = <dict>pyrbinary
cdef ChartItem Ilh
cdef double z
cdef double y
cdef ChartItem I1h 
for rule, z in unary[I]:
    result.append((ChartItem(rule[0][0], Ir), ((x+z,z), (Ih,))))
for rule, z in lbinary[I]:
    for I1h, y in Cx[rule[0][2]].items():
        if concat(rule[1], Ir, I1h.vec, bitlen):
            result.append((ChartItem(rule[0][0], Ir ^ I1h.vec), ((x+y+z, z), (Ih, I1h))))
for rule, z in rbinary[I]:
    for I1h, y in Cx[rule[0][1]].items():
        if concat(rule[1], I1h.vec, Ir, bitlen):
            result.append((ChartItem(rule[0][0], I1h.vec ^ Ir), ((x+y+z, z), (I1h, Ih))))
return result

I will stop here although I am confident that we can keep optimizing as more insight is acquired on the problem.

A series of unittest would be useful to assert that each optimization don't introduce any subtle error.

A side note, try to use spaces instead of tabs.

继续阅读：cython memory python

memory usage of a probabilistic parser

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？