开发者

Why is this Python method leaking memory?

This method iterate over a list of terms in the data base, check if the terms are in a the text passed as argument, and if one is, replace it with a link to the search page with the term as parameter.

The number of terms is high (about 100000), so the process is pretty slow, but this is Ok since it is performed as a cron job. However, it causes the script memory consumtion to skyrocket and I can't find why:

class SearchedTerm(models.Model):

[...]

@classmethod
def add_search_links_to_text(cls, string, count=3, queryset=None):
    """
        Take a list of all researched terms and search them in the 
        text. If they exist, turn them into links to the search
        page.

        This process is limited to `count` replacements maximum.

        WARNING: because the sites got different URLS schemas, we don't
        provides direct links, but we inject the {% url %} tag 
        so it must be rendered before display. You can use the `eval`
        tag from `libs` for this. Since they got different namespace as
        well, we enter a generic 'namespace' and delegate to the 
        template to change it with the proper one as well.

        If you have a batch process to do, you can pass a query set
        that will be used instead of getting all searched term at
        each calls.
    """

    found = 0

    terms = queryset or cls.on_site.all()

    # to avoid duplicate searched terms to be replaced twice 
    # keep a list of already linkified content
    # added words we are going to insert with the link so they won't match
    # in case of multi passes
    processed = set((u'video', u'streaming', u'title', 
                     u'search', u'namespace', u'href', u'title', 
                     u'url'))

    for term in terms:

        text = term.text.lower()

        # no small word and make
        # quick check to avoid all the rest of the matching
        if len(text) < 3 or text not in string:
            continue

        if found and cls._is_processed(text, processed):
            continue

        # match the search word with accent, for any case
        # ensure this is not part of a word by including 
        # two 'non-letter' character on both ends of the word
        pattern = re.compile(ur'([^\w]|^)(%s)([^\w]|$)' % text, 
                            re.UNICODE|re.IGNORECASE)

        if re.search(pattern, string):
            found += 1

            # create the link string
            # replace the word in the description 
            # use back references (\1, \2, etc) to preserve the original
            # formatin
            # use raw unicode strings (ur"string" notation) to avoid
            # problems with accents and escaping

            query = '-'.join(term.text.split())
            url = ur'{%% url namespace:static-search "%s" %%}' % query
            replace_with = ur'\1<a title="\2 video streaming" href="%s">\2</a>\3' % url

            string = re.sub(pattern, replace_with, string)

            processed.add(text)

            if found >= 3:
                break

    return string

You'll probably want this code as well:

class SearchedTerm(models.Model):

[...]

@classmethod
def _is_processed(cls, text, processed):
    """
        Check if the text if part of the already processed string
        we don't use `in` the set, but `in ` each strings of the set
        to avoid subtring matching that will destroy the tags.

        This is mainly an utility function so you probably won't use
        it directly.
    """
    if text in processed:
        return True

    return any(((text in string) for string in processed))

I really have only two objects with references that could be the suspects here: terms and processed. But I can't see any reason for them to not being garbage collected.

EDIT:

I think I should say that this method is called inside a Django model method itself. I don't know if it's relevant, but here is the code:

class Video(models.Model):

[...]

def update_html_description(self, links=3, queryset=None):
    """
        Take a list of all researched terms and search them in the 
        description. If they exist, turn them into links to the search
        engine. Put the reset into `html_description`.

        This use `add_search_link_to_text` and has therefor, the same 
        limitations.

        It DOESN'T call save().
    """
    queryset = queryset or SearchedTerm.objects.filter(sites__in=self.sites.all())
    text = self.description 开发者_C百科or self.title
    self.html_description = SearchedTerm.add_search_links_to_text(text, 
                                                                  links, 
                                                                  queryset)

I can imagine that the automatic Python regex caching eats up some memory. But it should do it only once and the memory consumtion goes up at every call of update_html_description.

The problem is not just that it consumes a lot of memory, the problem is that it does not release it: every calls take about 3% of the ram, eventually filling it up and crashing the script with 'cannot allocate memory'.


The whole queryset is loaded into memory once you call it, that is what will eat up your memory. You want to get chunks of results if the resultset is that large, it might be more hits on the database but it will mean a lot less memory consumption.


I was complety unable to find the cause of the problem, but for now I'm by passing this by isolating the infamous snippet by calling a script (using subprocess) that containes this method call. The memory goes up but of course, goes back to normal after the python process dies.

Talk about dirty.

But that's all I got for now.


make sure that you aren't running in DEBUG.


I think I should say that this method is called inside a Django model method itself.

@classmethod

Why? Why is this "class level"

Why aren't these ordinary methods that can have ordinary scope rules and -- in the normal course of events -- get garbage collected?

In other words (in the form of an answer)

Get rid of @classmethod.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜