开发者

How to delete entities not found in feed on GAE

I am updating and adding items from a feed(which can have about 40000 items) to the datastore 200 items at a time, the problem is that the feed can change and some items might be deleted from the feed. I have this code:

class FeedEntry(db.Model):
    name = db.StringProperty(required=True)

def updateFeed(offset, number=200):
    response = fetchFeed(offset, number)
    feedItems = 开发者_StackOverflowparseFeed(response)
    feedEntriesToAdd = []
    for item in feedItems:
        feedEntriesToAdd.append(
            FeedEntry(key_name=item.id, name=item.name)
        )
    db.put(feedEntriesToAdd)

How do I find out which items were not in the feed and delete them from the datastore? I thought about creating a list of items(in datastore) and just remove from there all the items that I updated and the ones left will be the ones to delete. - but that seems rather slow.

PS: All item.id are unique for that feed item and are consistent.


If you add a DateTimeProperty with auto_now=True, it will record the last modified time of each entity. Since you update every item in the feed, by the time you've finished they will all have times after the moment you started, so anything with a date before then isn't in the feed any more.

Xavier's generation counter is just as good - all we need is something guaranteed to increase between refreshes, and never decrease during a refresh.

Not sure from the docs, but I expect a DateTimeProperty is bigger than an IntegerProperty. The latter is a 64 bit integer, so they might be the same size, or it may be that DateTimeProperty stores several integers. A group post suggests maybe it's 10 bytes as opposed to 8.

But remember that by adding an extra property that you do queries on, you're adding another index anyway, so the difference in size of the field is diluted as a proportion of the overhead. Further, 40k times a few bytes isn't much even at $0.24/G/month.

With either a generation or a datetime, you don't necessarily have to delete the data immediately. Your other queries could filter on date/generation of the most recent refresh, meaning that you don't have to delete data immediately. If the feed (or your parsing of it) goes funny and fails to produce any items, or only produces a few, it might be useful to have the last refresh lying around as a backup. Depends entirely on the app whether it's worth having.


I would add a generation counter

class FeedEntry(db.Model):
    name = db.StringProperty(required=True)
    generation = db.IntegerProperty(required=True)
def updateFeed(offset, generation, number=200):
    response = fetchFeed(offset, number)
    feedItems = parseFeed(response)
    feedEntriesToAdd = []
    for item in feedItems:
        feedEntriesToAdd.append(
            FeedEntry(key_name=item.id, name=item.name,generation=generation)
        )
    db.put(feedEntriesToAdd)
def deleteOld(generation):
    q = db.GqlQuery("SELECT * FROM FeedEntry " +
            "WHERE generation != :1" ,generation )
    db.delete(generation)
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜