开发者

Python Datastructure Suggestion

i am building a small search engine to search a collection of pdfs. From each pdf i extract a set of tokens and store it in database. I do not want to store duplicate tokens in database, instead i want to store count of each token in the database开发者_开发问答. Does python has any special datastructure that do not store duplicates but stores the counts of each token?


Python >=2.7 has the Counter.


I'd suggest to use a simple dictionary to store the count like

storage = {} # initialize
# ...
if !storage.has_key(token):
  storage[token] = 1
else:
  storage[token] += 1

EDIT

That said, if you're using Python 3 I'd follow Space_C0wb0y's suggestion to use the Counter class ...


The collections package has defaultdict which can be used as a key-value storage with a counter:

>>> s = 'mississippi'
>>> d = defaultdict(int)
>>> for k in s:
...     d[k] += 1
...
>>> d.items()
[('i', 4), ('p', 2), ('s', 4), ('m', 1)]

Just so notice: This is not a databse, it's a pure in memory storage. You would have to save this data somehow!


You could always implement an object for every file, giving it a number of methods, like open and display and etc etc. You could then define __hash__ and __eq__ for the object, this would allow you to store items in a set, causing the duplicates to just update a single instance inside the set.

This is just another way of doing something by no means is it the best method.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜