C : Store and index a HUGH amount of info! School Project
i need to do a school project in C (I'm really don't know c++开发者_开发知识库 as well).
I need a data struct to index each word of about 34k documents, its a lot of words, and need to do some ranking about the words, i already did this project about 2 years ago (i'm pause in the school and back this year) and a i use a hash table of binary tree, but i got a small grade cause my project took about 2hours to index all words. I need something a little fast... any sugestions?
Tkz Roberto
If you have the option, I'd strongly recommend using a database engine (MSSQL, MySQL, etc.) as that's exactly the sort of datasets and operations these are written for. Best not to reinvent the wheel.
Otherwise, why use a btree at all? From what you've described (and I realise we're probably not getting the full story...) a straight up hash table with the word as a key and its rank/count of occurences should be useful?
bogofilter (the spam filter) has to keep word counts. It uses dbm as a backend, since it needs persistent storage of the word -> count map. You might want to look at the code for inspiration. Or not, since you need to implement the db part of it for the school project, not so much the spam filter part.
Minimize the amount of pointer chasing you have to do. Data-dependent memory-load operations are slow, esp. on a large working set where you will have cache misses. So make sure your hash table is big enough that you don't need a big tree in each bucket. And maybe check that your binary trees are dense, not degenerate linked lists, when you do get more than one value in a hash bucket.
If it's slow, profile it, and see if your problem is one slow function, or if it's cache misses, or if it's branch mispredictions.
精彩评论