开发者

Fastest C++ Container: Unique Values

I am writing an email application that interfaces with a MySQL database. I have two tables that are sourcing my data, one of which contains unsubscriptions, the other of which is a standard user table. As of now, I'm creating a vector of pointers to email objects, and storing all of the unsubscribed emails in it, initially. I then have a standard SQL loop in which I'm checking to see if the email is not in the unsubscribe vector, then adding it to the global send email vector. My question, is, is there a more efficient way of doing this? I have to search the unsub vector for every single email in my system, up to 50K different ones. Is there a better structure for searching? And, a better structure for 开发者_高级运维maintaining a unique collection of values? Perhaps one that would simply discard the value if it already contains it?


If your C++ Standard Library implementation supports it, consider using a std::unordered_set or a std::hash_set.

You can also use std::set, though its overhead might be higher (it depends on the cost of generating a hash for the object versus the cost of comparing two of the objects several times).

If you do use a node based container like set or unordered_set, you also get the advantage that removal of elements is relatively cheap compared to removal from a vector.


  1. Tasks like this (set manipulations) are better left to what is MEANT to execute them - the database!

    E.g. something along the lines of:

     SELECT email FROM all_emails_table e WHERE NOT EXISTS (
         SELECT 1 FROM unsubscribed u where e.email=u.email
     )
    
  2. If you want an ALGORITHM, you can do this fast by retrieving both the list of emails AND a list of unsubscriptions as ORDERED lists. Then you can go through the e-mail list (which is ordered), and as you do it you glide along the unsubscribe list. The idea is that you move 1 forward in whichever list has the "biggest" current" element. This algo is O(M+N) instead of O(M*N) like your current one

  3. Or, you can do a hash map which maps from unsubscribed e-mail address to 1. Then you do find() calls on that map whcih for correct hash implementations are O(1) for each lookup. Unfortunately, there's no Hash Map standard in C++ - please see this SO question for existing implementations (couple of ideas there are SGI's STL hash_map and Boost and/or TR1 std::tr1::unordered_map).

    One of the comments on that post indicates it will be added to the standard: "With this in mind, the C++ Standard Library Technical Report introduced the unordered associative containers, which are implemented using hash tables, and they have now been added to the Working Draft of the C++ Standard."


Store your email adresses in a std::set or use std::set_difference().


The best way to do this is within MySQL, I think. You can modify your users table schema with another column, a BIT column, for "is unsubscribed". Better yet: add a DATETIME column for "date deleted" with a default value of NULL.

If using a BIT column, your query becomes something like:

SELECT * FROM `users` WHERE `unsubscribed` <> 0b1;

If using a DATETIME column, your query becomes something like:

SELECT * FROM `users` WHERE `date_unsubscribed` IS NULL;
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜