Improving performance for string matching
I am working for a startup that is building a iphone app. And i would like to ask a few questions to improve an algorithm we use for string matching.
We have a database that has a huge list of phone numbers along with th开发者_如何学Ce name of the user who owns the phone number. Lets say that the database looks like this
name phonenum
hari 1234
abc 3873
....
This database has large number of rows (around 1 million). When the user opens the app, the app gets the list of phone numbers from the person's phone contacts and matches it with the database. We return all the phone numbers that are present in the database. Right now, what we do is very very inefficient. We send the phone numbers from phone contacts in sets of 20. And we match it with the database. This will lead to a complexity of num of phone contacts * O(n).
I thought of some improvements like having the database rows sorted by phone numbers so that we can do binary search. In addition to that, we can have a hash table containing some 10,000 phone numbers in the cache memory and we can search against this cache memory initially. Only if there is a miss, we will access the database and search the database with complexity of O(log n) using binary search.
Also, there is the issue of sending phone numbers for matching. do i send them as such or send them as a hashed value ? will that matter in terms of improving performance?
Is there any other way of doing this thing?
I explained the whole scenario so that you can have a better understanding of my need
thanks
If you already have an SQL Server database, let it take care of that. Create an index on the phone number column (if you don't have it already). Send all the numbers in the contact list in one go (no need to split them by 20) and match them against the database. The SQL server probably uses much better indexing than anything you could come up with, so it's going to be pretty fast.
Alternatively, you can try to insert the numbers into a temporary table and query against that, but I have no idea whether that would be faster.
If you can represent phone numbers as numeric values instead of strings, then you can put an index on your database field that will make lookup operations very fast. Even if you have to represent them as strings, an index on the database field will be make looking up values fast enough to be a non-issue in the grand scheme of things.
Your biggest performance problem is going to be with all the round trips between the application and your database. That is a performance bottleneck with any web-enabled program. If you are unlikely to have a high rate of success (maybe 2% of the user's contacts are in your database), you'll probably be better off sending the whole list of phone numbers at once, since you'll just be getting data back for a few of them.
If the purpose is to update the user's contact data with the data found in your database, you could create a hash out of the appropriate fields and send that along with the phone number. Have the database keep a hash of those fields on its side and do a comparison. If the hash matches, then you don't have to send any data back because the local and remote versions are the same.
A successful caching strategy would require a good understanding of how the data will be used, so I can't provide much guidance based on the information given. For example, if 90% of the phones using your app will have all of the phone numbers matched in a small group of the numbers in the database, then by all means, put that small group into a Hashtable. But if users are likely to have any phone numbers that aren't in that small group, you're going to have to do a database round-trip. The key will be to construct a query that allows the database to return all of the data you need in one trip.
I'd split the phone number up into three parts
example 777.777.7777
Each section can be stored into and int and used as a hash tag.
This would mean that your data store becomes a series of hash tables.
Or you could force the whole number into an int and then use that as your hash key. But for fast results you'd need more buckets.
Cheers
精彩评论