开发者

I was asked this in a recent interview

I was asked to stay away from HashMap or any sort of Hashing.

The question went something like开发者_如何学运维 this -

Lets say you have PRODUCT IDs of up to 20 decimals, along with Product Descriptions. Without using Maps or any sort of hashing function, what's the best/most efficient way to store/retrieve these product IDs along with their descriptions?

Why is using Maps a bad idea for such a scenario?

What changes would you make to sell your solution to Amazon?


A map is good to use when insert/remove/lookup operations are interleaved. Every operations are amortized in O(log n).

In your exemple you are only doing search operation. You may consider that any database update (inserting/removing a product) won't happen so much time. Therefore probably the interviewer want you to get the best data structure for lookup operations.

In this case I can see only some as already proposed in other answers:

  • Sorted array (doing a binary search)
  • Hasmap
  • trie

With a trie , if product ids do not share a common prefix, there is good chance to find the product description only looking at the first character of the prefix (or only the very first characters). For instance, let's take that product id list , with 125 products:

  • "1"
  • "2"
  • "3"
    ...
  • "123"
  • "124"
  • "1234567"

Let's assume you are looking for the product id titled "1234567" in your trie, only looking to the first letters: "1" then "2" then "3" then "4" will lead to the good product description. No need to read the remaining of the product id as there is no other possibilities. Considering the product id length as n , your lookup will be in O(n). But as in the exemple explained it above it could be even faster to retreive the product description. As the procduct ID is limited in size (20 characters) the trie height will be limited to 20 levels. That actually means you can consider the look up operations will never goes beyond a constant time, as your search will never goes beyong the trie height => O(1). While any BST lookups are at best amortized O(log N), N being the number of items in your tree .

While an hashmap could lead you to slower lookup as you'll need to compute an index with an hash function that is probably implemented reading the whole product id length. Plus browsing a list in case of collision with other product ids.

Doing a binary search on a sorted array, and performance in lookup operations will depends on the number of items in your database.


A B-Tree in my opinion. Does that still count as a Map?

Mostly because you can have many items loaded at once in memory. Searching these items in memory is very fast.


Consecutive integer numbers give perfect choice for the hash map but it only has one problem, as it does not have multithreaded access by default. Also since Amazon was mentioned in your question I may think that you need to take into account concurency and RAM limitation issues.

What you might do in the response to such question is to explain that since you are dissallowed to use any built-in data storage schemes, all you can do is to "emulate" one.

So, let's say you have M = 10^20 products with their numbers and descriptions. You can partition this set to the groups of N subsets. Then you can organize M/N containers which have sugnificantly reduced number of elements. Using this idea recursively will give you a way to store the whole set in containers with such property that access to them would have accepted performance rate.

To illustrate this idea, consider a smaller example of only 20 elements. I would like you to imagive the file system with directories "1", "2", "3", "4". In each directory you store the product descriptions as files in the following way:

folder 1: files 1 to 5
folder 2: files 6 to 10
...
folder 4: files 16 to 20

Then your search would only need two steps to find the file. First, you search for a correct folder by dividing 20 / 5 (your M/N). Then, you use the given ID to read the product description stored in a file.

This is just a very rough description, however, the idea is very intuitive. So, perhaps this is what your interviewer wanted to hear.

As for myself, when I face such questions on interview, even if I fail to get the question correctly (which is the worst case :)) I always try to get the correct answer from the interviewer.


Best/efficient for what? Would have been my answer.

E.g. for storing them, probably the fast thing to do are two arrays with 20 elements each. One for the ids, on for the description. Iterating over those is pretty fast to. And it is efficient memory wise.

Of course the solution is pretty useless for any real application, but so is the question.


There is an interesting alternative to B-Tree: Radix Tree


I think what he wanted you to do, and I'm not saying it's a good idea, is to use the computer memory space.

If you use a 64-bit (virtual) memory address, and assuming you have all the address space for your data (which is never the case) you can store a one-byte value.

You could use the ProductID as an address, casting it to a pointer, and then get that byte, which might be an offset in another memory for actual data.

I wouldn't do it this way, but perhaps that is the answer they were looking for.

Asaf


I wonder if they wanted you to note that in an ecommerce application (such as Amazon's), a common use case is "reverse lookup": retrieve the product ID using the description. For this, an inverted index is used, where each keyword in a description is an index key, which is associated with a list of relevant product identifiers. Binary trees or skip lists are good ways to index these key words.

Regarding the product identifier index: In practice, B-Trees (which are not binary search trees) would be used for a large, disk-based index of 20-digit identifiers. However, they may have been looking for a toy solution that could be implemented in RAM. Since the "alphabet" of decimal numbers is so small, it lends itself very nicely to a trie.


The hashmaps work really well if the hashing function gives you a very uniform distribution of the hashvalues of the existing keys. With really bad hash function it can happen so that hash values of your 20 values will be the same, which will push the retrieval time to O(n). The binary search on the other hand guaranties you O(log n), but inserting data is more expensive.

All of this is very incremental, the bigger your dataset is the less are the chances of a bad key distribution (if you are using a good, proven hash algorithm), and on smaller data sets the difference between O(n) and O(log n) is not much to worry about.


If the size is limited sometimes it's faster to use a sorted list.

When you use Hash-anything, you first have to calculate a hash, then locate the hash bucket, then use equals on all elements in the bucket. So it all adds up.

On the other hand you could use just a simple ArrayList ( or any other List flavor that is suitable for the application), sort it with java.util.Collections.sort and use java.util.Collections.binarySearch to find an element.

But as Artyom has pointed out maybe a simple linear search would be much faster in this case.

On the other hand, from maintainability point of view, I would normally use HashMap ( or LinkedHashMap ) here, and would only do something special here when profiler would tell me to do it. Also collections of 20 have a tendency to become collections of 20000 over time and all this optimization would be wasted.


There's nothing wrong with hashing or B-trees for this kind of situation - your interviewer probably just wanted you to think a little, instead of coming out with the expected answer. It's a good sign, when interviewers want candidates to think. It shows that the organization values thought, as opposed to merely parroting out something from the lecture notes from CS0210.

Incidentally, I'm assuming that "20 decimal product ids" means "a large collection of product ids, whose format is 20 decimal characters".... because if there's only 20 of them, there's no value in considering the algorithm. If you can't use hashing or Btrees code a linear search and move on. If you like, sort your array, and use a binary search.

But if my assumption is right, then what the interviewer is asking seems to revolve around the time/space tradeoff of hashmaps. It's possible to improve on the time/space curve of hashmaps - hashmaps do have collisions. So you might be able to get some improvement by converting the 20 decimal digits to a number, and using that as an index to a sparsely populated array... a really big array. :)

Selling it to Amazon? Good luck with that. Whatever you come up with would have to be patentable, and nothing in this discussion seems to rise to that level.


20 decimal PRODUCT IDs, along with Product Description

Simple linear search would be very good...

I would create one simple array with ids. And other array with data.

Linear search for small amount of keys (20!) is much more efficient then any binary-tree or hash.


I have a feeling based on their answer about product ids and two digits the answer they were looking for is to convert the numeric product ids into a different base system or packed form.

They made a point to indicate the product description was with the product ids to tell you that a higher base system could be used within the current fields datatype.


Your interviewer might be looking for a trie. If you have a [small] constant upper bound on your key, then you have O(1) insert and lookup.


I think what he wanted you to do, and I'm not saying it's a good idea, is to use the computer memory space.

If you use a 64-bit (virtual) memory address, and assuming you have all the address space for your data (which is never the case) you can store a one-byte value.

Unfortunately 2^64 =approx= 1.8 * 10^19. Just slightly below 10^20. Coincidence?

log2(10^20) = 66.43.

Here's a slightly evil proposal.

OK, 2^64 bits can fit inside a memory space.

Assume a bound of N bytes for the description, say N=200. (who wants to download Anna Karenina when they're looking for toasters?) Commandeer 8*N 64-bit machines with heavy RAM. Amazon can swing this.

Every machine loads in their (very sparse) bitmap one bit of the description text for all descriptions. Let the MMU/virtual memory handle the sparsity.

Broadcast the product tag as a 59-bit number and the bit mask for one byte. (59 = ceil(log2(10^20)) - 8)

Every machine returns one bit from the product description. Lookups are a virtual memory dereference. You can even insert and delete.

Of course paging will start to be a bitch at some point!

Oddly enough, it will work the best if product-id's are as clumpy and ungood a hash as possible.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜