Fingerprint tree generation
There is group of people [let's say 1874 of them], all representing different companies [lets say 236 of them] in the world. My task is to best identify what company each person works for. The trick is that I cannot simply ask a person "Where do you work for" and get the answer, but what I do have is a questionnaire with a number of question [lets say 290 questions] and the exact responses I should expect for employees of each company. Some compa开发者_运维百科nies might have identical answers, so at the end, even if I can not determine exactly what company a person works for, I should be able to narrow it down and say that he/she must work for one of these companies.
Using multi-value maps, and some other data structures, I've gone as far as determining all the companies that I can identify with 1 question [query]. Using these queries to represent the root of a tree data-structure, I need to build out the rest of the tree using other queries/questions as branches to identify the rest.
Any advice/help/suggestion ?
Based on your answer in the comments, I feel that you may as well just have each level of your tree represent a question, and the branches/subnodes of the nodes on that level representing the answers. This would technically be a trie, as mentioned by btilly.
A more efficient (though not necessarily space-wise) solution would possibly involve using a hashtable and a hash function that acts on the answer choices to create its hash, but I think a trie is the best way to go given your requirements and the don't-cares.
Oh, right: depending on how the answer choices are laid out, it's possible you may have a series of answers on particular branches where there aren't any sub-branches/trees for a few levels; in such a case, you could potentially collapse those singular branch sections into individual nodes. http://en.wikipedia.org/wiki/Trie#Compressing_tries might also provide some tips.
Based on your response to my initial answer, here's my idea:
Keep an array of nodes for the questions and their answer choices, with each answer choice being associated with a hash table (or whatever data structure you'd wish to use; I suggested a hash table due to using Python a lot and being used to Python's set
data structure, which is implemented as a type of hash table) containing pointers to each company, or a pointer to a single company if a given answer for a given question will indicate the company to begin with.
The first time you check an answer to a specific question, and there are multiple companies associated with that answer choice, make a temporary copy of the data in that first answer's hash table as a linked list or something. As more questions are answered, check the elements of the list against the hash table of each new answer, and remove companies that are not present in each new answer's hash table from the list. Repeat the question-asking process until 1) only one company is left in the list, 2) no companies are left in the list, or 3) you've asked all the questions.
If 1), that is the question-answerer's employer.
If 2), the employee isn't employed by any of the companies to check for, and/or there's an error somewhere.
If 3), the companies remaining in the linked list are the possible companies that question-answerer is employed by.
There's probably a more efficient method of doing this, as my implementation would require a minimum of 580 hash tables (one for each answer, with a minimum of 2 answers per question), but I can't really think of anything right now.
Build the tree recursively, starting at the root. At each step, you'll have an active set of questionnaires, which initially will be all of the questionnaires. From the active questionnaires, select a question that has about as many yes answers as no answers. Make a tree node for this question. Create a yes subtree (recursively), using the subset questionnaires that answered yes for the question you selected at this node. Also create a no subtree, using the subset of questionnaires that answered no for the question you selected.
Simple Example:
Suppose we're trying to guess the animal, and we have questionnaires from a bear, a zebra, a salmon, and a crocodile.
We look at the questionnaires and see that about half of them said "yes" to "Are you a mammal?", so we'll make that the root of the tree.
Now we take just the questionnaires that said yes to that question. In our example, they are the ones from the bear and the zebra. We select the question "Do you have stripes?", since about half of them say yes and half say no. Since there's only one questionnaire for each of those answers, you create leaf nodes that guess zebra and bear appropriately.
Now we backtrack to the root node and repeat the process for the "no" branch. That is, we look at the questionnaires for the salmon and the crocodile and select a question that distinguishes that set into separate groups. "Do you like to smile?" fits the bill.
The final tree looks like this:
Ask: "Are you a mammal?"
|
+- yes -> Ask: "Do you have stripes?"
| |
| +- yes -> Guess: Zebra
| |
| +- no --> Guess: Bear
|
+- no --> Ask: "Do you like to smile?"
|
+- yes -> Guess: Crocodile
|
+- no --> Guess: Salmon
Building an Expert system using Prolog is one possible solution. Have you considered this option ?
By doing so, you may even add some Natural Language Processing capabilities easing interaction with users.
精彩评论