How to perform a sorting according to rules but with repetition of items to solve circular references?
To explain in a clearer way my question I will start by explaining the real-life case I am facing.
I am building a physical panel with many words on it that can be selectively lit, in order to compose sentences. This is my situation:
- I know all the sentences that I want to display
- I want to find out [one of] the shortest set of ORDERED words that allows me to display all the sentences
Example:
SENTENCES:
"A dog is on t开发者_如何学编程he table"
"A cat is on the table"
SOLUTIONS:
"A dog cat is on the table"
"A cat dog is on the table"
I tried to approach this problem with "positional rules" finding for each UNIQUE word in the set of ALL the words used in ALL the sentences, what words should be at the left or at the right of it. In the example above, the ruleset for the "on" word would be "left(A, dog, cat, is) + right(the, table).
This approach worked for trivial cases, but my real-life situation has two additional difficulties that got me stuck and that have both to do with the need for repeating words:
- In-sentence repetitions: "the cat is on the table" has two "the".
- Circular references: In a set of three sentences "A red cat" + "My cat is on the table" + "That table is red", the rules would state that RED should be at the left of CAT, CAT should be at the left of TABLE and TABLE should be at the left of RED.
MY QUESTION THEREFORE IS:
What is the class of algorithms (or even better: what is the specific algorithm) that studies and solves this kind of problems? Could you post some reference or a code example of it?
EDIT: Level of complexity
From the first round of answers it appears the actual level of complexity (i.e. how different are the sentences one from the other) is an important factor. So, here comes some info on that:
- I have about 1500 sentences I want to represent.
- All of the sentences are essentially modifications of a restricted pool of ~10 sentences where only a few words change. Building on the previous example, it's a bit like all my sentences would speak about either "somebody's pet's position relative to a piece of furniture" or "a physical description of somebody's furniture".
- The number of unique words used to build all the sentences is <100.
- Sentences are 8 words long at most.
For this project I am using python, but any language reasonably readable (eg: NOT obfuscated perl!) will be fine.
Thank you in advance for your time!
If I understand you correctly, this is equivalent to the shortest common supersequence problem. This problem is NP-complete, but there exists approximation algorithms. Google turns up a few papers, including this one.
The problem can be solved with a simple DP algorithm in the case of two input sequences, but this doesn't generalize to multiple sequences since each sequence essentially requires you to add a dimension to the DP table which results in the exponential blow-up.
I'm a bioinformatician, and this sounds like it could be solved by doing a global multiple sequence alignment of all the sentences with infinite mismatch penalties (i.e. disallow mismatches entirely) and modest gap penalties (i.e. allow gaps, but prefer fewer gaps), and then reading off the gapless consensus sequence.
If this formulation is equivalent to your problem, then that means your problem is indeed NP-complete, since multiple sequence alignment is NP-complete, although there are many heuristic algorithms that run in reasonable time. Unfortunately, most MSA algorithms are designed to work on characters of DNA or protein sequences, not words of English.
Example
Here is an example of the kind of alignment that I describe, using the set of three sentences given by the OP. I don't know if the alignment that I give is optimal, but it is one possible solution. Gaps are indicated by a series of dashes.
Sentence 1: ---- -- A red cat -- -- --- ----- -- --- Sentence 2: ---- My - --- cat is on the table -- --- Sentence 3: That -- - --- --- -- -- --- table is red Consensus: That My A red cat is on the table is red
One advantage of this method is that the alignment not only gives you the full sequence of words, but shows which words belong in which sentences.
I have no actual proof, but I would be very surprised if your problem was not either NP-complete or NP-hard. I can find an algorithm which is guaranteed to give the best answer. But with even a small number of sentences of reasonable length (eg 16 sentences with 8 words each) it will bog down and not run in any reasonable time. Therefore I would advise you not to look for exact answers, but instead to look for good heuristics that give reasonable algorithms.
The heuristic that I would suggest would be to take pairs of sentences, solve the http://en.wikipedia.org/wiki/Longest_common_subsequence_problem for them (a solution for which is in the Python standard library, look up difflib), resolve the conflicted sections randomly, and replace the pair with that one sentence. Do this over all of your sentences, and you'll have an OK solution. Not a great one, but an OK one.
If you want to get clever, you could look for incremental improvements, run this multiple times, try to be smart about how you randomly resolve those differences, etc. But the simple approach will probably give reasonably good answers without too much effort.
After a week of so of coding (this is hobby project) I decided to answer my own question. This is not because the answers I previously got weren't good enough, but rather because I used the three of them to code the solution I wanted, and it felt wrong to give credit to only one of the responders, as I truly used input by the three of them to come up with a satisfactory solution.
Let's start from the end: the heuristic I came up gives very satisfactory results (at least for the real-life case I am using it for). I had 1440 sentences with an average of ~6 words each and using a set of 70 unique words. The program takes about 1 minute to run and provides me with a supersequence of just 76 words (10% more than the "physical" [but not "logical"] lower limit of the problem).
The heuristic is really tailored around my real-life case, in particular around the fact that most of the sentences are constructed around 10 or so "prototypes" (see point #2 of my edit in the question) and is composed of four successive steps:
- Isomorphic shrinking
- Similarity shrinking
- Coarse redundancy optimisation
- Fine redundancy optimisation
Isomorphic shrinking
I defined as "isomorphic" two sentences A and B such than transforming A in B and B in A would require exactly the same steps. Example:
'A dog is on the carpet'
'A cat is under the table'
The transformation always require to change words in position 1, 3, 5 of the first string with words in position 1, 3, 5 of the second.
Grouping sentences in "isomorphic families" allows to easily create optimised superstrings by simply inserting in the common root "A __ is __ the __"
the list of variable elements for position 1, 3, 5.
Similarity shrinking
At this stage of the process the number of sentences has dramatically lowered (there is normally a mixture of about 50 supersequences from isomorphic families and orphan sentences that were not isomorphic to any other in the pool). The program analyses this new pool and merges together the two most similar strings, then repeating the procedure on the new set composed by the result of the merge and the strings that haven't been merged yet, and so on until everything has been merged into one supersequence.
Coarse redundancy optimisation
At this stage we already have a valid supersequence to all the original 1440 sentences, but such supersequence is grossly suboptimal as many terms are repeated without the need for it. This optimisation step remove the bulk of the redundant terms by simply using the supersequence to formulate all the sentences, and then removing from it all the terms that haven't been used.
Fine redundancy optimisation
The result of the previous optimisation is pretty good already, but sometimes is possible to trim out a word of two via this last step. The optimisation here works by finding words that are repeated more than once, and checking if it possible to make two successive repetitions to converge towards the same location in the supersequence and still formulate all the sentences. Example: given the supersequence
aaa xxx bbb ccc ddd eee xxx fff
the program will try to shift the two xxx
words towards each other:
aaa bbb xxx ccc ddd eee xxx fff
aaa bbb xxx ccc ddd xxx eee fff
aaa bbb ccc xxx ddd xxx eee fff
if the supersequence reaches:
aaa bbb ccc xxx xxx ddd eee fff
and no sentence uses both occurrences of xxx
at the same time, then one of the two xxx
can go.
Of course this last passage could be optimised by shifting xxx
of more than one position at a time (think of shell sorting vs. bubble sorting), but the general structure of the program is such, and the gain in speed so little, that I preferred this "less optimised" procedure, as this "shifting procedure" is used elsewhere too.
Again, many thanks to all the original responder: your input was paramount to make me think of this solution. Your comments to this solution are also very welcomed!
FINAL NOTE: As soon as the program will be complete / the project finished (a couple of months at worst), I will release it under GPL and add the link to the code repo in the original question. I believe that marking the question as "favourite" should notify the marker of any edit.... just in case you are interested in the actual code, of course!
精彩评论