Algorithm problem: select two stories per topic so that the same story is never selected for two different topics
In my workplace I have stumbled upon the following problem that I am asked to solve. A solution is preferred although not absolutely required.
There is a database with a set of stories, and each story has a set of topics associated with it. The topics are stored in a separate table of the form (storyid, topicid).
The question is how to select ideally 5 topics (or at least 2, if 5 is impossible) such that each topic has 2 stories (or 1, if 2 is impossible) that are not repeated in any of the other selected topics. The algorithm must also return which exactly are the "proper" stories associated with each topic.
Is this actually an NP-complete problem that has no efficient solution that goes beyond simple enumeration of all possibilities or does it have an efficient solution?
I开发者_如何学编程f it does not have an efficient solution, please try to prove it (although not absolutely necessary).
If it does have an efficient solution, please be kind and post it.
A more general version of the problem is to select for all topics (or at least as many as possible) two stories so that the same story is never selected for two different topics.
Mark the stories with S1...Sm, and the topics with T1...Tn. Duplicate each topic, that is, introduce the new stories T'1...T'n, where T'i contains Sj if and only if Ti contains it. The problem can now be rephrased this way: select a different story for all topics (or as many as possible). Once you have your topic-story pairs, you can join the duplicated topics again, and each topic will have two stories.
The problem of finding the largest number of pairs possible while never selecting any element twice is called the maximum bipartite matching problem. (You can think of it as selecting the maximum number of non-connected edges from a bipartite graph.) There is a simple algorithm called augmenting path which solves it in O((m+n)e) steps (e being the number of edges) and some more sophisticated ones (such as the Hopcroft–Karp algorithm) which solve it in around O((m+n)^2.5) steps. The augmenting path algorithm consists of searching for "alternating" paths in the graph where the first edge of the path is not selected, the second is, the third isn't and so on, than inverting the selections on the path. This can be probably adapted to your case without actually doing the splitting and joining of topics.
This is a bit of an overkill because it will return two stories for all the topics, not only five; you can probably do a lot better when you only need to find stories for a limited number of topics. (Except for some edge cases, you can just select the five topics which have the largest number of stories, discard the stories not contained by them, and run the algorithm then.) At any rate it shows that the problem is far from being NP-hard.
Suppose we have a SQL table TopicStory that relates TopicID and StoryID, with that pair of columns forming a compound primary key.
The goal is to find a certain kind of set of TopicID's. At most five in a set are required by the Question posted. A depth-first search for these is outlined below.
With the depth of search bounded at five, the stated problem cannot be worse than polynomial complexity. However a generalized problem that asks for the largest set of topics that could be found with like constraints (each topic chosen has at least two stories not related to any of the other chosen topics) is probably NP-complete.
The use of the word "search" suggests a backtracking algorithm. Below we have effected backtracking through nested loops, where each loop is defined and parameterized by the loops that are outer to it.
Before we give the gritty details, it may be motivational to describe a "brute force" approach, next to which the more complicated approach may be more easily appreciated.
BRUTE_FORCE:
Generate all possible subsets of five topics.
Test each of these sets for feasibility (each topic has
at least two stories unrelated to any of the other topics).
Our sketch of depth-first search assumes topics have a total ordering, e.g. ordered by integer values for TopicID. This allows sets of topics to be generated without repetition (due to permutation of topics).
NESTED_LOOPS:
(Loop_1) Select into List_1 all topics with at least two stories.
Iterate through List_1, choosing the first topic %T1%.
PASS control into Loop_2.
CONTINUE in Loop_1.
If the end of List_1 is reached, EXIT with failure.
(Loop_2) Select into List_2 all topics > %T1% with at least two
stories unrelated to %T1%.
Iterate through List_2, choosing the second topic %T2%.
If topic %T1% still has at least two stories unrelated
to %T2%, PASS control into Loop_3.
CONTINUE in Loop_2.
If the end of List_2 is reached, go BACK to Loop_1.
(Loop 3) Select into List_3 all topics > %T2% with at least two
stories unrelated to %T1% or %T2%.
Iterate through List_3, choosing the third topic %T3%.
If topic %T1% still has at least two stories unrelated
to %T2% or %T3%,
and topic %T2% still has at least two stories unrelated
to %T1% or %T3%, PASS control into Loop_4.
CONTINUE in Loop_3.
If the end of List_3 is reached, go BACK to Loop_2.
(Loop 4) Select into List_4 all topics > %T3% with at least two
stories unrelated to %T1%, %T2%, or %T3%.
Iterate through List_4, choosing the fourth topic %T4%.
If topic %T1% still has at least two stories unrelated
to %T2%, %T3%, or %T4%,
and topic %T2% still has at least two stories unrelated
to %T1%, %T3%, or %T4%,
and topic %T3% still has at least two stories unrelated
to %T1%, %T2%, or %T4%, PASS control into Loop_5.
CONTINUE in Loop_4.
If the end of List_4 is reached, go BACK to Loop_3.
(Loop 5) Select into List_5 all topics > %T4% with at least two
stories unrelated to %T1%, %T2%, %T3%, or %T4%.
Iterate through List_5, choosing the fifh topic %T5%.
If topic %T1% still has at least two stories unrelated
to %T2%, %T3%, %T4%, or %T5%,
and topic %T2% still has at least two stories unrelated
to %T1%, %T3%, %T4%, or %T5%,
and topic %T3% still has at least two stories unrelated
to %T1%, %T2%, %T4%, or %T5%,
and topic %T4% still has at least two stories unrelated
to %T1%, %T2%, %T3%, or %T5%, EXIT with success
returning five topics %T1%, %T2%, %T3%, %T4%, and %T5%.
CONTINUE in Loop_5.
If the end of List_5 is reached, go BACK to Loop_4.
The use of "select" at the opening of each nested loop is meant to evoke the possibility of SQL queries to implement much of the logic. For example the outermost loop is basically just getting the result set for this query:
SELECT TS1.TopicID, Count(*)
From TopicStory TS1
Group By TS1.TopicID
Having Count(*) > 1
The corresponding lists of the inner loops can be constructed similarly by SQL queries depending on parametric values of topics chosen in the outer loops. To illustrate without unnecessary repetition let's jump right to the innermost loop and give an appropriate query for List_5:
SELECT TS5.TopicID, Count(*)
From TopicStory TS5
Where TS5.TopicID > %T4%
and NOT EXISTS ( SELECT *
From TopicStory TSX
Where TSX.TopicID in (%T1%,%T2%,%T3%,%T4%)
and TSX.StoryID = TS5.StoryID
)
Group By TS5.TopicID
Having Count(*) > 1
This would be followed by checking that %T5% in List_5 produces a count of at least two stories left for topic %T1%:
SELECT Count(*)
From TopicStory TZ1
Where TZ1.TopicID = %T1%
and NOT EXISTS ( SELECT *
From TopicStory TX1
Where TX1.StoryID = TZ1.StoryID
and TX1.TopicID in (%T2%,%T3%,%T4%,TS5.TopicID)
)
and mutatis mutandi for the other prior topic choices.
Although it might slow performance unacceptably, the additional logic for restricting topics related to %T5% (so that earlier topic choices still retain at least two story choices) could be pushed into one query. It would look like this:
/*
Given %T1%, %T2%, %T3$, and %T4% from queries above, find all topics %T5% > %T4%
with at least 2 stories not related to %T1%, %T2%, %T3%, or %T4% and such that
%T1% still has at least 2 stories not related to %T2%, %T3%, %T4%, or %T5% and
%T2% still has at least 2 stories not related to %T1%, %T3%, %T4%, or %T5% and
%T3% still has at least 2 stories not related to %T1%, %T2%, %T4%, or %T5% and
%T4% still has at least 2 stories not related to %T1%, %T2%, %T3%, or %T5%
*/
SELECT TS5.TopicID, Count(*)
From TopicStory TS5
Where TS5.TopicID > %T4%
and NOT EXISTS ( SELECT *
From TopicStory TSX
Where TSX.TopicID in (%T1%,%T2%,%T3%,%T4%)
and TSX.StoryID = TS5.StoryID
)
and ( SELECT Count(*)
From TopicStory TZ1
Where TZ1.TopicID = %T1%
and NOT EXISTS ( SELECT *
From TopicStory TX1
Where TX1.StoryID = TZ1.StoryID
and TX1.TopicID in (%T2%,%T3%,%T4%,TS5.TopicID)
)
) > 1
and ( SELECT Count(*)
From TopicStory TZ2
Where TZ2.TopicID = %T2%
and NOT EXISTS ( SELECT *
From TopicStory TX2
Where TX2.StoryID = TZ2.StoryID
and TX2.TopicID in (%T1%,%T3%,%T4%,TS5.TopicID)
)
) > 1
and ( SELECT Count(*)
From TopicStory TZ3
Where TZ3.TopicID = %T3%
and NOT EXISTS ( SELECT *
From TopicStory TX3
Where TX3.StoryID = TZ3.StoryID
and TX3.TopicID in (%T1%,%T2%,%T4%,TS5.TopicID)
)
) > 1
and ( SELECT Count(*)
From TopicStory TZ4
Where TZ4.TopicID = %T4%
and NOT EXISTS ( SELECT *
From TopicStory TX3
Where TX3.StoryID = TZ3.StoryID
and TX3.TopicID in (%T1%,%T2%,%T3%,TS5.TopicID)
)
) > 1
Group By TS5.TopicID
Having Count(*) > 1
The feature set of MySQL is a work in progress, so conceivably an efficient implementation in stored procedures is possible, where cursors can take the role of topic lists. However I'd be confident of good performance if the "cursors" are externally managed lists (e.g. in PHP) and database queries are kept as simple as possible.
Try adapting this to suit your needs:
SELECT topic, story
FROM story_topic
WHERE story IN (SELECT story FROM story_topic GROUP BY story HAVING COUNT(*) = 1);
The key here is to know what stories only occur in only one topic. You might want to pre-compute the number of topics to eradicate the subselect.
How about this? (If I understood your question)
(I haven't actually run it - just a thought - so... could be errors, or I could have blatantly missed something. But - for the moment, my tired head thinks it would work :)
$num_topics = 5;
$stories_per = 5;
$stories = array(); //array to store story ids
//select 5 topics
$query = mysql_query("SELECT * FROM topics ORDER BY RAND() LIMIT ".$num_topics);
//run repeat as many times as you want stories
for($i=0; $i<$stories_per; $i++) {
//repeat through each selected topic
while($row = mysql_fetch_array($query)) {
$q_addon = "";
foreach($stories as $value) {
$q_addon .= "id <> '".$value."' AND ";
}
//find a story not yet chosen for each topic
$q = mysql_query("SELECT storyid FROM stories_topics WHERE ".$q_addon." topicid='".$row['id']."' LIMIT 1");
//add that id to your $stories array
$tmp_id = mysql_result($q,0,'storyid');
array_push($stories, $tmp_id);
}
}
if you wona choose similar 5 topics similar to different stories as I understood: so you can make a join for the 2 tables and use top 5 in your query with where condition topic title="topic you want "
if that doesn't help kindly make it clear for me >>>
精彩评论